- Case Study
- Open Access
The distance function effect on k-nearest neighbor classification for medical datasets
- Li-Yu Hu^{1},
- Min-Wei Huang^{2}Email author,
- Shih-Wen Ke^{3} and
- Chih-Fong Tsai^{4}
- Received: 4 May 2016
- Accepted: 28 July 2016
- Published: 9 August 2016
Abstract
Introduction
K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output.
Case description
Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually.
Discussion and evaluation
The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets.
Conclusions
In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.
Keywords
- Pattern classification
- k-Nearest neighbor
- Euclidean distance
- Distance function
- Medical datasets
Background
In pattern classification, its goal is to allocate an object represented by a number of measurements (i.e. feature vectors) into one of a finite set of classes. The k-nearest neighbor (k-NN) algorithm is one of the most widely used classification algorithms since it is simple and easy to implement. Moreover, it is usually used as the baseline classifier in many domain problems (Jain et al. 2000).
The k-NN algorithm is a non-parametric method, which is usually used for classification and regression problems. It is a type of lazy learning algorithms that off-line training is not needed. During the classification stage for a given testing example, the k-NN algorithm directly searches through all the training examples by calculating the distances between the testing example and all of the training data in order to identify its nearest neighbors and produce the classification output (Mitchell 1997).
Particularly, the distance between two data points is decided by a similarity measure (or distance function) where the Euclidean distance is the most widely used distance function. In literature, there are several other types of distance functions, such as cosine similarity measure (Manning et al. 2008), Minkowsky (Batchelor 1978), correlation, and Chi square (Michalski et al. 1981). However, there is no a comparative study of examining the distance function effect on the performance of k-NN.
Moreover, since the real world datasets of medical domain problems can contain categorical (i.e. discrete), numerical (i.e. continuous), or both types of data, we believe that different distance functions should perform differently over different types of datasets. This is very important for relevant decision makers to identify the ‘best’ k-NN classifier for medical related problems. Therefore, the aim of this paper is to provide some guidelines about which distance function used in k-NN is the better choice for what type of medical datasets?
The rest of this paper is organized as follows. “Literature review” section defines the pattern classification problems, overviews the idea of k-NN classification, and briefly describes the five well known distance functions used in this paper. “Experiments” section presents the experimental setup and results. Finally, “Conclusion” section concludes this paper.
Literature review
Pattern classification
The goal of pattern classification is to allocate an object represented by a number of measurements (i.e. feature vectors) into one of a finite set of classes. Supervised learning can be thought as learning by examples or learning with a teacher. The teacher has knowledge of the environment which is represented by a set of input–output examples. In order to classify unknown patterns, a certain number of training samples are available for each class, and they are used to train the classifier (Mitchell 1997).
The problem of supervised pattern recognition can be stated as follows. Given a training dataset where each training example is composed of a number of input feature variables and their corresponding class labels. An unknown function is learned over the training dataset to approximate the mapping between the input–output examples, which is able to correctly classify as many of the training data as possible.
k-Nearest neighbor classification
As k-NN does not require the off-line training stage, it main computation is the on-line ‘searching’ for the k nearest neighbours of a given testing example. Although using different k values are likely to produce different classification results, 1-NN is usually used as a benchmark for the other classifiers since it can provide reasonable classification performances in many pattern classification problems (Jain et al. 2000).
Distance functions
Experiments
Experimental setup
Three different attribute types of datasets are chosen from the UCI machine learning repository.^{2} They are categorical, numerical, and mixed attribute types of data, which contain 10, 17, and 10 datasets respectively. Moreover, each type of datasets contains different numbers of attributes, samples, and classes in order to figure out the effect of using different types of datasets with different missing rates on the final classification accuracy.
Dataset information
Dataset | No. of instances | No. of attributes | No. of classes |
---|---|---|---|
Categorical datasets | |||
Lymphograph | 148 | 18 | 4 |
Nursery | 12,960 | 8 | 11 |
Promoters | 106 | 58 | 2 |
SPECT | 267 | 22 | 2 |
Numerical datasets | |||
Blood | 748 | 5 | 2 |
Breast cancer | 286 | 9 | 2 |
Ecoli | 336 | 8 | 8 |
Pima | 768 | 8 | 2 |
Mixed datasets | |||
Acute | 120 | 6 | 2 |
Contraceptive | 1473 | 9 | 3 |
Liver_disorders | 345 | 7 | 2 |
Statlog | 270 | 13 | 2 |
On the other hand, for k-NN classifier design, the k values are set from 1 to 15 for comparison. In addition, tenfold cross validation is used to divide each dataset into 90 % training and 10 % testing sets to train and test the k-NN classifier respectively. Specifically, four different distance functions, which are Euclidean distance, cosine similarity measure, Minkowsky, correlation, and Chi square, are used in the k-NN classifier respectively.
Experimental results
Results on categorical datasets
Results on numerical datasets
Results on mixed types of datasets
Further comparisons
Conclusions
In this paper, we hypothesize that since k-NN classification is based on measuring the distance between the test data and each of the training data, the chosen distance function can affect the classification accuracy. In addition, as different medical domain problem datasets usually contain different types of data, such as the categorical, numerical, and mixed types of data, these three types of data are considered in this paper.
By using four different distance functions, which are Euclidean, cosine, Chi square, and Minkowsky, our experimental results show that k-NN by the Chi square distance function can make the k-NN classifier perform the best over the three different types of datasets. On the other hand, using the Euclidean distance function performs reasonably well over the categorical and numerical datasets, but not for the mixed type of datasets.
Minkowsiki distance is typically used with r being 1 or 2, where the former is sometimes known as the Manhattan distance and the latter is the Euclidean distance.
Declarations
Authors’ contributions
LH: contributes on the research proposal, collecting the experimental datasets, and preparing “Background” section. MH: conducts the experiments. SK and CT: contribute on writing the rest of the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Batchelor BG (1978) Pattern recognition: ideas in practice. Plenum Press, Berlin, HeidelbergGoogle Scholar
- Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27View ArticleGoogle Scholar
- Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37View ArticleGoogle Scholar
- Manning CD, Raghavan P, Schutze H (2008) An introduction to information retrieval. Cambridge University Press, CambridgeView ArticleGoogle Scholar
- Michalski RS, Stepp RE, Diday E (1981) A recent advance in data analysis: clustering objects into classes characterized by conjunctive concepts. In: Kanal LN, Rosenfeld A (eds) Progress in pattern recognition. North-Holland, Amsterdam, pp 33–56View ArticleGoogle Scholar
- Mitchell T (1997) Machine learning. McGraw Hill, New YorkGoogle Scholar