Nearest neighbour classification techniques have been the topic of hundreds of papers over the past 40 years in the pattern recognition and statistical classification literature. An excellent survey of this area has recently been prepared by Dasarathy (1991). A surprising shortcoming of this extensive literature is that it gives little attention to the problem of selecting the optimal distance norm for determining nearest neighbours. In most papers, this issue is avoided by looking at the asymptotic performance as the number of training cases approaches infinity (in which case the metric is irrelevant). However, any reasonable learning method will converge to the Bayes optimal solution with infinite data, so it is the number of training cases required for a given level of performance that distinguishes learning methods.
The importance of an appropriate distance metric can be seen by the degradation in performance that often accompanies the addition of new input features. Each time an unimportant feature is added to the feature set and assigned a weight similar to an important feature, it increases the quantity of training data that is needed by a factor that allows for all combinations of the important and unimportant values. It is easy to create exponential increases in training data requirements by adding only a few poorly weighted features. This is why nearest-neighbours algorithms have sometimes shown excellent performance (when appropriate features and metrics have been used), but also often show poor performance in comparison with other learning methods (when poor metrics are chosen, usually on the basis of equal weighting for each feature).
One reason for the strong interest in neural network learning methods, such as back propagation, is that they are able to select useful input features from high-dimensional input vectors. Therefore, they do not suffer from the ``curse of dimensionality'' of classical nearest neighbours, in which higher dimensional inputs become less likely to provide accurate classifications with reasonable amounts of training data. This becomes even more important when it is necessary to take weighted combinations of noisy redundant inputs in order to produce the best classification. In these cases, it is even less likely that the initial assigned weights will be appropriate. In this paper, we use the same optimization techniques developed for other neural network methods, but apply them directly to determining relative feature weightings. The result is that equivalent or better generalization can be achieved while solving for far fewer parameters and gaining the other advantages of the nearest neighbour approach.
Research in the neural network field has recently been moving towards algorithms that interpolate between nearest neighbours. One of the most popular of these methods is radial basis function (RBF) networks (Broomhead & Lowe, 1988; Moody & Darken, 1989). This is quite similar to the classical Parzen window method of estimating probability density distributions (Duda & Hart, 1973), except that it uses somewhat fewer basis functions and adds a linear output layer of weights that are optimized during the learning process. However, neither the RBF nor Parzen window method provides any way to optimize the similarity metric. Therefore, they suffer from the same problem as standard nearest neighbours, in which performance will be good only when the appropriate feature weighting happens to be specified by the user. Poggio & Girosi (1989, 1990) have proposed extensions to the RBF method, which they call generalized RBFs and hyper basis functions, that optimize the centers of the basis functions and the global similarity metric. This provides a very flexible framework, but the large number of parameters (including those in the output layer) means that it is necessary to select some problem-specific subset of parameters to optimize and to determine some limited number of basis functions that is smaller than the number of training examples. In practice, this requires extensive cross-validation testing to determine the model size and the appropriate selection of free parameters, which is computationally very expensive and prevents the use of incremental learning as needed for biological models or on-line learning.
Some previous research on the problem of optimizing a similarity metric is the work of Atkeson (1991) on robot learning. He uses cross-validation to optimize not only a similarity metric but also other stabilization and cross-correlation terms. Similarly, the work of Wettschereck & Dietterich (1992) selects a similarity metric for the Wolpert approach to the NETtalk problem. Both of these methods use a distance weighted interpolation kernel that has the property of giving infinite weight to training data that exactly matches the current input. This is clearly undesirable for noisy inputs, as is the case with most real-world problems. This paper instead makes use of a variable kernel method that provides better interpolation and approximation in the presence of noise. VSM learning is aimed at classification problems with many input features, whereas the more extensive correlation matrix fitting of Atkeson may be more appropriate for continuous output problems based on low-dimensional inputs, as occurs in the problem of robot control.
Cleveland and Devlin (1988) describe the LOESS method for locally weighted regression, in which a local weighting kernel is used to smooth multivariate data. However, they use a similarity metric that is proportional to the variance of each input feature rather than being optimized according to its value in determining the output.