Next: Minimizing memory requirements Up: Similarity Metric Learning for Previous: Choice of interpolation

Optimization of the similarity metric

The similarity metric weights and the kernel width factor are optimized using the cross-validation procedure that has been widely adopted in neural network research. This minimizes the error resulting when the output of each exemplar in the training set is predicted on the basis of the remaining data without that exemplar. As Atkeson (1991) has discussed, this is simple to implement with the nearest-neighbour method because it is trivial to ignore one data item when applying the interpolation kernel. This avoids some of the problems of overtraining that are found in many other neural network learning methods that can not so easily remove a single exemplar to measure the cross validation error.

The particular optimization technique that has been used is conjugate gradient (with the Polak-Ribiere update), because it is efficient even with large numbers of parameters and converges rapidly without the need to set convergence parameters. One important technique in applying it to this problem is that the set of neighbours of each exemplar are stored before each line search, and the same neighbours are used throughout the line search. This avoids introducing discontinuities to the error measure due to changes in the set of neighbours with a changing similarity metric, which could lead in turn to inappropriate choices of step size in the line search. A nice side-effect is that this greatly speeds the line search, as repeatedly finding the nearest neighbours would otherwise be the dominant cost. For the problems we have studied, the conjugate gradient method converges to a minimum error in about 5 to 20 iterations.

In order to apply the conjugate gradient optimization in an efficient manner, it is necessary to compute the derivative of the cross validation error with respect to the parameters being optimized. The cross validation error E is defined as the sum over all training exemplars t and output labels i of the squared difference between the known correct output and the computed probability for that output label based on its nearest neighbours:

The derivative of this error can be computed with respect to each weight parameter :

where

and

The sum in this last expression does not depend on the particular neighbour j and can therefore be precomputed for the set of neighbours.

In order to optimize the parameter r determining the width of the Gaussian kernel, we can substitute the derivative with respect to r for the last equation above:

As noted above, the error function has discontinuities whenever the set of nearest neighbours changes due to changing weights. This can lead to inappropriate selection of the conjugate gradient search direction, so the search direction should be restarted (i.e., switched to pure gradient descent for the current iteration) whenever the error or the gradient increases. In fact, simple gradient descent with line search seems to work well for this problem, with only a small increase in the number of iterations required as compared to conjugate gradient.

One final improvement to the optimization process is to add a stabilizing term to the error measure E that can be used to prevent large weight changes when there is only a small amount of training data. This is less important than for most other neural network methods because of the smaller number of parameters, but it can still be useful for preventing overfitting to small samples of noisy training data. The following stabilizing term S is added to the cross-validation error E:

which has a derivative of

This tends to keep the value of each weight as close as possible to the initial weight value assigned by the user prior to optimization. We have used the stabilization constant , which means that a one log-unit change in the weight carries a penalty equivalent to a complete misclassification of a single item of training data. This has virtually no effect when there is a large amount of training data---as in the NETtalk problem below---but will prevent large weight changes based on a statistically invalid sample for small data sets.

Next: Minimizing memory requirements Up: Similarity Metric Learning for Previous: Choice of interpolation

David Lowe
Wed Jul 16 17:08:22 PDT 1997