Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).
7.6 Case-Based Reasoning
The previous methods tried to find a compact representation of the data that can be used for future prediction. In case-based reasoning, the training examples - the cases - are stored and accessed to solve a new problem. To get a prediction for a new example, those cases that are similar, or close to, the new example are used to predict the value of the target features of the new example. This is at one extreme of the learning problem where, unlike decision trees and neural networks, relatively little work must be done offline, and virtually all of the work is performed at query time.
Case-based reasoning can be used for classification and regression. It is also applicable when the cases are complicated, such as in legal cases, where the cases are complex legal rulings, and in planning, where the cases are previous solutions to complex problems.
If the cases are simple, one algorithm that works well is to use the k-nearest neighbors for some given number k. Given a new example, the k training examples that have the input features closest to that example are used to predict the target value for the new example. The prediction can be the mode, average, or some interpolation between the prediction of these k training examples, perhaps weighting closer examples more than distant examples.
For this method to work, a distance metric is required that measures the closeness of two examples. First define a metric for the domain of each feature, in which the values of the features are converted to a numerical scale that can be used to compare values. Suppose val(e,Xi) is a numerical representation of the value of feature Xi for the example e. Then (val(e1,Xi) - val(e2,Xi)) is the difference between example e1 and e2 on the dimension defined by feature Xi. The Euclidean distance, the square root of the sum of the squares of the dimension differences, can be used as the distance between two examples. One important issue is the relative scales of different dimensions; increasing the scale of one dimension increases the importance of that feature. Let wi be a non-negative real-valued parameter that specifies the weight of feature Xi. The distance between examples e1 and e2 is then
d(e1,e2) =sqrt(∑i wi ×(val(e1,Xi) - val(e2,Xi))2) .
The feature weights can be provided as input. It is also possible to learn these weights. The learning agent can try to find a parameter setting that minimizes the error in predicting the value of each element of the training set, based on every other instance in the training set. This is called the leave-one-out cross-validation error measure.
Suppose a learning agent wants to classify example e20, for which the author is unknown, the thread is a follow-up, the length is short, and it is read at home. First the learner tries to find similar cases. There is an exact match in example e11, so it may want to predict that the user does the same action as for example e11 and thus skips the article. It could also include other close examples.
Consider classifying example e19, where the author is unknown, the thread is new, the length is long, and it was read at work. In this case there are no exact matches. Consider the close matches. Examples e2, e8, and e18 agree on the features Author, Thread, and Where Read. Examples e10 and e12 agree on Thread, Length, and Where Read. Example e3 agrees on Author, Length, and Where Read. Examples e2, e8, and e18 predict Reads, but the other examples predict Skips. So what should be predicted? The decision-tree algorithm says that Length is the best predictor, and so e2, e8, and e18 should be ignored. For the sigmoid linear learning algorithm, the parameter values in Example 7.10 similarly predict that the reader skips the article. A case-based reasoning algorithm to predict whether the user will or will not read this article must determine the relative importance of the dimensions.
One of the problems in case-based reasoning is accessing the relevant cases. A kd-tree is a way to index the training examples so that training examples that are close to a given example can be found quickly. Like a decision tree, a kd-tree splits on input features, but at the leaves are subsets of the training examples. In building a kd-tree from a set of examples, the learner tries to find an input feature that partitions the examples into set of approximately equal size and then builds kd-trees for the examples in each partition. This division stops when all of the examples at a leaf are the same. A new example can be filtered down the tree, as in a decision tree. The exact matches will be at the leaf found. However, the examples at the leaves of the kd-tree could possibly be quite distant from the example to be classified; they agree on the values down the branch of the tree but could disagree on the values of all other features. The same tree can be used to search for those examples that have one feature different from the ones tested in the tree. See Exercise 7.16.
Case-based reasoning is also applicable when the cases are more complicated, for example, when they are legal cases or previous solutions to planning problems. In this scenario, the cases can be carefully chosen and edited to be useful. Case-based reasoning can be seen as a cycle of the following four tasks.
- Retrieve:
- Given a new case, retrieve similar cases from the case base.
- Reuse:
- Adapt the retrieved cases to fit to the new case.
- Revise:
- Evaluate the solution and revise it based on how well it works.
- Retain:
- Decide whether to retain this new case in the case base.
The revision can involve other reasoning techniques, such as using the proposed solution as a starting point to search for a solution, or a human could do the adaptation in an interactive system. Retaining can then save the new case together with the solution found.