Computational Intelligence

Online Slides

November 21, 2002

These are slides from Computational Intelligence, A Logical Approach, Oxford University Press, 1998. Copyright ©David Poole, Alan Mackworth, Randy Goebel and Oxford University Press, 1999-2002. You may prefer the pdf interface for which these slides were designed (you can read pdf files using the free acrobat reader).

Chapter 11, Lecture 1

Learning

Learning is the ability to improve one's behavior based on experience.

The range of behaviors is expanded: the agent can do more.
The accuracy on tasks is improved: the agent can do things better.
The speed is improved: the agent can do things faster.

Components of a learning problem

The following components are part of any learning problem:

task The behavior or task that's being improved.
For example: classification, acting in an environment
data The experiences that are being used to improve performance in the task.
measure of improvement How can the improvement be measured?
For example: increasing accuracy in prediction, new skills that were not present initially, improved speed.

Learning task

Learning architecture

Choosing a representation

The richer the representation, the more useful it is for subsequent problem solving.
The richer the representation, the more difficult it is to learn.

Common Learning Tasks

Supervised classification Given a set of pre-classified training examples, classify a new instance.
Unsupervised learning Find natural classes for examples.
Reinforcement learning Determine what to do based on rewards and punishments.
Analytic learning Reason faster using experience.
Inductive logic programming Build richer models in terms of logic programs.

Example Classification Data

	Action	Author	Thread	Length	Where
e1	skips	known	new	long	home
e2	reads	unknown	new	short	work
e3	skips	unknown	old	long	work
e4	skips	known	old	long	home
e5	reads	known	new	short	home
e6	skips	known	old	long	work

We want to classify new examples on property Action based on the examples' Author, Thread, Length, and Where.

Feedback

Learning tasks can be characterized by the feedback given to the learner.

Supervised learning What has to be learned is specified for each example.
Unsupervised learning No classifications are given; the learner has to discover categories and regularities in the data.
Reinforcement learning Feedback occurs after a sequence of actions.

Measuring Success

The measure of success is not how well the agent performs on the training examples, but how well the agent performs for new examples.
Consider two agents:
- P claims the negative examples seen are the only negative examples. Every other instance is positive.
- N claims the positive examples seen are the only positive examples. Every other instance is negative.
Both agents correctly classify every training example, but disagree on every other example.

Bias

The tendency to prefer one hypothesis over another is called a bias.
Saying a hypothesis is better than N's or P's hypothesis isn't something that's obtained from the data.
To have any inductive process make predictions on unseen data, you need a bias.
What constitutes a good bias is an empirical question about which biases work best in practice.

Learning as search

Given a representation and a bias, the problem of learning can be reduced to one of search.
Learning is search through the space of possible representations looking for the representation or representations that best fits the data, given the bias.
These search spaces are typically prohibitively large for systematic search. Use hill climbing.
A learning algorithm is made of a search space, an evaluation function, and a search method.

Noise

Data isn't perfect:
- some of the attributes are assigned the wrong value
- the attributes given are inadequate to predict the classification
- there are examples with missing attributes
overfitting occurs when a distinction appears in the data, but doesn't appear in the unseen examples. This occurs because of random correlations in the training set.

Characterizations of Learning

Find the best representation given the data.
Delineate the class of consistent representations given the data.
Find a probability distribution of the representations given the data.

Chapter 11, Lecture 2

Learning Decision Trees

Representation is a decision tree.
Bias is towards simple decision trees.
Search through the space of decision trees, from simple decision trees to more complex ones.

Decision trees

A decision tree is a tree where:

The nonleaf nodes are labeled with attributes.
The arcs out of a node labeled with attribute A are labeled with each of the possible values of the attribute A.
The leaves of the tree are labeled with classifications.

Example Decision Tree

Equivalent Logic Program

prop(Obj,user_action,skips) <-
     prop(Obj,length,long).
prop(Obj,user_action,reads) <-
     prop(Obj,length,short) & prop(Obj,thread,new).
prop(Obj,user_action,reads) <-
     prop(Obj,length,short) & prop(Obj,thread,old) &
     prop(Obj,author,known).
prop(Obj,user_action,skips) <-
     prop(Obj,length,short) & prop(Obj,thread,old) &
     prop(Obj,author,unknown).

Issues in decision-tree learning

Given some data, which decision tree should be generated? A decision tree can represent any discrete function of the inputs.
You need a bias. Example, prefer the smallest tree. Least depth? Fewest nodes? Which trees are the best predictors of unseen data?
How should you go about building a decision tree? The space of decision trees is too big for systematic search for the smallest decision tree.

Searching for a Good Decision Tree

The input is a target attribute (the Goal), a set of examples, and a set of attributes.
Stop if all examples have the same classification.
Otherwise, choose an attribute to split on,
- for each value of this attribute, build a subtree for those examples with this attribute value.

Decision tree learning: Boolean attributes

dtlearn(Goal, Examples, Attributes, DT) given Examples and Attributes construct decision tree DT for Goal.

dtlearn(Goal, Exs, Atts , Val) <-
      all_examples_agree(Goal, Exs, Val).
dtlearn(Goal, Exs, Atts, if(Cond,YT,NT)) <-
      examples_disagree(Goal, Exs) &
      select_split(Goal, Exs, Atts, Cond, Rem_Atts) &
      split(Exs, Cond, Yes, No) &
      dtlearn(Goal, Yes, Rem_Atts, YT) &
      dtlearn(Goal, No, Rem_Atts, NT).

Example: possible splits

Using this algorithm in practice

Attributes can have more than two values. This complicates the trees.
This assumes attributes are adequate to represent the concept. You can return probabilities at leaves.
Which attribute to select to split on isn't defined. You want to choose the attribute that results in the smallest tree. Often we use information theory as an evaluation function in hill climbing.
Overfitting is a problem.

Handling Overfitting

This algorithm gets into trouble overfitting the data. This occurs with noise and correlations in the training set that are not reflected in the data as a whole.
To handle overfitting:
- You can restrict the splitting, so that you split only when the split is useful.
- You can allow unrestricted splitting and prune the resulting tree where it makes unwarranted distinctions.

Chapter 11, Lecture 3

Neural Networks

These representations are inspired by neurons and their connections in the brain.
Artificial neurons, or units, have inputs, and an output. The output can be connected to the inputs of other units.
The output of a unit is a parameterized non-linear function of its inputs.
Learning occurs by adjusting parameters to fit data.
Neural networks can represent an approximation to any function.

Why Neural Networks?

As part of neuroscience, in order to understand real neural systems, researchers are simulating the neural systems of simple animals such as worms.
It seems reasonable to try to build the functionality of the brain via the mechanism of the brain (suitably abstracted).
The brain inspires new ways to think about computation.
Neural networks provide a different measure of simplicity as a learning bias.

Feed-forward neural networks

Feed-forward neural networks are the most common models.
These are directed acyclic graphs:

The Units

A unit with k inputs is like the parameterized logic program:

prop(Obj,output,V) <-
     prop(Obj,in_1,I_1) &
     prop(Obj,in_2,I_2) &
     ···
     prop(Obj,in_k,I_k) &
     Visf(w_0+w_1×I_1+w_2×I_2+···+w_k×I_k).

I_j are real-valued inputs.
w_j are adjustable real parameters.
f is an activation function.

Activation function

A typical activation function is the sigmoid function:

f(x)= (1)/(1+e^-x) f'(x)=f(x)(1-f(x))

Neural Network for the news example

Axiomatizing the Network

The values of the attributes are real numbers.
Thirteen parameters w₀,...,w₁₂ are real numbers.
The attributes h₁ and h₂ correspond to the values of hidden units.
There are 13 real numbers to be learned. The hypothesis space is thus a 13-dimensional real space.
Each point in this 13-dimensional space corresponds to a particular logic program that predicts a value for reads given known, new, short, and home.

predicted_prop(Obj,reads,V) <-
     prop(Obj,h_1,I_1) & prop(Obj,h_2,I_2) &
     Visf(w_0+w_1× I_1+w_2× I_2).
prop(Obj,h_1,V) <-
     prop(Obj,known,I_1) & prop(Obj,new,I_2) &
     prop(Obj,short,I_3) & prop(Obj,home,I_4) &
     Visf(w_3+w_4× I_1+w_5× I_2+w_6× I_3+ w_7× I_4).
prop(Obj,h_2,V) <-
     prop(Obj,known,I_1) & prop(Obj,new,I_2) &
     prop(Obj,short,I_3) & prop(Obj,home,I_4) &
     Visf(w_8+w_9× I_1+w_10× I_2+w_11× I_3+w_12× I_4).

Prediction Error

For particular values for the parameters w=w₀,...w_m and a set E of examples, the sum-of-squares error is
Error_E(w) = SUM_{e in E} (p_e^w-o_e)²,
- p_e^w is the predicted output by a neural network with parameter values given by w for example e
- o_e is the observed output for example e.
The aim of neural network learning is, given a set of examples, to find parameter settings that minimize the error.

Neural Network Learning

Aim of neural network learning: given a set of examples, find parameter settings that minimize the error.
Back-propagation learning is gradient descent search through the parameter space to minimize the sum-of-squares error.

Backpropagation Learning

Inputs:
- A network, including all units and their connections
- Stopping Criteria
- Learning Rate (constant of proportionality of gradient descent search)
- Initial values for the parameters
- A set of classified training data
Output: Updated values for the parameters

Backpropagation Learning Algorithm

Repeat
- evaluate the network on each example given the current parameter settings
- determine the derivative of the error for each parameter
- change each parameter in proportion to its derivative
until the stopping criteria is met

Gradient Descent for Neural Net Learning

At each iteration, update parameter w_i
w_i <- (w_i - eta(pderror(w_i))/(pdw_i))
eta is the learning rate
You can compute partial derivative:
- numerically: for small Delta
  (error(w_i+Delta) - error(w_i))/(Delta)
- analytically: f'(x)=f(x)(1-f(x)) + chain rule

Simulation of Neural Net Learning

Para-	iteration 0		iteration 1	iteration 80
meter	Value	Deriv	Value	Value
w₀	0.2	0.768	-0.18	-2.98
w₁	0.12	0.373	-0.07	6.88
w₂	0.112	0.425	-0.10	-2.10
w₃	0.22	0.0262	0.21	-5.25
w₄	0.23	0.0179	0.22	1.98
Error:	4.6121		4.6128	0.178

What Can a Neural Network Represent?

w₀	w₁	w₂	Logic
-15	10	10	and
-5	10	10	or
5	-10	-10	nor

Output is f(w₀+w₁ ×I₁ + w₂×I₂).

A single unit can't represent xor.

Bias in neural networks and decision trees

It's easy for a neural network to represent "at least two of I₁,...,I_k are true":

w₀ w₁ ··· w_k

-15 10 ··· 10

This concept forms a large decision tree.
Consider representing a conditional: "If c then a else b":
- Simple in a decision tree.
- Needs a complicated neural network to represent (c & a) or (¬c & b).

Neural Networks and Logic

Meaning is attached to the input and output units.
There is no a priori meaning associated with the hidden units.
What the hidden units actually represent is something that's learned.

w₀	w₁	···	w_k
-15	10	···	10