CS 536A -- Lecture 19 on Mar 22, 2001
Lecture by Anne Condon. Notes by Ho Sen Yung
Last Lecture
Class Prediction
- Metric for correlating genes with classes and with each other.
- Selecting informative genes
- Developing a class predictor: weighted voting (Golub et al.)
- Evaluation of class predictors --
Leave-one-out cross validation (LOOCV)
Let D={ (xi,li) } be a training set,
where xi's are samples and
li's are class labels.
A class predictor takes as input D and a query (sample) x
and returns a label l for x.
Algorithms
- Nearest Neighbour:
Labels x with the same label as its nearest neighbour in D.
- Clustering: Partitions samples into groups so as to
maximize similarities within a group and
maximize distances between groups. Trade-off in the
CAST clustering algorithm is controlled by a parameter t.
Large Margin Classifiers:
- Support Vector Machines (hyperplane and beyond):
For hyperplane, it partitions the samples into 2 classes by using
a hyperplane that maximizes the sum of the distances to the
hyperplane, from the closest gene expression vector on one side of
the hyperplane and from the closest gene expression vector on the
other side of the hyperplane.
- Boosters:
Constructs a sequence of very simple classifiers
f1,f2,..., where
fi attempts to improve fi-1.
The final classifier is a weighted vote of fi's.
Clustering algorithm: CAST
-
While there are unclustered elements do
- Pick one unclustered element
- Add it to a new cluster C
- Repeat ADD and REMOVE until no change occurs:
-
ADD:
| Add an unclustered element v with maximum similarity to C
if sim(v,C) > t |C|
(where sim(v,C) is the sum of correlations with samples in C)
|
REMOVE:
| Remove an element u with minimum similarity from C
if sim(u,C) < t |C|
|
- Add C to the set of clusters
Compatibility of a clustering is the number of pairs
that have the same label and are assigned to the same cluster plus the
number of pairs that have different labels and are assigned to different
clusters.
CAST does a binary search for a good t.
Predictor: CAST is run on D (with labels removed) and on x.
The label for x is the majority label in its cluster.
Large Margin Classifier: Boosters
A simple classifier is described by a gene g, a threshold t, and
a direction (< , >)
- A classifier outputs label
- 'ALL' if expression level of g in sample x is > t
- 'AML' if expression level of g in sample x is < t
Quality of a simple classifier, relative to a probability distance on
traininng samples, is the weighted sum of correct predictions, where
weights are probabilities.
- f1 is an optimal classifier for the initial, equal
weights on training samples
- Reweigh; give higher weights to samples incorrectly classified by
fi
- Use new weights to find an optimal classifier as fi+1
Evaluation
Algorithms | % correct
|
---|
Colon | Ovarian
|
Clustering | 88.7 | 42.9
|
Nearest Neighbour | 80.6 | 71.4
|
Supporting Vector Machine (linear hyperplane)
| 77.4 | 67.9
|
Boosters (100 iterations) | 72.6 | 89.3
|
Class Discovery
Find classes into which unlabelled samples fit. This is clustering.
Golub et al.: Self-organising maps (SOM,1997)
- User specifies how many clusters (2 clusters, based on all 6187 genes)
- Iterates to find a (optimal?) set of centroids around
which the data cluster
One cluster contained 25 samples, 24 of which were ALL.
The other cluster contained mostly AML samples.
Evaluation
- Class prediction was used to evaluate on 34 new samples:
30 accurate predictions, 3 errors, 3 unknown.
- Cross-validation
Reading Assignment: Clustering algorithms survey paper by Shamir and Sharon
SAGE (Serial Analysis of Gene Expression) 1995
Key ideas:
- A short sequence tag (10-14 bases) contains sufficient information to
uniquely identify a transcript (mRna, cDNA, etc.)
- Sequence tags can be linked together to form long molecules that can be
cloned and sequenced
- Quanitation of the number of times a particular tag is observed provides
the expression level of the corresponding transcript
2 sources of errors:
- Tags may not always uniquely identify transcripts
- One transcript may have more than one tag