CPSC 536A
Notes for March 13, 2001


Class 17: Gene Finding (2)


GENE FINDING


Hidden Markov Models (HMMs)

HMMs are graphical probabilistic models used in

  • - modelling time series
  • - speech recognition
  • - optical character recognition
  • - ion channel recordings
  • in biology (bioinformatics), HMMs can be used for modeling:

  • - coding/non-coding regions in DNA
  • - protein binding sites in DNA
  • - protein superfamilies
  • since the mid-1990s, HMMs + machine learning techniques are used for:

  • - modeling
  • - aligning
  • - analysing
  • DNA regions and protein families.

    HMMs are closely related to other formal models:

  • - neural networks
  • - stochastic grammars
  • - Baysian networks

  • An HMM is given by:

  • - finite set of states (intron, exon)
  • - discrete alphabet of symbols (A,T,C,G)
  • - probability transition matrix (transition die)
  • - probability emission matrix (emission die)
  • EXAMPLE

    A: Alphabet A={A,C,G,T}
    S: Set of States S={Sbegin, S1, S2, ..., Send}
    T(tij): probability of transition from Si->Sj
    E=(eix): probability of generating latter x in state i

    Fundamental questions:

    1. given model H, observation sequence O, what is the probability of generating O using H? P(O|H) (likelihood)
    2. given model H, observation sequence O, what is the most probable stae path of H generating O? (decoding)
    3. given model H, observation sequence O, how can we update H, based on O, to make it better (learning)

    Naive computation of likelihood:

    - P(O,path | model)
    - P(O | model) = sum{all paths p} P(O,p | model)
    -> summing over all paths leads to exponential complexity

    Forward algorithm (for determining likelihood)

    - key insight: 'reconstruct' observation sequence, need only to compute/store prob of being in state i after emitting sequence o_1 ... o_i
    - complexity: N^2 (length of X * num states)
    (see [BML, Section 7.3] for details)

    Viterbi algorithm (for decoding)

    - key idea: 'reconstruct' observation sequence, for each state s keep only most probable state seq leading to s consistent with o_1 ... o_i
    - dp algorithm very similar to forward algorithm, but use max instead of sum in each step (store pointers)
    - complexity: N^2 (length of X * num states)
    (see [BML, Section 7.3] for details)

    Learning Problem

    - EM(Expect Maximization)
    - Viterbi Learning
    (see also [BML, Chapter 7.3])


    Evaluation of Gene Finders

    Using HMMs for Gene Finding

    Individual Signal Detectors

    Probabilistic integration of various signals:
    - promotor regions
    - translation start & stop context sequences
    - reading frame periodicity
    - polyadenylation signals
    - intron splicing signals
    - compositional contrast between introns / exons
    - differences in nucleosome positioning signals
    - sequence determinants of topological domains
    (scaffold attachment regions, SARs)

    Two State-of-the-Art Gene Finders

    Genscan (Burge & Karlin, 1997):

    - based on probabilistic model of gene structure in human genomic sequence
    - emphasis on features recognised by general transcriptional, splicing, and translational machinery, e.g., TATA box, cap site in eukaryotic promotors (rather than signals specific to particular genes)
    - does not use similarity search
    - overall model similar to generalised HMM (explicit state duration HMM)
    - uses explicitly double stranded genomic sequence model
    -> potential genes on both strands are analysed simultaneously
    - covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes
    - uses WMM and maximal dependency decomposition (MDD) to model functional signals
    - cannot handle overlapping transcription units
    - does not address alternative splicing

    signal models used by Genscan:
    - WMM for transcriptional and translational signals (translation initiation, polyadenylation signals, TATA box etc.) probabilities estimated from GenBank annotated data
    - maximal dependency decomposition for splice signals (WMM and WAM inadequate)
    - probabilistic composition of conditional WMMs

    exon models and non-coding state models used by Genscan:
    - probabilistic models based on conditional hexamer frequency
    - consistent reading frame is maintained throughout a gene

    HMMgene (Krogh, 1997):

    - different approach from Genscan:
    rather than model individual functional elements and combining them in to big model, combined model is estimated directly from labeled sequence data
    - based on class HMMs (CHMMs - HMMs where states are labeled and emit symbol + label)
    - uses clever machine learning algorithm for estimating CHMM from sequence data such that probability of correct labeling is maximised
    important features:
    - emission probabilities of states can depend on n previous states
    - allows states to share emission/transition probabilities (tying)

    Evaluating Gene Finding Programs

    How do we know how good a gene finder is?

    - define performance measures for evaluation
    - test on standardised test sets of sequence data

    general performance measures:

    true positive TP: correctly predicted feature (e.g., exon/intron boundary)
    false positive FP: incorrectly predicted feature
    false negative FN: missed feature
    true negative TN: correctly predicted absence of a feature

    note:
    T = TP+FN, true number of features present
    P = TP+FP, number of features predicted

    sensitivity: SN= TP/T, correct predictions per feature
    specificity: SP= TP/P, correct predictions per prediction
    base-level: SN, SP for annotation of individual bases as coding, non-coding
    exon-level: SN, SP for complete exons (both splise sites)

    want: high specificity and sensitivity
    combined measures:
    - (SN+SP)/2
    - approximate correlation AC: high=good, low=bad

    Results for comparing several gene finders (Rogic et al., 2000; Burset and Guigo, 1996):


    - HMMGene and Genscan typically better than the five other programs tested
    - quality of prediction varies with
    - exon length
    - exon type (initial, internal, terminal, single)
    - signal type (acceptor, donor, start, stop)
    - similar prediction quality for human and murine genes

    correlation between programs is not perfect, e.g., Genscan sometimes misses exons that HMMgene finds and vice versa
    -> combination of programs can yield improved prediction accuracy!

    Combining Gene Finding Programs

    various methods, here: Exon Union-Intersection (Rogic et al., 2000)

    Observations:

    - if either HMMgene or Genscan predict a high score, it's usually correct
    - if only one program predicts with a low score, the prediction tends to be incorrect, but if both predict with a low score, it tends ot be a correct.
    - for most of the false positives, only one program predicts the exon, and the probability score is low

    idea:

    - accept prediction only if one program gives a high score or if _both_ programs predict with a low score

    EUI algorithm:


    - consider all Genscan and HMMgene exons with probability score >= threshold p'
    - of these, predict all exons that are predicted by at least one program
    - consider all Genscan and HMMgene exons with probability score < threshold p'
    - of these, predict only exons that are predicted by both programs

    Results:


    false positives significantly reduced
    -> significantly increased specificity, sensitivity almost unaffected.

    Conclusions:

    - gene prediction is an increasingly relevant problem
    - it is hard to do in a fully automated way (in reality, lab work is required to check predictions)
    - complex probabilistic models integrating biological knowledge (signal detection) and computer science techniques (machine learning, algorithms) provide the basis for modern gene finders
    - empirical methods for evaluating algorithms can be used to improve prediction accuracy by combining gene finding programs
    - significant further progress needed to achieve fully automated gene finding with acceptable accuracy