(a) Briefly explain how the additive scores in a position weight matrix (such as the one in Problem 2a) can be obtained from probabilities of observing certain bases at a given position within a signal. (In your own words, max. 100 words, be as concise and accurate as possible.) [5 marks]
(b) Briefly explain how a WMM can be seen as a special case of an HMM, using the WMM from Problem 2a as an illustrative example. (In your own words, max. 100 words, be as concise and accurate as possible.) [10 marks]
(a) Consider the position weight matrix for a transcription factor binding site in yeast:
Pos. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 11 | 5 | 15 | 0 | 0 | 16 | 0 | 2 | 9 | 3 | 11 | 8 | 23 | 0 | 0 | 9 | 10 | 8 |
C | 1 | 6 | 0 | 0 | 23 | 1 | 12 | 3 | 5 | 4 | 5 | 3 | 0 | 23 | 0 | 0 | 3 | 5 |
G | 5 | 4 | 8 | 1 | 0 | 4 | 0 | 4 | 3 | 3 | 3 | 6 | 0 | 0 | 23 | 6 | 5 | 5 |
T | 6 | 8 | 0 | 22 | 0 | 2 | 11 | 14 | 6 | 13 | 4 | 6 | 0 | 0 | 0 | 8 | 5 | 5 |
Write a program that computes WMM scores for all windows (of the same length as the given weight matrix). Run this program on the following genomic yeast sequence (also available from http://www.cs.ubc.ca/~safari/cpsc445/hw4.in:
TAAGAAACCGGGACTTATATATTTATAAATATAAATCTAACTTAATTAATAATTTAAATAATATACTTTATATTTTATAAATAAAAATAATTAT AACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTATTTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAA AAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAACTAAATAAAATATATTATTAATAATATTATATAATATAATA ATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAATTTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAAT AAAAATATGTTATATATATATTAATAAATAAAAATAATTATAACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTAT TTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAAAAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAA CTAAATAAAATATATTATTAATAATATTATATAATATAATAATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAAT TTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAATAAAAATATGTTATATATATATTAATAATAAAAGGTAGTGAGGATTAAATAAA TTATATAATAATTATAACTCTTAATTATAAAATAAATATATATATATAT
The length of this sequence is 806 and, hence, there are 788 windows (starting at positions 1, 2, etc.). Write the window start positions (starting the position count at 1 for the first position in the sequence) and the scores in the output file, 4.out, such that each line contains a start position followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. [30 marks]
Important notes:
|
(b) Given a number of sequences for a signal, such as the transcription factor binding site above, explain a general method for computing a position weight matrix. Illustrate your method with pseudo-code. Notice that DNAs in the sequence might have different lengths. [10 marks]
Consider the following HMM, which models an intronic sequence with GC-rich regions. The model consists of 8 states:
The emission probabilities for the non-silent states are as follows:
Base | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
A | 0 | 0 | 0.35 | 0.60 | 0.50 | 0.75 | 0 | 1 |
C | 0 | 0 | 0.37 | 0.13 | 0.03 | 0.08 | 1 | 0 |
G | 1 | 0 | 0.18 | 0.14 | 0.44 | 0.10 | 0 | 0 |
T | 0 | 1 | 0.10 | 0.13 | 0.03 | 0.07 | 0 | 0 |
Please answer the following questions and show your work (for some parts, this may be easiest to do by writing short programs that compute the answers and print out the steps):
(a) What is the probability of observing GTAATACA along the state path p = B-S1-S2-S3-S5-S4-S6-S7-S8-E? [5 mark ]
(b) What is the probability of seeing GTCGCGCA, given the current HMM? (Show the steps of your calculation.) [15 marks]
(c) Use the backward algorithm to determine the most likely state in the HMM shown above for every position in the observation sequence GTCGCGCA? (Show the steps of your calculation.) [15 marks]
(d) Assume an expert biologist has given you the following profile matrices for the 5' and 3' ends of introns:
5' | +1 | +2 | +3 | +4 | +5 |
A | 0 | 0 | 0.35 | 0.08 | 0.15 |
C | 0 | 0 | 0.37 | 0.04 | 0.19 |
G | 1 | 0 | 0.18 | 0.81 | 0.20 |
T | 0 | 1 | 0.10 | 0.07 | 0.46 |
3' | -5 | -4 | -3 | -2 | -1 |
A | 0.75 | 0.26 | 0.06 | 0 | 1 |
C | 0.08 | 0.30 | 0.05 | 1 | 0 |
G | 0.10 | 0.31 | 0.84 | 0 | 0 |
T | 0.07 | 0.13 | 0.05 | 0 | 0 |
Extend the given HMM model based on this information. [10 marks]
Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).