Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty d=8) on the protein sequences DEWDHE and NDWEHK, using the BLOSUM50 matrix:
N | D | W | E | H | K | ||
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | |
D | 0 | 2 | 8 | 0 | 2 | 0 | 0 |
E | 0 | 0 | 4 | 5 | 6 | 2 | 1 |
W | ? | 0 | 0 | 19 | 11 | 3 | 0 |
D | 0 | 2 | 8 | ? | 21 | 13 | 5 |
H | 0 | 1 | 1 | 5 | 13 | 31 | 23 |
E | 0 | 0 | 3 | 0 | ? | ? | 32 |
(a) Complete the table, by replacing the ?'s with the appropriate numbers. [4 marks]
(b) From the table, give the optimal local alignment for these two sequences and its score. [6 marks]
(b) Align the following two alignments (i.e., profiles) of protein
sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [5 marks]
Hint: Make sure you understand the description of profile alignment in
Durbin et al. (page 146-147)
Profile 1:
RWCH
R - CY
Profile 2:
RIWY
RVW -
(c) Briefly explain the role of guide trees in progressive multiple
sequence alignment algorithms. What do the leaf and internal nodes of a guide
tree represent? [5 marks]
Hint: Review the section on progressive
alignment methods in Durbin et al.
Important notes:
|
Given two DNA sequences, we are interested in finding the best pair-wise
sequence alignment between them. The scoring function uses 2 for any match, -1
for any mismatch, and -2 for any match against a gap.
More specifically,
we are interested in the best local and global alignment, its corresponding
alignment score, the possibility of existing multiple optimal alignments, and,
if the answer is positive, the set of all optimal alignments.
The two DNA
sequences are given to you in file 2.in in two lines, each line corresponds to
one of the two DNA sequences. Each DNA sequence has length between 5 and
50.
Note that each part of the problem must be written in a separate
output file (this simplifies partly automated marking).
(a) (output: 2.o1) - 10 marks
In this file, write the optimal score
as an integer value.
(b) (output: 2.o2) - 10 marks
For this part, you should write the
dynamic programming matrix.
Assumem the first DNA has length m and the second
one has length n. Your output, 2.o2, should contain m+1 lines each containing
n+1 integer values separated by a single space character. The value j in the
line i corresponds the value in column j and row i of the dynamic programming
matrix.
(c) (output: 2.o3) - 10 marks
For this part, we are interested in
one optimal alignment. Write the best alignment of the two input sequences on
two lines. For any gap write a single character '-'. Notice that there might be
multiple optimal alignments and you are required to only report one.
(d) (output: 2.o4) - 10 marks
Are there multiple optimal
alignments? Write either 'YES' or 'NO' (all in capital letters) into the output
file.
(e) (output: 2.o5) - bonus question, no marks
In this part report
all possible optimal alignments. In the first line write the number of optimal
alignments and then put every alignment in two lines separated by an empty
line.
(f) (hand in this part with your written assignment) - 10 marks
Which parts of the program did you need to change in order to perform local
rather than global pairwise sequence alignment?
Examples for input and output file formats:
program 893960.cpp
2.in:
ATCAGAGTC
TTCAGTC
2.o1:
7
2.o2:
0 -2 -4
-6 -8 -10 -12 -14
-2 -1 -3 -5 -4 -6 -8 -10
-4 0 1 -1 -3 -5 -4 -6
-6
-2 -1 3 1 -1 -3 -2
-8 -4 -3 1 5 3 1 -1
-10 -6 -5 -1 3 7 5 3
-12 -8
-7 -3 1 5 6 4
-14 -10 -9 -5 -1 3 4 5
-16 -12 -8 -7 -3 1 5 3
-18 -14
-10 -6 -5 -1 3 7
2.o3:
ATCAGAGTC
TTC--AGTC
2.o4:
YES
2.o5:
3
ATCAGAGTC
TTC--AGTC
ATCAGAGTC
TTCA--GTC
ATCAGAGTC
TTCAG--TC
-------------------------------------------------------------
program 893960-local.cpp
2.in:
ATCAGAGTC
TTCAGTC
2.o1:
8
2.o2:
0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0
0 2 2 0 0 1 2 0
0 0 1 4
2 0 0 4
0 0 0 2 6 4 2 2
0 0 0 0 4 8 6 4
0 0 0 0 2 6 7 5
0 0 0 0
0 4 5 6
0 2 2 0 0 2 6 4
0 0 1 4 2 0 4 8
2.o3:
AGTC
AGTC
2.o4:
YES
2.o5:
4
AGTC
AGTC
TCAGAGTC
TCA--GTC
TCAGAGTC
TCAG--TC
TCAG
TCAG
>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein)
(Membrane glycoprotein)
MSSKTTPAPVYIWTADEAIKFLKEWNFSLGIILLFITIILQFGYTSRSMFVYVIKMIILWLMWPLTIILT
IFNCVYALNNVYLGLSIVFTIVAIIMWIVYFVNSIRLFIRTGSFWSFNPETNNLMCIDMKGTMYVRPIIE
DYHTLTVTIIRGHLYIQGIKLGTGYSWADLPAYMTVAKVTHLCTYKRGFLDRISDTSGFAVYVKSKVGNY
RLPSTQKGSGMDTALLRNNI
(b) Are the scores for the hits with bit scores between 80 and 200 significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [2 marks]
(c) Does the alignment between Human coronavirus and SARS proteins change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [4 marks]
(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [4 marks]
(e) Based on the respective alignments, would you say that the Human coronovirus sequence is very similar to the sequence for the SARS matrix protein? Briefly justify your answer. [4 marks]
(f) Human coronavirus OC43 belongs to the group 2 of mammalian coronaviruses. One of the BLAST hits in the search you have performed is a matrix protein from a group 1 human coronavirus (229E). Which of these two matrix proteins belonging to two different groups of human coronaviruses is more similar to SARS matrix protein? (Hint: you will have to perform another BLAST search using the FASTA sequence of the group1 human coronavirus protein; click on the web link for that protein which will take you to its GenBank record.) [4 marks]