CPSC445 - Assignment 2 (covers Module 2)

released: Thu, 06/02/09; due: Tue, 06/02/21

1 Local Sequence Alignment [10 marks]

Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty d=8) on the protein sequences DEWDHE and NDWEHK, using the BLOSUM50 matrix:

		N	D	W	E	H	K
	0	0	0	0	0	0	0
D	0	2	8	0	2	0	0
E	0	0	4	5	6	2	1
W	?	0	0	19	11	3	0
D	0	2	8	?	21	13	5
H	0	1	1	5	13	31	23
E	0	0	3	0	?	?	32

(a) Complete the table, by replacing the ?'s with the appropriate numbers. [4 marks]

(b) From the table, give the optimal local alignment for these two sequences and its score. [6 marks]

2 Multiple Sequence Alignment [15 marks]

(a) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, a (base,gap) pair gets a value of -2, and a (gap,gap) pair gets a value of 0. [5 marks]

GCAA
GT - A
C - - A

(b) Align the following two alignments (i.e., profiles) of protein sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [5 marks]
Hint: Make sure you understand the description of profile alignment in Durbin et al. (page 146-147)

Profile 1:
RWCH
R - CY

Profile 2:
RIWY
RVW -

(c) Briefly explain the role of guide trees in progressive multiple sequence alignment algorithms. What do the leaf and internal nodes of a guide tree represent? [5 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.

3 Programming: Optimal Local and Global Pairwise Sequence Alignment [50 marks]

Important notes:

Your programs should be written either in Java or C++.
When you are done, send an email to safari@cs.ubc.ca with subject 'CPSC445-hw2' and attach your programs sources.
The name of your programs should be [your-student-id].{c,cpp,java} and [your-student-id]-local.{c,cpp,java}, e.g. 80132322.cpp and 80132322-local.cpp. The first one finds a global alignment whereas the second one should find the best local alignment. Feel free to add any prefix to your file name in case needed. We are fine as long as your student id appears in the file name.
You programs should be well documented and you should explain the purpose of every function that you write.
Since the marking process is partly automated, you must strictly follow the format of all output files as specified below.

Given two DNA sequences, we are interested in finding the best pair-wise sequence alignment between them. The scoring function uses 2 for any match, -1 for any mismatch, and -2 for any match against a gap.

More specifically, we are interested in the best local and global alignment, its corresponding alignment score, the possibility of existing multiple optimal alignments, and, if the answer is positive, the set of all optimal alignments.

The two DNA sequences are given to you in file 2.in in two lines, each line corresponds to one of the two DNA sequences. Each DNA sequence has length between 5 and 50.

Note that each part of the problem must be written in a separate output file (this simplifies partly automated marking).

(a) (output: 2.o1) - 10 marks
In this file, write the optimal score as an integer value.

(b) (output: 2.o2) - 10 marks
For this part, you should write the dynamic programming matrix.
Assumem the first DNA has length m and the second one has length n. Your output, 2.o2, should contain m+1 lines each containing n+1 integer values separated by a single space character. The value j in the line i corresponds the value in column j and row i of the dynamic programming matrix.

(c) (output: 2.o3) - 10 marks
For this part, we are interested in one optimal alignment. Write the best alignment of the two input sequences on two lines. For any gap write a single character '-'. Notice that there might be multiple optimal alignments and you are required to only report one.

(d) (output: 2.o4) - 10 marks
Are there multiple optimal alignments? Write either 'YES' or 'NO' (all in capital letters) into the output file.

(e) (output: 2.o5) - bonus question, no marks
In this part report all possible optimal alignments. In the first line write the number of optimal alignments and then put every alignment in two lines separated by an empty line.

(f) (hand in this part with your written assignment) - 10 marks

Which parts of the program did you need to change in order to perform local rather than global pairwise sequence alignment?

Examples for input and output file formats:

program 893960.cpp

2.in:
ATCAGAGTC
TTCAGTC

2.o1:
7

2.o2:
0 -2 -4 -6 -8 -10 -12 -14
-2 -1 -3 -5 -4 -6 -8 -10
-4 0 1 -1 -3 -5 -4 -6
-6 -2 -1 3 1 -1 -3 -2
-8 -4 -3 1 5 3 1 -1
-10 -6 -5 -1 3 7 5 3
-12 -8 -7 -3 1 5 6 4
-14 -10 -9 -5 -1 3 4 5
-16 -12 -8 -7 -3 1 5 3
-18 -14 -10 -6 -5 -1 3 7

2.o3:
ATCAGAGTC
TTC--AGTC

2.o4:
YES

2.o5:
3
ATCAGAGTC
TTC--AGTC

ATCAGAGTC
TTCA--GTC

ATCAGAGTC
TTCAG--TC

-------------------------------------------------------------

program 893960-local.cpp

2.in:
ATCAGAGTC
TTCAGTC

2.o1:
8

2.o2:
0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0
0 2 2 0 0 1 2 0
0 0 1 4 2 0 0 4
0 0 0 2 6 4 2 2
0 0 0 0 4 8 6 4
0 0 0 0 2 6 7 5
0 0 0 0 0 4 5 6
0 2 2 0 0 2 6 4
0 0 1 4 2 0 4 8

2.o3:
AGTC
AGTC

2.o4:
YES

2.o5:
4
AGTC
AGTC

TCAGAGTC
TCA--GTC

TCAGAGTC
TCAG--TC

TCAG
TCAG

4 Similarity Search using BLAST (Hands-on Problem) [30 marks]

Run a Protein-Protein Blast search (BLASTP) at the NCBI web site in order to find proteins similar to the Matrix glycoprotein of Human coronavirus OC43 in the SWISSPROT database; the sequence of this protein in FASTA format is as follows:

>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein) (Membrane glycoprotein)
MSSKTTPAPVYIWTADEAIKFLKEWNFSLGIILLFITIILQFGYTSRSMFVYVIKMIILWLMWPLTIILT
IFNCVYALNNVYLGLSIVFTIVAIIMWIVYFVNSIRLFIRTGSFWSFNPETNNLMCIDMKGTMYVRPIIE
DYHTLTVTIIRGHLYIQGIKLGTGYSWADLPAYMTVAKVTHLCTYKRGFLDRISDTSGFAVYVKSKVGNY
RLPSTQKGSGMDTALLRNNI

Perform the search using the BLOSUM62 scoring matrix, with gap opening cost = 11, gap extension cost = 1, Expectation = 10 and word size = 3.

(a) Answer the following questions:

What is the number of hits reported by BLASTP? [2 marks]
What is the number of hits with a bit score between 80 and 200? [2 marks]
Which types of organisms do the proteins from the previous question belong to? (Hint: Use the taxonomy report link on the BLAST result page) [2 marks]
Some of the these organisms are parasitic. In which hosts are these organisms found? (This can be easily inferred from the taxonomy report information) [2 mark]
What type of disorders do the viruses cause whose proteins got a score between 80 and 200 in this search? [2 marks]
Give the alignment of the Human coronavirus matrix protein sequence with the SARS coronavirus sequence. Specify the number of identical residues and % sequence identity, the number of similar residues and % similarity (in BLAST terminology: % positive), and the number and % of gaps. [2 mark]

(b) Are the scores for the hits with bit scores between 80 and 200 significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [2 marks]

(c) Does the alignment between Human coronavirus and SARS proteins change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [4 marks]

(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [4 marks]

(e) Based on the respective alignments, would you say that the Human coronovirus sequence is very similar to the sequence for the SARS matrix protein? Briefly justify your answer. [4 marks]

(f) Human coronavirus OC43 belongs to the group 2 of mammalian coronaviruses. One of the BLAST hits in the search you have performed is a matrix protein from a group 1 human coronavirus (229E). Which of these two matrix proteins belonging to two different groups of human coronaviruses is more similar to SARS matrix protein? (Hint: you will have to perform another BLAST search using the FASTA sequence of the group1 human coronavirus protein; click on the web link for that protein which will take you to its GenBank record.) [4 marks]

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
This assignment should take you no longer than about 4-5 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Holger, Baharak or Mohammad if you feel you need further help than can be provided by your fellow students.

		N	D	W	E	H	K
	0	0	0	0	0	0	0
D	0	2	8	0	2	0	0
E	0	0	4	5	6	2	1
W	?	0	0	19	11	3	0
D	0	2	8	?	21	13	5
H	0	1	1	5	13	31	23
E	0	0	3	0	?	?	32

		N	D	W	E	H	K
	0	0	0	0	0	0	0
D	0	2	8	0	2	0	0
E	0	0	4	5	6	2	1
W	?	0	0	19	11	3	0
D	0	2	8	?	21	13	5
H	0	1	1	5	13	31	23
E	0	0	3	0	?	?	32

		N	D	W	E	H	K
	0	0	0	0	0	0	0
D	0	2	8	0	2	0	0
E	0	0	4	5	6	2	1
W	?	0	0	19	11	3	0
D	0	2	8	?	21	13	5
H	0	1	1	5	13	31	23
E	0	0	3	0	?	?	32