CPSC 445 - Assignment 2

released: 2012/03/20, tue, 15:30; due: 2012/03/27, tue, 18:00

1 Multiple Sequence Alignment (Hands-on Problem) [20 marks]

Consider the following partial sequences from E.Coli clone vectors in FASTA format. (Source: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/dnasequences.htm)

>pBR322
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG >pBR325
aggccatgtttgacagcttatcatcgataagctttaatgcggtagtttat cacagttaaattgctaacgcagtcaggcaccgtgtatgaaatctaacaat gcgctcatcgtcatcctcggcaccgtcaccctggatgctgtaggcatagg cttggttatgccggtactgccgggcctcttgcgggatatcgtccattccg >pBR327
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG >pACYC184
GAATTCCGGATGAGCATTCATCAGGCGGGCAAGAATGTGAATAAAGGCCG GATAAAACTTGTGCTTATTTTTCTTTACGGTCTTTAAAAAGGCCGTAATA TCCAGCTGAACGGTCTGGTTATAGGTACATTGAGCAACTGACTGAAATGC CTCAAAATGTTCTTTACGATGCCATTGGGATATATCAACGGTGGTATATC >pHSG575
TGATGTCCGGCGGTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGA ACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGCACCTCAAAAAC ACCATCATACACTAAATCACTAAGTTGGCAGCATCACCCGACGCACTTTG CGCCGAATAAATACCTGTGACGGAAGATCACTTCGCAGAATAAATAAATC >pGEX2T
acgttatcgactgcacggtgcaccaatgcttctggcgtcaggcagccatc ggaagctgtggtatggctgtgcaggtcgtaaatcactgcataattcgtgt cgctcaaggcgcactcccgttctggataatgttttttgcgccgacatcat aacggttctggcaaatattctgaaatgagctgttgacaattaatcatcgg

(a) Use ClustalW2 (http://www.ebi.ac.uk/Tools/clustalw2/index.html) to obtain a multiple sequence alignment of these sequences. Report the multiple sequence alignment and the guide tree used for the alignment. [5 marks]

(b) Obtain another multiple sequence alignment for the same sequeces using the progressive multiple sequence alignment program MULTI-LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). Report the multiple sequence alignment and the guide tree used for constructing it (the alignment is accessed by clicking a TextBrowser link and then the MFA multiple sequence alignment). [5 marks]

(c) Recalculate the MULTI-LAGAN alignment using the guide tree produced by ClustalW2. The phylogenetic tree can be entered into the MULTI-LAGAN program at the bottom of the form by using a string input. MULT-LAGAN only takes a binary tree, and the result of ClustalW2 might contain a branch with more then 2 children. If this happens, convert the tree into any binary tree. Report the resulting multiple sequence alignment and guide tree. [5 marks]

(d) Comment on the differences between the multiple sequence alignments from (a), (b) and (c). Keep your answer as concise as possible. [5 marks]

2 Sum-of-pairs Scoring of Multiple Sequence Alignments (Programming Problem) [40 marks]

Important notes:

Your programs should be written either in C, Java or C++.
When you are done, send an email to hoos@cs.ubc.ca and joelaf@cs.ubc.ca with subject 'CPSC445-hw2' and attach your programs sources.
The name of your programs should be [your-student-id]-sp.{c,cpp,java}, e.g. 80132322-sp.cpp. Feel free to add any prefix to your file name in case needed. We are fine as long as your student id appears in the file name. If you would like to attach a readme text file, it should be named [your-student-id]-readme.txt.
Your programs should be well documented and you should explain the purpose of every function that you write [up to 10 marks will be deducted for code that is not commented/documented].
Your programs should output their results to standard out (stdout).

We are interested in finding the sum-of-pairs score for a given alignment. We will use the following scoring function for this program: 4 points for a match, -1 points for a mismatch, -2 for a s(-,base) or s(base,-) and 0 for a s(-,-).

(a) (Hand in this part with your written assignment) Compute (by hand) the sum-of-pairs score for the following alignment using the above score.
[5 marks]

A-G
A--
TCG

Write a program that computes the sum-of-pairs score for an alignment. The input for your program will be a file with an alignment names (asst2.in). The alignment will be a set of sequences separated by line breaks. Each sequence will have a length of up to 500 bases, and contain anywhere from 3 to 10 sequences.

(b) Using your program, compute the sum-of-pairs score for the alignment from part (a). [10 marks]

(c) Using your program, compute the sum-of-pairs score for the following alignment:

CTCT--CTCCACGGGC
CCAAA-ATTTACAGAC
CCCTAGGTTCGCAGAC
CCCTAATCCCGCAGGG
[10 marks]

(d) Compute the sum-of-pairs scores for the multiple sequence alignments from from problems 1a), 1b) and 1c). Can you make any additional comments about the success of these programs? Note: You will need to modify the output multiple sequence alignments of these programs before using them as input for your program.
[15 marks]

3 Scoring Models [10 marks]

The following questions should be answered after carefully reading section 8.1 and 8.2 of Durbin et al.

(a) What is the Jukes-Cantor distance model and why is it more appropriate than a simple model that merely counts the number of mismatches? (<= 50 words, in your own words). [3 marks]

(b) Why might the 2-parameter Kimura model be even more appropriate than the Jukes-Cantor model? (<= 50 words, in your own words). [3 marks]

(c) All three of the above models are less then realistic. Give 3 reasons or examples where all three of the models would not, or could not model real-life cases. [4 marks]

4 Phylogenetic Trees / Distance Based Methods [25 marks]

(a) Show all steps of the UPGMA algorithm as applied to the following five sequences, where the distance between two sequences is defined as the number of base positions in which they differ (for example, the first two sequences have a distance of 6 unmatched base pairs). [10 marks]

GTTAAACATCTCCTC
GTGAAACAACATGAC
GTTAAACATGTGGAC
GCACGGAACTCGCCT
GTCTTACTGGCATGA

(b) Briefly describe the role of "arithmetic averaging" in UPGMA. (<= 50 words, in your own words) [5 marks]

(c) Prove that Equation (7.2) from Durbin et al. gives the correct distances d_kl between a merged cluster C_k = C_i + C_j (where '+' denotes set union) and every other cluster C_l according to the general definition of distance between clusters as given in Equation (7.1). [10 marks]

5 CHALLENGE PROBLEMS [no marks, but highly recommended to broaded and deepen your knowledge]

Special Note - In each assignment, there will be a Challenge Problems section. These problems are not for credit. However, they will give you a greater depth of knowledge in the subject matter for this section, and will help you prepare for, and impress in, the Final Oral Examination. If you have any questions regarding these problems, since some are quite open-ended in nature, feel free to discuss them with Holger or Joel.

Reading Questions: In graduate-style seminars, you will often be asked to read academic papers before class and have a set of questions prepared for discussion with your peers during the seminar period. These questions should not simply be limited to questions of the form, "I didn't understand X, how does it work?" but should demonstrate that you have made an effort to re-read and understand the paper to come up with more piercing, critical analysis, or suggestions for future improvements or directions to the work. Working out these questions will prepare you for such seminars and discussions in the future.

(a) Reading Question 1: Read the land-mark paper,"CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/pdf/nar00046-0131.pdf ) and formulate two (2) reading questions according to the note above - clarification questions are OK as long as you have made an effort to understand them by discussing with your peers. To get you started, try to describe and justify the four steps taken in the paper to improve the sensitivity of the multiple sequence alignment method.

(b) Reading Question 2: Read the paper, "MUSCLE: multiple sequence alignment with high accuracy and high throughput," available here: ( http://nar.oxfordjournals.org/content/32/5/1792.full.pdf+html ). Describe and motivate the decisions the authors took for the selection of their scoring model.

(c) Reading Question 3: The paper, "How well do evolutionary trees describe genetic relationships among populations?," takes a critical view of the descriptive power of constructed trees from a biological perspective ( http://www.nature.com/hdy/journal/v102/n5/pdf/hdy2008136a.pdf ). Describe how the authors evaluate the results generated by tree contruction methods - including UPGMA - through comparison with actual, biological, hereditary data. What trends do they find? What conclusions do they draw?

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
This assignment should take you no longer than about 2 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. Mark the names of all student you work with.
Feel always free to contact Holger or Joel if you feel you need further help than can be provided by your fellow students.