Using the Fitch algorithm, compute the total parsimony score under the Hamming distance and determine internal node labels.

How many different trees can be obtained?

As in the previous problem, the structure of the phylogenetic tree is already defined. Compute the total parsimony score using scoring matrix S and determine the internal nodes.

The scoring matrix $\mathbf{S}$ is defined as

$$ \mathbf{S} = \begin{array}{c c} & \begin{array} {@{} c c c @{}} \hphantom{.}\mathrm{A}\hphantom{.} & \hphantom{.}\mathrm{C}\hphantom{.} & \hphantom{.}\mathrm{G}\hphantom{.} & \hphantom{.}\mathrm{T}\hphantom{.} \end{array} \\ \begin{array}{c} \mathrm{A} \\ \mathrm{C} \\ \mathrm{G} \\ \mathrm{T} \end{array}\hspace{-1em} & \left[ \begin{array}{@{} c c c @{}} 0 & 2.5 & 1 & 2.5 \\ 2.5 & 0 & 2.5 & 1 \\ 1 & 2.5 & 0 & 2.5 \\ 2.5 & 1 & 2.5 & 0 \end{array} \right] \\ \mbox{} % Blank line to match column names so as to align the = vertically \end{array} $$

and the structure of the tree is following

How many different trees can be obtained?

What is the relationship between Fitch and Sankoff algorithm?

Assume that the * Jukes-Cantor model * holds, i.e., the probability of substitutions can be expressed by the following matrix

$$ \begin{array}{c c} & \begin{array} {@{} c c c @{}} \hphantom{.}\mathrm{A}\hphantom{.} & \hphantom{.}\mathrm{C}\hphantom{.} & \hphantom{.}\mathrm{G}\hphantom{.} & \hphantom{.}\mathrm{T}\hphantom{.} \end{array} \\ \begin{array}{c} \mathrm{A} \\ \mathrm{C} \\ \mathrm{G} \\ \mathrm{T} \end{array}\hspace{-1em} & \left[ \begin{array}{@{} c c c @{}} 0.7 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.7 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.7 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.7 \end{array} \right] \\ \mbox{} % Blank line to match column names so as to align the = vertically \end{array} $$ and that all bases are equally likely in the tree root. What is the likelihood of the following tree assuming that all edges have unit length?

[ Problem adapted from https://www.youtube.com/watch?v=6H8o0_RQleQ ]

First, start by downloading data that we will use.

bash

wget https://cw.fel.cvut.cz/wiki/_media/courses/bin/tutorials/dnaprimates.txt

- Maddison, D. R., Swofford, D. L., & Maddison, W. P. (1997). NEXUS: an extensible file format for systematic information.
*Systematic biology*, 46(4), 590-621.

Now, switch to the R environment, and load the required libraries.

R

library("phangorn") library("ape")

The following code loads the file with sequences.

R

primates = read.phyDat("dnaprimates.txt", format="phylip", type="DNA")

To solve the small parsimony problem, we need to build a phylogenetic tree. To do that, the distance matrix needs to be calculated.

R

dm = dist.dna(as.DNAbin(primates))

R

treeUPGMA = upgma(dm) treeNJ = NJ(dm)

R

plot(treeUPGMA, main="UPGMA", cex = 0.8) plot(treeNJ, "unrooted", main="NJ", cex = 0.5)

To calculate the parsimony score, you can use the following commands.

R

parsimony(treeUPGMA, primates) parsimony(treeNJ, primates)

Which tree is better? Why? Is it truly the most parsimonious tree?

Clearly, we can pick a better tree out of the two. However, we need to be sure to find the best one. To do that call

R

optParsUPGMA = optim.parsimony(treeUPGMA, primates) optParsNJ = optim.parsimony(treeNJ, primates)

And again, we can plot the trees by

R

plot(optParsUPGMA, main="UPGMA", cex = 0.8) plot(optParsNJ, "unrooted", main="NJ", cex = 0.5)

Parsimony is not the only way how to evaluate a tree. We can calculate the probability of the tree given the data. The following code calculates the probability of the trees.

R

likeliNJ <- pml(treeNJ, primates) print(likeliNJ) likeliUPGMA <- pml(treeUPGMA, primates) print(likeliUPGMA)

R

optLikeliNJ <- optim.pml(likeliNJ, model = "JC", rearrangement = "stochastic") optLikeliUPGMA <- optim.pml(likeliUPGMA, model = "JC", rearrangement = "stochastic")

`model = “JC”`

tells R to use the Jukes-Cantor model for scoring the tree. There are many other models; they differ in the number of free parameters in the matrix of substitution probabilities. You may replace Jukes-Cantor model `TN93`

, which is named after Tamura-Nei and the year of publication.
Calling

R

logLik(optLikeliNJ) logLik(optLikeliUPGMA)

The last method that shows tree quality which we will mention today is different from the previous. The tree can be calculated if we omit some of the columns in all sequences. In other words, we sample the columns of the multiple-sequence alignment. In the next step, we can compare how often two nodes are in the same position in a tree after sampling and in the reference tree we evaluate. In R, the following single line command does the work for the neighbor-joining tree.

R

bs <- bootstrap.pml(optLikeliNJ, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0)) plotBS(midpoint(optLikeliNJ$tree), bs, p = 50, type="p")

courses/bin/tutorials/tutorial6.txt · Last modified: 2019/03/25 15:01 by rysavpe1