Search
Using the Fitch algorithm, compute the total parsimony score under the Hamming distance and determine internal node labels.
As in the previous problem, the structure of the phylogenetic tree is already defined. Compute the total parsimony score using scoring matrix S and determine the internal nodes.
The scoring matrix $\mathbf{S}$ is defined as
$$ \mathbf{S} = \begin{array}{c c} & \begin{array} {@{} c c c @{}} \hphantom{.}\mathrm{A}\hphantom{.} & \hphantom{.}\mathrm{C}\hphantom{.} & \hphantom{.}\mathrm{G}\hphantom{.} & \hphantom{.}\mathrm{T}\hphantom{.} \end{array} \\ \begin{array}{c} \mathrm{A} \\ \mathrm{C} \\ \mathrm{G} \\ \mathrm{T} \end{array}\hspace{-1em} & \left[ \begin{array}{@{} c c c @{}} 0 & 2.5 & 1 & 2.5 \\ 2.5 & 0 & 2.5 & 1 \\ 1 & 2.5 & 0 & 2.5 \\ 2.5 & 1 & 2.5 & 0 \end{array} \right] \\ \mbox{} % Blank line to match column names so as to align the = vertically \end{array} $$
and the structure of the tree is following
Assume that the Jukes-Cantor model holds, i.e., the probability of substitutions can be expressed by the following matrix
$$ \begin{array}{c c} & \begin{array} {@{} c c c @{}} \hphantom{.}\mathrm{A}\hphantom{.} & \hphantom{.}\mathrm{C}\hphantom{.} & \hphantom{.}\mathrm{G}\hphantom{.} & \hphantom{.}\mathrm{T}\hphantom{.} \end{array} \\ \begin{array}{c} \mathrm{A} \\ \mathrm{C} \\ \mathrm{G} \\ \mathrm{T} \end{array}\hspace{-1em} & \left[ \begin{array}{@{} c c c @{}} 0.7 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.7 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.7 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.7 \end{array} \right] \\ \mbox{} % Blank line to match column names so as to align the = vertically \end{array} $$ and that all bases are equally likely in the tree root. What is the likelihood of the following tree assuming that all edges have unit length?
[ Problem adapted from https://www.youtube.com/watch?v=6H8o0_RQleQ ]
First, start by downloading data that we will use.
bash
wget https://cw.fel.cvut.cz/wiki/_media/courses/bin/tutorials/dnaprimates.txt
Now, switch to the R environment, and load the required libraries.
R
library("phangorn") library("ape")
The following code loads the file with sequences.
primates = read.phyDat("dnaprimates.txt", format="phylip", type="DNA")
To solve the small parsimony problem, we need to build a phylogenetic tree. To do that, the distance matrix needs to be calculated.
dm = dist.dna(as.DNAbin(primates))
treeUPGMA = upgma(dm) treeNJ = NJ(dm)
plot(treeUPGMA, main="UPGMA", cex = 0.8) plot(treeNJ, "unrooted", main="NJ", cex = 0.5)
To calculate the parsimony score, you can use the following commands.
parsimony(treeUPGMA, primates) parsimony(treeNJ, primates)
Clearly, we can pick a better tree out of the two. However, we need to be sure to find the best one. To do that call
optParsUPGMA = optim.parsimony(treeUPGMA, primates) optParsNJ = optim.parsimony(treeNJ, primates)
And again, we can plot the trees by
plot(optParsUPGMA, main="UPGMA", cex = 0.8) plot(optParsNJ, "unrooted", main="NJ", cex = 0.5)
Parsimony is not the only way how to evaluate a tree. We can calculate the probability of the tree given the data. The following code calculates the probability of the trees.
likeliNJ <- pml(treeNJ, primates) print(likeliNJ) likeliUPGMA <- pml(treeUPGMA, primates) print(likeliUPGMA)
optLikeliNJ <- optim.pml(likeliNJ, model = "JC", rearrangement = "stochastic") optLikeliUPGMA <- optim.pml(likeliUPGMA, model = "JC", rearrangement = "stochastic")
model = “JC”
TN93
Calling
logLik(optLikeliNJ) logLik(optLikeliUPGMA)
The last method that shows tree quality which we will mention today is different from the previous. The tree can be calculated if we omit some of the columns in all sequences. In other words, we sample the columns of the multiple-sequence alignment. In the next step, we can compare how often two nodes are in the same position in a tree after sampling and in the reference tree we evaluate. In R, the following single line command does the work for the neighbor-joining tree.
bs <- bootstrap.pml(optLikeliNJ, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0)) plotBS(midpoint(optLikeliNJ$tree), bs, p = 50, type="p")
This tutorial is a combination of the following: