compile.sh
that compiles your codes
genefinding.sh
that calculates the required output
packages.txt
with a list of packages that need to be installed)
In the last homework, we used Markov models to classify a sequence as a $\mathrm{CpG}$-island. We already know that those islands precede genes. In this assignment, we will use hidden Markov models to find genes in the sequence.
We will use the Viterbi algorithm to find parts of a sequence, which are likely a gene. As you already know, each gene starts with a start codon and terminates with a stop codon. The number of nucleotides in between must be divisible by $3$. This fact can be used to create a hidden Markov model that searches for start codon, then accepts several internal codons terminated by a stop codon.
Description of the model is available on https://www.biostat.wisc.edu/~craven/776/hw3.html. A figure with the model is here. For learning, you need E. coli sequence and a training dataset. For testing, you should use this dataset.
You may assume that files sequenceOfEcoliStrainM54.txt
, train.seqs
, and test.seqs
will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.
Your code should output the following two files:
predictions.txt
with your predictions in the same format as train.seqs
and test.seqs
. One sequence (with possible multiple genes) per line.
accuracy.txt
with the quality of your predictions. Include five lines with the following values.
Your program should run in $\mathcal{O}(l \cdot |S|)$ time, where $l$ is the length of a sequence and $S$ is the set of states of the model. Submitting a code that does not compile may result in significant point reduction.
You can use any external library with an implementation of the Viterbi algorithm. If you need anything else, ask me in advance.
All positive credit should go to Mark Craven from University at Wisconsin. This assignment refers to his work https://www.biostat.wisc.edu/~craven/776/hw3.html.