Search
compile.sh
genefinding.sh
packages.txt
In the last homework, we used Markov models to classify a sequence as a $\mathrm{CpG}$-island. We already know that those islands precede genes. In this assignment, we will use hidden Markov models to find genes in the sequence.
We will use the Viterbi algorithm to find parts of a sequence, which are likely a gene. As you already know, each gene starts with a start codon and terminates with a stop codon. The number of nucleotides in between must be divisible by $3$. This fact can be used to create a hidden Markov model that searches for start codon, then accepts several internal codons terminated by a stop codon.
Description of the model is available on https://www.biostat.wisc.edu/~craven/776/hw3.html. A figure with the model is here. For learning, you need E. coli sequence and a training dataset. For testing, you should use this dataset.
You may assume that files sequenceOfEcoliStrainM54.txt, train.seqs, and test.seqs will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.
sequenceOfEcoliStrainM54.txt
train.seqs
test.seqs
Your code should output the following two files:
predictions.txt
accuracy.txt
Your program should run in $\mathcal{O}(l \cdot |S|^2)$ time, where $l$ is the length of a sequence and $S$ is the set of states of the model. Submitting a code that does not compile may result in significant point reduction.
You can use any external library with an implementation of the Viterbi algorithm, standard libraries for I/O, and/or common libraries as numpy. If you need anything else, ask me in advance.
All positive credit should go to Mark Craven from University at Wisconsin. This assignment refers to his work https://www.biostat.wisc.edu/~craven/776/hw3.html.