Table of Contents

Assignment 4: GENE FINDING

Motivation

In the last homework, we used Markov models to classify a sequence as a $\mathrm{CpG}$-island. We already know that those islands precede genes. In this assignment, we will use hidden Markov models to find genes in the sequence.

We will use the Viterbi algorithm to find parts of a sequence, which are likely a gene. As you already know, each gene starts with a start codon and terminates with a stop codon. The number of nucleotides in between must be divisible by $3$. This fact can be used to create a hidden Markov model that searches for start codon, then accepts several internal codons terminated by a stop codon.

Model, training, and testing

Description of the model is available on https://www.biostat.wisc.edu/~craven/776/hw3.html. A figure with the model is here. For learning, you need E. coli sequence and a training dataset. For testing, you should use this dataset.

Input and Output

You may assume that files sequenceOfEcoliStrainM54.txt, train.seqs, and test.seqs will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.

Your code should output the following two files:

Scoring

Your program should run in $\mathcal{O}(l \cdot |S|)$ time, where $l$ is the length of a sequence and $S$ is the set of states of the model. Submitting a code that does not compile may result in significant point reduction.

External libraries

You can use any external library with an implementation of the Viterbi algorithm. If you need anything else, ask me in advance.

References

All positive credit should go to Mark Craven from University at Wisconsin. This assignment refers to his work https://www.biostat.wisc.edu/~craven/776/hw3.html.