Assignment 4: GENE FINDING

15 points
Deadline: Wednesday, May 10th, 2024, 23:59
Late submission penalty: -1 point per day but no more than 12 points
Submit to BRUTE.
Work individually.
Submit
- your source codes in a language of your choice
- bash script compile.sh that compiles your codes
- a wrapper bash script genefinding.sh that calculates the required output
- your source codes must compile either
  - on Ubuntu machines in the lab, or
  - on a freshly installed Ubuntu machine (if this is the case, provide file packages.txt with a list of packages that need to be installed)
The problem is based on homework from a class by Mark Craven, see https://www.biostat.wisc.edu/~craven/776/hw3.html. If you believe that I have introduced a mistake into the project specification, please email petr.rysavy@fel.cvut.cz.
You can assume that the input is correct; the algorithm won't encounter any unknown parameters or misspelled input. Therefore you do not need to check for those.

Motivation

In the last homework, we used Markov models to classify a sequence as a $\mathrm{CpG}$-island. We already know that those islands precede genes. In this assignment, we will use hidden Markov models to find genes in the sequence.

We will use the Viterbi algorithm to find parts of a sequence, which are likely a gene. As you already know, each gene starts with a start codon and terminates with a stop codon. The number of nucleotides in between must be divisible by $3$. This fact can be used to create a hidden Markov model that searches for start codon, then accepts several internal codons terminated by a stop codon.

Model, training, and testing

Description of the model is available on https://www.biostat.wisc.edu/~craven/776/hw3.html. A figure with the model is here. For learning, you need E. coli sequence and a training dataset. For testing, you should use this dataset.

Input and Output

You may assume that files sequenceOfEcoliStrainM54.txt, train.seqs, and test.seqs will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.

Your code should output the following two files:

predictions.txt with your predictions in the same format as train.seqs and test.seqs. One sequence (with possible multiple genes) per line.
accuracy.txt with the quality of your predictions. Include five lines with the following values.
1. The number of predicted genes.
2. The number of predicted genes that have nonempty overlap with at least one of the ground-truth genes.
3. The total length of all genes that your model predicted.
4. The total length of the overlap between predicted genes and ground-truth genes.
5. The Jaccard index of predicted genes and ground-truth genes. Consider each gene as a set of nucleotide indices in the E.coli strain. More on Jaccard index may be found on this Wikipedia page.

Scoring

Your program should run in $\mathcal{O}(l \cdot |S|^2)$ time, where $l$ is the length of a sequence and $S$ is the set of states of the model. Submitting a code that does not compile may result in significant point reduction.

External libraries

You can use any external library with an implementation of the Viterbi algorithm, standard libraries for I/O, and/or common libraries as numpy. If you need anything else, ask me in advance.

References

All positive credit should go to Mark Craven from University at Wisconsin. This assignment refers to his work https://www.biostat.wisc.edu/~craven/776/hw3.html.

Table of Contents