Assignment 3: $\mathrm{CpG}$-ISLAND RECOGNITION

5 points
Deadline: Wednesday, May 3rd, 2023, 23:59
Late submission penalty: -0.25 point per day
Submit to BRUTE.
Work individually.
Submit
- your source codes in a language of your choice
- bash script compile.sh that compiles your codes
- a wrapper bash script cpg.sh that calculates the required output
- your source codes must compile either
  - on Ubuntu machines in the lab, or
  - on a freshly installed Ubuntu machine (if this is the case, provide file packages.txt with a list of packages that need to be installed)
Should you have any questions, email petr.rysavy@fel.cvut.cz.
You can assume that the input is correct; the algorithm won't encounter any unknown parameters or misspelled input. Therefore you do not need to check for those.

If you think that this homework is not for you, you might submit EXON SEARCH instead.

Motivation

$\mathrm{CG}$-content is the percentage of $\mathrm{C}$ and $\mathrm{G}$ bases in the genome. For humans, this content is approximately $42\,\%$. For any pair of consecutive nucleotides (in short dinucleotide), we can calculate their probability using the $\mathrm{CG}$-content. In DNA, not all dinucleotides are equally likely. For example, $\mathrm{C}$ followed by immediately $\mathrm{G}$ is not common as expected ($4.4\,\%$ in human).

The $\mathrm{C}$ nucleotide in $\mathrm{CG}$ dinucleotide often mutates to $\mathrm{T}$. This mutation is called methylation. However, methylation usually does not occur in parts of the genome called $\mathrm{CpG}$-islands. The $\mathrm{CpG}$-islands are located around starts of genes and are connected with the promoter regions.

Knowing that the $\mathrm{CpG}$-islands usually precede genes, finding those regions might be used for classification of genes. Hidden Markov Models come handy in this process. In this assignment, we will train a classifier capable of classification of $\mathrm{CpG}$-islands.

Formal Definition

Let $\{ (x^{(i)}, y^{(i)}) \mid i = 1, 2, \ldots, N\}$ be a training set of sequences $x^{(i)}$ and their labels $y^{(i)}$, where $x^{(i)} \in \Sigma^+$ and $y^{(i)} \in \{ \mathrm{CpG}, \mathrm{null} \}$. Class $\mathrm{CpG}$ corresponds to $\mathrm{CpG}$-islands, class $\mathrm{null}$ corresponds to the rest of the genome.

Learn two Markov models approximating probability distributions $P(x \mid y)$. Assume the Markovian property, e.g., $$ P(x \mid \mathrm{CpG}) = P(x_1x_2\cdots x_l \mid \mathrm{CpG}) = P_\mathrm{CpG}(x_1)P_\mathrm{CpG}(x_2\mid x_1) P_\mathrm{CpG} (x_3 \mid x_2) \cdots P_\mathrm{CpG}(x_l \mid x_{l-1}) \cdot P_\mathrm{CpG}(\mathrm{end} \mid x_{l}). $$

Use the naive Bayes classifier to classify a new sequence $x$, i.e., the class is $\mathrm{CpG}$ iff $$P(\mathrm{CpG} \mid x) > P(\mathrm{null} \mid x). $$ Using the Bayes formula, the previous condition can be rewritten to $$ P(\mathrm{CpG}) \cdot P(x \mid \mathrm{CpG}) > P(\mathrm{null}) \cdot P(x \mid \mathrm{null}). $$

To learn the transitions probabilities, use the Bayes estimates with pseudocount of each class equal to $1$. Consider $\Sigma = \{ \mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T}, \mathrm{end} \}$. Then the apriori probability of symbol $a \in \Sigma$ is $$ P(a) = \frac{\mathop{\mathrm{count}}(a) + 1}{\mathop{\mathrm{count}}(\cdot) + 5}; $$ and the transition probabilities from $b$ to $a$ are estimated as $$ P(b \mid a) = \frac{\mathop{\mathrm{count}}(ab) + 1}{\mathop{\mathrm{count}}(a\cdot) + 5}. $$

Input and Output

Download the file with data: data.zip. Alternatively, you can download the data together with a teplate Python solution: sample_solution_3.zip. The data.zip file contains four files:

cpg_train.txt with a training dataset of $\mathrm{CpG}$-islands. Each line contains exactly one example.
null_train.txt with a training dataset of the rest of the genome. Each line contains exactly one example.
seqs_test.txt with test sequences in the same format as above.
classes_test.txt with the labeling of sequences from the seqs_test.txt. Class $\mathrm{CpG}$ corresponds to label $1$.

You may assume that those four files will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.

Your code should output the following two files:

predictions.txt with your predictions - $1$ or $0$ for each sequence. One prediction per line.
accuracy.txt with the quality of your predictions. Include five lines with the following values.
1. The number of correct predictions.
2. The number of wrong predictions.
3. The accuracy of your predictions.
4. The precision of your predictions.
5. The recall of your predictions.

More details on those statistics may be found on this Wikipedia page.

Scoring

Your program should run in linear time with respect to input and should consume only a linear amount of memory. Submitting a code that does not compile may result in significant point reduction.

External libraries

The alignment implementation must be solely your own work. However, you can use any external library for calculating the confusion matrix statistics. In Python, the pycm library may be handy, in Java the smile library provides some tools for calculating those statistics. If you need anything else, ask me in advance.

Hints

Be careful with sums and products of small numbers.
On the provided dataset, expect accuracy near to $100\,\%$.

Table of Contents