Table of Contents

Assignment 3: $\mathrm{CpG}$-ISLAND RECOGNITION

If you think that this homework is not for you, you might submit EXON SEARCH instead.

Motivation

$\mathrm{CG}$-content is the percentage of $\mathrm{C}$ and $\mathrm{G}$ bases in the genome. For humans, this content is approximately $42\,\%$. For any pair of consecutive nucleotides (in short dinucleotide), we can calculate their probability using the $\mathrm{CG}$-content. In DNA, not all dinucleotides are equally likely. For example, $\mathrm{C}$ followed by immediately $\mathrm{G}$ is not common as expected ($4.4\,\%$ in human).

The $\mathrm{C}$ nucleotide in $\mathrm{CG}$ dinucleotide often mutates to $\mathrm{T}$. This mutation is called methylation. However, methylation usually does not occur in parts of the genome called $\mathrm{CpG}$-islands. The $\mathrm{CpG}$-islands are located around starts of genes and are connected with the promoter regions.

Knowing that the $\mathrm{CpG}$-islands usually precede genes, finding those regions might be used for classification of genes. Hidden Markov Models come handy in this process. In this assignment, we will train a classifier capable of classification of $\mathrm{CpG}$-islands.

Formal Definition

Let $\{ (x^{(i)}, y^{(i)}) \mid i = 1, 2, \ldots, N\}$ be a training set of sequences $x^{(i)}$ and their labels $y^{(i)}$, where $x^{(i)} \in \Sigma^+$ and $y^{(i)} \in \{ \mathrm{CpG}, \mathrm{null} \}$. Class $\mathrm{CpG}$ corresponds to $\mathrm{CpG}$-islands, class $\mathrm{null}$ corresponds to the rest of the genome.

Learn two Markov models approximating probability distributions $P(x \mid y)$. Assume the Markovian property, e.g., $$ P(x \mid \mathrm{CpG}) = P(x_1x_2\cdots x_l \mid \mathrm{CpG}) = P_\mathrm{CpG}(x_1)P_\mathrm{CpG}(x_2\mid x_1) P_\mathrm{CpG} (x_3 \mid x_2) \cdots P_\mathrm{CpG}(x_l \mid x_{l-1}) \cdot P_\mathrm{CpG}(\mathrm{end} \mid x_{l}). $$

Use the naive Bayes classifier to classify a new sequence $x$, i.e., the class is $\mathrm{CpG}$ iff $$P(\mathrm{CpG} \mid x) > P(\mathrm{null} \mid x). $$ Using the Bayes formula, the previous condition can be rewritten to $$ P(\mathrm{CpG}) \cdot P(x \mid \mathrm{CpG}) > P(\mathrm{null}) \cdot P(x \mid \mathrm{null}). $$

To learn the transitions probabilities, use the Bayes estimates with pseudocount of each class equal to $1$. Consider $\Sigma = \{ \mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T}, \mathrm{end} \}$. Then the apriori probability of symbol $a \in \Sigma$ is $$ P(a) = \frac{\mathop{\mathrm{count}}(a) + 1}{\mathop{\mathrm{count}}(\cdot) + 5}; $$ and the transition probabilities from $b$ to $a$ are estimated as $$ P(b \mid a) = \frac{\mathop{\mathrm{count}}(ab) + 1}{\mathop{\mathrm{count}}(a\cdot) + 5}. $$

Input and Output

Download the file with data: data.zip. Alternatively, you can download the data together with a teplate Python solution: sample_solution_3.zip. The data.zip file contains four files:

You may assume that those four files will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.

Your code should output the following two files:

More details on those statistics may be found on this Wikipedia page.

Scoring

Your program should run in linear time with respect to input and should consume only a linear amount of memory. Submitting a code that does not compile may result in significant point reduction.

External libraries

The alignment implementation must be solely your own work. However, you can use any external library for calculating the confusion matrix statistics. In Python, the pycm library may be handy, in Java the smile library provides some tools for calculating those statistics. If you need anything else, ask me in advance.

Hints