Search
compile.sh
cpg.sh
packages.txt
$\mathrm{CG}$-content is the percentage of $\mathrm{C}$ and $\mathrm{G}$ bases in the genome. For humans, this content is approximately $42\,\%$. For any pair of consecutive nucleotides (in short dinucleotide), we can calculate their probability using the $\mathrm{CG}$-content. In DNA, not all dinucleotides are equally likely. For example, $\mathrm{C}$ followed by immediately $\mathrm{G}$ is not common as expected ($4.4\,\%$ in human).
The $\mathrm{C}$ nucleotide in $\mathrm{CG}$ dinucleotide often mutates to $\mathrm{T}$. This mutation is called methylation. However, methylation usually does not occur in parts of the genome called $\mathrm{CpG}$-islands. The $\mathrm{CpG}$-islands are located around starts of genes and are connected with the promoter regions.
Knowing that the $\mathrm{CpG}$-islands usually precede genes, finding those regions might be used for classification of genes. Hidden Markov Models come handy in this process. In this assignment, we will train a classifier capable of classification of $\mathrm{CpG}$-islands.
Let $\{ (x^{(i)}, y^{(i)}) \mid i = 1, 2, \ldots, N\}$ be a training set of sequences $x^{(i)}$ and their labels $y^{(i)}$, where $x^{(i)} \in \Sigma^+$ and $y^{(i)} \in \{ \mathrm{CpG}, \mathrm{null} \}$. Class $\mathrm{CpG}$ corresponds to $\mathrm{CpG}$-islands, class $\mathrm{null}$ corresponds to the rest of the genome.
Learn two Markov models approximating probability distributions $P(x \mid y)$. Assume the Markovian property, e.g., $$ P(x \mid \mathrm{CpG}) = P(x_1x_2\cdots x_l \mid \mathrm{CpG}) = P_\mathrm{CpG}(x_1)P_\mathrm{CpG}(x_2\mid x_1) P_\mathrm{CpG} (x_3 \mid x_2) \cdots P_\mathrm{CpG}(x_l \mid x_{l-1}) \cdot P_\mathrm{CpG}(\mathrm{end} \mid x_{l}). $$
Use the naive Bayes classifier to classify a new sequence $x$, i.e., the class is $\mathrm{CpG}$ iff $$P(\mathrm{CpG} \mid x) > P(\mathrm{null} \mid x). $$ Using the Bayes formula, the previous condition can be rewritten to $$ P(\mathrm{CpG}) \cdot P(x \mid \mathrm{CpG}) > P(\mathrm{null}) \cdot P(x \mid \mathrm{null}). $$
To learn the transitions probabilities, use the Bayes estimates with pseudocount of each class equal to $1$. Consider $\Sigma = \{ \mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T}, \mathrm{end} \}$. Then the apriori probability of symbol $a \in \Sigma$ is $$ P(a) = \frac{\mathop{\mathrm{count}}(a) + 1}{\mathop{\mathrm{count}}(\cdot) + 5}; $$ and the transition probabilities from $b$ to $a$ are estimated as $$ P(b \mid a) = \frac{\mathop{\mathrm{count}}(ab) + 1}{\mathop{\mathrm{count}}(a\cdot) + 5}. $$
Download the file with data: data.zip. Alternatively, you can download the data together with a teplate Python solution: sample_solution_3.zip. The data.zip file contains four files:
cpg_train.txt
null_train.txt
seqs_test.txt
classes_test.txt
You may assume that those four files will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.
Your code should output the following two files:
predictions.txt
accuracy.txt
More details on those statistics may be found on this Wikipedia page.
Your program should run in linear time with respect to input and should consume only a linear amount of memory. Submitting a code that does not compile may result in significant point reduction.
The alignment implementation must be solely your own work. However, you can use any external library for calculating the confusion matrix statistics. In Python, the pycm library may be handy, in Java the smile library provides some tools for calculating those statistics. If you need anything else, ask me in advance.