====== Assignment 3: $\mathrm{CpG}$-ISLAND RECOGNITION  ======

  * 5 points
  * Deadline: <del>Wednesday, April 29th, 2020, 23:59</del> Thursday, April 30th, 2020, 23:59
  * Late submission penalty: -0.25 point per day
  * Submit to [[https://cw.felk.cvut.cz/brute/|BRUTE]].
  * Work individually.
  * Submit
     * your source codes in a language of your choice
     * bash script ''compile.sh'' that compiles your codes
     * a wrapper bash script ''cpg.sh'' that calculates the required output
     * your source codes must compile either 
        * on Ubuntu machines in the lab, or
        * on a freshly installed Ubuntu machine (if this is the case, provide file ''packages.txt'' with a list of packages that need to be installed)
  * Should you have any questions, email [[petr.rysavy@fel.cvut.cz|petr.rysavy@fel.cvut.cz]].
  * You can assume that the input is correct; the algorithm won't encounter any unknown parameters or misspelled input. Therefore you do not need to check for those.


===== Motivation =====

$\mathrm{CG}$-content is the percentage of $\mathrm{C}$ and $\mathrm{G}$ bases in the genome. For humans, this content is approximately $42\,\%$. For any pair of consecutive nucleotides (in short dinucleotide), we can calculate their probability using the $\mathrm{CG}$-content. In DNA, not all dinucleotides are equally likely. For example, $\mathrm{C}$ followed by immediately $\mathrm{G}$ is not common as expected ($4.4\,\%$ in human).

The $\mathrm{C}$ nucleotide in $\mathrm{CG}$ dinucleotide often mutates to $\mathrm{T}$. This mutation is called [[https://en.wikipedia.org/wiki/Methylation|methylation]]. However, methylation usually does not occur in parts of the genome called $\mathrm{CpG}$-islands. The $\mathrm{CpG}$-islands are located around starts of genes and are connected with the promoter regions.

Knowing that the $\mathrm{CpG}$-islands usually precede genes, finding those regions might be used for classification of genes. Hidden Markov Models come handy in this process. In this assignment, we will train a classifier capable of classification of $\mathrm{CpG}$-islands.
===== Formal Definition =====

Let $\{ (x^{(i)}, y^{(i)}) \mid i = 1, 2, \ldots, N\}$ be a training set of sequences $x^{(i)}$ and their labels $y^{(i)}$, where $x^{(i)} \in \Sigma^+$ and $y^{(i)} \in \{ \mathrm{CpG}, \mathrm{null} \}$. Class $\mathrm{CpG}$ corresponds to $\mathrm{CpG}$-islands, class $\mathrm{null}$ corresponds to the rest of the genome.

Learn two Markov models approximating probability distributions $P(x \mid y)$. Assume the Markovian property, e.g.,
$$ P(x \mid \mathrm{CpG}) = P(x_1x_2\cdots x_l \mid \mathrm{CpG}) = P_\mathrm{CpG}(x_1)P_\mathrm{CpG}(x_2\mid x_1) P_\mathrm{CpG} (x_3 \mid x_2) \cdots P_\mathrm{CpG}(x_l \mid x_{l-1}) \cdot P_\mathrm{CpG}(\mathrm{end} \mid x_{l}). $$

Use the naive Bayes classifier to classify a new sequence $x$, i.e., the class is $\mathrm{CpG}$ iff $$P(\mathrm{CpG} \mid x) > P(\mathrm{null} \mid x). $$
Using the Bayes formula, the previous condition can be rewritten to
$$ P(\mathrm{CpG}) \cdot P(x \mid \mathrm{CpG}) > P(\mathrm{null}) \cdot P(x \mid \mathrm{null}). $$

To learn the transitions probabilities, use the Bayes estimates with pseudocount of each class equal to $1$. Consider $\Sigma = \{ \mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T}, \mathrm{end} \}$. Then the apriori probability of symbol $a \in \Sigma$ is
$$ P(a) = \frac{\mathop{\mathrm{count}}(a) + 1}{\mathop{\mathrm{count}}(\cdot) + 5}; $$
and the transition probabilities from $b$ to $a$ are estimated as
$$ P(b \mid a) = \frac{\mathop{\mathrm{count}}(ab) + 1}{\mathop{\mathrm{count}}(a\cdot) + 5}. $$

===== Input and Output =====

Download the file with data: {{../data.zip}}. Alternatively, you can download the data together with a teplate Python solution: {{sample_solution_3.zip}}. The {{../data.zip}} file contains four files:
  * ''cpg_train.txt'' with a training dataset of $\mathrm{CpG}$-islands. Each line contains exactly one example.
  * ''null_train.txt'' with a training dataset of the rest of the genome. Each line contains exactly one example.
  * ''seqs_test.txt'' with test sequences in the same format as above.
  * ''classes_test.txt'' with the labeling of sequences from the ''seqs_test.txt''. Class $\mathrm{CpG}$ corresponds to label $1$.
You may assume that those four files will be located in the current working directory. I may use different sequences to test your code; the names will remain the same.

Your code should output the following two files:
  * ''predictions.txt'' with your predictions - $1$ or $0$ for each sequence. One prediction per line.
  * ''accuracy.txt'' with the quality of your predictions. Include five lines with the following values.
    - The number of correct predictions.
    - The number of wrong predictions.
    - The accuracy of your predictions.
    - The precision of your predictions.
    - The recall of your predictions.
    More details on those statistics may be found on [[https://en.wikipedia.org/wiki/Confusion_matrix|this Wikipedia page]].
===== Scoring =====

Your program should run in linear time with respect to input and should consume only a linear amount of memory. Submitting a code that does not compile may result in significant point reduction.

===== External libraries =====

The alignment implementation must be solely your own work. However, you can use any external library for calculating the confusion matrix statistics. In Python, the [[https://github.com/sepandhaghighi/pycm|pycm]] library may be handy, in Java the [[https://github.com/haifengl/smile|smile]] library provides some tools for calculating those statistics. If you need anything else, ask me in advance.

===== Hints =====

  * Be careful with sums and products of small numbers.
  * On the provided dataset, expect accuracy near to $100\,\%$.