====== Assignment 3: $\mathrm{CpG}$-ISLAND RECOGNITION ====== * 5 points * Deadline: Wednesday, April 29th, 2020, 23:59 Thursday, April 30th, 2020, 23:59 * Late submission penalty: -0.25 point per day * Submit to [[https://cw.felk.cvut.cz/brute/|BRUTE]]. * Work individually. * Submit * your source codes in a language of your choice * bash script ''compile.sh'' that compiles your codes * a wrapper bash script ''cpg.sh'' that calculates the required output * your source codes must compile either * on Ubuntu machines in the lab, or * on a freshly installed Ubuntu machine (if this is the case, provide file ''packages.txt'' with a list of packages that need to be installed) * Should you have any questions, email [[petr.rysavy@fel.cvut.cz|petr.rysavy@fel.cvut.cz]]. * You can assume that the input is correct; the algorithm won't encounter any unknown parameters or misspelled input. Therefore you do not need to check for those. ===== Motivation ===== $\mathrm{CG}$-content is the percentage of $\mathrm{C}$ and $\mathrm{G}$ bases in the genome. For humans, this content is approximately $42\,\%$. For any pair of consecutive nucleotides (in short dinucleotide), we can calculate their probability using the $\mathrm{CG}$-content. In DNA, not all dinucleotides are equally likely. For example, $\mathrm{C}$ followed by immediately $\mathrm{G}$ is not common as expected ($4.4\,\%$ in human). The $\mathrm{C}$ nucleotide in $\mathrm{CG}$ dinucleotide often mutates to $\mathrm{T}$. This mutation is called [[https://en.wikipedia.org/wiki/Methylation|methylation]]. However, methylation usually does not occur in parts of the genome called $\mathrm{CpG}$-islands. The $\mathrm{CpG}$-islands are located around starts of genes and are connected with the promoter regions. Knowing that the $\mathrm{CpG}$-islands usually precede genes, finding those regions might be used for classification of genes. Hidden Markov Models come handy in this process. In this assignment, we will train a classifier capable of classification of $\mathrm{CpG}$-islands. ===== Formal Definition ===== Let $\{ (x^{(i)}, y^{(i)}) \mid i = 1, 2, \ldots, N\}$ be a training set of sequences $x^{(i)}$ and their labels $y^{(i)}$, where $x^{(i)} \in \Sigma^+$ and $y^{(i)} \in \{ \mathrm{CpG}, \mathrm{null} \}$. Class $\mathrm{CpG}$ corresponds to $\mathrm{CpG}$-islands, class $\mathrm{null}$ corresponds to the rest of the genome. Learn two Markov models approximating probability distributions $P(x \mid y)$. Assume the Markovian property, e.g., $$ P(x \mid \mathrm{CpG}) = P(x_1x_2\cdots x_l \mid \mathrm{CpG}) = P_\mathrm{CpG}(x_1)P_\mathrm{CpG}(x_2\mid x_1) P_\mathrm{CpG} (x_3 \mid x_2) \cdots P_\mathrm{CpG}(x_l \mid x_{l-1}) \cdot P_\mathrm{CpG}(\mathrm{end} \mid x_{l}). $$ Use the naive Bayes classifier to classify a new sequence $x$, i.e., the class is $\mathrm{CpG}$ iff $$P(\mathrm{CpG} \mid x) > P(\mathrm{null} \mid x). $$ Using the Bayes formula, the previous condition can be rewritten to $$ P(\mathrm{CpG}) \cdot P(x \mid \mathrm{CpG}) > P(\mathrm{null}) \cdot P(x \mid \mathrm{null}). $$ To learn the transitions probabilities, use the Bayes estimates with pseudocount of each class equal to $1$. Consider $\Sigma = \{ \mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T}, \mathrm{end} \}$. Then the apriori probability of symbol $a \in \Sigma$ is $$ P(a) = \frac{\mathop{\mathrm{count}}(a) + 1}{\mathop{\mathrm{count}}(\cdot) + 5}; $$ and the transition probabilities from $b$ to $a$ are estimated as $$ P(b \mid a) = \frac{\mathop{\mathrm{count}}(ab) + 1}{\mathop{\mathrm{count}}(a\cdot) + 5}. $$ ===== Input and Output ===== Download the file with data: {{../data.zip}}. Alternatively, you can download the data together with a teplate Python solution: {{sample_solution_3.zip}}. The {{../data.zip}} file contains four files: * ''cpg_train.txt'' with a training dataset of $\mathrm{CpG}$-islands. Each line contains exactly one example. * ''null_train.txt'' with a training dataset of the rest of the genome. Each line contains exactly one example. * ''seqs_test.txt'' with test sequences in the same format as above. * ''classes_test.txt'' with the labeling of sequences from the ''seqs_test.txt''. Class $\mathrm{CpG}$ corresponds to label $1$. You may assume that those four files will be located in the current working directory. I may use different sequences to test your code; the names will remain the same. Your code should output the following two files: * ''predictions.txt'' with your predictions - $1$ or $0$ for each sequence. One prediction per line. * ''accuracy.txt'' with the quality of your predictions. Include five lines with the following values. - The number of correct predictions. - The number of wrong predictions. - The accuracy of your predictions. - The precision of your predictions. - The recall of your predictions. More details on those statistics may be found on [[https://en.wikipedia.org/wiki/Confusion_matrix|this Wikipedia page]]. ===== Scoring ===== Your program should run in linear time with respect to input and should consume only a linear amount of memory. Submitting a code that does not compile may result in significant point reduction. ===== External libraries ===== The alignment implementation must be solely your own work. However, you can use any external library for calculating the confusion matrix statistics. In Python, the [[https://github.com/sepandhaghighi/pycm|pycm]] library may be handy, in Java the [[https://github.com/haifengl/smile|smile]] library provides some tools for calculating those statistics. If you need anything else, ask me in advance. ===== Hints ===== * Be careful with sums and products of small numbers. * On the provided dataset, expect accuracy near to $100\,\%$.