====== Tutorial 12 -  Motif discovery ======
In this tutorial, we will practise the motif discovery algorithms learned during the motif finding lecture.

===== A simple problem =====
/** the solution is available in: tutorial_solved_example_motif_em_gibbs.pdf  **/

Consider the following set of DNA sequences:
$$
 \begin{array}
   \mathrm{A}  & \mathrm{A}  & \mathrm{C}  & \mathrm{T}  & \mathrm{G}  & \mathrm{A}\\
   \mathrm{A}  & \mathrm{C}  & \mathrm{T}  & \mathrm{C}  & \mathrm{C}  & \mathrm{A}\\
   \mathrm{C}  & \mathrm{G}  & \mathrm{G}  & \mathrm{A}  & \mathrm{C}  & \mathrm{T}\\
 \end{array}
$$

The goal is to find a motif of length W=3. Answer the following questions:

  - What is the motif we are searching for? Are there any other plausible motifs in the sequence set? /** ACT, the other one could be CTG but it has much worse parameters (CTG, CTC and CGG) matches**/
  - How does its information content logo look? What is its average bit-score? /** ACT with 2 bits in every position, the average bit-score is thus 2 **/
  - Could occurrence of such a motif in this sequence set be random? What is its E-value? 
  
  /** the probability that we find a trigram in all the three sequences, difficult to compute, but obviously small  **/
  /** in the first sequence we draw 4 trigrams out of 64 that exist (with probability of 1), in the second one we have to hit one of them with 4 trials (1-(60/64)^4=0.23), in the third sequence we have to hit the previous hit again with 4 trials (if we assume only one previous hit, it is 1-(63/64)^4=0.061), the outcome is 0.0139, however the problem is resampling (some trigrams could be drawn repeatedly and thus we generate fewer distinct trigrams) and dependence in trigrams**/  
  /** see experimental derivation in motif_eval_calc.R, the result was 0.014, a similar figure to the previous approximate estimate **/


===== Task 1 =====

Your task is to use the **EM algorithm** and simulate its run. Start with a random motif position matrix $p$. Perform at least one E step, i.e., propose the expected values of the motif starting position matrix $Z$. Then, recalculate the motif position matrix $p$. Discuss the development of $p$.

==== MEME ====

/* MEME ma web server, zpracovani se ale uklada do fronty a trva dlouho, pro cviceni a vlastne ani pozdejsi pouziti je nevhodne. I lokalni instalace je pro cviceni prilis dlouha, proto je idealni se na cviceni pouze podivat na MEME vystup, viz nize */

EM-based motif search is implemented in [[https://meme-suite.org/meme/doc/meme.html?man_type=web|MEME]]. It can be [[https://meme-suite.org/meme/doc/download.html|downloaded]] and [[https://meme-suite.org/meme/doc/install.html?man_type=web|installed locally]]. In our simple task, we would work with the following //tutorial_short.fasta// file:

<code>
>s1
AACTGA
>s2
ACTCCA
>s3
CGGACT
</code>

The MEME call could be:

<code|bash>
meme tutorial_short.fasta -w 3 -dna -nmotifs 3
</code>

You can inspect its output {{:courses:bin:tutorials:meme.zip|here}}.

/* Motifs were found exactly as expected, E-value for ACT is slightly higher, could be caused by ZOOPS model that has been used. */ 
===== Task 2 =====

Your task is to use the **Gibbs sampling** to learn the motif. Answer a couple of general questions first:

  - What is the state space? /* the set of initial motif positions in all the sequences */ 
  - How large the state space is?  /* L-W+1 different init positions in each sequence, altogether (L-W+1)^N, in our case 64 different states */
  - Describe the Markov chain we will deal with. /* The chain has the above-described states, each state has 12 transitions, only 4 of them can be taken at each step. */
  - How many states will we need to visit?  /* ideally, we would need to visit an infinite number of states, in here we need to guarantee that we reached convergence, i.e., the distribution of state frequency does not change */
  - How will we eventually evaluate the walk through states?  
  /* we will study the relative frequency of the individual states in the walk, the best solution is provided by the most frequently visited state, alternatively we can see the motif bit-scores in the most frequent states (sum of their position bit-scores) */

Start in a random state, i.e., a random configuration of motif starting positions. Perform at least two iterations = generate other two states. Discuss the sequence of sampled states and propose a further process that would lead to convergence. 

You can stem from the graphs below. The first one shows the empirical state distributions in terms of their relative frequency in Gibbs sampling. The second graph depicts the average bit-scores in the individual states. The last graph illustrates how quickly the best motif is found (= visited for the first time):


{{ :courses:bin:tutorials:gibbs_state_freq.png|The relative state frequency in Gibbs sampling in our simple example.}}

  /** We construct a Gibbs' walk through our state space. The graph shows that the above-mentioned motifs ACT and CTG correspond to by far the most often visited states 214 (ACT) and 321 (CTG). The graph was constructed from a sequence of 100,000 states generated by Gibbs sampling with the regularization constant = 0.25. The number of visits should match with state probability. Some of the 64 states were not visited even once, the algorithm has to show strong preference for important states = those with large state probability. The distribution gets more even with a larger regularization constant. **/
  
{{ :courses:bin:tutorials:state_bit_scores.png|The average bit-score in the individual states.}} /** The graph demonstrates that the average bit-scores (see information content logos) also have the ability to identify interesting motifs. In our case, 214 (ACT) and 321 (CTG) belong to the leading states. However, the differences among the states are relatively small. There are not states with zero bit-score (the distribution of 4 nucleotides in 3 sequences can never be perfectly even for any of the positions), also there some states that are pretty close to the ACT motif that also score as high as the CTG motif. Consequently, this measure would not be a good measure to drive Gibbs sampling. **/

{{ :courses:bin:tutorials:bit_score_iterations_wide.png|The average bit-score in the best state reached until the given iteration.}} /** The graph shows five runs, in all of them, the best motif ACT was reached before 32nd iteration. The graph works with the idea that we are not interested in state prob distribution we only wish to find the best motif. **/
   
===== Reference =====

Both the algorithms were described {{ :courses:bin:bin_motif.pdf|here}}.