Search
In this tutorial, we will practise the motif discovery algorithms learned during the motif finding lecture.
Consider the following set of DNA sequences: $$ \begin{array} \mathrm{A} & \mathrm{A} & \mathrm{C} & \mathrm{T} & \mathrm{G} & \mathrm{A}\\ \mathrm{A} & \mathrm{C} & \mathrm{T} & \mathrm{C} & \mathrm{C} & \mathrm{A}\\ \mathrm{C} & \mathrm{G} & \mathrm{G} & \mathrm{A} & \mathrm{C} & \mathrm{T}\\ \end{array} $$
The goal is to find a motif of length W=3. Answer the following questions:
Your task is to use the EM algorithm and simulate its run. Start with a random motif position matrix $p$. Perform at least one E step, i.e., propose the expected values of the motif starting position matrix $Z$. Then, recalculate the motif position matrix $p$. Discuss the development of $p$.
EM-based motif search is implemented in MEME. It can be downloaded and installed locally. In our simple task, we would work with the following tutorial_short.fasta file:
>s1 AACTGA >s2 ACTCCA >s3 CGGACT
The MEME call could be:
bash
meme tutorial_short.fasta -w 3 -dna -nmotifs 3
You can inspect its output here.
Your task is to use the Gibbs sampling to learn the motif. Answer a couple of general questions first:
Start in a random state, i.e., a random configuration of motif starting positions. Perform at least two iterations = generate other two states. Discuss the sequence of sampled states and propose a further process that would lead to convergence.
You can stem from the graphs below. The first one shows the empirical state distributions in terms of their relative frequency in Gibbs sampling. The second graph depicts the average bit-scores in the individual states. The last graph illustrates how quickly the best motif is found (= visited for the first time):
Both the algorithms were described here.