Tutorial 12 - Motif discovery

In this tutorial, we will practise the motif discovery algorithms learned during the motif finding lecture.

A simple problem

Consider the following set of DNA sequences: $$ \begin{array} \mathrm{A} & \mathrm{A} & \mathrm{C} & \mathrm{T} & \mathrm{G} & \mathrm{A}\\ \mathrm{A} & \mathrm{C} & \mathrm{T} & \mathrm{C} & \mathrm{C} & \mathrm{A}\\ \mathrm{C} & \mathrm{G} & \mathrm{G} & \mathrm{A} & \mathrm{C} & \mathrm{T}\\ \end{array} $$

The goal is to find a motif of length W=3. Answer the following questions:

What is the motif we are searching for? Are there any other plausible motifs in the sequence set?
How does its information content logo look? What is its average bit-score?
Could occurrence of such a motif in this sequence set be random? What is its E-value?

Task 1

Your task is to use the EM algorithm and simulate its run. Start with a random motif position matrix $p$. Perform at least one E step, i.e., propose the expected values of the motif starting position matrix $Z$. Then, recalculate the motif position matrix $p$. Discuss the development of $p$.

MEME

EM-based motif search is implemented in MEME. It can be downloaded and installed locally. In our simple task, we would work with the following tutorial_short.fasta file:

>s1
AACTGA
>s2
ACTCCA
>s3
CGGACT

The MEME call could be:

bash

meme tutorial_short.fasta -w 3 -dna -nmotifs 3

You can inspect its output here.

Task 2

Your task is to use the Gibbs sampling to learn the motif. Answer a couple of general questions first:

What is the state space?
How large the state space is?
Describe the Markov chain we will deal with.
How many states will we need to visit?
How will we eventually evaluate the walk through states?

Start in a random state, i.e., a random configuration of motif starting positions. Perform at least two iterations = generate other two states. Discuss the sequence of sampled states and propose a further process that would lead to convergence.

You can stem from the graphs below. The first one shows the empirical state distributions in terms of their relative frequency in Gibbs sampling. The second graph depicts the average bit-scores in the individual states. The last graph illustrates how quickly the best motif is found (= visited for the first time):

Reference

Both the algorithms were described here.

Table of Contents