Tutorial 2 - de Bruijn and Overlap graphs, Velvet tutorial

Problems

  1. Construct a de Bruijn graph for genome TAATGCCATGGGATGTT and k=3. Identify all possible assemblies from the graph. Why is the resulting assembly not unique?
  2. Now do the same; however, set k=5.
  3. Consider reads TGGCA, GCATTGCAA, TGCAAT, CAATT, ATTTGAC.
    1. Assemble the reads using OLC (overlap-layout-consensus, i.e., the Hamiltonian approach).
    2. Assemble the reads using a de Bruijn graph with k=4.
    3. Assemble the reads using a de Bruijn graph with k=5.

Practical Example

In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data:

bash

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_2.fastq.gz
The read data were produced by a sequencer. The FASTQ file format is used. Look into the first file and note that 4 consecutive lines represent a single read. Because the file is zipped, we can view it using commands as zcat (be careful …), zless, zmore, etc.

bash

zless SRR292770_1.fastq.gz
The first line contains an identifier, starting with @. The second line contains the read itself; the third contains just +. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates?

Download and unpack the Velvet assembler. This algorithm was proposed here: https://doi.org/10.1101/gr.074492.107.

bash

wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar zxvf velvet_1.2.10.tgz

Now build the assembler.

bash

cd velvet_1.2.10
make MAXKMERLENGTH=60 OPENMP=1
cd ..

At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using velveth command. Then velvetg command is used for deBruijn graph construction. Run

bash

./velvet_1.2.10/velveth
./velvet_1.2.10/velvetg
to find out about the usage of the commands. Remember that we use paired-end reads in two files. In this first experiment set hash length to 35 (i.e., k-mer size is 35). velvetg has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set -cov_cutoff 2.81. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands.

You can find out how many contigs were produced by running

bash

cat <out_dir_35>/contigs.fa
This time, contigs are in FASTA format. In the next tutorial, we will find out which organism was assembled.

Change k and other settings of the Velvet assembler. Watch how they influence assembly results.

courses/bin/tutorials/tutorial2.txt · Last modified: 2018/02/28 11:03 by rysavpe1