====== Tutorial 2 - de Bruijn and Overlap graphs, Velvet tutorial ====== ===== Problems ===== - Construct a de Bruijn graph for genome ''TAATGCCATGGGATGTT'' and $k=3$. Identify all possible assemblies from the graph. Why is the resulting assembly not unique? - Now do the same; however, set $k=5$. - Consider reads ''TGGCA'', ''GCATTGCAA'', ''TGCAAT'', ''CAATT'', ''ATTTGAC''. - Assemble the reads using OLC (overlap-layout-consensus, i.e., the Hamiltonian approach). - Assemble the reads using a de Bruijn graph with $k=4$. - Assemble the reads using a de Bruijn graph with $k=5$. - Regarding the de Bruijn approach, which situation is better? How does the algorithm deal with ambiguous paths? ===== Practical Example ===== In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_2.fastq.gz The read data were produced by a sequencer. The **FASTQ** file format is used. Look into the first file and note that 4 consecutive lines represent a single read. Because the file is zipped, we can view it using commands as ''zcat'' (be careful ...), ''zless'', ''zmore'', etc. zless SRR292770_1.fastq.gz The first line contains an identifier, starting with ''@''. The second line contains the read itself; the third contains just ''+''. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates? Download and unpack the Velvet assembler. This algorithm was proposed here: [[https://doi.org/10.1101/gr.074492.107|https://doi.org/10.1101/gr.074492.107]]. git clone https://github.com/dzerbino/velvet Now build the assembler. cd velvet_1.2.10 make MAXKMERLENGTH=60 OPENMP=1 cd .. At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using ''velveth'' command. Then ''velvetg'' command is used for deBruijn graph construction. Run ./velvet_1.2.10/velveth ./velvet_1.2.10/velvetg to find out about the usage of the commands. Remember that we use paired-end reads in two files. In this first experiment set hash length to 35 (i.e., //k//-mer size is 35). ''velvetg'' has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set ''-cov_cutoff 2.81''. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands. You can find out how many contigs were produced by running cat /contigs.fa This time, contigs are in **FASTA** format. Use BLAST to find out which organism was assembled. Change //k// and other settings of the Velvet assembler. Watch how they influence assembly results. ==== Visualization ==== To visualize the assembly, you can use [[http://rrwick.github.io/Bandage/|Bandage]] program. First, download the program wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip unzip Bandage_Ubuntu_dynamic_v0_8_1.zip To run it, call ./Bandage The program visualizes the de-Bruijn graph from the Velvet assembler, which you can find in the folder where you put the output from the velvet assembler. {{ :courses:bin:tutorials:bandage_full.png?nolink&800 |}} {{ :courses:bin:tutorials:bandage_misassembly.png?nolink&800 |}} In the graph, find contigs, that are not connected with the remains of the graph. Find likely assembly errors and repeats. Each block in the graph represents one contig. Another visualization tool is [[https://ics.hutton.ac.uk/tablet/|Tablet]]. Download and install it by typing wget https://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_21_02_08.sh chmod +x tablet_linux_x64_1_17_08_17.sh ./tablet_linux_x64_1_17_08_17.sh We have to tell velvet to store additional statistics about assembly. For this purpose, call ''velvetg'' again with an additional parameter '' -amos_file yes''. Next, open the ''.afg'' file in Tablet, for example by tablet your_output_directory/velvet_asm.afg You can use ''tablet'' to identify assembly errors as one below in the picture. {{ :courses:bin:tutorials:tablet_assembly_error.png?nolink&800 |}} How can you explain the following situation when a short subsequence of a contig has twice that high coverage as remains of the contig? {{ :courses:bin:tutorials:tablet_repeat.png?nolink&800 |}} More details about using the Velvet assembler and other tools can be found on [[https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/user/18/private/velvet-practical_part-1.pdf|ENA website]].