Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
courses:bin:tutorials:tutorial2 [2019/02/25 12:58]
courses:bin:tutorials:tutorial2 [2024/02/09 10:17]
127.0.0.1 external edit
Line 1: Line 1:
 +====== Tutorial 2 - de Bruijn and Overlap graphs, Velvet tutorial ​ ======
  
 +
 +===== Problems =====
 +
 +  - Construct a de Bruijn graph for genome ''​TAATGCCATGGGATGTT''​ and $k=3$. Identify all possible assemblies from the graph. Why is the resulting assembly not unique?
 +  - Now do the same; however, set $k=5$.
 +  - Consider reads ''​TGGCA'',​ ''​GCATTGCAA'',​ ''​TGCAAT'',​ ''​CAATT'',​ ''​ATTTGAC''​.
 +      - Assemble the reads using OLC (overlap-layout-consensus,​ i.e., the Hamiltonian approach).
 +      - Assemble the reads using a de Bruijn graph with $k=4$.
 +      - Assemble the reads using a de Bruijn graph with $k=5$.
 +      - Regarding the de Bruijn approach, which situation is better? How does the algorithm deal with ambiguous paths?
 +
 +===== Practical Example =====
 +
 +In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data:
 +<​code|bash>​
 +wget ftp://​ftp.sra.ebi.ac.uk/​vol1/​fastq/​SRR292/​SRR292770/​SRR292770_1.fastq.gz
 +wget ftp://​ftp.sra.ebi.ac.uk/​vol1/​fastq/​SRR292/​SRR292770/​SRR292770_2.fastq.gz
 +</​code>​
 +The read data were produced by a sequencer. The **FASTQ** file format is used. Look into the first file and note that 4 consecutive lines represent a single read. Because the file is zipped, we can view it using commands as ''​zcat''​ (be careful ...), ''​zless'',​ ''​zmore'',​ etc.
 +<​code|bash>​
 +zless SRR292770_1.fastq.gz
 +</​code>​
 +The first line contains an identifier, starting with ''​@''​. The second line contains the read itself; the third contains just ''​+''​. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates?
 +
 +Download and unpack the Velvet assembler. This algorithm was proposed here: [[https://​doi.org/​10.1101/​gr.074492.107|https://​doi.org/​10.1101/​gr.074492.107]].
 +<​code|bash>​
 +#wget http://​www.ebi.ac.uk/​~zerbino/​velvet/​velvet_1.2.10.tgz
 +#tar zxvf velvet_1.2.10.tgz
 +git clone https://​github.com/​dzerbino/​velvet
 +</​code>​
 +
 +Now build the assembler.
 +<​code|bash>​
 +cd velvet_1.2.10
 +make MAXKMERLENGTH=60 OPENMP=1
 +cd ..
 +</​code>​
 +
 +At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using ''​velveth''​ command. Then ''​velvetg''​ command is used for deBruijn graph construction. Run
 +<​code|bash>​
 +./​velvet_1.2.10/​velveth
 +./​velvet_1.2.10/​velvetg
 +</​code>​
 +to find out about the usage of the commands. Remember that we use paired-end reads in two files. In this first experiment set hash length to 35 (i.e., //k//-mer size is 35). ''​velvetg''​ has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set ''​-cov_cutoff 2.81''​. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands.
 +
 +You can find out how many contigs were produced by running
 +<​code|bash>​
 +cat <​out_dir_35>/​contigs.fa
 +</​code>​
 +This time, contigs are in **FASTA** format. Use BLAST to find out which organism was assembled.
 +
 +Change //k// and other settings of the Velvet assembler. Watch how they influence assembly results.
 +
 +==== Visualization ====
 +
 +To visualize the assembly, you can use [[http://​rrwick.github.io/​Bandage/​|Bandage]] program. First, download the program
 +
 +<​code|bash>​
 +wget https://​github.com/​rrwick/​Bandage/​releases/​download/​v0.8.1/​Bandage_Ubuntu_dynamic_v0_8_1.zip
 +unzip Bandage_Ubuntu_dynamic_v0_8_1.zip
 +</​code>​
 +To run it, call
 +<​code|bash>​
 +./Bandage
 +</​code>​
 +
 +The program visualizes the de-Bruijn graph from the Velvet assembler, which you can find in the folder where you put the output from the velvet assembler.
 +
 +{{ :​courses:​bin:​tutorials:​bandage_full.png?​nolink&​800 |}}
 +
 +{{ :​courses:​bin:​tutorials:​bandage_misassembly.png?​nolink&​800 |}}
 +
 +In the graph, find contigs, that are not connected with the remains of the graph. Find likely assembly errors and repeats. Each block in the graph represents one contig.
 +
 +Another visualization tool is [[https://​ics.hutton.ac.uk/​tablet/​|Tablet]]. Download and install it by typing
 +<​code|bash>​
 +wget https://​bioinf.hutton.ac.uk/​tablet/​installers/​tablet_linux_x64_1_21_02_08.sh
 +chmod +x tablet_linux_x64_1_17_08_17.sh
 +./​tablet_linux_x64_1_17_08_17.sh
 +</​code>​
 +
 +We have to tell velvet to store additional statistics about assembly. For this purpose, call ''​velvetg''​ again with an additional parameter ''​ -amos_file yes''​. Next, open the ''​.afg''​ file in Tablet, for example by
 +<​code|bash>​
 +tablet your_output_directory/​velvet_asm.afg
 +</​code>​
 +
 +You can use ''​tablet''​ to identify assembly errors as one below in the picture.
 +
 +{{ :​courses:​bin:​tutorials:​tablet_assembly_error.png?​nolink&​800 |}}
 +
 +How can you explain the following situation when a short subsequence of a contig has twice that high coverage as remains of the contig?
 +
 +{{ :​courses:​bin:​tutorials:​tablet_repeat.png?​nolink&​800 |}}
 +
 +More details about using the Velvet assembler and other tools can be found on [[https://​www.ebi.ac.uk/​training/​online/​sites/​ebi.ac.uk.training.online/​files/​user/​18/​private/​velvet-practical_part-1.pdf|ENA website]].
courses/bin/tutorials/tutorial2.txt · Last modified: 2024/03/28 16:16 by rysavpe1