Differences

This shows you the differences between two versions of the page.

--- courses:bin:tutorials:tutorial2 [2019/02/25 12:58]
+++ courses:bin:tutorials:tutorial2 [2024/02/09 10:17]
127.0.0.1 external edit
@@ Line 1: / Line 1: @@
+====== Tutorial 2 - de Bruijn and Overlap graphs, Velvet tutorial  ======
+===== Problems =====
+  - Construct a de Bruijn graph for genome ''TAATGCCATGGGATGTT'' and $k=3$. Identify all possible assemblies from the graph. Why is the resulting assembly not unique?
+  - Now do the same; however, set $k=5$.
+  - Consider reads ''TGGCA'', ''GCATTGCAA'', ''TGCAAT'', ''CAATT'', ''ATTTGAC''.
+      - Assemble the reads using OLC (overlap-layout-consensus, i.e., the Hamiltonian approach).
+      - Assemble the reads using a de Bruijn graph with $k=4$.
+      - Assemble the reads using a de Bruijn graph with $k=5$.
+      - Regarding the de Bruijn approach, which situation is better? How does the algorithm deal with ambiguous paths?
+===== Practical Example =====
+In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data:
+<code|bash>
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_1.fastq.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_2.fastq.gz
+</code>
+The read data were produced by a sequencer. The **FASTQ** file format is used. Look into the first file and note that 4 consecutive lines represent a single read. Because the file is zipped, we can view it using commands as ''zcat'' (be careful ...), ''zless'', ''zmore'', etc.
+<code|bash>
+zless SRR292770_1.fastq.gz
+</code>
+The first line contains an identifier, starting with ''@''. The second line contains the read itself; the third contains just ''+''. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates?
+Download and unpack the Velvet assembler. This algorithm was proposed here: [[https://doi.org/10.1101/gr.074492.107|https://doi.org/10.1101/gr.074492.107]].
+<code|bash>
+#wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
+#tar zxvf velvet_1.2.10.tgz
+git clone https://github.com/dzerbino/velvet
+</code>
+Now build the assembler.
+<code|bash>
+cd velvet_1.2.10
+make MAXKMERLENGTH=60 OPENMP=1
+cd ..
+</code>
+At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using ''velveth'' command. Then ''velvetg'' command is used for deBruijn graph construction. Run
+<code|bash>
+./velvet_1.2.10/velveth
+./velvet_1.2.10/velvetg
+</code>
+to find out about the usage of the commands. Remember that we use paired-end reads in two files. In this first experiment set hash length to 35 (i.e., //k//-mer size is 35). ''velvetg'' has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set ''-cov_cutoff 2.81''. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands.
+You can find out how many contigs were produced by running
+<code|bash>
+cat <out_dir_35>/contigs.fa
+</code>
+This time, contigs are in **FASTA** format. Use BLAST to find out which organism was assembled.
+Change //k// and other settings of the Velvet assembler. Watch how they influence assembly results.
+==== Visualization ====
+To visualize the assembly, you can use [[http://rrwick.github.io/Bandage/|Bandage]] program. First, download the program
+<code|bash>
+wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip
+unzip Bandage_Ubuntu_dynamic_v0_8_1.zip
+</code>
+To run it, call
+<code|bash>
+./Bandage
+</code>
+The program visualizes the de-Bruijn graph from the Velvet assembler, which you can find in the folder where you put the output from the velvet assembler.
+{{ :courses:bin:tutorials:bandage_full.png?nolink&800 |}}
+{{ :courses:bin:tutorials:bandage_misassembly.png?nolink&800 |}}
+In the graph, find contigs, that are not connected with the remains of the graph. Find likely assembly errors and repeats. Each block in the graph represents one contig.
+Another visualization tool is [[https://ics.hutton.ac.uk/tablet/|Tablet]]. Download and install it by typing
+<code|bash>
+wget https://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_21_02_08.sh
+chmod +x tablet_linux_x64_1_17_08_17.sh
+./tablet_linux_x64_1_17_08_17.sh
+</code>
+We have to tell velvet to store additional statistics about assembly. For this purpose, call ''velvetg'' again with an additional parameter '' -amos_file yes''. Next, open the ''.afg'' file in Tablet, for example by
+<code|bash>
+tablet your_output_directory/velvet_asm.afg
+</code>
+You can use ''tablet'' to identify assembly errors as one below in the picture.
+{{ :courses:bin:tutorials:tablet_assembly_error.png?nolink&800 |}}
+How can you explain the following situation when a short subsequence of a contig has twice that high coverage as remains of the contig?
+{{ :courses:bin:tutorials:tablet_repeat.png?nolink&800 |}}
+More details about using the Velvet assembler and other tools can be found on [[https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/user/18/private/velvet-practical_part-1.pdf|ENA website]].