CourseWare Wiki
Switch Term
Summer 2023 / 2024
Summer 2022 / 2023
Summer 2021 / 2022
Summer 2020 / 2021
Summer 2019 / 2020
Summer 2018 / 2019
Summer 2017 / 2018
Search
Log In
b232
courses
bin
tutorials
tutorial2
Differences
This shows you the differences between two versions of the page.
View differences:
Side by Side
Inline
Go
Link to this comparison view
Both sides previous revision
Previous revision
2024/03/28 16:16 rysavpe1
2024/02/09 10:17 external edit
Go
2024/03/28 16:16 rysavpe1
2024/02/09 10:17 external edit
Go
Last revision
Both sides next revision
courses:bin:tutorials:tutorial2 [2019/02/25 09:50]
courses:bin:tutorials:tutorial2 [2024/02/09 10:17]
127.0.0.1
external edit
Line 1:
Line 1:
+
====== Tutorial 2 - de Bruijn and Overlap graphs, Velvet tutorial ======
+
+
===== Problems =====
+
+
- Construct a de Bruijn graph for genome ''TAATGCCATGGGATGTT'' and $k=3$. Identify all possible assemblies from the graph. Why is the resulting assembly not unique?
+
- Now do the same; however, set $k=5$.
+
- Consider reads ''TGGCA'', ''GCATTGCAA'', ''TGCAAT'', ''CAATT'', ''ATTTGAC''.
+
- Assemble the reads using OLC (overlap-layout-consensus, i.e., the Hamiltonian approach).
+
- Assemble the reads using a de Bruijn graph with $k=4$.
+
- Assemble the reads using a de Bruijn graph with $k=5$.
+
- Regarding the de Bruijn approach, which situation is better? How does the algorithm deal with ambiguous paths?
+
+
===== Practical Example =====
+
+
In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data:
+
<code|bash>
+
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_1.fastq.gz
+
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_2.fastq.gz
+
</code>
+
The read data were produced by a sequencer. The **FASTQ** file format is used. Look into the first file and note that 4 consecutive lines represent a single read. Because the file is zipped, we can view it using commands as ''zcat'' (be careful ...), ''zless'', ''zmore'', etc.
+
<code|bash>
+
zless SRR292770_1.fastq.gz
+
</code>
+
The first line contains an identifier, starting with ''@''. The second line contains the read itself; the third contains just ''+''. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates?
+
+
Download and unpack the Velvet assembler. This algorithm was proposed here: [[https://doi.org/10.1101/gr.074492.107|https://doi.org/10.1101/gr.074492.107]].
+
<code|bash>
+
#wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
+
#tar zxvf velvet_1.2.10.tgz
+
git clone https://github.com/dzerbino/velvet
+
</code>
+
+
Now build the assembler.
+
<code|bash>
+
cd velvet_1.2.10
+
make MAXKMERLENGTH=60 OPENMP=1
+
cd ..
+
</code>
+
+
At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using ''velveth'' command. Then ''velvetg'' command is used for deBruijn graph construction. Run
+
<code|bash>
+
./velvet_1.2.10/velveth
+
./velvet_1.2.10/velvetg
+
</code>
+
to find out about the usage of the commands. Remember that we use paired-end reads in two files. In this first experiment set hash length to 35 (i.e., //k//-mer size is 35). ''velvetg'' has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set ''-cov_cutoff 2.81''. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands.
+
+
You can find out how many contigs were produced by running
+
<code|bash>
+
cat <out_dir_35>/contigs.fa
+
</code>
+
This time, contigs are in **FASTA** format. Use BLAST to find out which organism was assembled.
+
+
Change //k// and other settings of the Velvet assembler. Watch how they influence assembly results.
+
+
==== Visualization ====
+
+
To visualize the assembly, you can use [[http://rrwick.github.io/Bandage/|Bandage]] program. First, download the program
+
+
<code|bash>
+
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip
+
unzip Bandage_Ubuntu_dynamic_v0_8_1.zip
+
</code>
+
To run it, call
+
<code|bash>
+
./Bandage
+
</code>
+
+
The program visualizes the de-Bruijn graph from the Velvet assembler, which you can find in the folder where you put the output from the velvet assembler.
+
+
{{ :courses:bin:tutorials:bandage_full.png?nolink&800 |}}
+
+
{{ :courses:bin:tutorials:bandage_misassembly.png?nolink&800 |}}
+
+
In the graph, find contigs, that are not connected with the remains of the graph. Find likely assembly errors and repeats. Each block in the graph represents one contig.
+
+
Another visualization tool is [[https://ics.hutton.ac.uk/tablet/|Tablet]]. Download and install it by typing
+
<code|bash>
+
wget https://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_21_02_08.sh
+
chmod +x tablet_linux_x64_1_17_08_17.sh
+
./tablet_linux_x64_1_17_08_17.sh
+
</code>
+
+
We have to tell velvet to store additional statistics about assembly. For this purpose, call ''velvetg'' again with an additional parameter '' -amos_file yes''. Next, open the ''.afg'' file in Tablet, for example by
+
<code|bash>
+
tablet your_output_directory/velvet_asm.afg
+
</code>
+
+
You can use ''tablet'' to identify assembly errors as one below in the picture.
+
+
{{ :courses:bin:tutorials:tablet_assembly_error.png?nolink&800 |}}
+
+
How can you explain the following situation when a short subsequence of a contig has twice that high coverage as remains of the contig?
+
+
{{ :courses:bin:tutorials:tablet_repeat.png?nolink&800 |}}
+
+
More details about using the Velvet assembler and other tools can be found on [[https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/user/18/private/velvet-practical_part-1.pdf|ENA website]].
courses/bin/tutorials/tutorial2.txt
· Last modified: 2024/03/28 16:16 by
rysavpe1