TAATGCCATGGGATGTT
and k=3. Identify all possible assemblies from the graph. Why is the resulting assembly not unique?
TGGCA
, GCATTGCAA
, TGCAAT
, CAATT
, ATTTGAC
.
In this tutorial, we are going to de-novo assembly a genome of an unknown organism. First, download the read data:
bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/SRR292770/SRR292770_2.fastq.gz
zcat
(be careful …), zless
, zmore
, etc.
bash
zless SRR292770_1.fastq.gz
@
. The second line contains the read itself; the third contains just +
. Find out what is the meaning of the fourth line. You can use Wikipedia. How does the sequencing machine come up with the estimates?
Download and unpack the Velvet assembler. This algorithm was proposed here: https://doi.org/10.1101/gr.074492.107.
bash
wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz tar zxvf velvet_1.2.10.tgz
Now build the assembler.
bash
cd velvet_1.2.10 make MAXKMERLENGTH=60 OPENMP=1 cd ..
At this point, we are ready to run the assembly algorithm. Velvet first calculates hashes, using velveth
command. Then velvetg
command is used for deBruijn graph construction. Run
bash
./velvet_1.2.10/velveth ./velvet_1.2.10/velvetg
velvetg
has several options that can help it with graph construction. We know that the expected coverage of the sequencing experiment was 21. Set -cov_cutoff 2.81
. We are only interested in contigs long 200 base-pairs or more. Now assemble the genome yourself using appropriate commands.
You can find out how many contigs were produced by running
bash
cat <out_dir_35>/contigs.fa
Change k and other settings of the Velvet assembler. Watch how they influence assembly results.