Assignment 2: SEQUENCE ASSEMBLY

15 points
Deadline: Wednesday, April 3rd, 2024, 23:59
Late submission penalty: -1 point per day but no more than 12 points
Submit to BRUTE.
Work individually.
Submit
- report with a description of your findings
- a bash script with commands you used to generate all the results
This option for the homework is new this year; if you believe there should be some changes done, please email petr.rysavy@fel.cvut.cz.

De-novo sequence assembly

The sequencing machines cannot read the whole nucleotide sequence at once. Instead, short fragments, called reads, are sequenced. In the case of short-read sequences, those fragments are long tens to hundreds of nucleotides; in the case of long-read sequencing, reads span over thousands of nucleotides. Here, in the homework, we will go through practical sequence assembly in more detail than in class. However, this homework will be more about finding the right tools and pipelines than about implementation (you actually do not need to implement anything at all; only call existing tools). We will also experiment a bit with the outputs to see how well the tools work.

As the previous paragraphs suggest, this homework is also very open-ended, so your first task will be to find input data. Get any raw real-world read data, either long-read or short-read. The data have to be suitable for de novo assembly. I recommend you to get some bacterial DNA (not a requirement). Briefly describe what species you obtained, where you downloaded it from, sequencer information, and essential characteristics such as the genome length. Hint: If you are not sure where to get such data, try looking up NCBI SRA or its ENA counterpart. If you still do not know where to get data, an example is here: https://www.ebi.ac.uk/ena/browser/view/SRR747869. Feel free to use it with a 1 point penalization.
Usually, you need to check for the quality of the input data before doing the assembly. There are tools that allow you to visualize read-length distribution and show the quality of the sequenced reads and similar characteristics. Find such a tool and present its outputs. Briefly describe what is in the plots. Hint: It depends on the previous, but maybe FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is relevant?
Assemble the reads using the velvet assembler we saw in class.
Find another assembly algorithm of your choice. Reference the publication that describes it and explain briefly the basic idea of the assembler. Provide a link for the implementation. If you are not sure which algorithm to chose from, feel free to use ABYSS, SPADES, or SSAKE.
Assemble the data using the other assembly algorithm.
Find yourself a tool for assessing the quality of assembly.
Compare the previous two assemblies. Analyze the results - which algorithm was better and why? What are the most common statistics used to check the quality of assembly?
[Optional] Find a tool used for read filtering and repeat the steps with filtered data. What is the purpose of read cleaning?
Download the ART program (https://www.niehs.nih.gov/research/resources/software/biostatistics/art) and simulate artificial reads from the official assembly of the species of your choice.
Experiment with three or four sequencing settings and assemble the artificial reads. How does the quality of the assembly depend on the number of errors?