Assignment 2: SEQUENCE ASSEMBLY

  • 15 points
  • Deadline: Wednesday, April 2nd, 2025, 23:59
  • Late submission penalty: -1 point per day but no more than 12 points
  • Submit to BRUTE.
  • Work individually.
  • Submit
    • report with a description of your findings
    • a bash script with commands you used to generate all the results
  • This option for the homework was new in 2024; if you believe there should be some changes done, please email petr.rysavy@fel.cvut.cz.

De-novo sequence assembly

The sequencing machines cannot read the whole nucleotide sequence at once. Instead, short fragments, called reads, are sequenced. In the case of short-read sequences, those fragments are long tens to hundreds of nucleotides; in the case of long-read sequencing, reads span over thousands of nucleotides. Here, in the homework, we will go through practical sequence assembly in more detail than in class. However, this homework will be more about finding the right tools and pipelines than about implementation (you actually do not need to implement anything at all; only call existing tools). We will also experiment a bit with the outputs to see how well the tools work.

  1. As the previous paragraphs suggest, this homework is also very open-ended, so your first task will be to find input data. Get any raw real-world read data, either long-read or short-read. The data have to be suitable for de novo assembly. I recommend you to get some bacterial DNA (not a requirement). Briefly describe what species you obtained, where you downloaded it from, sequencer information, and essential characteristics such as the genome length. The data can be downloaded from the NCBI SRA (see https://www.ncbi.nlm.nih.gov/sra ) or its ENA counterpart. If you still do not know where to get data, an example is here: https://www.ebi.ac.uk/ena/browser/view/SRR747869. Feel free to use it if you have no other preference.
  2. Usually, you need to check the quality of the input data before assembling it. There are tools that allow you to visualize read-length distribution and show the quality of the sequenced reads and similar characteristics. Download the FastQC tool (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and briefly describe its outputs on the data you decided to use.
  3. Assemble the reads using the velvet assembler we saw in class.
  4. Find another assembly algorithm of your choice. Reference the publication that describes it and explain briefly the basic idea of the assembler. Provide a link for the implementation. If you are not sure which algorithm to chose from, feel free to use ABYSS, SPADES, or SSAKE. SPADES is the recommended option.
  5. Assemble the data using the other assembly algorithm.
  6. Find yourself a tool for assessing the quality of assembly.
  7. Compare the previous two assemblies. Analyze the results - which algorithm was better and why? For comparison, QUAST, or assembly-stats can be used. Feel free to choose either. What are the most common statistics used to check the quality of assembly? Discuss the outputs of the tool.
  8. [Optional] Find a tool used for read filtering and repeat the steps with filtered data. What is the purpose of read cleaning?
courses/bin/assignments/hw2a.txt · Last modified: 2025/02/20 12:23 by rysavpe1