Assignment 2: BLAST ALIGNMENT

15 points (+2 extra points possible)
Deadline: Wednesday, April 5th, 2023, 23:59
Late submission penalty: -1 point per day but no more than 12 points
Submit to BRUTE.
Work individually.
Submit
- your source codes in a language of your choice
- bash script compile.sh that compiles your codes
- a wrapper bash script blast.sh that calculates the alignment
- your source codes must compile either
  - on Ubuntu machines in the lab, or
  - on a freshly installed Ubuntu machine (if this is the case, provide file packages.txt with a list of packages that need to be installed)
- A short report as explained below.
This option for the homework is new this year, if you believe that there should be some changes done, please email petr.rysavy@fel.cvut.cz.

Nucleotide BLAST algorithm

The BLAST algorithm is a common tool to search huge sequence databases. If you visit https://blast.ncbi.nlm.nih.gov/Blast.cgi, you can find out that in minutes the algorithm searches database that contains DNA sequences of many known organisms. The original paper https://dx.doi.org/10.1016%2FS0022-2836%2805%2980360-2 was published in the 1990s, nevertheless, there was a rapid development afterward.

Your task will be to implement a simplified version of the BLAST algorithm. Study how the algorithm works and implement your own version. Start by downloading the Human Genome from the Ensembl database https://www.ensembl.org/Homo_sapiens/Info/Index. The chromosomes will form our database. Next, find a sequence of interest, for example hemoglobin gene sequence HBA1 might be a good test sequence:

actcttctggtccccacagactcagagagaacccaccatggtgctgtctcctgccgacaa
gaccaacgtcaaggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcgga
ggccctggagaggtgaggctccctcccctgctccgacccgggctcctcgcccgcccggac
ccacaggccaccctcaaccgtcctggccccggacccaaaccccacccctcactctgcttc
tccccgcaggatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacct
gagccacggctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaa
cgccgtggcgcacgtggacgacatgcccaacgcgctgtccgccctgagcgacctgcacgc
gcacaagcttcgggtggacccggtcaacttcaaggtgagcggcgggccgggagcgatctg
ggtcgaggggcgagatggcgccttcctcgcagggcagaggatcacgcgggttgcgggagg
tgtagcgcaggcggcggctgcgggcctgggccctcggccccactgaccctcttctctgca
cagctcctaagccactgcctgctggtgaccctggccgcccacctccccgccgagttcacc
cctgcggtgcacgcctccctggacaagttcctggcttctgtgagcaccgtgctgacctcc
aaataccgttaagctggagcctcggtggccatgcttcttgccccttgggcctccccccag
cccctcctccccttcctgcacccgtacccccgtggtctttgaataaagtctgagtgggcg
gca

This protein is located on chromosome 16, therefore use at least chromosome 16 when searching for the sequence above.

Start with reviewing how blast algorithm works. You might want to read the original Atschul paper, gapped blast update or any other source explaining BLAST in detail. You might take inspiration from the very simplified version we did on tutorials.

Program blast.sh should accept the following arguments:

argument	meaning
`-e`	path to the score matrix in CSV format
`-p`	gap penalization (optional, depending on your version)
`-x`	gap extension penalization gap penalization (optional, depending on your version)
`-d`	list of database files in fasta format, for example, `-d chr1.fasta chr2.fasta`
`-k`	length of the word (you will probably need this one)
`-t`	threshold on the seed score (this one will be likely needed as well)

Any other arguments are good as long as you mention their meaning in the report. Once the user runs the blast.sh script, the program loads the database sequences and listens on standard input for query sequences. Once the query sequence is found in the database, the program outputs all matches containing the following information:

Name of the database file with the match,
range of the match in the database file,
range in the query,
time needed for the search.

Finally, write a short report (no longer than two pages!). Explain how your implementation works, what did you implement, whatnot. Explain where I can find various components of your code. Of course, your implementation does not need to be as perfect as the original one but nevertheless, we can compare the implementations. Did your algorithm find hemoglobin as expected? Did it find other genes from the globin family? What about other test sequences? How fast is your algorithm compared to the BLASTn server? How many chromosomes (or nucleotides) are you able to fit in your database if you are allowed to allocate only 4GB of memory? What obstacles did you have during your implementation?

External libraries

The alignment implementation must be solely your own work. However, you can use any external library of your choice for the following:

Command line arguments parsing. In Java, apache commons cli may be useful. In Python, the argparse library is handy.
Scoring matrix parsing. For example, Java: Apache commons csv, Python: csv
FASTA file reading. Java: jFasta or Picard tools, Python: biopython.
If you need anything else, ask me in advance.

Don't forget that your code needs to compile!

Table of Contents

Assignment 2: BLAST ALIGNMENT

Nucleotide BLAST algorithm

External libraries