====== Assignment 2: BLAST ALIGNMENT  ======

  * 15 points
  * Deadline: Wednesday, April 2nd, 2025, 23:59
  * Late submission penalty: -1 point per day but no more than 12 points
  * Submit to [[https://cw.felk.cvut.cz/brute/|BRUTE]].
  * Work individually.
  * Submit
     * your source codes in a language of your choice
     * bash script ''compile.sh'' that compiles your codes
     * a wrapper bash script ''blast.sh'' that calculates the alignment
     * your source codes must compile either 
        * on Ubuntu machines in the lab, or
        * on a freshly installed Ubuntu machine (if this is the case, provide file ''packages.txt'' with a list of packages that need to be installed)
     * A short report as explained below. 
  * This option for the homework is new this year, if you believe that there should be some changes done, please email [[petr.rysavy@fel.cvut.cz|petr.rysavy@fel.cvut.cz]].
  
===== Nucleotide BLAST algorithm =====

The BLAST algorithm is a common tool to search huge sequence databases. If you visit
[[https://blast.ncbi.nlm.nih.gov/Blast.cgi]], you can find out that in minutes the
algorithm searches database that contains DNA sequences of many known organisms. The
original paper [[https://dx.doi.org/10.1016%2FS0022-2836%2805%2980360-2]] was published
in the 1990s, nevertheless, there was a rapid development afterward.

Your task will be to implement a simplified version of the BLAST algorithm. Study how
the algorithm works and implement your own version. Start by downloading the Human
Genome from the Ensembl database [[https://www.ensembl.org/Homo_sapiens/Info/Index]].
The chromosomes will form our database. Next, find a sequence of interest, for example
[[https://www.ncbi.nlm.nih.gov/nuccore/NC_000016.10?strand=1&report=genbank&from=176680&to=177522|hemoglobin gene sequence HBA1]]
might be a good test sequence:
<code>actcttctggtccccacagactcagagagaacccaccatggtgctgtctcctgccgacaa
gaccaacgtcaaggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcgga
ggccctggagaggtgaggctccctcccctgctccgacccgggctcctcgcccgcccggac
ccacaggccaccctcaaccgtcctggccccggacccaaaccccacccctcactctgcttc
tccccgcaggatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacct
gagccacggctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaa
cgccgtggcgcacgtggacgacatgcccaacgcgctgtccgccctgagcgacctgcacgc
gcacaagcttcgggtggacccggtcaacttcaaggtgagcggcgggccgggagcgatctg
ggtcgaggggcgagatggcgccttcctcgcagggcagaggatcacgcgggttgcgggagg
tgtagcgcaggcggcggctgcgggcctgggccctcggccccactgaccctcttctctgca
cagctcctaagccactgcctgctggtgaccctggccgcccacctccccgccgagttcacc
cctgcggtgcacgcctccctggacaagttcctggcttctgtgagcaccgtgctgacctcc
aaataccgttaagctggagcctcggtggccatgcttcttgccccttgggcctccccccag
cccctcctccccttcctgcacccgtacccccgtggtctttgaataaagtctgagtgggcg
gca
</code>
This protein is located on chromosome 16, therefore use at least chromosome
16 when searching for the sequence above.

Start with reviewing how blast algorithm works. You might want to
read [[https://academic.oup.com/nar/article/25/17/3389/1061651|the original Atschul paper]],
[[https://academic.oup.com/nar/article/25/17/3389/1061651|gapped blast update]]
or any other source explaining BLAST in detail. You might take inspiration
from the very simplified version we did on tutorials.

Program ''blast.sh'' should accept the following arguments:
| argument | meaning |
| ''-e'' | path to the score matrix in CSV format |
| ''-p'' | gap penalization (optional, depending on your version) |
| ''-x'' | gap extension penalization gap penalization (optional, depending on your version) |
| ''-d'' | list of database files in fasta format, for example, ''-d chr1.fasta chr2.fasta'' |
| ''-k'' | length of the word (you will probably need this one) |
| ''-t'' | threshold on the seed score (this one will be likely needed as well) |
Any other arguments are good as long as you mention their meaning in the
report. Once the user runs the ''blast.sh'' script, the program loads the
database sequences and listens on standard input for query sequences. Once
the query sequence is found in the database, the program outputs all matches
containing the following information:
  * Name of the database file with the match,
  * range of the match in the database file,
  * range in the query,
  * time needed for the search.

Finally, write a short report (no longer than two pages!). Explain how your implementation
works, what did you implement, whatnot. Explain where I can find various components of
your code. Of course, your implementation does not need to be as perfect as the original
one but nevertheless, we can compare the implementations. Did your algorithm find hemoglobin
as expected? Did it find other genes from the globin family? What about other test sequences?
How fast is your algorithm compared to the BLASTn server? How many chromosomes (or nucleotides)
are you able to fit in your database if you are allowed to allocate only 4GB of memory? What
obstacles did you have during your implementation?

===== External libraries =====

The alignment implementation must be solely your own work. However, you can use any external library of your choice for the following:
  * Command line arguments parsing. In Java, [[https://mvnrepository.com/artifact/commons-cli/commons-cli/1.3.1|apache commons cli]] may be useful. In Python, the [[https://docs.python.org/3.7/howto/argparse.html|argparse]] library is handy.
  * Scoring matrix parsing. For example, Java: [[https://commons.apache.org/proper/commons-csv/|Apache commons csv]], Python: [[https://docs.python.org/3/library/csv.html|csv]]
  * FASTA file reading. Java: [[http://jfasta.sourceforge.net/|jFasta]] or [[http://broadinstitute.github.io/picard/|Picard tools]], Python: [[https://biopython.org/wiki/SeqIO|biopython]].
  * If you need anything else, ask me in advance.
Don't forget that your code needs to compile!