====== Assignment 2: BLAST ALIGNMENT ====== * 15 points * Deadline: Wednesday, April 3rd, 2024, 23:59 * Late submission penalty: -1 point per day but no more than 12 points * Submit to [[https://cw.felk.cvut.cz/brute/|BRUTE]]. * Work individually. * Submit * your source codes in a language of your choice * bash script ''compile.sh'' that compiles your codes * a wrapper bash script ''blast.sh'' that calculates the alignment * your source codes must compile either * on Ubuntu machines in the lab, or * on a freshly installed Ubuntu machine (if this is the case, provide file ''packages.txt'' with a list of packages that need to be installed) * A short report as explained below. * This option for the homework is new this year, if you believe that there should be some changes done, please email [[petr.rysavy@fel.cvut.cz|petr.rysavy@fel.cvut.cz]]. ===== Nucleotide BLAST algorithm ===== The BLAST algorithm is a common tool to search huge sequence databases. If you visit [[https://blast.ncbi.nlm.nih.gov/Blast.cgi]], you can find out that in minutes the algorithm searches database that contains DNA sequences of many known organisms. The original paper [[https://dx.doi.org/10.1016%2FS0022-2836%2805%2980360-2]] was published in the 1990s, nevertheless, there was a rapid development afterward. Your task will be to implement a simplified version of the BLAST algorithm. Study how the algorithm works and implement your own version. Start by downloading the Human Genome from the Ensembl database [[https://www.ensembl.org/Homo_sapiens/Info/Index]]. The chromosomes will form our database. Next, find a sequence of interest, for example [[https://www.ncbi.nlm.nih.gov/nuccore/NC_000016.10?strand=1&report=genbank&from=176680&to=177522|hemoglobin gene sequence HBA1]] might be a good test sequence: actcttctggtccccacagactcagagagaacccaccatggtgctgtctcctgccgacaa gaccaacgtcaaggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcgga ggccctggagaggtgaggctccctcccctgctccgacccgggctcctcgcccgcccggac ccacaggccaccctcaaccgtcctggccccggacccaaaccccacccctcactctgcttc tccccgcaggatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacct gagccacggctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaa cgccgtggcgcacgtggacgacatgcccaacgcgctgtccgccctgagcgacctgcacgc gcacaagcttcgggtggacccggtcaacttcaaggtgagcggcgggccgggagcgatctg ggtcgaggggcgagatggcgccttcctcgcagggcagaggatcacgcgggttgcgggagg tgtagcgcaggcggcggctgcgggcctgggccctcggccccactgaccctcttctctgca cagctcctaagccactgcctgctggtgaccctggccgcccacctccccgccgagttcacc cctgcggtgcacgcctccctggacaagttcctggcttctgtgagcaccgtgctgacctcc aaataccgttaagctggagcctcggtggccatgcttcttgccccttgggcctccccccag cccctcctccccttcctgcacccgtacccccgtggtctttgaataaagtctgagtgggcg gca This protein is located on chromosome 16, therefore use at least chromosome 16 when searching for the sequence above. Start with reviewing how blast algorithm works. You might want to read [[https://academic.oup.com/nar/article/25/17/3389/1061651|the original Atschul paper]], [[https://academic.oup.com/nar/article/25/17/3389/1061651|gapped blast update]] or any other source explaining BLAST in detail. You might take inspiration from the very simplified version we did on tutorials. Program ''blast.sh'' should accept the following arguments: | argument | meaning | | ''-e'' | path to the score matrix in CSV format | | ''-p'' | gap penalization (optional, depending on your version) | | ''-x'' | gap extension penalization gap penalization (optional, depending on your version) | | ''-d'' | list of database files in fasta format, for example, ''-d chr1.fasta chr2.fasta'' | | ''-k'' | length of the word (you will probably need this one) | | ''-t'' | threshold on the seed score (this one will be likely needed as well) | Any other arguments are good as long as you mention their meaning in the report. Once the user runs the ''blast.sh'' script, the program loads the database sequences and listens on standard input for query sequences. Once the query sequence is found in the database, the program outputs all matches containing the following information: * Name of the database file with the match, * range of the match in the database file, * range in the query, * time needed for the search. Finally, write a short report (no longer than two pages!). Explain how your implementation works, what did you implement, whatnot. Explain where I can find various components of your code. Of course, your implementation does not need to be as perfect as the original one but nevertheless, we can compare the implementations. Did your algorithm find hemoglobin as expected? Did it find other genes from the globin family? What about other test sequences? How fast is your algorithm compared to the BLASTn server? How many chromosomes (or nucleotides) are you able to fit in your database if you are allowed to allocate only 4GB of memory? What obstacles did you have during your implementation? ===== External libraries ===== The alignment implementation must be solely your own work. However, you can use any external library of your choice for the following: * Command line arguments parsing. In Java, [[https://mvnrepository.com/artifact/commons-cli/commons-cli/1.3.1|apache commons cli]] may be useful. In Python, the [[https://docs.python.org/3.7/howto/argparse.html|argparse]] library is handy. * Scoring matrix parsing. For example, Java: [[https://commons.apache.org/proper/commons-csv/|Apache commons csv]], Python: [[https://docs.python.org/3/library/csv.html|csv]] * FASTA file reading. Java: [[http://jfasta.sourceforge.net/|jFasta]] or [[http://broadinstitute.github.io/picard/|Picard tools]], Python: [[https://biopython.org/wiki/SeqIO|biopython]]. * If you need anything else, ask me in advance. Don't forget that your code needs to compile!