Search
compile.sh
blast.sh
packages.txt
The BLAST algorithm is a common tool to search huge sequence databases. If you visit https://blast.ncbi.nlm.nih.gov/Blast.cgi, you can find out that in minutes the algorithm searches database that contains DNA sequences of many known organisms. The original paper https://dx.doi.org/10.1016%2FS0022-2836%2805%2980360-2 was published in the 1990s, nevertheless, there was a rapid development afterward.
Your task will be to implement a simplified version of the BLAST algorithm. Study how the algorithm works and implement your own version. Start by downloading the Human Genome from the Ensembl database https://www.ensembl.org/Homo_sapiens/Info/Index. The chromosomes will form our database. Next, find a sequence of interest, for example hemoglobin gene sequence HBA1 might be a good test sequence:
actcttctggtccccacagactcagagagaacccaccatggtgctgtctcctgccgacaa gaccaacgtcaaggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcgga ggccctggagaggtgaggctccctcccctgctccgacccgggctcctcgcccgcccggac ccacaggccaccctcaaccgtcctggccccggacccaaaccccacccctcactctgcttc tccccgcaggatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacct gagccacggctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaa cgccgtggcgcacgtggacgacatgcccaacgcgctgtccgccctgagcgacctgcacgc gcacaagcttcgggtggacccggtcaacttcaaggtgagcggcgggccgggagcgatctg ggtcgaggggcgagatggcgccttcctcgcagggcagaggatcacgcgggttgcgggagg tgtagcgcaggcggcggctgcgggcctgggccctcggccccactgaccctcttctctgca cagctcctaagccactgcctgctggtgaccctggccgcccacctccccgccgagttcacc cctgcggtgcacgcctccctggacaagttcctggcttctgtgagcaccgtgctgacctcc aaataccgttaagctggagcctcggtggccatgcttcttgccccttgggcctccccccag cccctcctccccttcctgcacccgtacccccgtggtctttgaataaagtctgagtgggcg gca
Start with reviewing how blast algorithm works. You might want to read the original Atschul paper, gapped blast update or any other source explaining BLAST in detail. You might take inspiration from the very simplified version we did on tutorials.
Program blast.sh should accept the following arguments:
-e
-p
-x
-d
-d chr1.fasta chr2.fasta
-k
-t
Any other arguments are good as long as you mention their meaning in the report. Once the user runs the blast.sh script, the program loads the database sequences and listens on standard input for query sequences. Once the query sequence is found in the database, the program outputs all matches containing the following information:
Finally, write a short report (no longer than two pages!). Explain how your implementation works, what did you implement, whatnot. Explain where I can find various components of your code. Of course, your implementation does not need to be as perfect as the original one but nevertheless, we can compare the implementations. Did your algorithm find hemoglobin as expected? Did it find other genes from the globin family? What about other test sequences? How fast is your algorithm compared to the BLASTn server? How many chromosomes (or nucleotides) are you able to fit in your database if you are allowed to allocate only 4GB of memory? What obstacles did you have during your implementation?
The alignment implementation must be solely your own work. However, you can use any external library of your choice for the following:
Don't forget that your code needs to compile!