====== Tutorial 4 - BLAST, Star Alignment, Clustal Omega ====== ===== Problem 1 - Multiple Sequence Alignment Score ===== Calculate the score of the following alignment - using //sum-of-pairs// method (match is $+4$, mismatch $-2$, indel $-1$ and $s(\_,\_)=0$); - using Shannon //entropy// method. $$\begin{array}{l} \mathtt{MQPILL\_G} \\ \mathtt{MLR\_LL\_G} \\ \mathtt{MK\_ILLL\_} \\ \mathtt{MPPVLLI\_} \end{array}$$ Calculate the //consensus sequence//. [Adapted from (not available now) [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] ===== Problem 2 - STAR Alignment ===== Calculate multiple sequence alignment using the star approach. $$\begin{aligned} s_1 &= \mathtt{CCTGCTGCAG} \\ s_2 &= \mathtt{GATGTGCCG} \\ s_3 &= \mathtt{GATGTGCAG} \\ s_4 &= \mathtt{CCGCTAGCAG} \\ s_5 &= \mathtt{CCTGTAGG} \end{aligned}$$ Match is for $+1$, mismatches and indels for $-1$. The following code may help you to calculate the pairwise alignments. if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings") library(Biostrings) s1 <- "CCTGCTGCAG" s2 <- "GATGTGCCG" s3 <- "GATGTGCAG" s4 <- "CCGCTAGCAG" s5 <- "CCTGTAGG" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s2, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s4, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) ===== Problem 3 - CLUSTAL ===== Align group $$\begin{aligned} s_1 &= \mathtt{ATTGCCATT\_\_} \\ s_2 &= \mathtt{ATC\_CAATTTT} \end{aligned}$$ with group $$\begin{aligned} s_3 &= \mathtt{ATGGCCATT} \\ s_4 &= \mathtt{ATCTTC\_TT} \end{aligned}$$ using the approach of CLUSTALW algorithm. Align groups based on two most similar sequences considering matches for $+1$ and mismatches and gaps for $-1$. The respective guiding tree is below. {{ :courses:bin:tutorials:clustalw-guiding-tree.png?nolink&250 |}} [Source (not working now) [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] The following code may help you to decide which two sequences will guide the alignment. if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings") library(Biostrings) s1 <- "ATTGCCATT" s2 <- "ATCCAATTTT" s3 <- "ATGGCCATT" s4 <- "ATCTTCTT" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) ===== Problem 4 - BLAST ===== Use BLAST algorithm to find the local alignment of query sequence $$ \mathtt{IHNWALN} $$ in database $$ \mathtt{AFGIAAAHDWALNW}. $$ Use $k=3$, a threshold for high scoring words $T=20$, and [[http://rosalind.info/glossary/blosum62/|BLOSUM 62 scoring matrix]]. ===== Problem 5 - BLAST online ===== Use [[https://blast.ncbi.nlm.nih.gov/Blast.cgi | NCBI BLAST page]] to find what species are likely to contain those sequences in their DNA. AAAACCGCTGATGAGCGTCGGTAAAGTACTGAATATGAACAACATCGCGGCAGCCGGCATGGTGGCAACGCTTGCCAACA ACATCCCGATGTTCGGCATGATGAAGCAGATGGATACCCGCGGCAAAGTCATCAACTGCGCCTTCGCCGTTTCCGCTGCT TTCGCCCTGGGCGACCACTTAGGCTTCGCCGCTGCCAACATGAACGCCATGATCTTCCCGATGATTGTCGGCAAGTTGAT CGGCGGCGTAACGGCGATTGGCGTGGCGATGATGCTGGTGCCAAAAGAAGACGCGACCGCGACTAAAACCGAAGCGGAGG CACAATCGTGAACACTCGCCAGCTATTGAGCGTCGGTATCGATATCGGCACCACCACCACCCAGGTGATTTTCTCCCACC TGGAGCTGGTTAACCGTGCGGCGGTGTCGCAGGTGCCGCGCTACGAATTCATTAAACGCGAAATTAGCTGGCAAAGTCCG GTGTTCTTTACCCCTGTCGATAAACAGGGCGGTTTAAAAGAAGCGGAACTGAAAACCTTAATACTCGAGCAATATCAGGC TGCGGGTATTGCGCCGGAAAGCGTTGATTCTGGTGCCATCATCATCACCGGTGAAAGCGCGAAAACCCGCAATGCTCGCC CGGCGGTGATGGCGCTCTCTCAATCGCTGGGGGATTTTGTCGTTGCCAGCGCCGGGCCGCACCTCGAATCCGTGATCGCC GGTCACGGAGCTGGGGCGCAAACCCTTTCTGAACAACGGCTGTGTCGGGTACTGAATATCGACATCGGCGGTGGCACCGC GAACTACGCCCTGTTCGATGCCGGAAAAATCAGCGGCACTGCCTGTCTCAACGTCGGTGGTCGCCTGCTGGAAACCGACA GCCAGGGGCGCGTGGTTTACGCTCATAAACCGGGGCAGATGATTGTGGATGAGTGCTTCGGTGCAGGCACTGACGCCCGT TCGCTGACCGGCGCGCAGCTGGTGCAGGTTACCCGGCGGATGGCGGCGCTGATTGTCGAAGTGATTGACGGAACGCTTTC GCCGCTCGCGCAGGCATTGATGCAAACCGGTTTGCTGCCCGCAGGTGTTACGCCCGAAATCATTACGCTTTCTGGAGGCG TGGGCGAATGTTATCGCCACCAGCCCGCCGACCCGTTCTGTTTTGCCGATATTGGCCCACTTCTGGCAACGGCGCTGCAT GACCATCCGCGCCTGCGTGAGATGAATGTGCAGTTTCCGGCGCAAACCGTACGCGCCACGGTGATTGGCGCGGGTGCACA TACCCTTTCGCTCTCTGGCAGCACAATCTGGCTGGAGGGCGTACAACTGCCGCTGCGCAATTTGCCGGTGGCGATCCCGA TTGATGAAACGGATCTGGTGAGTGCCTGGCAACAGGCGCTGCTTCAGCTGGATCTTGATCCCAAAACTGACGCGTACGTG CTGGCGCTTCCCGCCTCGCTGCCTGTGCGTTACGCCGCGGTACTGACGGTCATCAACGCGCTGGTCGATTTCGTCGCGCG TTTTCCGAATCCGCATCCCCTGCTGGTGGTGGCCGGGCAGGACTTTGGTAAAGCTCTGGGCATGTTGTTGCGCCCACAGC TACAACAACTCCCGTTGGCAGTCATTGACGAAGTGATTGTCCGCGCGGGGGACTATATCGACATTGGTACGCCTCTTTTT GGCGGATCGGTTGTGCCGGTGACGGTGAAATCACTCGCATTTCCTTCCTGAGGGAACGACTTATGAAACTAAAGACCACA TTGTTCGGCAATGTATATCAGTTTAAGGATGTAAAAGAGGTGCTGGCTAAAGCCAACGAACTGTGTTCGGGGGATGTGCT GGCAGGCGTTGCAGCGGCAAGTTCACAGGAGCGCGTGGCGGCAAAGCAGGTGTTGTCGGAAATGACCGTAGCGGACATCC GCAATAATCCGGTGATTGCCTATGAAGATGACTGCGTGACGCGGCTGATTCAGGACGATGTTAACGAAACGGCCTACAAC CCACAAGACGTCAAGTTTCCGGGCGGCGGCCAGATCGTTGGCGGAGTATACTTGCTGCCGCGCAGGGGCCCCAGGTTGGG TGTGCGCGCGGCAAGGAAAACTTCGGAGCGGTCACAGCCCCGTGGGAGACGCCAGCCCATCCCCAAAGATCGGCGTCCCA CTGGCAAGTCCTGGGGAAAACCAGGATACCCTTGGCCCTTATATGGGAACGAGGGGCTCGGCTGGGCAGGATGGCTCCTG TCCCCCCAGGGCTCCCGTCCCTCTTGGGGCCCCACTGACCCCCGGCGTAGGTCGCGCAATGTGGGTAAGGTCATCGACAC CCTAACGTGCGGCTTCGCCGACCTCATGGGGTACATCCCCGTCGTAGGCGCCCCGCTTGGCGGTGTCGCCAGAGCTCTCG CGCATGGCGTGAGGGCCCTGGAGGACGGGGTCAACTATGCAACAGGGAACTTACCCGGTTGCCCCTTTTCTATCTTCTTG CTGGCCCTACTGTCCTGCATCACCACTCCGGTCTCAGCTGCCCAGGTGAAAAACACCAGTGACATCTACATGGTGACTAA CGACTGTCCCAACAGCAGCATCACCTGGCAGCTTAGGGCCGCAGTCCTCCACGTCCCCGGATGTGTCCCGTGTGAGAAAG TGGGGAATACATCTCAGTGCTGGACGCCGGTCTCACCCAATGTGGCTGTGCAGCAACCCGGCGCCCTCACGCGGGGCTTG CGGACGCACATCGATATCGTTGTAATGTCCGCTACGCTCTGCTCCGCTCTCTATGTGGGGGACCTCTGCGGCGGGGTAAT GCTCGCGGCCCAGATATTCATCGTCTCGCCACAACACCACTGGTTCGTGCAAGAGTGCAATTGCTCCATCTACCCTGGTA CCATCACTGGTCACCGTATGGCATGGGACATGATGATGAACTGGTCGCCCACAGCTACCATGATCCTGGCGTACGCGACA CGTGTTCCCGAGGTCATCATAGACATCATTAGCGGGGCTCACTGGGGTGTCATGTTCGGCCTGGCCTACTTCTCTATGCA GGGAGCGTGGGCGAAGGTCGTTGTCATCCTCCTGCTGGCCGCTGGGGTGGACGCACATACCAACGTCATTGGGGCCCAGG TGGGGCGCACCGCCAGTAGCCTTAATAGCTTGTTCACCGTCGGCGCTAAGCAGAACATCCAGCTGATCAACTCCAATGGC AGTTGGCACATCAACCGCACTGCTCTGAACTGCAATGACTCTCTGAACACCGGCTTCCTCGCGTCCCTGTTCTACACCAA TCGCTTCAACTCGTCGGGATGCCCAGAACGTCTGGCATCCTGCCGTAGGATTGAGGCCTTCAGGATAGGATGGGGCACTC TGCAATATGAGCACAATGTCACCAATTCAGAGGATATGAGACCATACTGCTGGCATTATCCACCCAAACCTTGTGGTATA GTCCCCGCGAGGTCTGTGTGTGGCCCGGTGTACTGTTTCACACCCAGCCCAGTAGTAGTGGGCACGACCGACAGGCGTGG AGTGCCCACTTACACGTGGGGGGAGAATGAGACGGACGTCTTCCTACTGAACAGCACCCGGCCACCGCGGGGGTCATGGT TCGGCTGTACGTGGATGAACTCCACTGGCTTCACCAAGACTTGTGGCGCACCACCTTGCCGCATTAGAGCTGATTTCAAT GCCAGCACGGACCTGTTGTGCCCCACGGACTGTTTTAGGAAACACCCTGACGCCACTTACATCAAGTGTGGCTCCGGGCC CTGGCTCACGCCCAGATGCCTGGTCGACTACCCCTACAGGCTCTGGCACTACCCCTGCACAGTCAACTATAGCATCTTCA AGATAAGGATGTACGTGGGGGGGGTTGAACACAGGCTTACAGCTGCCTGTAACTTCACCCGCGGGGATCCTTGCAACTTG GATGACAGAGACAGAAGTCAACTGTCCCCCTTGTTGCACTCTACCACGGAGTGGGCCATCTTGCCCTGCACTTACTCTGA CCTGCCCGCCTTGTCGACCGGTCTCCTCCACCTCCACCAAAACATCGTGGACGTGCAATACATGTACGGCCTTTCACCAG CCGTCACGAAGTACATAGTCCGGTGGGAGTGGGTAGTGCTCTTGTTCCTGCTCTTGGCGGACGCCAGGGTCTGTGCCTGT GTATGGATGCTCATCCTGCTGGGCCAAGCCGAGGCAGCCCTAGAGAAGCTGGTTGTTTTGCACGCCGCGAGTGCGGCTGG CTGCAATGGCTTTCTATATTTCATCATCTTTTTCGTGGCTGCGTGGTGCATCAAGGGTCGAGTGGTCCCCTTGGCTACCT ATTCCCTCATCGGCCTATGGTCCTTCTTCCTACTGCTCCTAGCATTGCCTCAACAGGCTTATGCTTATGATGCAACTGTG CATGGACAAATAGGCGTGGCCCTGTTGGTGCTGCTCACCCTCTTTACACTCACCCCGGCATATAAGACCCTCCTGGGCCG GTGTCTGTGGTGGCTGTGCTATCTCCTGACCTTGGGAGAGGCCCTCGACCAGGAGTGGGCACCCTCCATGCAGGCGCGCG GTGGCCGGGATGGCATCATATGGGCTGCCACCATATTCTGCCCGGGTGTGGTGTTTGACATAACCAAGTGGCTTTTGGCG ATACTTGGACCTGGTTATCTCCTAAGAGATGCTTTGACACGCGTGCCGTATTTCGTCAGAGCCCACGCTCTGCTGAGAAT GTGCGCCATGGTGATGCACCTCGTGGGGGGTAAGTACGTCCAGATGGCGCTATTAACCCTTGGTAGGTGGACTGGCACTT ACATCTACGACCACCTCGCCCCCATGTCGGATTGGGCTGCCAGCGGCCTGCGGGACCTGGCGGTCGCTGTGGAACCTATC ATCTTCAGTCCGATGGAGAAAAAAGTCATCGTATGGGGAGCGGAGACAGCCGCGTGCGGGGACATCTTGCACGGACTTCC CGTGTCTGCTCGGCTTGGTCGAGAGATCCTTCTTGGCCCAGCTGACGGCTACACCTCTAAGGGGTGGAAGCTTCTTGCGC CTATCACTGCTTATGCCCAGCAGACACGAGGTCTCTTGGGCGCCATAGTGGTGAGCATGACAGGCCGTGACAAAACGGAA CAGGCCGGGGAGATCCAAGTCCTGTCCACGGTCACTCAGACCTTCCTCGGAACTACCATCTCAGGGGTCTTATGGACCGT CTACCACGGAGCTGGCAACAAGACCTTAGCCGGTTCGCGGGGCCCGGTCACGCAGATGTACTCCAGTGCCGAGGGAGACT TGGTGGGGTGGCCCAGTCCCCCCGGGACCAAATCCATGGAGCCGTGCACATGCGGAGCGGTCGACCTGTATCTGGTCACG CGGAACGCTGATGTCATCCCGGCTCGGAGACGCGGGGACAAGCGGGGAGCGTTGCTCTCCCCGAGACCTCTCTCGACCTT GAAGGGGTCCTCAGGGGGACCGGTGCTTTGCCCCAGGGGCCACGTTGTTGGGATCTTCCGGGCAGCCATATGCTCTCGGG GCGTGGCCAAGTCCATAGACTTCATCCCCGTTGAGATGCTTGACATCGTCACGCGCTCCCCCACCTTTACCGACAACAGC GCTTCTGTCTAGTTTTTATATGAAGATATTCCCATTTCCAATGACGGCCTCAAAGCAGTCCAAATATCCACTTGCAGATT ATAAGAAAAGAGTGTTTCAAAATTGCTCTATGAAAAGGGAAGTTTAACTCTGTGAGTTGAATGCAAACATCACAAAGAAG TTTCTGACAATGCTTCTGTCTAGTTTTTATTTATAGATATTTCCTTTTCCACCATAGGCCTCCAAGCTCTCCAAATGTCT GCTTGCAGATTCTACAAAAAAAGTGCTTCAAACCTGCTCTATCAAAAGAAAGGTTCAATTCTGTGAGTGGAATGCACACA TCACAAAGAGATTTCTGAGAATGATTCTGTCTAGTGTTTATGTGAAGATATCCCCTTTTCCAACGAAGGCCTCAAAGCGG TTCAAATATCCACTTGCAGATTCTGCAAAAAGAGTGCTTCAAAACTGCTCTATGAAGAGGTATGTTCAACCCTGTGATTT GAAAGCACACATCATAAAGTAGTTCCGAAGAATTATTCTGTCTGGTTTTTATATAATGATATTTCCTTTTCCATCATAGG CCTCAAAGCTCGCCATATGTCCACTTGAAGATTCTACAAAAAGACGGTTTCAAACCTGCTCTATGAAAAGAAAGGTTCAA CTCTGTGAGTTGAATGCACACATCACAAAGCAGTTTCTGAGAATGCTTCTGTCTAGTGTTTATGTGAAGATAATCCCGTT All sequences come from different taxonomic kingdoms. Where can you find the bacteria? Why is it so important for bioinformatics and biologists? What was the length of the portion of the virus DNA presented on this page? Compare length of the genome with the other organisms? Mammals have more than a single chromosome. The count range can be found on [[https://en.wikipedia.org/wiki/List_of_organisms_by_chromosome_count|this Wikipedia page]]. Which chromosome does the sequence come from? What is the similarity reported by BLAST? What may be biological motivation for such a thing? How does this influence sequence assembly? ===== 6 - Assignment 2 - implement pairwise alignment ===== Work individually on the [[../assignments/hw2|second programming assignment]]. The deadline is on March 20, 2019. Upload in BRUTE.