====== Tutorial 4 - BLAST, Star Alignment, Clustal Omega ====== ===== Problem 1 - Multiple Sequence Alignment Score ===== Calculate the score of the following alignment - using //sum-of-pairs// method (match is $+4$, mismatch $-2$, indel $-1$ and $s(\_,\_)=0$); - using Shannon //entropy// method. $$\begin{array}{l} \mathtt{MQPILL\_G} \\ \mathtt{MLR\_LL\_G} \\ \mathtt{MK\_ILLL\_} \\ \mathtt{MPPVLLI\_} \end{array}$$ Calculate the //consensus sequence//. [Adapted from [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] ===== Problem 2 - STAR Alignment ===== Calculate multiple sequence alignment using the star approach. $$\begin{aligned} s_1 &= \mathtt{CCTGCTGCAG} \\ s_2 &= \mathtt{GATGTGCCG} \\ s_3 &= \mathtt{GATGTGCAG} \\ s_4 &= \mathtt{CCGCTAGCAG} \\ s_5 &= \mathtt{CCTGTAGG} \end{aligned}$$ Match is for $+1$, mismatches and indels for $-1$. The following code may help you to calculate the pairwise alignments. source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) s1 <- "CCTGCTGCAG" s2 <- "GATGTGCCG" s3 <- "GATGTGCAG" s4 <- "CCGCTAGCAG" s5 <- "CCTGTAGG" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s2, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s4, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) ===== Problem 3 - CLUSTAL ===== Align group $$\begin{aligned} s_1 &= \mathtt{ATTGCCATT\_\_} \\ s_2 &= \mathtt{ATC\_CAATTTT} \end{aligned}$$ with group $$\begin{aligned} s_1 &= \mathtt{ATGGCCATT} \\ s_2 &= \mathtt{ATCTTC\_TT} \end{aligned}$$ using the approach of CLUSTALW algorithm. Align groups based on two most similar sequences considering matches for $+1$ and mismatches and gaps for $-1$. [Source [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] The following code may help you to decide which two sequences will guide the alignment. source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) s1 <- "ATTGCCATT" s2 <- "ATCCAATTTT" s3 <- "ATGGCCATT" s4 <- "ATCTTCTT" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) ===== Problem 4 - BLAST ===== Use BLAST algorithm to find the local alignment of query sequence $$ \mathtt{IHNWALN} $$ in database $$ \mathtt{AFGIAAAHDWALNW}. $$ Use $k=3$, a threshold for high scoring words $T=20$, and [[http://rosalind.info/glossary/blosum62/|BLOSUM 62 scoring matrix]]. ===== Problem 5 - BLAST online ===== On the second tutorial, we assembled a sequence. If you did not obtain results in reasonable time, you could use {{courses:bin:tutorials:contigs.fa.txt| this file}}. Use [[https://blast.ncbi.nlm.nih.gov/Blast.cgi | NCBI BLAST page]] to find what species it is. You do not need to search for the whole sequence; one contig is enough.