====== Tutorial 4 - BLAST, Star Alignment, Clustal Omega ====== ===== Problem 1 - Multiple Sequence Alignment Score ===== Calculate the score of the following alignment - using //sum-of-pairs// method (match is $+4$, mismatch $-2$, indel $-1$ and $s(\_,\_)=0$); - using Shannon //entropy// method. $$\begin{array}{l} \mathtt{MQPILL\_G} \\ \mathtt{MLR\_LL\_G} \\ \mathtt{MK\_ILLL\_} \\ \mathtt{MPPVLLI\_} \end{array}$$ Calculate the //consensus sequence//. [Adapted from [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] ===== Problem 2 - STAR Alignment ===== Calculate multiple sequence alignment using the star approach. $$\begin{aligned} s_1 &= \mathtt{CCTGCTGCAG} \\ s_2 &= \mathtt{GATGTGCCG} \\ s_3 &= \mathtt{GATGTGCAG} \\ s_4 &= \mathtt{CCGCTAGCAG} \\ s_5 &= \mathtt{CCTGTAGG} \end{aligned}$$ Match is for $+1$, mismatches and indels for $-1$. The following code may help you to calculate the pairwise alignments.


source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")

library(Biostrings)
s1 <- "CCTGCTGCAG"
s2 <- "GATGTGCCG"
s3 <- "GATGTGCAG"
s4 <- "CCGCTAGCAG"
s5 <- "CCTGTAGG"

submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE)


pairwiseAlignment(s1, s2, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s3, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s3, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s4, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)

===== Problem 3 - CLUSTAL ===== Align group $$\begin{aligned} s_1 &= \mathtt{ATTGCCATT\_\_} \\ s_2 &= \mathtt{ATC\_CAATTTT} \end{aligned}$$ with group $$\begin{aligned} s_1 &= \mathtt{ATGGCCATT} \\ s_2 &= \mathtt{ATCTTC\_TT} \end{aligned}$$ using the approach of CLUSTALW algorithm. Align groups based on two most similar sequences considering matches for $+1$ and mismatches and gaps for $-1$. [Source [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ] The following code may help you to decide which two sequences will guide the alignment.


source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")

library(Biostrings)
s1 <- "ATTGCCATT"
s2 <- "ATCCAATTTT"
s3 <- "ATGGCCATT"
s4 <- "ATCTTCTT"


submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE)


pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)

===== Problem 4 - BLAST ===== Use BLAST algorithm to find the local alignment of query sequence $$ \mathtt{IHNWALN} $$ in database $$ \mathtt{AFGIAAAHDWALNW}. $$ Use $k=3$, a threshold for high scoring words $T=20$, and [[http://rosalind.info/glossary/blosum62/|BLOSUM 62 scoring matrix]]. ===== Problem 5 - BLAST online ===== On the second tutorial, we assembled a sequence. If you did not obtain results in reasonable time, you could use {{courses:bin:tutorials:contigs.fa.txt| this file}}. Use [[https://blast.ncbi.nlm.nih.gov/Blast.cgi | NCBI BLAST page]] to find what species it is. You do not need to search for the whole sequence; one contig is enough.