====== Tutorial 4 - BLAST, Star Alignment, Clustal Omega ======
===== Problem 1 - Multiple Sequence Alignment Score =====
Calculate the score of the following alignment
- using //sum-of-pairs// method (match is $+4$, mismatch $-2$, indel $-1$ and $s(\_,\_)=0$);
- using Shannon //entropy// method.
$$\begin{array}{l} \mathtt{MQPILL\_G} \\ \mathtt{MLR\_LL\_G} \\ \mathtt{MK\_ILLL\_} \\ \mathtt{MPPVLLI\_} \end{array}$$
Calculate the //consensus sequence//.
[Adapted from [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ]
===== Problem 2 - STAR Alignment =====
Calculate multiple sequence alignment using the star approach.
$$\begin{aligned}
s_1 &= \mathtt{CCTGCTGCAG} \\
s_2 &= \mathtt{GATGTGCCG} \\
s_3 &= \mathtt{GATGTGCAG} \\
s_4 &= \mathtt{CCGCTAGCAG} \\
s_5 &= \mathtt{CCTGTAGG}
\end{aligned}$$
Match is for $+1$, mismatches and indels for $-1$.
The following code may help you to calculate the pairwise alignments.
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
s1 <- "CCTGCTGCAG"
s2 <- "GATGTGCCG"
s3 <- "GATGTGCAG"
s4 <- "CCGCTAGCAG"
s5 <- "CCTGTAGG"
submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE)
pairwiseAlignment(s1, s2, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s3, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s3, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s4, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
===== Problem 3 - CLUSTAL =====
Align group
$$\begin{aligned}
s_1 &= \mathtt{ATTGCCATT\_\_} \\
s_2 &= \mathtt{ATC\_CAATTTT}
\end{aligned}$$
with group
$$\begin{aligned}
s_1 &= \mathtt{ATGGCCATT} \\
s_2 &= \mathtt{ATCTTC\_TT}
\end{aligned}$$
using the approach of CLUSTALW algorithm. Align groups based on two most similar sequences considering matches for $+1$ and mismatches and gaps for $-1$.
[Source [[http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf]]. ]
The following code may help you to decide which two sequences will guide the alignment.
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
s1 <- "ATTGCCATT"
s2 <- "ATCCAATTTT"
s3 <- "ATGGCCATT"
s4 <- "ATCTTCTT"
submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE)
pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
===== Problem 4 - BLAST =====
Use BLAST algorithm to find the local alignment of query sequence
$$ \mathtt{IHNWALN} $$
in database
$$ \mathtt{AFGIAAAHDWALNW}. $$
Use $k=3$, a threshold for high scoring words $T=20$, and [[http://rosalind.info/glossary/blosum62/|BLOSUM 62 scoring matrix]].
===== Problem 5 - BLAST online =====
On the second tutorial, we assembled a sequence. If you did not obtain results in reasonable time, you could use {{courses:bin:tutorials:contigs.fa.txt| this file}}. Use [[https://blast.ncbi.nlm.nih.gov/Blast.cgi | NCBI BLAST page]] to find what species it is. You do not need to search for the whole sequence; one contig is enough.