Calculate the score of alignment
PRT———EINS
YRNWPSEEN-
Use BLOSUM62 scoring matrix and affine gap penalty with gap opening cost 11 and gap extension cost 1.
[adapted from http://rosalind.info/problems/gaff/ ]
Calculate a local alignment of sequences CGTGAATTCAT
and GACTTAC
. Consider $\mathrm{gap\_penalty} = -4$, $\mathrm{mismatch\_penalty} = -3$, and $\mathrm{match\_premium} = 5$.
Consider the following alignment of 5 sequences
AAI
SAL
TAL
TAV
AAL
Calculate the BLOSUM substitution matrix. Use scaling factor $\lambda = 0.5$. For an explanation how to calculate the matrix, you can visit this lecture (original source).
[ the example is taken from (original source, not working now) http://www.cs.columbia.edu/4761/assignments/assignment1/reference1.pdf ]
In bioinformatics and statistics, R programming language become popular. CLI can be downloaded from the R-project site . If you prefer GUI, you can install an open-source version of RStudio. Other sources are available as well.
We will use this language to calculate a local and a global alignment of two sequences. First, we will start with validating results we got in the second problem. BLOSUM matrix and other tools are located in library Biostrings, which is part of Bioconductor. To install the library and load it, call
R
source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings)
R
sigma <- nucleotideSubstitutionMatrix(match = 5, mismatch = -3, baseOnly = TRUE)
R
sigma
The two sequences are
R
s1 <- "CGTGAATTCAT" s2 <- "GACTTAC"
R
pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = 0, gapExtension = -4, scoreOnly = FALSE, type="local")
gapOpening
and gapExtension
than we do.
To see help for the pairwiseAlignment
function, use
R
help(pairwiseAlignment)
Now we will test on two real-world sequences. Download them from the UniProt database
R
download.file("http://www.uniprot.org/uniprot/Q9CD83.fasta", "Q9CD83.fasta") download.file("http://www.uniprot.org/uniprot/A0PQ23.fasta", "A0PQ23.fasta")
read.fasta
from library 'seqinr
can be used. If you did not install this library yet, install it.
R
install.packages("seqinr")
R
library("seqinr") leprae <- read.fasta(file = "Q9CD83.fasta") ulcerans <- read.fasta(file = "A0PQ23.fasta")
R
getSequence(leprae) getAnnot(leprae) getName(leprae)
leprae
with ulcerans
to see the other bacteria. Again, we will need the Biostrings
package. Blosum matrix can be loaded using
R
data(BLOSUM50)
R
BLOSUM50
Global alignment can be calculated as
R
pairwiseAlignment(toupper(c2s(leprae[[1]])), toupper(c2s(ulcerans[[1]])), substitutionMatrix = BLOSUM50, gapOpening = -2, gapExtension = -8, scoreOnly = FALSE)
To see more examples, please, check a script from the last year.