Tutorial 3 - Sequence alignment

Problem 1 - score of an alignment

Calculate the score of alignment

PRT———EINS
YRNWPSEEN-

Use BLOSUM62 scoring matrix and affine gap penalty with gap opening cost 11 and gap extension cost 1.

In some tools, the cost for a gap of size 1 is only the gap opening penalty, in others, it is gap opening penalty + gap extension penalty. In this tutorial, we will follow the notation from the lecture, which corresponds to the first case.

[adapted from http://rosalind.info/problems/gaff/ ]

Problem 2 - local sequence alignment

Calculate a local alignment of sequences CGTGAATTCAT and GACTTAC. Consider $\mathrm{gap\_penalty} = -4$, $\mathrm{mismatch\_penalty} = -3$, and $\mathrm{match\_premium} = 5$.

Problem 3 - BLOSUM matrix

Consider the following alignment of 5 sequences

AAI
SAL
TAL
TAV
AAL

Calculate the BLOSUM substitution matrix. Use scaling factor $\lambda = 0.5$. For an explanation how to calculate the matrix, you can visit this lecture (original source).

[ the example is taken from (original source, not working now) http://www.cs.columbia.edu/4761/assignments/assignment1/reference1.pdf ]

4 - a practical example in R

In bioinformatics and statistics, R programming language become popular. CLI can be downloaded from the R-project site . If you prefer GUI, you can install an open-source version of RStudio. Other sources are available as well.

A toy example

We will use this language to calculate a local and a global alignment of two sequences. First, we will start with validating results we got in the second problem. BLOSUM matrix and other tools are located in library Biostrings, which is part of Bioconductor. To install the library and load it, call

R

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

We need a matrix, where a match has value $5$, mismatch $-3$, and gap is $4$. We create the matrix by

R

sigma <- nucleotideSubstitutionMatrix(match = 5, mismatch = -3, baseOnly = TRUE)

To visualize the matrix, call simply

R

 sigma

The two sequences are

R

s1 <- "CGTGAATTCAT"
s2 <- "GACTTAC"

You can calculate and visualize the local alignment by

R

pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = 0, gapExtension = -4, scoreOnly = FALSE, type="local")

Note that this function uses a slightly different definition of gapOpening and gapExtension than we do.

To see help for the pairwiseAlignment function, use

R

 help(pairwiseAlignment)

Two real-world sequences

Now we will test on two real-world sequences. Download them from the UniProt database

R

download.file("http://www.uniprot.org/uniprot/Q9CD83.fasta", "Q9CD83.fasta")
download.file("http://www.uniprot.org/uniprot/A0PQ23.fasta", "A0PQ23.fasta")

Now we need to load the sequences into R. Function read.fasta from library 'seqinr can be used. If you did not install this library yet, install it.

R

install.packages("seqinr")

Now we can load the library and the sequences.

R

library("seqinr")

leprae <- read.fasta(file = "Q9CD83.fasta")
ulcerans <- read.fasta(file = "A0PQ23.fasta")

You can visualize the sequences using

R

getSequence(leprae)
getAnnot(leprae)
getName(leprae)

Replace leprae with ulcerans to see the other bacteria. Again, we will need the Biostrings package. Blosum matrix can be loaded using

R

 data(BLOSUM50)

and visualized by

R

 BLOSUM50

Global alignment can be calculated as

R

pairwiseAlignment(toupper(c2s(leprae[[1]])), toupper(c2s(ulcerans[[1]])), substitutionMatrix = BLOSUM50, gapOpening = -2, gapExtension = -8, scoreOnly = FALSE)

To see more examples, please, check a script from the last year.

Table of Contents