Calculate the score of alignment

`PRT———EINS`

`YRNWPSEEN-`

Use BLOSUM62 scoring matrix and affine gap penalty with gap opening cost 11 and gap extension cost 1.

In some tools, the cost for a gap of size 1 is only the *gap opening penalty*, in others, it is *gap opening penalty + gap extension penalty*. In this tutorial, we will follow the notation from the lecture, which corresponds to the first case.

[adapted from http://rosalind.info/problems/gaff/ ]

Calculate a local alignment of sequences `CGTGAATTCAT`

and `GACTTAC`

. Consider $\mathrm{gap\_penalty} = -4$, $\mathrm{mismatch\_penalty} = -3$, and $\mathrm{match\_premium} = 5$.

Consider the following alignment of 5 sequences

`AAI`

`SAL`

`TAL`

`TAV`

`AAL`

Calculate the BLOSUM substitution matrix. Use scaling factor $\lambda = 0.5$. For an explanation how to calculate the matrix, you can visit this lecture.

[ the example is taken from http://www.cs.columbia.edu/4761/assignments/assignment1/reference1.pdf ]

In bioinformatics and statics, **R** programming language become popular. CLI can be downloaded from the R-project site . If you prefer GUI, you can install an open-source version of RStudio. Other sources are available as well.

We will use this language to calculate a local and a global alignment of two sequences. First, we will start with validating results we got in the second problem. BLOSUM matrix and other tools are located in library Biostrings, which is part of Bioconductor. To install the library and load it, call

R

source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings)

R

sigma <- nucleotideSubstitutionMatrix(match = 5, mismatch = -3, baseOnly = TRUE)

R

sigma

The two sequences are

R

s1 <- "CGTGAATTCAT" s2 <- "GACTTAC"

R

pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = 0, gapExtension = -4, scoreOnly = FALSE, type="local")

`gapOpening`

and `gapExtension`

than we do.
To see help for the `pairwiseAlignment`

function, use

R

help(pairwiseAlignment)

Now we will test on two real-world sequences. Download them from the UniProt database

R

download.file("http://www.uniprot.org/uniprot/Q9CD83.fasta", "Q9CD83.fasta") download.file("http://www.uniprot.org/uniprot/A0PQ23.fasta", "A0PQ23.fasta")

`read.fasta`

from library `'seqinr`

can be used. If you did not install this library yet, install it.
R

install.packages("seqinr")

R

library("seqinr") leprae <- read.fasta(file = "Q9CD83.fasta") ulcerans <- read.fasta(file = "A0PQ23.fasta")

R

getSequence(leprae) getAnnot(leprae) getName(leprae)

`leprae`

with `ulcerans`

to see the other bacteria. Again, we will need the `Biostrings`

package. Blosum matrix can be loaded using
R

data(BLOSUM50)

R

BLOSUM50

Global alignment can be calculated as

R

pairwiseAlignment(toupper(c2s(leprae[[1]])), toupper(c2s(ulcerans[[1]])), substitutionMatrix = BLOSUM50, gapOpening = -2, gapExtension = -8, scoreOnly = FALSE)

To see more examples, please, check a script from the last year.

Work individually on the second programming assignment. The deadline is on March 20, 2019. Upload in BRUTE.

courses/bin/tutorials/tutorial2.txt · Last modified: 2020/02/11 16:47 (external edit)