Search
Calculate the score of the following alignment
$$\begin{array}{l} \mathtt{MQPILL\_G} \\ \mathtt{MLR\_LL\_G} \\ \mathtt{MK\_ILLL\_} \\ \mathtt{MPPVLLI\_} \end{array}$$
Calculate the consensus sequence.
[Adapted from http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf. ]
Calculate multiple sequence alignment using the star approach.
$$\begin{aligned} s_1 &= \mathtt{CCTGCTGCAG} \\ s_2 &= \mathtt{GATGTGCCG} \\ s_3 &= \mathtt{GATGTGCAG} \\ s_4 &= \mathtt{CCGCTAGCAG} \\ s_5 &= \mathtt{CCTGTAGG} \end{aligned}$$ Match is for $+1$, mismatches and indels for $-1$.
The following code may help you to calculate the pairwise alignments.
R
source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) s1 <- "CCTGCTGCAG" s2 <- "GATGTGCCG" s3 <- "GATGTGCAG" s4 <- "CCGCTAGCAG" s5 <- "CCTGTAGG" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s2, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s3, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s4, s5, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
Align group $$\begin{aligned} s_1 &= \mathtt{ATTGCCATT\_\_} \\ s_2 &= \mathtt{ATC\_CAATTTT} \end{aligned}$$ with group $$\begin{aligned} s_3 &= \mathtt{ATGGCCATT} \\ s_4 &= \mathtt{ATCTTC\_TT} \end{aligned}$$ using the approach of CLUSTALW algorithm. Align groups based on two most similar sequences considering matches for $+1$ and mismatches and gaps for $-1$. The respective guiding tree is below.
[Source http://www.bii.a-star.edu.sg/docs/education/lsm5192_04/Multiple%20Sequence%20Alignment%20Progressive%20Approaches.pdf. ]
The following code may help you to decide which two sequences will guide the alignment.
source("https://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) s1 <- "ATTGCCATT" s2 <- "ATCCAATTTT" s3 <- "ATGGCCATT" s4 <- "ATCTTCTT" submatrix <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE) pairwiseAlignment(s1, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s1, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s3, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE) pairwiseAlignment(s2, s4, substitutionMatrix = submatrix, gapOpening = 0,gapExtension = -1, scoreOnly = FALSE)
Use BLAST algorithm to find the local alignment of query sequence $$ \mathtt{IHNWALN} $$ in database $$ \mathtt{AFGIAAAHDWALNW}. $$ Use $k=3$, a threshold for high scoring words $T=20$, and BLOSUM 62 scoring matrix.
Use NCBI BLAST page to find what species are likely to contain those sequences in their DNA.
AAAACCGCTGATGAGCGTCGGTAAAGTACTGAATATGAACAACATCGCGGCAGCCGGCATGGTGGCAACGCTTGCCAACA ACATCCCGATGTTCGGCATGATGAAGCAGATGGATACCCGCGGCAAAGTCATCAACTGCGCCTTCGCCGTTTCCGCTGCT TTCGCCCTGGGCGACCACTTAGGCTTCGCCGCTGCCAACATGAACGCCATGATCTTCCCGATGATTGTCGGCAAGTTGAT CGGCGGCGTAACGGCGATTGGCGTGGCGATGATGCTGGTGCCAAAAGAAGACGCGACCGCGACTAAAACCGAAGCGGAGG CACAATCGTGAACACTCGCCAGCTATTGAGCGTCGGTATCGATATCGGCACCACCACCACCCAGGTGATTTTCTCCCACC TGGAGCTGGTTAACCGTGCGGCGGTGTCGCAGGTGCCGCGCTACGAATTCATTAAACGCGAAATTAGCTGGCAAAGTCCG GTGTTCTTTACCCCTGTCGATAAACAGGGCGGTTTAAAAGAAGCGGAACTGAAAACCTTAATACTCGAGCAATATCAGGC TGCGGGTATTGCGCCGGAAAGCGTTGATTCTGGTGCCATCATCATCACCGGTGAAAGCGCGAAAACCCGCAATGCTCGCC CGGCGGTGATGGCGCTCTCTCAATCGCTGGGGGATTTTGTCGTTGCCAGCGCCGGGCCGCACCTCGAATCCGTGATCGCC GGTCACGGAGCTGGGGCGCAAACCCTTTCTGAACAACGGCTGTGTCGGGTACTGAATATCGACATCGGCGGTGGCACCGC GAACTACGCCCTGTTCGATGCCGGAAAAATCAGCGGCACTGCCTGTCTCAACGTCGGTGGTCGCCTGCTGGAAACCGACA GCCAGGGGCGCGTGGTTTACGCTCATAAACCGGGGCAGATGATTGTGGATGAGTGCTTCGGTGCAGGCACTGACGCCCGT TCGCTGACCGGCGCGCAGCTGGTGCAGGTTACCCGGCGGATGGCGGCGCTGATTGTCGAAGTGATTGACGGAACGCTTTC GCCGCTCGCGCAGGCATTGATGCAAACCGGTTTGCTGCCCGCAGGTGTTACGCCCGAAATCATTACGCTTTCTGGAGGCG TGGGCGAATGTTATCGCCACCAGCCCGCCGACCCGTTCTGTTTTGCCGATATTGGCCCACTTCTGGCAACGGCGCTGCAT GACCATCCGCGCCTGCGTGAGATGAATGTGCAGTTTCCGGCGCAAACCGTACGCGCCACGGTGATTGGCGCGGGTGCACA TACCCTTTCGCTCTCTGGCAGCACAATCTGGCTGGAGGGCGTACAACTGCCGCTGCGCAATTTGCCGGTGGCGATCCCGA TTGATGAAACGGATCTGGTGAGTGCCTGGCAACAGGCGCTGCTTCAGCTGGATCTTGATCCCAAAACTGACGCGTACGTG CTGGCGCTTCCCGCCTCGCTGCCTGTGCGTTACGCCGCGGTACTGACGGTCATCAACGCGCTGGTCGATTTCGTCGCGCG TTTTCCGAATCCGCATCCCCTGCTGGTGGTGGCCGGGCAGGACTTTGGTAAAGCTCTGGGCATGTTGTTGCGCCCACAGC TACAACAACTCCCGTTGGCAGTCATTGACGAAGTGATTGTCCGCGCGGGGGACTATATCGACATTGGTACGCCTCTTTTT GGCGGATCGGTTGTGCCGGTGACGGTGAAATCACTCGCATTTCCTTCCTGAGGGAACGACTTATGAAACTAAAGACCACA TTGTTCGGCAATGTATATCAGTTTAAGGATGTAAAAGAGGTGCTGGCTAAAGCCAACGAACTGTGTTCGGGGGATGTGCT GGCAGGCGTTGCAGCGGCAAGTTCACAGGAGCGCGTGGCGGCAAAGCAGGTGTTGTCGGAAATGACCGTAGCGGACATCC GCAATAATCCGGTGATTGCCTATGAAGATGACTGCGTGACGCGGCTGATTCAGGACGATGTTAACGAAACGGCCTACAAC
CCACAAGACGTCAAGTTTCCGGGCGGCGGCCAGATCGTTGGCGGAGTATACTTGCTGCCGCGCAGGGGCCCCAGGTTGGG TGTGCGCGCGGCAAGGAAAACTTCGGAGCGGTCACAGCCCCGTGGGAGACGCCAGCCCATCCCCAAAGATCGGCGTCCCA CTGGCAAGTCCTGGGGAAAACCAGGATACCCTTGGCCCTTATATGGGAACGAGGGGCTCGGCTGGGCAGGATGGCTCCTG TCCCCCCAGGGCTCCCGTCCCTCTTGGGGCCCCACTGACCCCCGGCGTAGGTCGCGCAATGTGGGTAAGGTCATCGACAC CCTAACGTGCGGCTTCGCCGACCTCATGGGGTACATCCCCGTCGTAGGCGCCCCGCTTGGCGGTGTCGCCAGAGCTCTCG CGCATGGCGTGAGGGCCCTGGAGGACGGGGTCAACTATGCAACAGGGAACTTACCCGGTTGCCCCTTTTCTATCTTCTTG CTGGCCCTACTGTCCTGCATCACCACTCCGGTCTCAGCTGCCCAGGTGAAAAACACCAGTGACATCTACATGGTGACTAA CGACTGTCCCAACAGCAGCATCACCTGGCAGCTTAGGGCCGCAGTCCTCCACGTCCCCGGATGTGTCCCGTGTGAGAAAG TGGGGAATACATCTCAGTGCTGGACGCCGGTCTCACCCAATGTGGCTGTGCAGCAACCCGGCGCCCTCACGCGGGGCTTG CGGACGCACATCGATATCGTTGTAATGTCCGCTACGCTCTGCTCCGCTCTCTATGTGGGGGACCTCTGCGGCGGGGTAAT GCTCGCGGCCCAGATATTCATCGTCTCGCCACAACACCACTGGTTCGTGCAAGAGTGCAATTGCTCCATCTACCCTGGTA CCATCACTGGTCACCGTATGGCATGGGACATGATGATGAACTGGTCGCCCACAGCTACCATGATCCTGGCGTACGCGACA CGTGTTCCCGAGGTCATCATAGACATCATTAGCGGGGCTCACTGGGGTGTCATGTTCGGCCTGGCCTACTTCTCTATGCA GGGAGCGTGGGCGAAGGTCGTTGTCATCCTCCTGCTGGCCGCTGGGGTGGACGCACATACCAACGTCATTGGGGCCCAGG TGGGGCGCACCGCCAGTAGCCTTAATAGCTTGTTCACCGTCGGCGCTAAGCAGAACATCCAGCTGATCAACTCCAATGGC AGTTGGCACATCAACCGCACTGCTCTGAACTGCAATGACTCTCTGAACACCGGCTTCCTCGCGTCCCTGTTCTACACCAA TCGCTTCAACTCGTCGGGATGCCCAGAACGTCTGGCATCCTGCCGTAGGATTGAGGCCTTCAGGATAGGATGGGGCACTC TGCAATATGAGCACAATGTCACCAATTCAGAGGATATGAGACCATACTGCTGGCATTATCCACCCAAACCTTGTGGTATA GTCCCCGCGAGGTCTGTGTGTGGCCCGGTGTACTGTTTCACACCCAGCCCAGTAGTAGTGGGCACGACCGACAGGCGTGG AGTGCCCACTTACACGTGGGGGGAGAATGAGACGGACGTCTTCCTACTGAACAGCACCCGGCCACCGCGGGGGTCATGGT TCGGCTGTACGTGGATGAACTCCACTGGCTTCACCAAGACTTGTGGCGCACCACCTTGCCGCATTAGAGCTGATTTCAAT GCCAGCACGGACCTGTTGTGCCCCACGGACTGTTTTAGGAAACACCCTGACGCCACTTACATCAAGTGTGGCTCCGGGCC CTGGCTCACGCCCAGATGCCTGGTCGACTACCCCTACAGGCTCTGGCACTACCCCTGCACAGTCAACTATAGCATCTTCA AGATAAGGATGTACGTGGGGGGGGTTGAACACAGGCTTACAGCTGCCTGTAACTTCACCCGCGGGGATCCTTGCAACTTG GATGACAGAGACAGAAGTCAACTGTCCCCCTTGTTGCACTCTACCACGGAGTGGGCCATCTTGCCCTGCACTTACTCTGA CCTGCCCGCCTTGTCGACCGGTCTCCTCCACCTCCACCAAAACATCGTGGACGTGCAATACATGTACGGCCTTTCACCAG CCGTCACGAAGTACATAGTCCGGTGGGAGTGGGTAGTGCTCTTGTTCCTGCTCTTGGCGGACGCCAGGGTCTGTGCCTGT GTATGGATGCTCATCCTGCTGGGCCAAGCCGAGGCAGCCCTAGAGAAGCTGGTTGTTTTGCACGCCGCGAGTGCGGCTGG CTGCAATGGCTTTCTATATTTCATCATCTTTTTCGTGGCTGCGTGGTGCATCAAGGGTCGAGTGGTCCCCTTGGCTACCT ATTCCCTCATCGGCCTATGGTCCTTCTTCCTACTGCTCCTAGCATTGCCTCAACAGGCTTATGCTTATGATGCAACTGTG CATGGACAAATAGGCGTGGCCCTGTTGGTGCTGCTCACCCTCTTTACACTCACCCCGGCATATAAGACCCTCCTGGGCCG GTGTCTGTGGTGGCTGTGCTATCTCCTGACCTTGGGAGAGGCCCTCGACCAGGAGTGGGCACCCTCCATGCAGGCGCGCG GTGGCCGGGATGGCATCATATGGGCTGCCACCATATTCTGCCCGGGTGTGGTGTTTGACATAACCAAGTGGCTTTTGGCG ATACTTGGACCTGGTTATCTCCTAAGAGATGCTTTGACACGCGTGCCGTATTTCGTCAGAGCCCACGCTCTGCTGAGAAT GTGCGCCATGGTGATGCACCTCGTGGGGGGTAAGTACGTCCAGATGGCGCTATTAACCCTTGGTAGGTGGACTGGCACTT ACATCTACGACCACCTCGCCCCCATGTCGGATTGGGCTGCCAGCGGCCTGCGGGACCTGGCGGTCGCTGTGGAACCTATC ATCTTCAGTCCGATGGAGAAAAAAGTCATCGTATGGGGAGCGGAGACAGCCGCGTGCGGGGACATCTTGCACGGACTTCC CGTGTCTGCTCGGCTTGGTCGAGAGATCCTTCTTGGCCCAGCTGACGGCTACACCTCTAAGGGGTGGAAGCTTCTTGCGC CTATCACTGCTTATGCCCAGCAGACACGAGGTCTCTTGGGCGCCATAGTGGTGAGCATGACAGGCCGTGACAAAACGGAA CAGGCCGGGGAGATCCAAGTCCTGTCCACGGTCACTCAGACCTTCCTCGGAACTACCATCTCAGGGGTCTTATGGACCGT CTACCACGGAGCTGGCAACAAGACCTTAGCCGGTTCGCGGGGCCCGGTCACGCAGATGTACTCCAGTGCCGAGGGAGACT TGGTGGGGTGGCCCAGTCCCCCCGGGACCAAATCCATGGAGCCGTGCACATGCGGAGCGGTCGACCTGTATCTGGTCACG CGGAACGCTGATGTCATCCCGGCTCGGAGACGCGGGGACAAGCGGGGAGCGTTGCTCTCCCCGAGACCTCTCTCGACCTT GAAGGGGTCCTCAGGGGGACCGGTGCTTTGCCCCAGGGGCCACGTTGTTGGGATCTTCCGGGCAGCCATATGCTCTCGGG GCGTGGCCAAGTCCATAGACTTCATCCCCGTTGAGATGCTTGACATCGTCACGCGCTCCCCCACCTTTACCGACAACAGC
GCTTCTGTCTAGTTTTTATATGAAGATATTCCCATTTCCAATGACGGCCTCAAAGCAGTCCAAATATCCACTTGCAGATT ATAAGAAAAGAGTGTTTCAAAATTGCTCTATGAAAAGGGAAGTTTAACTCTGTGAGTTGAATGCAAACATCACAAAGAAG TTTCTGACAATGCTTCTGTCTAGTTTTTATTTATAGATATTTCCTTTTCCACCATAGGCCTCCAAGCTCTCCAAATGTCT GCTTGCAGATTCTACAAAAAAAGTGCTTCAAACCTGCTCTATCAAAAGAAAGGTTCAATTCTGTGAGTGGAATGCACACA TCACAAAGAGATTTCTGAGAATGATTCTGTCTAGTGTTTATGTGAAGATATCCCCTTTTCCAACGAAGGCCTCAAAGCGG TTCAAATATCCACTTGCAGATTCTGCAAAAAGAGTGCTTCAAAACTGCTCTATGAAGAGGTATGTTCAACCCTGTGATTT GAAAGCACACATCATAAAGTAGTTCCGAAGAATTATTCTGTCTGGTTTTTATATAATGATATTTCCTTTTCCATCATAGG CCTCAAAGCTCGCCATATGTCCACTTGAAGATTCTACAAAAAGACGGTTTCAAACCTGCTCTATGAAAAGAAAGGTTCAA CTCTGTGAGTTGAATGCACACATCACAAAGCAGTTTCTGAGAATGCTTCTGTCTAGTGTTTATGTGAAGATAATCCCGTT
Mammals have more than a single chromosome. The count range can be found on this Wikipedia page. Which chromosome does the sequence come from? What is the similarity reported by BLAST? What may be biological motivation for such a thing? How does this influence sequence assembly?