Search
Consider the following multiple sequence alignmet. $$ \begin{array} \mathrm{A} & \mathrm{C} & \mathrm{D} & \mathrm{E} & \mathrm{F} & \mathrm{A} & \mathrm{C} & \mathrm{A} & \mathrm{F} \\ \mathrm{A} & \mathrm{F} & \mathrm{D} & \mathrm{A} & \mathrm{\_} & \mathrm{\_} & \mathrm{\_} & \mathrm{C} & \mathrm{F} \\ \mathrm{A} & \mathrm{\_} & \mathrm{\_} & \mathrm{E} & \mathrm{F} & \mathrm{D} & \mathrm{\_} & \mathrm{F} & \mathrm{C} \\ \mathrm{A} & \mathrm{C} & \mathrm{C} & \mathrm{E} & \mathrm{F} & \mathrm{\_} & \mathrm{\_} & \mathrm{A} & \mathrm{C} \\ \mathrm{A} & \mathrm{D} & \mathrm{D} & \mathrm{E} & \mathrm{F} & \mathrm{A} & \mathrm{A} & \mathrm{A} & \mathrm{F} \\ \end{array} $$ Use threshold $\theta = 2$ to ignore columns which have too many gaps. Construct an HMM profile based on this multiple sequence alignment.
Imagine that you have found a new protein located in human heart muscle and some other strong muscles in the body. You believe that the protein has some connection with the ability of muscles to produce energy. Use either the Pfam database or the EBI HMMER search tool to identify the protein family to which the protein belongs.
The protein sequence is below.
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH PGDFGADAQGAMNKALELFRKDMASNYKELGFQG
You may find the following alignment. Answer the questions below.
The quality of search is quantified in so-called $E$-value (sometimes Expect value). Find what this quantity means. What is the difference between the $E$-value and $p$-value? Based on the reported $E$-value, do you believe that the protein family is the same as the one reported by the database search?
The protein is, in fact, not new. Use the UniProt BLAST tool to find the true name of the protein. Knowing the protein name, answer the following questions:
This protein was not chosen arbitrarily. There was a Nobel prize awarded for research of this protein in 1962. Who received the prize and why? You will talk more about this field in the second part of the course.
There are many hidden Markov models tools capable of predicting introns and exons. Their power is in understanding the semantics of genes, and they provide, therefore, a good accuracy. Consider the sequence below (you can also download it as a FASTA file).
>NG_008301.1:5001-8050 Homo sapiens keratin 16 (KRT16), RefSeqGene on chromosome 17 AGTTAGGAGGGCCCCGCCTTCCCCAGCTGCATATAAAGGTCTCTGGGGTTGGAGGCAGCCACAGCACGCT CTCAGCCTTCCTGAGCACCTTTCCTTCTTTCAGCCAACTGCTCACTCGCTCACCTCCCTCCTTGGCACCA TGACCACCTGCAGCCGCCAGTTCACCTCCTCCAGCTCCATGAAGGGCTCCTGCGGCATCGGAGGCGGCAT CGGGGGCGGCTCCAGCCGCATCTCCTCCGTCCTGGCCGGAGGGTCCTGCCGTGCCCCCAGCACCTACGGG GGCGGCCTGTCTGTCTCCTCTCGCTTCTCCTCTGGGGGAGCCTGCGGGCTGGGGGGCGGCTATGGCGGTG GCTTCAGCAGCAGCAGCAGCTTTGGTAGTGGCTTCGGGGGAGGATATGGTGGTGGCCTTGGTGCTGGCTT CGGTGGTGGCTTGGGTGCTGGCTTTGGTGGTGGTTTTGCTGGTGGTGATGGGCTTCTGGTGGGCAGTGAG AAGGTGACCATGCAGAACCTCAATGACCGCCTGGCCTCCTACCTGGACAAGGTGCGTGCTCTGGAGGAGG CCAACGCCGACCTGGAAGTGAAGATCCGTGACTGGTACCAGAGGCAGCGGCCCAGTGAGATCAAAGACTA CAGTCCCTACTTCAAGACCATCGAGGACCTGAGGAACAAGGTGGGTGACTTTGGTGTATGGAGCACTGAG AGAGGCTGGGGCTACAGTGGCCCTTGGGATACCTCTTTTTAGCAATTACACTTTACAAACAGGGAGACTG GGCACCTTTGGGGAGTGGCCAGGATCACCCAGGGAAGTGGTAGCAGAGGGTCCCTTTTCAGTATCTCTGT GCCCGGACTGGGGCTGTTACCCTAAATCTCTTATTTCCTTCAAGGGTTCAGCTGCAAGTTCAGCTTCCCT GCCTTGGGCCCAGGAAGGGGGTGATCGGGATGGAGTGCATCCCTACGTAGCCTGAGCTGGTGGAGAAGGC ATGCCAGCCCTGCCAGCCAGAAGACTTCCAGATTTGGGGCGGTTCCTTTTGCCCCTTTCTGCCTTTCATG CTCAAGTAGTAAGGTCCTTGGCTGACCAGGGCTCCTGTCCTCCATCCCCACTCCAGATCATTGCGGCCAC CATTGAGAATGCGCAGCCCATTTTGCAGATTGACAATGCCAGGCTGGCAGCCGATGACTTCAGGACCAAG TGAGCAGCCAGCATGGTGGGCTGGGGGCAGAGGGCAAGGGACAAAGAGTGGGGCGGTCCACCCAGCAGGG CCAGCAGACCCCGAGCCTCAGAATCCTCAGGGCTGCAGCCTGAGGACCTGACCTCTGTCCTGCCAGGTAT GAGCATGAACTGGCCCTGCGGCAGACTGTGGAGGCCGACGTCAATGGCCTGCGCCGGGTGTTGGATGAGC TGACCCTGGCCAGGACTGACCTGGAGATGCAGATCGAAGGCCTGAAGGAGGAGCTGGCCTACCTGAGGAA GAACCACGAGGAGGTACGGTCGCTGCTGGCTTCCGGGGTGGGAGGCTGGTTTGGTGGGGTTGCCAGATGC ACCCAGGGCCAGGAGAGAAGTCTGCTGAACTGACCGCCTCCTGCCATCCCTTCCCAGGAGATGCTTGCTC TGAGAGGTCAGACCGGCGGAGATGTGAACGTGGAGATGGATGCTGCACCTGGCGTGGACCTGAGCCGCAT CCTGAATGAGATGCGTGACCAGTACGAGCAGATGGCAGAGAAAAACCGCAGAGACGCTGAGACCTGGTTC CTGAGCAAGGTGGGGCTCGGGCCCGCAGTGAGCCTGCAGCACTTCCCAGCTGGGGGCTTTGGGAGAGCCT CACCTTTCACTCTGCTTTCCTGCCTCAGACCGAGGAGCTGAACAAAGAAGTGGCCTCCAACAGCGAACTG GTACAGAGCAGCCGCAGTGAGGTGACGGAGCTCCGGAGGGTGCTCCAGGGCCTGGAGATTGAGCTGCAGT CCCAGCTCAGCATGGTATGAAGGACCCAGCACAGCAGCAGCCCCCAAGTCACCAGTAATGGCCACCACCC CCTCAAAAAGCCACAGTCTAGTTCCACCTTTCTTTTCTCAGGATGGGACCAGGGGACTCATGGGACCGTT ATATAGATAGAGAAACTAAGCCCTAGAATAGTGGGCTAGCTTTTCTCCATATTGTCTGGCCCATCAGTAC CCCAACTGGGATCAAAATCCAGGCATCTCTCAAAAAACATGCCCAGAGACCTGGAGGAACAGGAGTGACC ACCTCCATGGACTCTTTTTCTCTCTCTCACTTGCAGAAAGCATCCCTGGAGAACAGCCTGGAGGAGACCA AAGGCCGCTACTGCATGCAGCTGTCCCAGATCCAGGGACTGATTGGCAGTGTGGAGGAGCAGCTGGCCCA GCTACGCTGTGAGATGGAGCAGCAGAGCCAGGAGTACCAGATCTTGCTGGATGTGAAGACGCGGCTGGAG CAGGAGATTGCCACCTACCGCCGCCTGCTGGAGGGCGAGGATGCCCAGTGAGTCCCAGGCCCCTCAGTTC TGCCTCCCAGACCCTTTAGCCCCCCTGCTGCTCTCAGCACAACTGACTGCCCTGCTTTTTCTCTCCCACA GCCTTTCCTCCCAGCAAGCATCTGGCCAATCCTATTCTTCCCGCGAGGGTAAGGCTTCTGAGGCTCCCCG GCACTGCAGCCCCTCTGCCTGTTTCCATGGAGTGGGGGCTGGGCCCTTCTCCTCAGAGCTCCCAGCCCTC CCTTCTCCCTGCCCTGGAGTCAGCTTAGCTCTCAGACCCCTTCTCACCTCCTCTTCTCTCTCCCACAGTC TTCACCTCCTCCTCGTCCTCTTCGAGCCGTCAGACCCGGCCCATCCTCAAGGAGCAGAGCTCATCCAGCT TCAGCCAGGGCCAGAGCTCCTAGAACTGAGCTGCCTCTACCACAGCCTCCTGCCCACCAGCTGGCCTCAC CTCCTGAAGGCCCGGGTCAGGACCCTGCTCTCCTGGCGCAGTTCCCAGCTATCTCCCCTGCTCCTCTGCT GGTGGTGGGCTAATAAAGCTGACTTTCTGGTTGATGCAAA
The first tool we will use is HMMgene on http://www.cbs.dtu.dk/services/HMMgene/. Use this tool to classify introns and exons of the sequence above.
In the output, you may notice the program to claim that you inserted two sequences, even though you used only a single sequence. Why does the program report two of them?
## gff-version 1 ## date: Thu Apr 4 10:57:50 2019 ## HMMgene1.1e (human model sim10gc.C.bsmod) # SEQ: 3050 (+) A:593 C:922 G:903 T:632 HMMgene1.1e firstex 140 670 # ... # .......... # SEQ: 3050 (-) A:632 C:903 G:922 T:593 HMMgene1.1e exon_1 1824 1941 0.293 - 1 # ... # ..........
How many introns and exons were predicted?
So far we do not know anything about the accuracy of the result. We shall, therefore, use a second tool to get additional information. Use GENSCAN on http://genes.mit.edu/GENSCAN.html to predict introns and exons for the same sequence.
Now compare the results. From the sequence annotation, you may see that the protein is keratin, which is a base stone of hair, nails, and cytoskeleton. Use the NCBI nucleotide archive to search for the sequence and annotation.
Compare the predicted introns and exons with the official annotation. How many differences did you find? What are the differences between the predicted peptide sequence and the one in the NCBI database?
If you are interested in more details on how GENSCAN works, visit the last slides of this presentation.
Work individually on the fourth programming assignment. The deadline is on May 13, 2020. Upload in BRUTE.