====== Tutorial 8 - Hidden Markov models II. ====== ===== Problem 1 - Profile HMM construction ===== Consider the following multiple-sequence alignment: $$ \begin{array} \mathrm{A} & \mathrm{C} & \mathrm{D} & \mathrm{E} & \mathrm{F} & \mathrm{A} & \mathrm{C} & \mathrm{A} & \mathrm{F} \\ \mathrm{A} & \mathrm{F} & \mathrm{D} & \mathrm{A} & \mathrm{\_} & \mathrm{\_} & \mathrm{\_} & \mathrm{C} & \mathrm{F} \\ \mathrm{A} & \mathrm{\_} & \mathrm{\_} & \mathrm{E} & \mathrm{F} & \mathrm{D} & \mathrm{\_} & \mathrm{F} & \mathrm{C} \\ \mathrm{A} & \mathrm{C} & \mathrm{C} & \mathrm{E} & \mathrm{F} & \mathrm{\_} & \mathrm{\_} & \mathrm{A} & \mathrm{C} \\ \mathrm{A} & \mathrm{D} & \mathrm{D} & \mathrm{E} & \mathrm{F} & \mathrm{A} & \mathrm{A} & \mathrm{A} & \mathrm{F} \\ \end{array} $$ Use threshold $\theta = 2$ to ignore columns with too many gaps. Construct an HMM profile based on this multiple-sequence alignment. ===== Problem 2 - Pfam database search ===== Imagine that you have found a new protein located in human heart muscle and some other strong muscles in the body. You believe that the protein has some connection with the ability of muscles to produce energy. Use either the [[http://www.ebi.ac.uk/interpro/|Pfam database]] or the [[https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan|EBI HMMER search tool]] to identify the protein family to which the protein belongs. {{ :courses:bin:tutorials:hmmer.png?nolink&400 |}} The protein sequence is below. MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH PGDFGADAQGAMNKALELFRKDMASNYKELGFQG You may find the following alignment. Answer the questions below. {{ :courses:bin:tutorials:pfams_search.png?nolink&800 |}} Which protein family does your protein belong to? What is this family responsible for/what is its purpose? The quality of search is quantified in so-called //$E$-value// (sometimes Expect value). Find what this quantity means. What is the difference between the $E$-value and $p$-value? Based on the reported $E$-value, do you believe that the protein family is the same as the one reported by the database search? The protein is, in fact, not new. Use the [[https://www.uniprot.org/blast/|UniProt BLAST]] tool to find the true name of the protein. Knowing the protein name, answer the following questions: * What is the purpose of the protein? * Do we need this protein, i.e., is a malfunction of this protein lethal? * Why it is good for organisms to have several proteins capable of the same thing? How do we call two proteins with the same origin? This protein was not chosen arbitrarily. There was a Nobel prize awarded for research of this protein in 1962. Who received the prize and why? You will talk more about this field in the second part of the course. ===== Problem 3 - exon prediction ===== There are many hidden Markov models tools capable of predicting introns and exons. Their power is in understanding the semantics of genes, and they provide, therefore, a good accuracy. Consider the sequence below (you can also download it as a {{ :courses:bin:tutorials:hmm_input.txt |FASTA file}}). >NG_008301.1:5001-8050 Homo sapiens keratin 16 (KRT16), RefSeqGene on chromosome 17 AGTTAGGAGGGCCCCGCCTTCCCCAGCTGCATATAAAGGTCTCTGGGGTTGGAGGCAGCCACAGCACGCT CTCAGCCTTCCTGAGCACCTTTCCTTCTTTCAGCCAACTGCTCACTCGCTCACCTCCCTCCTTGGCACCA TGACCACCTGCAGCCGCCAGTTCACCTCCTCCAGCTCCATGAAGGGCTCCTGCGGCATCGGAGGCGGCAT CGGGGGCGGCTCCAGCCGCATCTCCTCCGTCCTGGCCGGAGGGTCCTGCCGTGCCCCCAGCACCTACGGG GGCGGCCTGTCTGTCTCCTCTCGCTTCTCCTCTGGGGGAGCCTGCGGGCTGGGGGGCGGCTATGGCGGTG GCTTCAGCAGCAGCAGCAGCTTTGGTAGTGGCTTCGGGGGAGGATATGGTGGTGGCCTTGGTGCTGGCTT CGGTGGTGGCTTGGGTGCTGGCTTTGGTGGTGGTTTTGCTGGTGGTGATGGGCTTCTGGTGGGCAGTGAG AAGGTGACCATGCAGAACCTCAATGACCGCCTGGCCTCCTACCTGGACAAGGTGCGTGCTCTGGAGGAGG CCAACGCCGACCTGGAAGTGAAGATCCGTGACTGGTACCAGAGGCAGCGGCCCAGTGAGATCAAAGACTA CAGTCCCTACTTCAAGACCATCGAGGACCTGAGGAACAAGGTGGGTGACTTTGGTGTATGGAGCACTGAG AGAGGCTGGGGCTACAGTGGCCCTTGGGATACCTCTTTTTAGCAATTACACTTTACAAACAGGGAGACTG GGCACCTTTGGGGAGTGGCCAGGATCACCCAGGGAAGTGGTAGCAGAGGGTCCCTTTTCAGTATCTCTGT GCCCGGACTGGGGCTGTTACCCTAAATCTCTTATTTCCTTCAAGGGTTCAGCTGCAAGTTCAGCTTCCCT GCCTTGGGCCCAGGAAGGGGGTGATCGGGATGGAGTGCATCCCTACGTAGCCTGAGCTGGTGGAGAAGGC ATGCCAGCCCTGCCAGCCAGAAGACTTCCAGATTTGGGGCGGTTCCTTTTGCCCCTTTCTGCCTTTCATG CTCAAGTAGTAAGGTCCTTGGCTGACCAGGGCTCCTGTCCTCCATCCCCACTCCAGATCATTGCGGCCAC CATTGAGAATGCGCAGCCCATTTTGCAGATTGACAATGCCAGGCTGGCAGCCGATGACTTCAGGACCAAG TGAGCAGCCAGCATGGTGGGCTGGGGGCAGAGGGCAAGGGACAAAGAGTGGGGCGGTCCACCCAGCAGGG CCAGCAGACCCCGAGCCTCAGAATCCTCAGGGCTGCAGCCTGAGGACCTGACCTCTGTCCTGCCAGGTAT GAGCATGAACTGGCCCTGCGGCAGACTGTGGAGGCCGACGTCAATGGCCTGCGCCGGGTGTTGGATGAGC TGACCCTGGCCAGGACTGACCTGGAGATGCAGATCGAAGGCCTGAAGGAGGAGCTGGCCTACCTGAGGAA GAACCACGAGGAGGTACGGTCGCTGCTGGCTTCCGGGGTGGGAGGCTGGTTTGGTGGGGTTGCCAGATGC ACCCAGGGCCAGGAGAGAAGTCTGCTGAACTGACCGCCTCCTGCCATCCCTTCCCAGGAGATGCTTGCTC TGAGAGGTCAGACCGGCGGAGATGTGAACGTGGAGATGGATGCTGCACCTGGCGTGGACCTGAGCCGCAT CCTGAATGAGATGCGTGACCAGTACGAGCAGATGGCAGAGAAAAACCGCAGAGACGCTGAGACCTGGTTC CTGAGCAAGGTGGGGCTCGGGCCCGCAGTGAGCCTGCAGCACTTCCCAGCTGGGGGCTTTGGGAGAGCCT CACCTTTCACTCTGCTTTCCTGCCTCAGACCGAGGAGCTGAACAAAGAAGTGGCCTCCAACAGCGAACTG GTACAGAGCAGCCGCAGTGAGGTGACGGAGCTCCGGAGGGTGCTCCAGGGCCTGGAGATTGAGCTGCAGT CCCAGCTCAGCATGGTATGAAGGACCCAGCACAGCAGCAGCCCCCAAGTCACCAGTAATGGCCACCACCC CCTCAAAAAGCCACAGTCTAGTTCCACCTTTCTTTTCTCAGGATGGGACCAGGGGACTCATGGGACCGTT ATATAGATAGAGAAACTAAGCCCTAGAATAGTGGGCTAGCTTTTCTCCATATTGTCTGGCCCATCAGTAC CCCAACTGGGATCAAAATCCAGGCATCTCTCAAAAAACATGCCCAGAGACCTGGAGGAACAGGAGTGACC ACCTCCATGGACTCTTTTTCTCTCTCTCACTTGCAGAAAGCATCCCTGGAGAACAGCCTGGAGGAGACCA AAGGCCGCTACTGCATGCAGCTGTCCCAGATCCAGGGACTGATTGGCAGTGTGGAGGAGCAGCTGGCCCA GCTACGCTGTGAGATGGAGCAGCAGAGCCAGGAGTACCAGATCTTGCTGGATGTGAAGACGCGGCTGGAG CAGGAGATTGCCACCTACCGCCGCCTGCTGGAGGGCGAGGATGCCCAGTGAGTCCCAGGCCCCTCAGTTC TGCCTCCCAGACCCTTTAGCCCCCCTGCTGCTCTCAGCACAACTGACTGCCCTGCTTTTTCTCTCCCACA GCCTTTCCTCCCAGCAAGCATCTGGCCAATCCTATTCTTCCCGCGAGGGTAAGGCTTCTGAGGCTCCCCG GCACTGCAGCCCCTCTGCCTGTTTCCATGGAGTGGGGGCTGGGCCCTTCTCCTCAGAGCTCCCAGCCCTC CCTTCTCCCTGCCCTGGAGTCAGCTTAGCTCTCAGACCCCTTCTCACCTCCTCTTCTCTCTCCCACAGTC TTCACCTCCTCCTCGTCCTCTTCGAGCCGTCAGACCCGGCCCATCCTCAAGGAGCAGAGCTCATCCAGCT TCAGCCAGGGCCAGAGCTCCTAGAACTGAGCTGCCTCTACCACAGCCTCCTGCCCACCAGCTGGCCTCAC CTCCTGAAGGCCCGGGTCAGGACCCTGCTCTCCTGGCGCAGTTCCCAGCTATCTCCCCTGCTCCTCTGCT GGTGGTGGGCTAATAAAGCTGACTTTCTGGTTGATGCAAA The first tool we will use is HMMgene on [[https://services.healthtech.dtu.dk/services/HMMgene-1.1/]]. Use this tool to classify introns and exons of the sequence above. {{ :courses:bin:tutorials:hmmgene.png?nolink&600 |}} In the output, you may notice the program to claim that you inserted two sequences, even though you used only a single sequence. Why does the program report two of them? ## gff-version 1 ## date: Thu Apr 4 10:57:50 2019 ## HMMgene1.1e (human model sim10gc.C.bsmod) # SEQ: 3050 (+) A:593 C:922 G:903 T:632 HMMgene1.1e firstex 140 670 # ... # .......... # SEQ: 3050 (-) A:632 C:903 G:922 T:593 HMMgene1.1e exon_1 1824 1941 0.293 - 1 # ... # .......... How many introns and exons were predicted? So far we do not know anything about the accuracy of the result. We shall, therefore, use a second tool to get additional information. Use GENSCAN on [[http://hollywood.mit.edu/GENSCAN.html]] to predict introns and exons for the same sequence. {{ :courses:bin:tutorials:genscan.png?nolink&600 |}} Now compare the results. From the sequence annotation, you may see that the protein is keratin, which is a base stone of hair, nails, and cytoskeleton. Use the [[https://www.ncbi.nlm.nih.gov/nuccore/|NCBI nucleotide archive]] to search for the sequence and annotation. Compare the predicted introns and exons with the official annotation. How many differences did you find? What are the differences between the predicted peptide sequence and the one in the NCBI database? {{ :courses:bin:tutorials:keratin.png?nolink&800 |}} If you are interested in more details on how GENSCAN works, visit the last slides of [[https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-096-algorithms-for-computational-biology-spring-2005/lecture-notes/lecture7.pdf| this presentation]]. ===== Assignment 4 - gene finding ===== Work individually on the [[../assignments/hw4|fourth programming assignment]]. The deadline is May 10th, 2022. Upload in BRUTE.