Search
Protein function prediction aims to assign biological or biochemical roles to proteins. Prediction could be based on sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interactions. In this assignment you will focus purely on protein sequences and work with the assumption that sequentially similar proteins are likely to share function.
Protein function is a very general concept. In this assignment, you will interpret it as an annotation that comes from a closed vocabulary of terms contained in The Gene Ontology. The Gene Ontology provides a taxonomical hierarchy of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Your task will be to assign appropriate terms from this ontology to proteins whose function (the set of annotating GO terms) is unknown (or rather temporarily hidden).
This assignment is a (very) simplified version of The Critical Assessment of protein Function Annotation (CAFA) challenge.
The input dataset is available here. It contains a FASTA file with 66,841 proteins and an annotation file with 386,197 links between proteins and GO terms. On average, each protein is annotated with approximately 6 terms. Only leaf annotations are provided. If a protein is annotated with a special term, it is by default annotated by all its generalizations too, however, exceptions exist.
For your analysis, work with only a small number of GO terms (3 or 4). When addressing a classification task, try to discriminate between GO terms and investigate whether distantly related GO terms (whose shortest path in the ontology passes through the root node) are easier to separate than closely related GO terms connected by an ancestor-descendant relationship.
When searching for motifs, examine how the motifs identified for a certain GO term relate to its generalizations (ancestor terms), sibling terms and biologically distinct GO terms with no direct ontological link. This may help assess the specificity and transferability of the discovered motifs.
A guide to understand the protein annotation file is outlined in this Python script. An additional R script demonstrates how to preprocess the FASTA file too.
You are supposed to solve ONE OF THE FOLLOWING tasks:
* protein annotation based on sequential similarity
* motif discovery in functionally related protein sets
* protein annotation with recurrent neural networks OR transformers
* 4 points for experimental design
* 2 points for source code
* 4 points for final report