Assignment 5: PROTEIN FUNCTION PREDICTION
10 points.
Deadline: Wednesday, May 28th, 23:59
Late submission penalty: -1 point per day, but no more than 8 points.
Work individually.
Submit
Motivation
Protein function prediction aims to assign biological or biochemical roles to proteins. Prediction could be based on sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interactions. In this assignment you will focus purely on protein sequences and work with the assumption that sequentially similar proteins are likely to share function.
Protein function is a very general concept. In this assignment, you will interpret it as an annotation that comes from a closed vocabulary of terms contained in The Gene Ontology. The Gene Ontology provides a taxonomical hierarchy of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Your task will be to assign appropriate terms from this ontology to proteins whose function (the set of annotating GO terms) is unknown (or rather temporarily hidden).
This assignment is a (very) simplified version of The Critical Assessment of protein Function Annotation (CAFA) challenge.
The input dataset is here. It contains a fasta file with 66,841 proteins as well as an annotation file with 386,197 links among protein and GO terms. Each protein is on average annotated with approximately 6 terms. Only leaf annotations are provided (if a protein is annotated with a special term, it is by default annotated by all its generalizations too, however, exceptions exist).
Work with a couple of GO terms only (3 or 4). When solving a classification task, try to discriminate GO terms and see whether distant GO terms (the shortest ontology path contains the root node) split better than related GO terms (ancestor relationship). When searching for motifs, try to figure out how a set of motifs found for a certain GO term fits for example its generalizations (ancestor or sibling ontology nodes) and a very different term (no direct link). A way to preprocess the fasta file is outlined in this R file.
Tasks
You are supposed to solve ONE OF THE FOLLOWING tasks:
* protein annotation based on sequential similarity
pick a couple of sufficiently abundant GO terms,
get their protein sequences,
use BLAST to find their sequential similarity
use
kNN to classify the proteins into classes defined by the previously picked GO terms
do not forget to use cross-validation or train/test split at least
draw a conclusion, compare your results with expectations (a small literature review required)
* motif discovery in functionally related protein sets
install
MEME for motif discovery
pick a couple of sufficiently abundant GO terms,
get their protein sequences
use STREME to find motifs that match GO terms
example call searching for motifs that appear in GO1_train but not in GO2_train:
the motifs found will be stored into streme_out/streme.txt,
interpret the motifs, compare them, evaluate their scores,
use FIMO to detect the motifs in the test set
example call: fimo streme_out/streme.txt CAFA3_training_data/GO1_test.fasta
the motif scores will be stored into fimo_out/fimo.tsv,
use the scores to classify the proteins into classes defined by the previously picked GO terms
draw a conclusion, compare your results with expectations (a small literature review required).
R auxiliary functions available
here
STREME and FIMO R calls, creation of an input file for supervised learning from FIMO score files (fimo.tsv),
do not hesitate to experiment with settings (e.g., STREME parameters thres and time), FIMO max aggregation etc.
* protein annotation with recurrent neural networks OR transformers
pick a couple of sufficiently abundant GO terms,
get their protein sequences,
use a recurrent neural network to classify selected proteins
alternatively, perform the same classification with pre-trained protein language models
do not forget to use cross-validation or train/test split at least
draw a conclusion
Scoring
This is an experimental and research assignment rather than a programming one. Please note that the main stress is on the experimental design and the report. The way of evaluation is shifted wrt the previous assignments.
* 4 points for experimental design
the overall pipeline fulfills the task requirements,
conceptual correctness of the solution,
meaningful evaluation (scores, classification measures).
* 2 points for source code
correctness, documentation, clarity, accessibility (ease of utilization for different fasta inputs, GO terms, etc.),
* 4 points for final report
a description of your solution is needed, the individual steps should be briefly motivated,
evaluation (graphs, quality measures),
you can study e.g., the influence of train set size, the role of model parameters, relatedness between the GO terms,
choose a proper quality measure (accuracy, AUROC, precision, recall, F1, …)
review, scope of discussion that goes beyond your implementation
CAFA motivation,
interpretation of the graphs,
difference between classification that we do here and protein annotation that is expected in general,
the report is supposed to be self-contained (understandable without the source code).