Assignment 5: PROTEIN FUNCTION PREDICTION

10 points.
Deadline: Wednesday, May 27th, 23:59
Late submission penalty: -1 point per day, but no more than 8 points.
Submit to BRUTE.
Work individually.
Submit
- a PDF report that describes your solution,
- your source codes in a language of your choice.

Motivation

Protein function prediction aims to assign biological or biochemical roles to proteins. Prediction could be based on sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interactions. In this assignment you will focus purely on protein sequences and work with the assumption that sequentially similar proteins are likely to share function.

Protein function is a very general concept. In this assignment, you will interpret it as an annotation that comes from a closed vocabulary of terms contained in The Gene Ontology. The Gene Ontology provides a taxonomical hierarchy of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Your task will be to assign appropriate terms from this ontology to proteins whose function (the set of annotating GO terms) is unknown (or rather temporarily hidden).

This assignment is a (very) simplified version of The Critical Assessment of protein Function Annotation (CAFA) challenge.

Inputs

The input dataset is here. It contains a fasta file with 66,841 proteins as well as an annotation file with 386,197 links among protein and GO terms. Each protein is on average annotated with approximately 6 terms. Only leaf annotations are provided (if a protein is annotated with a special term, it is by default annotated by all its generalizations too, however, exceptions exist).

Work with a couple of GO terms only (3 or 4). When solving a classification task, try to discriminate GO terms and see whether distant GO terms (the shortest ontology path contains the root node) split better than related GO terms (ancestor relationship). When searching for motifs, try to figure out how a set of motifs found for a certain GO term fits for example its generalizations (ancestor or sibling ontology nodes) and a very different term (no direct link). A way to preprocess the fasta file is outlined in this R file.

Tasks

You are supposed to solve ONE OF THE FOLLOWING tasks:

* protein annotation based on sequential similarity

pick a couple of sufficiently abundant GO terms,
get their protein sequences,
use BLAST to find their sequential similarity
- running Blast locally is recommended,
use kNN to classify the proteins into classes defined by the previously picked GO terms
- see this BIN lecture to understand BLAST-kNN,
do not forget to use cross-validation or train/test split at least
- leave-one-out CV frequently applied with kNN,
draw a conclusion, compare your results with expectations (a small literature review required)
- hints: Zhou et al., Hamp et al..

* motif discovery in functionally related protein sets

install MEME for motif discovery
- understand motif logos,
- learn about sequence motifs and motif discovery,
- do not forget to install (Perl) dependencies too (described in the installation guide),
pick a couple of sufficiently abundant GO terms,
get their protein sequences
- split them on train and test,
use STREME to find motifs that match GO terms
- example call searching for motifs that appear in GO1_train but not in GO2_train:
  - streme -p CAFA3_training_data/GO1_train.fasta -n CAFA3_training_data/GO2_train.fasta –protein -thres 0.1 -time 360
- the motifs found will be stored into streme_out/streme.txt,
- interpret the motifs, compare them, evaluate their scores,
use FIMO to detect the motifs in the test set
- example call: fimo streme_out/streme.txt CAFA3_training_data/GO1_test.fasta
- the motif scores will be stored into fimo_out/fimo.tsv,
use the scores to classify the proteins into classes defined by the previously picked GO terms
- you have to repeat the previous steps several times to get motif sets for all the classes, a script is needed,
draw a conclusion, compare your results with expectations (a small literature review required).
- hints: Lu et al., Zhou et al.
R auxiliary functions available here
- STREME and FIMO R calls, creation of an input file for supervised learning from FIMO score files (fimo.tsv),
- do not hesitate to experiment with settings (e.g., STREME parameters thres and time), FIMO max aggregation etc.

* protein annotation with recurrent neural networks OR transformers

pick a couple of sufficiently abundant GO terms,
get their protein sequences,
use a recurrent neural network to classify selected proteins
- classes given by (non-overlapping) GO terms,
- Protein Sequence Classification with deep learning,
- other relevant papers Lee and Nguen, Yusuf et al.,
alternatively, perform the same classification with pre-trained protein language models
- protein language modeling,
- additional references: Deep learning with proteins, Language models of protein sequences,
do not forget to use cross-validation or train/test split at least
draw a conclusion
- under this option, the final report can be simplified (only a brief summary required).

Scoring

This is an experimental and research assignment rather than a programming one. Please note that the main stress is on the experimental design and the report. The way of evaluation is shifted wrt the previous assignments.

* 4 points for experimental design

the overall pipeline fulfills the task requirements,
conceptual correctness of the solution,
- the design avoids biases, overfitting, etc.
meaningful evaluation (scores, classification measures).

* 2 points for source code

correctness, documentation, clarity, accessibility (ease of utilization for different fasta inputs, GO terms, etc.),

* 4 points for final report

a description of your solution is needed, the individual steps should be briefly motivated,
- clarity of presentation of your solution,
evaluation (graphs, quality measures),
- you can study e.g., the influence of train set size, the role of model parameters, relatedness between the GO terms,
- choose a proper quality measure (accuracy, AUROC, precision, recall, F1, …)
review, scope of discussion that goes beyond your implementation
- CAFA motivation,
- interpretation of the graphs,
- difference between classification that we do here and protein annotation that is expected in general,
the report is supposed to be self-contained (understandable without the source code).

Table of Contents

Assignment 5: PROTEIN FUNCTION PREDICTION

Motivation

Inputs

Tasks

Scoring