Assignment 5: PROTEIN FUNCTION PREDICTION

  • 10 points.
  • Deadline: Wednesday, May 27th, 23:59
  • Late submission penalty: -1 point per day, but no more than 8 points.
  • Submit to BRUTE.
  • Work individually.
  • Submit
    • a PDF report that describes your solution,
    • your source codes in a language of your choice.

Motivation

Protein function prediction aims to assign biological or biochemical roles to proteins. Prediction could be based on sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interactions. In this assignment you will focus purely on protein sequences and work with the assumption that sequentially similar proteins are likely to share function.

Protein function is a very general concept. In this assignment, you will interpret it as an annotation that comes from a closed vocabulary of terms contained in The Gene Ontology. The Gene Ontology provides a taxonomical hierarchy of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Your task will be to assign appropriate terms from this ontology to proteins whose function (the set of annotating GO terms) is unknown (or rather temporarily hidden).

This assignment is a (very) simplified version of The Critical Assessment of protein Function Annotation (CAFA) challenge.

Inputs

The input dataset is available here. It contains a FASTA file with 66,841 proteins and an annotation file with 386,197 links between proteins and GO terms. On average, each protein is annotated with approximately 6 terms. Only leaf annotations are provided. If a protein is annotated with a special term, it is by default annotated by all its generalizations too, however, exceptions exist.

For your analysis, work with only a small number of GO terms (3 or 4). When addressing a classification task, try to discriminate between GO terms and investigate whether distantly related GO terms (whose shortest path in the ontology passes through the root node) are easier to separate than closely related GO terms connected by an ancestor-descendant relationship.

When searching for motifs, examine how the motifs identified for a certain GO term relate to its generalizations (ancestor terms), sibling terms and biologically distinct GO terms with no direct ontological link. This may help assess the specificity and transferability of the discovered motifs.

A guide to understand the protein annotation file is outlined in this Python script. An additional R script demonstrates how to preprocess the FASTA file too.

Tasks

You are supposed to solve ONE OF THE FOLLOWING tasks:

* protein annotation based on sequential similarity

  • pick a couple of sufficiently abundant GO terms,
  • get their protein sequences,
  • use BLAST to find their sequential similarity
  • use kNN to classify the proteins into classes defined by the previously picked GO terms
  • do not forget to use cross-validation or train/test split at least
    • leave-one-out CV frequently applied with kNN,
  • draw a conclusion, compare your results with expectations (a small literature review required)

* motif discovery in functionally related protein sets

  • install MEME for motif discovery
  • pick a couple of sufficiently abundant GO terms,
  • get their protein sequences
    • split them on train and test,
  • use STREME to find motifs that match GO terms
    • example call searching for motifs that appear in GO1_train but not in GO2_train:
      • streme -p CAFA3_training_data/GO1_train.fasta -n CAFA3_training_data/GO2_train.fasta –protein -thres 0.1 -time 360
    • the motifs found will be stored into streme_out/streme.txt,
    • interpret the motifs, compare them, evaluate their scores,
  • use FIMO to detect the motifs in the test set
    • example call: fimo streme_out/streme.txt CAFA3_training_data/GO1_test.fasta
    • the motif scores will be stored into fimo_out/fimo.tsv,
  • use the scores to classify the proteins into classes defined by the previously picked GO terms
    • you have to repeat the previous steps several times to get motif sets for all the classes, a script is needed,
  • draw a conclusion, compare your results with expectations (a small literature review required).
  • R auxiliary functions available here
    • STREME and FIMO R calls, creation of an input file for supervised learning from FIMO score files (fimo.tsv),
    • do not hesitate to experiment with settings (e.g., STREME parameters thres and time), FIMO max aggregation etc.

Motif (The Meme suite)

* protein annotation with recurrent neural networks OR transformers

  • pick a couple of sufficiently abundant GO terms,
  • get their protein sequences,
  • use a recurrent neural network to classify selected proteins
  • alternatively, perform the same classification with pre-trained protein language models
  • do not forget to use cross-validation or train/test split at least
  • provide a more detailed analysis of one selected relevant learning aspect, ideas can be:
    • what is the role of protein length and sequence truncation? which parts of the protein carry the most predictive information (N-terminus, C-terminus, central regions, domains),
    • visualization of the protein embeddings (PCA, t-SNE, UMAP), analyzing how the classifier separates the classes in this representation space,
    • effect of classifier head complexity (e.g. linear vs MLP) and training hyperparameters (learning rate, regularization) on performance,
    • which proteins tend to be classified correctly and which of them tend to be misclassified, investigate potential causes such as multifunctionality (number of GO terms), sequence similarity (cosine similarity among embeddings) or annotation ambiguity.
  • draw a conclusion
    • under this option, the final report does not have to include the literature review required in the previous tracks.

Scoring

This is an experimental and research assignment rather than a programming one. Please note that the main stress is on the experimental design and the report. The way of evaluation is shifted wrt the previous assignments.

* 4 points for experimental design

  • the overall pipeline fulfills the task requirements,
  • conceptual correctness of the solution,
    • the design avoids biases, overfitting, etc.
  • meaningful evaluation (scores, classification measures).

* 2 points for source code

  • correctness, documentation, clarity, accessibility (ease of utilization for different fasta inputs, GO terms, etc.),

* 4 points for final report

  • a description of your solution is needed, the individual steps should be briefly motivated,
    • clarity of presentation of your solution,
  • evaluation (graphs, quality measures),
    • you can study e.g., the influence of train set size, the role of model parameters, relatedness between the GO terms,
    • choose a proper quality measure (accuracy, AUROC, precision, recall, F1, …)
  • review, scope of discussion that goes beyond your implementation
    • CAFA motivation,
    • interpretation of the graphs,
    • difference between classification that we do here and protein annotation that is expected in general,
  • the report is supposed to be self-contained (understandable without the source code).
courses/bin/assignments/hw5.txt · Last modified: 2026/06/03 17:02 by klema