Table of Contents

Assignment 5: PROTEIN FUNCTION PREDICTION

Motivation

Protein function prediction aims to assign biological or biochemical roles to proteins. Prediction could be based on sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interactions. In this assignment you will focus purely on protein sequences and work with the assumption that sequentially similar proteins are likely to share function.

Protein function is a very general concept. In this assignment, you will interpret it as an annotation that comes from a closed vocabulary of terms contained in The Gene Ontology. The Gene Ontology provides a taxonomical hierarchy of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Your task will be to assign appropriate terms from this ontology to proteins whose function (the set of annotating GO terms) is unknown (or rather temporarily hidden).

This assignment is a (very) simplified version of The Critical Assessment of protein Function Annotation (CAFA) challenge.

Inputs

The input dataset is here. It contains a fasta file with 66,841 proteins as well as an annotation file with 386,197 links among protein and GO terms. Each protein is on average annotated with approximately 6 terms. Only leaf annotations are provided (if a protein is annotated with a special term, it is by default annotated by all its generalizations too, however, exceptions exist).

Work with a couple of GO terms only (3 or 4). When solving a classification task, try to discriminate GO terms and see whether distant GO terms (the shortest ontology path contains the root node) split better than related GO terms (ancestor relationship). When searching for motifs, try to figure out how a set of motifs found for a certain GO term fits for example its generalizations (ancestor or sibling ontology nodes) and a very different term (no direct link). A way to preprocess the fasta file is outlined in this R file.

Tasks

You are supposed to solve ONE OF THE FOLLOWING tasks:

* protein annotation based on sequential similarity

* motif discovery in functionally related protein sets

Motif (The Meme suite)

* protein annotation with recurrent neural networks

Scoring

This is an experimental and research assignment rather than a programming one. Please note that the main stress is on the experimental design and the report. The way of evaluation is shifted wrt the previous assignments.

* 4 points for experimental design

* 2 points for source code

* 4 points for final report