Warning

# Tutorial 12 - Protein function prediction using propositionalization

In this tutorial, we will look into an unconventional method of protein function prediction. We assume to have a labeled set of proteins (i.e. supervised learning scenario) with their structure represented relationally. We will search for patterns in relational data which could be decisive for the protein function.

Fig. 1: Relational patterns found to be decisive for DNA binding function by Andrea Szaboová in her disertation thesis

## Introduction

In classical machine learning, input samples are represented as fixed-length vectors of numbers. There is no natural way to represent proteins or peptides this way, but one commonly used method is to select as set of measurable properties of the protein to use as protein features. These can be for example: protein mass, dipole moment, net charge, mean radius, chain length et cetera. This has been applied with some success. Here we explore the possibility to also add automatically constructed features describing the presence of local substructures on the peptide.

### Example: Protein relational representation

Let us have the following predicates, which will enable us to encode some information about the structure of a peptide:

• aminoAcid/1 denoting whether something is an amino-acid residue in a peptide.
• peptideBond/2 defining the ordering of the residues in the peptide chain. (primary structure)
• type/2 indicating that an amino-acid residue is of certain type such as lysine. (primary structure)
• distance/3 indicating the distance between two amino-acid residues [Å]. (local tertiary structure)

A peptide representation is a set of ground atoms on these predicates (technically a Herbrand model), for example:

P = aminoAcid(a1), aminoAcid(a2), aminoAcid(a3), aminoAcid(a4), aminoAcid(a5), type(a1, lys), type(a2, cys), type(a3, trp), type(a4, gly), type(a5, ile), distance(a1, a2, 6), peptideBond(a1,a2), peptideBond(a2, a3), peptideBond(a3, a4), peptideBond(a4, a5), distance(a2, a1, 6), distance(a2, a3, 8), distance(a3, a2, 8), distance(a1, a3, 10), distance(a3, a1, 10), distance(a3, a4, 8), distance(a4, a3, 8),distance(a4, a5, 10), distance(a5, a4, 10)

A relational feature is a logical formula, in our case a non-ground conjunction of positive literals, for example:

F = aminoAcid(X) ∧ distance(X, Y, 8) ∧ type(Y, his).

In such case, checking whether the feature holds for the protein reduces to checking whether F θ-subsumes P, in other words whether there is a substitution θ such that Fθ ⊆ P. That would define a Boolean feature. Additionally, we can also count how many of such substitutions (groundings) exist, to get a more informative integer-valued feature.

The RelF polynomial algorithm generates a set of such features from a template in a way that the features are non redundant in a certain sense and also compute values of these features for samples. These can then be used on their own or along with the earlier mentioned measurement-based features. Detailed instructions on how to use the templates can be found in section 4 of the TreeLiker manual.

## TreeLiker Installation

On this tutorial, we will use the TreeLiker software, which implements the RelF propositionalization algorithm, to predict whether peptides have or have not antibacterial properties.

After unzipping, you should be able to run it by either the batch script or the following command:

sh

java -Xmx1G -cp TreeLikerGUI.jar app.gui.main.Main

## TreeLiker instructions for tutorial

The dataset we are going to use is a dataset of peptide structures along with the information whether or not they exhibit antimicrobial activity. Determining such properties is important in the development of novel antibiotics.

1. Load project. Navigate to examples_treelikergui/peptides. Open directory. WARNING: This currently does not seem to work with newer Java versions.
2. Click “Add New Dataset”. Check if there is correct path in “Data directory field” – it should be examples_treelikergui/peptides/dataset. Keep format on “pseudo-prolog with class label”.
3. Navigate to the “Template” tab. Make sure you have an understanding of what the template means, otherwise you can keep it. You are later encouraged to tweak it and observe what happens. The template should be residue[1](+a0,!aaType),residue(-a0,#aa), next[1](+a0,-a1), dist(+a0, !aa, -a1, !aa, #dist), residue[1](+a1,#aa)
4. Navigate to “Pattern Search” tab. Select RelF algorithm. Run “Search!”. Wait for the computation to finish, it may take several minutes.
5. Navigate to the “Found Patterns” tab. Look at the discovered patterns. Can you see how they were generated from the template? How do you interpret the Chi^2 and Information Gain values?
6. Now the propositional phase is complete. We will use the feature counts as attributes in classical machine learning, TreeLiker uses the WEKA framework for that. Navigate to the “Training tab”.
1. Choose a learning algorithm, try “Linear SVM” first and others later.
2. Check 10-fold cross validation for more accurate quality measuring.
3. Run “Start”. It may again take several minutes, depending on the selected algorithm and CV setting.
4. Interpret the output: What is the accuracy of the learning? Compare these values the to the performance measures of other algorithms. Do the relational features provide enough information to classify better than random? Could you do better with a modified template?

## Exercises on relational features

1. Give an informal definition of “propositionalization”.
2. Suppose you have a relational sample aminoAcid(a1), aminoAcid(a2), type(a1, lysine), type(a2, tryptophan), distance(a1, a2, 8) How would you add the information that this is present on an alpha helix?
3. Let us have a feature aminoAcid(A) ∧ aminoAcid(B) ∧ type(A, lysine) ∧ distance(A, B, 8) determine how many groundings of this feature each of the following samples admit:
• aminoAcid(a1), aminoAcid(a2), type(a1, lysine), distance(a1, a2, 8)
• aminoAcid(a1), aminoAcid(a2), type(a1, lysine), distance(a2, a1, 8)
• aminoAcid(a1), aminoAcid(a2), type(a2, lysine), distance(a1, a2, 8)
• aminoAcid(a1), aminoAcid(a2), aminoAcid(a3), type(a1, cystein), type(a2, tryptophan), type(a3, lysine), distance(a2, a1, 6), distance(a3, a2, 8)
• aminoAcid(a1), aminoAcid(a2), aminoAcid(a3), type(a1, cystein), type(a2, lysine), type(a3, lysine), distance(a2, a1, 8), distance(a3, a2, 8)
4. Determine, which of the features F1-F5 are redundant given the following dataset. A feature is called redundant if there is another feature which covers a superset of positive examples and another feature which covers a superset of negative examples:
F1 F2 F3 F4 F5 Class
Example 1 1 1 1 1 1 1
Example 2 1 1 0 0 1 1
Example 3 0 1 1 0 0 0
Example 3 1 0 1 1 1 0
1. Consider the following template: aminoAcid(-a) ∧ type(+a,#t) ∧ distance(+a,-b,#num) ∧ aminoAcid(+b) ∧ type(+b,#t). Which of the following features it could have generated?
• aminoAcid(X) ∧ distance(X, Y, 8) ∧ type(Y, histidine)
• aminoAcid(X) ∧ type(Y, X)
• aminoAcid(X) ∧ aminoAcid(Y) ∧ aminoAcid(Z) ∧ distance(X, Y, 8) ∧ distance(X, Z, 10)
• aminoAcid(X) ∧ aminoAcid(Y) ∧ aminoAcid(Z) ∧ distance(X, Y, 8) ∧ distance(Y, Z, 10)