Search
In this tutorial, we will look into an unconventional method of protein function prediction. We assume to have a labeled set of proteins (i.e. supervised learning scenario) with their structure represented relationally. We will search for patterns in relational data which could be decisive for the protein function.
Fig. 1: Relational patterns found to be decisive for DNA binding function by Andrea Szaboová in her disertation thesis
In classical machine learning, input samples are represented as fixed-length vectors of numbers. There is no natural way to represent proteins or peptides this way, but one commonly used method is to select as set of measurable properties of the protein to use as protein features. These can be for example: protein mass, dipole moment, net charge, mean radius, chain length et cetera. This has been applied with some success. Here we explore the possibility to also add automatically constructed features describing the presence of local substructures on the peptide.
Let us have the following predicates, which will enable us to encode some information about the structure of a peptide:
A peptide representation is a set of ground atoms on these predicates (technically a Herbrand model), for example:
P = aminoAcid(a1), aminoAcid(a2), aminoAcid(a3), aminoAcid(a4), aminoAcid(a5), type(a1, lys), type(a2, cys), type(a3, trp), type(a4, gly), type(a5, ile), distance(a1, a2, 6), peptideBond(a1,a2), peptideBond(a2, a3), peptideBond(a3, a4), peptideBond(a4, a5), distance(a2, a1, 6), distance(a2, a3, 8), distance(a3, a2, 8), distance(a1, a3, 10), distance(a3, a1, 10), distance(a3, a4, 8), distance(a4, a3, 8),distance(a4, a5, 10), distance(a5, a4, 10)
A relational feature is a logical formula, in our case a non-ground conjunction of positive literals, for example:
F = aminoAcid(X) ∧ distance(X, Y, 8) ∧ type(Y, his).
In such case, checking whether the feature holds for the protein reduces to checking whether F θ-subsumes P, in other words whether there is a substitution θ such that Fθ ⊆ P. That would define a Boolean feature. Additionally, we can also count how many of such substitutions (groundings) exist, to get a more informative integer-valued feature.
The RelF polynomial algorithm generates a set of such features from a template in a way that the features are non redundant in a certain sense and also compute values of these features for samples. These can then be used on their own or along with the earlier mentioned measurement-based features. Detailed instructions on how to use the templates can be found in section 4 of the TreeLiker manual.
On this tutorial, we will use the TreeLiker software, which implements the RelF propositionalization algorithm, to predict whether peptides have or have not antibacterial properties.
Download the TreeLiker-GUI software from: http://ida.felk.cvut.cz/treeliker/download/binaries_treelikergui.zip or indirectly from http://ida.felk.cvut.cz/treeliker/TreeLiker.html.
After unzipping, you should be able to run it by either the batch script or the following command:
sh
java -Xmx1G -cp TreeLikerGUI.jar app.gui.main.Main
The dataset we are going to use is a dataset of peptide structures along with the information whether or not they exhibit antimicrobial activity. Determining such properties is important in the development of novel antibiotics.
examples_treelikergui/peptides
examples_treelikergui/peptides/dataset
residue[1](+a0,!aaType),residue(-a0,#aa), next[1](+a0,-a1), dist(+a0, !aa, -a1, !aa, #dist), residue[1](+a1,#aa)
The official TreeLiker tutorial/manual: http://ida.felk.cvut.cz/treeliker/download/treeliker.pdf
For more details on protein function prediction using propositionalization, have a look at the disertation thesis' of Andrea Szaboová and Ondřej Kuželka.
The peptides dataset is from: Peptides dataset (Cherkasov, A., Jankovic, B. Application of Inductive QSAR Descriptors for Quantification of Antibacterial Activity of Cationic Polypeptides. Molecules 2004, 9, 1034-1052.)