This is an old revision of the document!

Bonus task: Protein folding

This task is a slightly modified version of the MATLAB Protein Folding contest. The task description is taken from there, and is modified for Python.

Download protein_folding.zip archive, containing both

folding.py helper library, and
testsuite.txt file containing the amono acid chains to be folded.

Introduction

Your task is to solve the problem of folding of proteins, which are synthesized in the cell as long chains of amino acid building blocks, which automatically fold down into more compact working shapes. Exactly how they do this, and how they do it quickly and consistently, is poorly understood.

The problem is so notoriously difficult that even dramatic simplifications are difficult to solve. Here we use a stripped-down version of the protein-folding problem introduced by Lau and Dill in 1989.

Task description

This task concerns a simplified two-dimensional version of the protein folding problem. Given the amino acid sequence of a simplified protein, your job is to determine a final “optimally folded” configuration. Each potential arrangement of the protein sequence is associated with a free energy level, which we call the “result” of your folding algorithm. An optimal folding (and there may be more than one) corresponds to the configuration with the lowest possible result, which can be thought of as a minimum energy configuration. In our simplified world, there are only two kinds of amino acids: hydrophobic amino acids (denoted by 1) and hydrophilic amino acids (denoted by 0). Since proteins exist in a watery environment, the hydrophobic amino acids prefer to cling to each other, compact and apart from the water as much as possible.

Implementation note: You should solve this task in Python. To simplify certain actions, we provide a library called folding.py containing several functions that can be used during the solution of the task. The implementation uses complex numbers to refer to certain points in a plane, and also to designate the bending directions. It is of course possible to use a different formalism, but then all the work is up to you.

Sequence of amino acids

The sequence of amino acids will be just a sequence of 0s and 1s:

>>> a = [0, 1, 1, 1, 1, 0]

The above list represents the sequence depicted below: hydrophobic amino acids are shown in red and hydrophilic amino acids are shown in blue.

Bending the sequence

Think of the amino acids that make up the protein as links in a chain. Each link may bend by a multiple of 90 degrees, but the distance between links may not change (and is assumed to be 1). No self-intersections are permitted, and the final result is constrained to lie in a single plane. Thus you can think of the chain as a line snaking through a Manhattan-style grid, one amino acid at a time. This is sometimes referred to as a self-avoiding walk.

We will refer to the complex plane to simplify the configuration of any given chain. The folding sequence your solver returns will define the turn taken at each link by means of the complex numbers 1 (east), 1j (north), -1 (west), and -1j (south). If your solver returns the configuration sequence

>>> c = [1, 1, 1, 1, 1]

for the sequence of amino acids a defined above, it represents the straight line configuration shown in the diagram above. The configuration sequence is necessarily one element shorter than the sequence itself, since each element defines the relationship between two amino acids.

Function positions() from module folding will compute the positions of individual amino acids in the complex plane, given the configuration. It assumes that the first amino acid is always anchored at 0. The position of each amino acid actually given by the cumulative sum of the configuration.

>>> from folding import positions
>>> p = positions(c)
>>> p
[0, 1, 2, 3, 4, 5]

So, c = [1, 1, 1, 1, 1] can be interpreted as: “from your starting location, go east five consecutive times.”

To introduce a small turn in the configuration, we can put a northward turn (given by 1j) in the third element of the configuration vector c. We expect to see a picture like the one below.

This would translate to Python code as:

>>> c = [1, 1, 1j, 1, 1]
>>> p = positions(c)
>>> p
[0, 1, 2, (2+1j), (3+1j), (4+1j)]

Folding

Using function fold() from module folding, you can create a new configuration by applying a folding vector to an old configuration. Complex arithmetic makes the folding calculations straightforward. It is just sufficient to multiply both vectors element by element.

If we want to fold the entire right half of the original sequence up by 90 degrees, we can multiply the original c vector by a fold vector f that bends it like so

>>> from folding import fold
>>> c = [1, 1, 1, 1, 1]
>>> f = [1, 1, 1j, 1j, 1j]
>>> c2 = fold(c, f)
>>> c2 
[1, 1, 1j, 1j, 1j]
>>> p2 = positions(c2)
>>> p2
[0, 1, 2, (2+1j), (2+2j), (2+3j)]

Thereby making the protein shown below.

By applying successive folding vectors, your code should try to minimize the result. Here is one final configuration vector that tries to minimize the fold energy by bringing the four red amino acids close together. We apply yet another 90 degree turn to the last two elements.

>>> f2 = [1, 1, 1, 1j, 1j]
>>> c3 = fold(c2,f2)
>>> c3
[1, 1, 1j, (-1+0j), (-1+0j)]
>>> p3 = positions(c3)
>>> p3
[0, 1, 2, (2+1j), (1+1j), 1j]

Scoring

The goal of this task is to find such a folding configuration for all amino acid sequences in the provided dataset such that the average free energy across all resulting proteins is minimized.

The free energy for a single protein is calculated by totaling the distance between all pairs of hydrophobic amino acids. Your best configuration will have the hydrophobic amino acids in the most compact mass. Suppose you have the last sequence shown above. We ignore the location of the hydrophilic amino acids, since they do not contribute to the result, and then total up the distance matrix for the remaining elements.

$$ E = 1 + 1 + 1 + 1 + \sqrt2 + \sqrt2 $$

The above can be achieved by using the score() function from module folding. It computes the sum of distances for hydrophobic amino acids in sequence a given configuration c:

>>> from folding import score
>>> score(a, c3)
6.82842712474619

Important note: function score() does not check the validity of the configuration, i.e. it does not check whether the configuration is self-avoiding. You must check that yourself.

The overall quality (result) of your algorithm is judged using the average free energy across the whole data set. Compare your result with the table of winners (column “results”) of the MATLAB Contest, or with the online scores at the statistics page.

Data

We provide a dataset with various amino acids sequences to be folded into minimal energy configurations. This dataset will also be used for the task evaluation.

Evaluation

You can get up to 10 points for solving this task. Additional 5 points can be awarded if the results will be outstanding.

The task is evaluated manually, during a discussion with the teacher. We assume that the student shall be able to explain the algorithm and demonstrate her method, i.e. reading the data, result visualisation, and computation of the score for the whole dataset. In case of non-trivial dataset (and a result close to the table of winners) the task will be evaluated with the full 10 points. Additional up to 5 points for reaching results comparable with the best results.

Suggestions for elaboration

First and utmost, read the specs carefully and make sure you understand them.

Decide what formalism you will use. Implement a vector class for representation of amino acid sequences and configurations, which also allows to compute the cumulative sum of the vector (to compute the positions based on configuration), the elementwise multiplication (for folding), and computation of the free energy. You can use the formalism chosen for module folding, or you can choose a different one. Module folding is implemented in pure Python; maybe using numpy library would be better.

Your solver (maybe a function named solve(a)) shall take a sequence of amino acids as its input, and shall provide configuration c (sequence of length len(a)-1) representing the final configuration of protein. You should make sure that the resulting configuration is self-avoiding, i.e. when positions of amino acids are computed based on the configuration, no position is attended twice or more times.

It is highly desirable to create a visualization tool which would display your protein configurations graphically. It may be e.g. a function visualize(a,c) taking the amino acid sequence and the configuration as arguments.

Table of Contents