====== Bonus task: Protein folding ====== This task is a slightly modified version of the [[http://www.mathworks.com/matlabcentral/contest/contests/11/rules.html|MATLAB Protein Folding contest]]. The task description is taken from there, and is modified for Python. Download ''{{:courses:be5b33prg:homeworks:protein_folding.zip|protein_folding.zip}}'' archive, containing both * ''{{:courses:be5b33prg:homeworks:folding.py|folding.py}}'' helper library, and * ''{{:courses:be5b33prg:homeworks:testsuite.txt|testsuite.txt}}'' file containing the amono acid chains to be folded. ===== Introduction ===== Your task is to solve the problem of folding of proteins, which are synthesized in the cell as long chains of amino acid building blocks, which automatically fold down into more compact working shapes. Exactly how they do this, and how they do it quickly and consistently, is poorly understood. {{ :courses:a4b99rph:cviceni:folding-small.gif?200|Protein Folding}} The problem is so notoriously difficult that even dramatic simplifications are difficult to solve. Here we use a stripped-down version of the protein-folding problem introduced by Lau and Dill in 1989. ===== Task description ===== This task concerns a simplified two-dimensional version of the protein folding problem. Given the amino acid sequence of a simplified protein, your job is to determine a final "optimally folded" configuration. Each potential arrangement of the protein sequence is associated with a free energy level, which we call the "result" of your folding algorithm. An optimal folding (and there may be more than one) corresponds to the configuration with the lowest possible result, which can be thought of as a minimum energy configuration. In our simplified world, there are only two kinds of amino acids: hydrophobic amino acids (denoted by 1) and hydrophilic amino acids (denoted by 0). Since proteins exist in a watery environment, the hydrophobic amino acids prefer to cling to each other, compact and apart from the water as much as possible. **Implementation note:** You should solve this task in Python. To simplify certain actions, we provide a library called ''folding.py'' containing several functions that can be used during the solution of the task. The implementation uses complex numbers to refer to certain points in a plane, and also to designate the bending directions. It is of course possible to use a different formalism, but then all the work is up to you. ==== Sequence of amino acids ==== The sequence of amino acids will be just a sequence of ''0''s and ''1''s: >>> a = [0, 1, 1, 1, 1, 0] The above list represents the sequence depicted below: hydrophobic amino acids are shown in red and hydrophilic amino acids are shown in blue. {{:courses:a4b99rph:cviceni:folding1.gif?200|Protein 1}} ==== Bending the sequence ==== Think of the amino acids that make up the protein as links in a chain. Each link may bend by a multiple of 90 degrees, but the distance between links may not change (and is assumed to be 1). No self-intersections are permitted, and the final result is constrained to lie in a single plane. Thus you can think of the chain as a line snaking through a Manhattan-style grid, one amino acid at a time. This is sometimes referred to as a self-avoiding walk. We will refer to the complex plane to simplify the configuration of any given chain. The folding sequence your solver returns will define the turn taken at each link by means of the complex numbers ''1'' (east), ''1j'' (north), ''-1'' (west), and ''-1j'' (south). If your solver returns the configuration sequence >>> c = [1, 1, 1, 1, 1] for the sequence of amino acids ''a'' defined above, it represents the straight line configuration shown in the diagram above. The configuration sequence is necessarily one element shorter than the sequence itself, since each element defines the relationship between two amino acids. Function ''positions()'' from module ''folding'' will compute the positions of individual amino acids in the complex plane, given the configuration. It assumes that the first amino acid is always anchored at 0. The position of each amino acid actually given by the cumulative sum of the configuration. >>> from folding import positions >>> p = positions(c) >>> p [0, 1, 2, 3, 4, 5] So, ''c = [1, 1, 1, 1, 1]'' can be interpreted as: "from your starting location, go east five consecutive times." To introduce a small turn in the configuration, we can put a northward turn (given by ''1j'') in the third element of the configuration vector ''c''. We expect to see a picture like the one below. {{ :courses:a4b99rph:cviceni:folding2.gif?200 |Protein 2}} This would translate to Python code as: >>> c = [1, 1, 1j, 1, 1] >>> p = positions(c) >>> p [0, 1, 2, (2+1j), (3+1j), (4+1j)] ==== Folding ==== Using function ''fold()'' from module ''folding'', you can create a new configuration by applying a folding vector to an old configuration. Complex arithmetic makes the folding calculations straightforward. It is just sufficient to multiply both vectors element by element. If we want to fold the entire right half of the original sequence up by 90 degrees, we can multiply the original ''c'' vector by a fold vector ''f'' that bends it like so >>> from folding import fold >>> c = [1, 1, 1, 1, 1] >>> f = [1, 1, 1j, 1j, 1j] >>> c2 = fold(c, f) >>> c2 [1, 1, 1j, 1j, 1j] >>> p2 = positions(c2) >>> p2 [0, 1, 2, (2+1j), (2+2j), (2+3j)] Thereby making the protein shown below. {{:courses:a4b99rph:cviceni:folding3.gif?100|Protein 3}} By applying successive folding vectors, your code should try to minimize the result. Here is one final configuration vector that tries to minimize the fold energy by bringing the four red amino acids close together. We apply yet another 90 degree turn to the last two elements. >>> f2 = [1, 1, 1, 1j, 1j] >>> c3 = fold(c2,f2) >>> c3 [1, 1, 1j, (-1+0j), (-1+0j)] >>> p3 = positions(c3) >>> p3 [0, 1, 2, (2+1j), (1+1j), 1j] {{ :courses:a4b99rph:cviceni:folding4.gif?100 |Protein 4}} ==== Scoring ==== The goal of this task is to find such a folding configuration for all amino acid sequences in the provided dataset such that the average free energy across all resulting proteins is minimized. The free energy for a single protein is calculated by totaling the distance between all pairs of hydrophobic amino acids. Your best configuration will have the hydrophobic amino acids in the most compact mass. Suppose you have the last sequence shown above. We ignore the location of the hydrophilic amino acids, since they do not contribute to the result, and then total up the distance matrix for the remaining elements. $$ E = 1 + 1 + 1 + 1 + \sqrt2 + \sqrt2 $$ The above can be achieved by using the ''score()'' function from module ''folding''. It computes the sum of distances for hydrophobic amino acids in sequence ''a'' given configuration ''c'': >>> from folding import score >>> score(a, c3) 6.82842712474619 **Important note:** function ''score()'' does not check the validity of the configuration, i.e. it does not check whether the configuration is self-avoiding. You must check that yourself. The overall quality (result) of your algorithm is judged using the average free energy across the whole data set. Compare your result with the [[http://www.mathworks.com/matlabcentral/contest/contests/11.html|table of winners]] (column "results") of the MATLAB Contest, or with the online scores at the [[http://www.mathworks.com/matlabcentral/contest/contests/11/statistics.html|statistics]] page. ===== Data ===== We provide a {{:courses:be5b33prg:homeworks:testsuite.txt|dataset with various amino acids sequences}} to be folded into minimal energy configurations. This dataset will also be used for the task evaluation. ===== Evaluation ===== You can get up to 10 points for solving this task. Additional 5 points can be awarded if the results will be outstanding. The task is evaluated manually, during a discussion with the teacher. We assume that the student shall be able to explain the algorithm and demonstrate her method, i.e. reading the data, result visualisation, and computation of the score for the whole dataset. In case of non-trivial dataset (and a result close to the table of winners) the task will be evaluated with the full 10 points. Additional up to 5 points for reaching results comparable with the best results. ===== Suggestions for elaboration ===== First and utmost, read the specs carefully and make sure you understand them. * Decide what formalism you will use. Implement a vector class for representation of amino acid sequences and configurations, which also allows to compute the cumulative sum of the vector (to compute the positions based on configuration), the elementwise multiplication (for folding), and computation of the free energy. You can use the formalism chosen for module ''folding'', or you can choose a different one. Module ''folding'' is implemented in pure Python; maybe using ''[[http://www.numpy.org/|numpy]]'' library would be better (not part of Python standard library, must be installed separately). * It is highly desirable to create a visualization tool which would display your protein configurations graphically. It may be e.g. a function ''visualize(a,c)'' taking the amino acid sequence and the configuration as arguments. Consider using e.g. the ''[[http://matplotlib.org|matplotlib]]'' plotting library (not part of Python standard library, must be installed separately). * Your solver (maybe a function named ''solve(a)'') shall take a sequence of amino acids as its input, and shall provide configuration ''c'' (sequence of length ''len(a)-1'') representing the final configuration of protein. You should make sure that the resulting configuration is self-avoiding, i.e. when positions of amino acids are computed based on the configuration, no position is attended twice or more times.