====== Tutorial  11 - Protein structure prediction ======

===== Recap =====

Make sure you can answer the following questions:
  * Describe the levels of protein structure. /* proteins exhibit a hierarchical organization of structure, starting with the linear sequence of amino acids (primary structure), which folds into local arrangements (secondary structure), then into the overall three-dimensional shape (tertiary structure), and finally, multiple subunits can assemble to form a functional protein complex (quaternary structure) */
  * How do we represent and store them? /* primary = FASTA = a header line starting with a ">" symbol, followed by the sequence itself, secondary = DSSP format (Dictionary of Protein Secondary Structure) = a sequence of one-letter codes, where each letter corresponds to a specific secondary structure element (H: alpha-helix, B: beta-bridge, E: beta-strand, T. turn, ..., tertiary and quaternary = PDB file (also XML) = information about the atoms, residues, secondary structure, and spatial coordinates of the protein) */
  * Explain the meaning of the words when used for genes: analog, homolog, paralog, ortholog and xenolog. /* analog - vznikne konvergentnim vyvojem z nepribuznych predku, tj. homoplazicky, xenolog - vznikne horizontalnim prenosem, tj. z bakterie na eukaryotu */ 
  * What is a protein ligand? /* typicky malá molekula, která vytváří komplex s biomolekulou, typicky proteinem; často signální molekula, která se váže na vazebné místo cílového proteinu pomocí slabých molekulových interakcí, díky tomu je vazba ligandů většinou reverzibilní, vazba ligandu na receptorový protein většinou mění jeho konformaci a tím určuje biologickou funkci proteinu */

{{:courses:bin:tutorials:pdb-file-format-600x321.png?400|PDB file}} /* PDB soubor obsahuje i hlavicku (jmeno struktury, autori, poznamky o strukture), ta tu neni, zobrazeny jsou radky o jednotlivych atomech, na prvnim radku je ATOM cislo 1, jde o dusik, je soucasti kyseliny asparagove, patri do retezce L (linearni sekvence AAs svazane kovalentni vazbou, znaci se pismeny A, B,..., retezce v jednom proteinu se mohou mezi sebou nekovalentne vazat, ..., tj. nektere proteiny mohou byt tvoreny vice retezci (multimericke, stepene, nebo nespojite)), atom patri do prvni aminokyseliny, a jeho prostorove souradnice v angstromech (10e-10 metru, tj. desetina nanometru) jsou ..., na druhem radku je druhy atom stejne aminokyseliny, jde o alfa uhlik, proto CA */

PDB file format from https://lammpstube.com


===== Homology modeling - protein structure prediction exercise =====

A simple, although not always reliable, way to discover the secondary structure of a peptide sequence is to look up a protein with similar primary sequence in a database. Let us try this! The task is to obtain the secondary structure of the following peptide sequence: ''HYLCKYVINAIPPTLTAKIHFRPELPAERNQLIQRLA''
  - Go to [[https://blast.ncbi.nlm.nih.gov/Blast.cgi]] and click "Protein blast".
  - Enter the sequence and enter "Homo sapiens (taxid:9606)" as organism.
  - Click the blast button and wait. This may take up to several minutes. 
  - Look for the best matching protein. It should be: "monoamine oxidase A"  /********** nehledame castecnou shodu, cely protein ma delku 527 AAs **********/ 
  - Enter this protein name to [[https://www.uniprot.org/uniprot/|UniProt]]. /********** pozor, az treti shoda v poradi ma tu spravnou delku a muze poslouzit k zarovnani **********/ 
  - Check whether the result has a secondary sequence annotation and find the position respective to the BLAST match. /******* nejlepe lze zjistit ve Structure -> Feature viewer -> jde o pozice 263-297 (lze zjistit uz v Blastu), jde o usek, na kterem jsou 2 alfa-helixy a 2 beta skladane listy *******/ 

Use the above-described procedure to learn most about the following peptide sequence: ''TEYAINKLRQLYVLRC''. 

<note tip> A hint: the sequence is a part of a frequent [[https://en.wikipedia.org/wiki/Protein_domain|protein domain]]. </note>

/**
 * It is a part of SH2 domain.
 *  SH2 doména (Src-homology 2 domain) je strukturní doména vyskytující se v různé míře u všech eukaryotických organismů; je typická tím, že se váže na fosforylovaný tyrosin (fosfotyrosin, pY). Je součástí celé řady především signálních bílkovin v buňce. Také je součástí Src onkogenu, který může způsobit rakovinné bujení.
 * It was taken from: https://www.pnas.org/doi/10.1073/pnas.011577898, Fig.1 (alphaA and betaB, the first two proteins JAK1 and JAK2 merged).
 * BLASP finds: Tyrosine-protein kinase JAK1, the total length 1154, match with positions 446-466 (important to know where to search for structures).
 *  JAK1 patří mezi tyrozinkinázy, t.j. enzymy ze skupiny proteinkináz, které katalyzují přenos fosfátové skupiny (fosforylace) z nukleosidtrifosfátů (většinou ATP) na aminokyselinu tyrozin v proteinech. Fosforylace je nejčastější posttranslační modifikací proteinů a má důležitou funkci v regulaci mnoha buněčných signálních drah.
 * Uniprot Tyrosine-protein kinase JAK1 record:
 *  Molecule processing: check that the length is the same,
 *  Secondary structure: 446 ... helix starts, 463 ...bsheet starts,
 *  Domains and Repeats: 439 – 544 ... SH2.
 */
===== Automated protein folding =====
[[https://www.cgl.ucsf.edu/chimerax/docs/user/tools/esmfold.html|ESMFold]] is a deep learning-based method developed for predicting protein tertiary structures from amino acid sequences. In the following steps, we will show how the method works. We will use one of the existing ESMFold predictions and verify it against PDB database.

  - Access the public ESMFold webserver [[https://esmatlas.com/resources?action=fold|here]]. /* A method and server from META/Facebook known to be faster than AlphaFold2 */
  - Use the first available example in ESMFold: [[https://en.wikipedia.org/wiki/PETase|plastic degradation protein PETase]]. /* >PETase
MGSSHHHHHHSSGLVPRGSHMRGPNPTAASLEASAGPFTVRSFTVSRPSGYGAGTVYYPTNAGGTVGAIAIVPGYTARQSSIKWWGPRLASHGFVVITIDTNSTLDQPSSRSSQQMAALRQVASLNGTSSSPIYGKVDTARMGVMGWSMGGGGSLISAANNPSLKAAAPQAPWDSSTNFSSVTVPTLIFACENDSIAPVNSSALPIYDSMSRNAKQFLEINGGSHSCANSGNSNQALIGKKGVAWMKRFMDNDTRYSTFACENPNSTRVSDFRTANCSLEDPAANKARKEAELAAATAEQ*/
  - Learn and download the sequence of 300 amino acids representing the primary sequence of PETase. 
  - Predict/retreive, visualize and download the ESMFold tertiary structure prediction (a PDB file). /* reakce je velmi rychla, struktura je predpocitana, stranka je soucasne strankou atlasu, kde je toho hodne uz predpocitano */
  - Generally validate the PDB file with [[https://prosa.services.came.sbg.ac.at/prosa.php|Prosa]]. /* analýza vychází ze srovnání našeho proteinu s jinými známými proteiny, nepotřebujeme zatím referenční strukturu, protein dopadne dobře, např. jeho z-score –6 potvrzuje, že struktura je srovnatelná se známými reálnými proteiny podobné velikosti */
  - Find the PETase enzyme in PDB database, use [[https://www.rcsb.org/search/advanced/sequence|sequential search]]. /* Použijeme sekvenční vyhledávání, najdeme celou řadu PETáz, orientujeme se dle pořadí, délky atd. */ 
  - Use the sequence that scores most, it should be [[https://www.rcsb.org/structure/5XJH|5XJH Crystal structure of PETase from Ideonella sakaiensis]].
  - Compare the predicted structure with the X-ray crystallography one available in the PDB database. Employ [[https://www.rcsb.org/alignment|Pairwise Structure Alignment]] available through the PDB site.
  - Use the 5XJH code for the PDB internal structure and upload the PDB file downloaded from ESMFold (set the chain to A and let residues go from 1 to 300).
  - See the match (also shown in the right figure below). 
    * The structures match well visually. 
    * [[https://en.wikipedia.org/wiki/Root_mean_square_deviation_of_atomic_positions|RMSD]] gives the average distance between corresponding atoms in the two structures after they have been aligned, its value is 0.57, which is below the threshold of 1 for high quality models. 
    * Also, [[https://en.wikipedia.org/wiki/Template_modeling_score|TM-score]] of 0.99 is close to its maximum value of 1. Generally, scores below 0.20 correspond to randomly chosen unrelated proteins, whereas structures with a score higher than 0.5 assume roughly the same fold.

{{ :courses:bin:tutorials:5xjh_merged.png?600 |PETase structure and its alignment with prediction}}

===== Independent work =====

Predict and validate your own protein structure. The output to be reported: 
  - a PDB protein of interest (you could start with the fragment from the homology section above),
  - a predicted structure,
  - a validation that reports the match between the experimental and predicted structure.

Test the limits of applicability of the prediction. You can focus on large proteins (such as [[https://www.rcsb.org/structure/4AKG|dynein motor protein 4AKG]] with multiple interacting domains), disordered proteins, proteins with novel folds, or membrane proteins. However, it becomes more and more difficult to find proteins for which prediction fails. See for example [[https://www.ebi.ac.uk/training/online/courses/alphafold/an-introductory-guide-to-its-strengths-and-limitations/strengths-and-limitations-of-alphafold/|the strengths and limitations of AlphaFold2]].
/* uvedený 4AKG příklad má délku sekvence 2696 AAs, ESMFold webserver připouští sekvence jen do délky 400 AAs, AlphaFold3 trvá výpočet struktury 9 minut, zarovnání struktur v PDB také trvá několik minut, výsledek je solidní, mj. RSDS 7.12 a TM 0.73 */
/* ukazuje se, že je poměrně složité najít protein se zcela špatnou predikcí, našli jsme např 8Q1H, chain 2, s TM 0.77, důvodem výběru bylo to, že jde o nový protein s na první pohled netradiční strukturou */
/* další možností jsou 2 proteiny, které jsou reportovány jako nejhorší v CASP15, jde o T1122 a T1131, oba s GDT pod 50% (viz obrázek s křivkami výkonu v přednášce převzatý z Kryshtafovych et al.: Critical assessment of methods of protein structure prediction (CASP)—Round XV., Proteins: Structure, Function, and Bioinformatics, 2023.), dle https://predictioncenter.org/casp15/domains_summary.cgi lze prvni dohledat v PDB pod označením 8BBT, druhý tam bohužel chybí, ESMFold dosáhl TM skóre 0.46, AlphaFold3 0.51 */

Resources (other from the resources available above):
   - [[https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/ESMFold.ipynb|ESMFold Colab]] for protein structure prediction,
   - [[https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.2.0/AlphaFold2.ipynb|AlphaFold2 Colab]] for protein structure prediction,
   - [[https://zhanggroup.org/TM-align/|TM-align]] for protein structure alignment,
   - [[https://alphafoldserver.com|AlphaFold3 server]] for structure predictions containing proteins, DNA, RNA, ligands, ions (beware of a queuing system).
     /* AlphaFold3 se moc nehodí pro predikci sekundární struktury RNA, vrací totiž 3D strukturu a nejde nijak přepnout. Nejde tedy přímo srovnat jeho výstup např. pro RNA z přednášky s výstupem Nussinovova algoritmu, Mfoldem, nebo MXfoldem2 ... Pokud jde o proteiny, vrací AlphaFold3 CIF soubory, lze je v PDB také srovnávat, vrací 4 zavinutí, nejlepší je zřejmě to první. */
 

===== References =====

Abramson et al.: Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature, 2024 ([[https://www.nature.com/articles/s41586-024-07487-w_reference.pdf|pdf preprint]]).

Poleksic: Algorithms for optimal protein structure alignment, Bioinformatics, 2009 ([[https://academic.oup.com/bioinformatics/article/25/21/2751/228079|online]]).

/* Learn more about [[https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/protein_superposition/|protein superposition]] whose goal is to rotate and translate one protein structure (the "moving" structure) so that it best matches another protein structure (the "reference" structure) in three-dimensional space. Basic steps are: 1) selection of atoms: choose the atoms to be used for alignment, commonly the backbone C-alpha atoms, although sometimes all heavy atoms or specific residues are used, match these atoms in both proteins, 2) initial alignment: based on sequence or structural motifs to provide a starting point for further refinement, 3) optimization: optimize the superposition by minimizing the RMSD or maximizing the TM-score, this typically involves iterative procedures to find the best rotation and translation that align the structures, Singular Value Decomposition (SVD) can also be used to find a rotation and translation which will minimise this distance (vezmi pozicni matice obou proteinu, vystred je, spocitej jejich kovariancni matici, pro ni udelej SVD a matice V a U pouzij k transformaci jednoho proteinu na druhy), 4) evaluation: evaluate the quality of the superposition using metrics like RMSD and TM-score. Visual inspection using molecular visualization tools can also be helpful to understand the structural alignment.*/