Difference between revisions of "Template:Greece/PSP"

 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<html>
 
<html>
 
<head>
 
<head>
<style>
 
#protein_prediction{ position: absolute; width: 800px; height: 800px; min-width:100px; min-height:100px; display: block; }
 
</style>
 
  
 +
<!-- MathJax Library -->
 +
<script src="https://2015.igem.org/common/MathJax-2.5-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
  
 
<!-- Collapsable -->
 
<!-- Collapsable -->
 +
<script type='text/javascript' src='https://2017.igem.org/Template:Greece/expanding_boxes_functionality?action=raw&ctype=text/javascript'></script>
 +
<link href='https://2017.igem.org/Template:Greece/expanding_boxes_stylesheet?action=raw&ctype=text/css' rel='stylesheet'>
 +
<!-- End of collapsable -->
 +
 +
<!-- Google Fonts -->
 +
<link href="https://fonts.googleapis.com/css?family=Poiret+One" rel="stylesheet">
 +
 +
    <link rel="stylesheet" type="text/css" href="https://2017.igem.org/Template:Greece/quorum_sensing_style?action=raw&ctype=text/css" />
 +
    <script type="text/javascript" src="https://2017.igem.org/Template:Greece/quorum_sensing_scripts?action=raw&ctype=text/javascript"></script>
 +
   
 +
</head>
 +
<body>
 +
 
<script>
 
<script>
$(document).ready(function(){
+
var close = document.getElementsByClassName("closebtn");
var acc = document.getElementsByClassName("accordion");
+
 
var i;
 
var i;
  
for (i = 0; i < acc.length; i++) {
+
for (i = 0; i < close.length; i++) {
  acc[i].onclick = function() {
+
    close[i].onclick = function(){
    this.classList.toggle("active");
+
        var div = this.parentElement;
    var panel = this.nextElementSibling;
+
        div.style.opacity = "0";
    if (panel.style.maxHeight){
+
        setTimeout(function(){ div.style.display = "none"; }, 600);
      panel.style.maxHeight = null;
+
 
+
      $('#display_box').height(function (index, height) {
+
        panel.id == 'panel_1' ? return (height - 7700) : return(height - 6820);
+
      });
+
    } else {
+
      panel.style.maxHeight = panel.scrollHeight + "px";
+
     
+
      $('#display_box').height(function (index, height) {
+
        panel.id == 'panel_1' ? return (height + 7700) : return(height + 6820);
+
      });
+
 
     }
 
     }
  }
 
 
}
 
}
 +
</script>
  
});
+
<style>
 +
  #HQ_page p{ font-size: 18px; line-height: 1.7em; }
  
</script>
+
  #article_box{ position: absolute; min-width: 700px; width: 85%; left: 8%; text-align: center; margin: 50px 0px 0px 0px; }
<link href='https://2017.igem.org/Template:Greece/expanding_boxes_stylesheet?action=raw&ctype=text/css' rel='stylesheet'>
+
<!-- End of collapsable -->
+
  #article_content{ margin: 50px 0px 0px 0px; }
 +
  #header{ display: none; font-size: 25px; font-weight: bold; }
  
<!-- Google Fonts -->
+
  .sections{ margin: 80px 0px 0px 0px; text-align: justify; }
<link href="https://fonts.googleapis.com/css?family=Poiret+One" rel="stylesheet">
+
  .sub_headers{ font-family: 'Poiret One', cursive; font-size: 40px; margin: 0px 0px 25px 0px; text-align: center; }
 
+
  .sub_sections, .sub_sections p{ font-size: 18px; line-height: 1.5;}
</head>
+
  .sub_images{ max-width: 75%; max-height: 75%; width: auto; height: auto; }
<body>
+
</style>
  
 
<div id='article_box'>
 
<div id='article_box'>
   <article id='static_content'>
+
   <article id='general_content'>
 
<header><strong class='sub_headers' style='font-size: 55px;'>Protein Structure Prediction</strong></header>
 
<header><strong class='sub_headers' style='font-size: 55px;'>Protein Structure Prediction</strong></header>
 
</article>
 
 
 
 +
<!-- First Collapsable -->
 +
<button id='Section1_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Why do we need to predict the tertiary structure of proteins? - MOTIVATION</button>
 +
<div id='panelPSP1' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_1' style='margin: 0px;' class='sections'>
 +
<p>Proteins consist of amino acid chains that fold in 3-dimensional space in ways we do not yet completely comprehend. Each protein chain spans from a handful of amino acids to more than a thousand residues, rendering the problem of determining the relationship between primary and tertiary structure immensely complex . </p>
 +
<p>In our project, we employed a mutated fimH modifiedby adding the RPMrel peptide in order to achieve selective adhesion of our E. coli strains on cancer cells. In particular, we the fimH sequence was altered by substituting 1 (Proline to Glutamine) and inserting 28 new residues (RPMrel, SpyTag, HisTag). We would like to explore how these changes affect the 3-dimensional structure of the fimH protein. </p>
 +
<p>To this end, we embarked on a journey to create an artificial neural network model that, given a protein's primary structure, can yield sufficient information to reconstruct the full 3-dimensional structure.</p>
 +
<p>The idea is that our model can provide insight as to how the tertiary structure changes after the modifications, in that we can compare the native structure of the protein with the structure of the protein's modified sequence.</p>
 +
<p>Moreover, such a model would be invaluable as it could predict structures de novo for proteins for which we do not yet have structural information. Such a protein is Apoptin, the toxin we used as our classifier's output in order to induce selective cell death.</p>
 +
<p>Therefore, our motivation is 2-fold:</p>
 +
<ul>
 +
<li style='list-style:none'><p>to study how a protein's structure changes after altering its primary sequence profile (in our case FimH)</p></li>
 +
<li style='list-style:none'><p>to predict the de novo tertiary protein structures based solely on primary structure information (in our case Apoptin)</p></li>
 +
</ul>
 +
<p>What we propose is an end-to-end pipeline that solely requires the amino acid sequence of the protein (primary structure) as input, predicts its corresponding secondary structure & contact map and finally recreates and visualizes the full 3-dimensional tertiary structure.</p></br>
 +
<div style='text-align: center'><img alt='' tittle='' class='sub_images' src='https://static.igem.org/mediawiki/2017/a/a3/Greekom_PSPworkflow.png' /></div>
 +
</section>
 +
</article>
 +
</div>
 +
<!-- Second Collapsable -->
 +
<button id='Section2_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>From Amino acids to 0s and 1s - ENCODING PROTEINS</button>
 +
<div id='panelPSP2' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_2' style='margin: 0px;' class='sections'>
 +
 +
<p>In order to unravel the protein sequence-to-structure mystery one inevitably needs to summon the power of neural networks due to the aforementioned complexity and data volume. To fully exploit the generalization power of neural networks, we need to design a concise yet unabridged representation for our data, the protein sequences and their corresponding tertiary structures.</p>
 +
<p>To make use of the primary and secondary structure sequence information we map the one-letter Amino acid codes to integers in range 1-20 and we use 0 to signify unknown encountered residues. We have now a vocabulary of 21 words that we project to a high-dimensional vector through the use of word embeddings [<a href='#ref1_psp' class='reference'>1</a>], a technique widely used in natural language processing. Similarly, regarding the secondary structure, we have a vocabulary of 8 words that is mapped to high-dimensional vectors through embedding.</a> 
 +
<p>A common representation of a protein's tertiary structure is the contact map. This N x N matrix (N the number of residues in the chain) encompasses all the required information to reconstruct the 3-dimensional structure. Every (x,y) point on the map is interpreted as a contact probability between the residues x and y. </p>
 +
<div style='text-align: center'><table style='position: relative; left: 50%; margin-right: -50%; transform: translate(-50%, 0%);'>
 +
<tr>
 +
<td>
 +
<div style='text-align: center'><img alt='' title='' class='sub_images' src='https://static.igem.org/mediawiki/2017/7/76/Greekom_PSPfimhnativecon9distance.png' /></div>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td>The fimH contact map with distance cutoff 9 Å recreated from chain A of PDB id 2VCO</td>
 +
</tr>
 +
</table></div>
 +
<p>We can see that the majority of contacts span around the diagonal and that is only natural as it means that residues which are at a closer distance have a higher probability of forming a contact than distant ones. In general, secondary structures form patterns on the contact map, e.g. helices are adjacent to the diagonal, and provide a foundation for the prediction of the most difficult long-range contacts. That being said, we first need to obtain information about the secondary structure and subsequently use that to assist in contact prediction.</p>
 +
<p>Since, we have access to ground truth for every protein sequence in our dataset we can employ supervised learning approaches. Secondary structure prediction is a classification problem of 8 classes, where we need to predict the secondary structure class of every residue, while providing context information for residues of the same sequence. Contact map prediction can be interpreted as a regression problem, where one needs to predict the contact probabilities for every pair of residues in the chain. </p>
  
<!-- Collapsable Section 1 -->
+
</section>
 +
</article>
 +
</div>
 +
<!-- Third Collapsable -->
 +
<button id='Section3_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Gaining local structural insight - SECONDARY STRUCTURE PREDICTION MODEL</button>
 +
<div id='panelPSP3' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_3' style='margin: 0px;' class='sections'>
 +
 +
<p>The first step, given the protein's primary structure is to make an educated guess about its secondary structure. In 2017, we don't even need to do that. We can have machine learning models do that for us.</p>
 +
<p>Although secondary structure provides valuable information for the local structural entities along the chain, it is not enough on its own to produce the contact map. Therefore, we can interpret this step, in terms of machine learning terminology, as an internal representation of the input that is passed on deeper in the network to assist contact prediction. However, since secondary structure has a semantic meaning of its own in the eyes of biologists, we treat the two models separately as it is standard practice in the literature.</p>
 +
<p>Our Secondary Structure Predictor consists of a wide bidirectional recurrent layer, followed by a number of fully connected layers and the output layer. The model trains on primary sequence samples and predicts the secondary structure class for every Amino acid.</p>
 +
<p>In a nutshell, the sequence passes through the recurrent layer and the model “remembers” what has already happened in the sequence -in terms of secondary structures- to provide context for every subsequent residue that it sees. The bigger the size of the recurrent layer the greater the memory capacity of the network and, in turn, its predictive power. </p>
 +
<p>The additional layers that follow progressively reduce the representation dimension until we are left with a 1-hot vector (1 x 8) that indicates the most prevalent class choice. In the end, the full predicted sequence corresponds to all secondary structure segments.</p>
  
<button id='Section1_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Why do we need to predict the tertiary structure of proteins? - MOTIVATION</button>
+
<div style='text-align: center'><table style='position: relative; left: 50%; margin-right: -50%; transform: translate(-50%, 0%);'>
<div id='panel_1' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
+
<tr>
<article id='article_content'>
+
<td>
<!-- Part One -->
+
<div style='text-align: center'><img alt='' title='' class='sub_images' src='https://static.igem.org/mediawiki/2017/5/55/Greekom_PSPss.png' /></div>
<section id='section_1' style='margin: 0px;' class='sections'>
+
</td>
  <article id='sub_article_1'>
+
</tr>
<header id='sub_header_1' class='sub_headers'>Chemical interaction model description</header>
+
</br>
+
<table>
+
 
<tr>
 
<tr>
<td><img id='image_1' class='sub_images' src /></td>
+
<td>Top Row: primary structure sequence of one-letter Amino acid codes</td>
 
</tr>
 
</tr>
<tr>
+
<tr>
<td><i>Quorum sensing network. The arrows imply chemical reactions.</i></td>
+
<td>Bottom Row: secondary structure sequence of one-letter class of the 8 classes</td>
</tr>
+
</tr>
</table>
+
</table></div>
</br>
+
<section id='sub_section_1' class='sub_sections'>
+
the protein LuxI, which produces AHL, and LuxR, which binds it and then, activated, goes on to induce the “DNA”. The “DNA” species refers to plasmids carrying the Lux regulatory system. Its induction by (LuxR.AHL)<sub>2</sub> marks the off->on transition; when most of the DNA is in the (uninduced) “DNA' form, the switch is off; when most of it is in the “DNA.(LuxR.AHL)<sub>2</sub>” form, the switch is on.</p>
+
</section>
+
    </br>
+
<table>
+
<tr>
+
<td><img id='image_2' class='sub_images' src='https://static.igem.org/mediawiki/2017/c/cd/Greekom_syncytium.png' /></td>
+
</tr>
+
<tr>
+
<td>Dilution mechanism that keeps the growth rate and the cell density constant. These are the conditions modelled in [<a href='#ref1'>1</a>].</td>
+
</tr>
+
</table>
+
</br>
+
<section id='sub_section_2' class='sub_sections'>
+
+
<p>We built this first model to better understand the constraints and basic properties of the physical system, before going into greater depth in the following sections. A caveat to our extrapolation is that cellular metabolism could be significantly altered when cells enter a stationary growth phase, impacting the core QS functionality. We keep this in mind, but have found no way to account for it.</p>
+
<p>In the following section, we present exploratory simulations of the system's behaviour.</p>
+
</section>
+
  </article>
+
</section>
+
<!-- Part Two -->
+
<section id='section_3' class='sections'>
+
  <article id='sub_article_3'>
+
<header id='sub_header_3' class='sub_headers'>Results (no custom growth model)</header>
+
</br>
+
+
</br>
+
+
+
+
            </br>
+
<section id='sub_section_7' class='sub_sections'>
+
<p>The growth model impacts the QS system greatly. As is evident in [figure <a href='#figure7'>7</a> vs <a href='#figure8'>8</a>], while the growth rate is high (r=1.5) quorum sensing is difficult to achieve. This corroborates the result in [<a href='#figure3'>figure 3</a>]. As the volumes increase and the growth curve remains the same, more AHL has to be produced to achieve the same concentration, which takes more time. At an extreme volume of $\SI{5}{\milli\liter$, in [<a href='#figure9'>figure 9</a>] QS still happens, but much later, at 40 hours. This volume is significant, because it is the volume of a small petri dish, which we would like to simulate with the diffusive model to compare the results.</p>
+
</section>
+
  </article>
+
</section>
+
  </article>
+
</div>
+
  
<button id='Section2_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Diffusive model</button>
+
</section>
<div id='panel_2' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
+
</article>
  <article>
+
</div>
<!-- Part One -->
+
<!-- Fourth Collapsable -->
<section id='section_6' style='margin:0px' class='sections'>
+
<button id='Section4_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Inferring the contacts - CONTACT MAP PREDICTION MODEL</button>
  <article id='sub_article_10'>
+
<div id='panelPSP4' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
<header id='sub_header_10' class='sub_headers'>Chemical interaction model description</header>
+
<article id='article_content'>
</br>
+
<!-- Part One -->
<table>
+
<section id='section_4' style='margin: 0px;' class='sections'>
<tr>
+
<td><img id='image_10' class='sub_images' src /></td>
+
<p>Our end goal is to predict contacts between residues in a pairwise fashion. For a sequence of N Amino acids we want to derive a N x N map, where each position shows the probability of the two residues being in contact.</p>
</tr>
+
<p>At first, we need to build a N x N map, hereafter referred to as “tensor”, that will serve as the input layer to our network and will be updated after each epoch as the network learns to solve the regression task. The tensor captures the semantic relationship between residues and constantly changes as training progresses.</p>
<tr>
+
<p>In the previous section, we mentioned how we use embeddings to map our single-letter Amino acid codes to high-dimensional vectors. That is, for a primary structure sequence of N residues, we end up with a N x m matrix, where m is the size of the embedding. </p>
<td></td>
+
<p>In order to exploit the additional information we have for every Amino acid about its secondary structure classification, we employ, once again, word embeddings of size k that we stack to our previous m-dimensional vector. Now, every Amino acid is encoded as a concatenation of the two embeddings with a resulting vector of size m + k.</p>
</tr>
+
<p>The secondary structure information may be incorporated directly as a result of our Secondary Structure Predictor model or - for proteins with known structures- can be provided directly as additional input. However, the latter can only be of use for model evaluation as, in practice, we are more interested to predict structures de novo.</p>
</table>
+
<p>The next step is to create a representation for every pair of Amino acids in order to create the full N x N tensor. For a pair (x,y) of Amino acids we concatenate their respective embeddings thus creating a new vector of size 2 * (m + k). As an alternative, we can perform an element-wise operation (e.g. multiplication) between the two distinct residue embeddings and maintain the resulting pair embedding dimension to m + k. </p>
</br>
+
<p>Now that we have constructed our tensor, we add several 2-dimensional convolutional layers to scan every Amino acid pair on our map for contacts. When training is complete, the resulting map is compared to the native contact map by specifying a probability threshold that distinguishes contacts from non-contacts.</p>
<section id='sub_section_10' class='sub_sections'>
+
<p>We've modelled a bacterial population in a well-stirred liquid culture so far. Without the “growth model”, we either model a stationary population or one that grows at a steady rate, its density maintained constant by compensating dilution. With the growth model, an initial inoculation grows to the environment's carrying capacity, modelling a bacterial colonization of a new environment.</p>
+
<p>Bacteria growing on a surface are packed very closely together, but the AHL they produce is free to leave their immediate surroundings and diffuse into the surrounding area. Diffusion is a well-described physical phenomenon and this model aims to couple the diffusive process with the chemical interactions of AHL inside millions of independent bacterial cells that are geometrically defined. There doesn't seem to be any case of chemical gradients affecting diffusion of AHL, therefore our model is concerned only with its concentration.</p>
+
<p>The primary goal was to simulate bacteria growing on the surface of an agar plate, as these conditions are easy to recreate experimentally and thus provide verification to our model. Our collaboration with iGEM Columbia was meant to enable these experiments, but unfortunately material shortages only allowed us to experiment in liquid cultures.</p>
+
</section>
+
</section>
+
<!-- Part Two -->
+
<section id='section_7' style='margin:0px' class='sections'>
+
  <article id='sub_article_11'>
+
<header id='sub_header_11' class='sub_headers'>Model Description</header>
+
<section id='sub_section_11' class='sub_sections'>
+
<p>We start with Fick's laws of diffusion [<a href='#ref6'>6</a>]. In the simplest case of isotropic media without mass transport phenomena or external potentials, the driving force of diffusion is the concentration gradient and the diffusion coefficient is a constant real number. Thus, the general diffusion equation takes the form of the simpler heat equation:</p>
+
$$\frac{\partial{\mathit{[AHL]}}}{\partial{t}} = D \nabla^2\mathit{[AHL]}$$
+
<p>It is a parabolic partial differential equation (PDE) in space and time. To specify a solvable problem based on such an equation, many more ingredients are needed: </p>
+
<ul>
+
<li>a geometry</li>
+
<li>initial conditions</li>
+
<li>boundary conditions</li>
+
</ul>
+
<p>To actually solve it, we furthermore need to select a solution algorithm, which requires its own ingredients.</p>
+
<p>If we allowed the diffusion coefficient to vary in space, we'd have a more general form of diffusion. Adding a production term $q$ to the right side, it becomes:</p>
+
$$\frac{\partial{\mathit{[AHL]}}}{\partial{t}} = \nabla \cdot \left(D \nabla \mathit{[AHL]} \right) + q$$
+
</section>
+
  </article>
+
</section>
+
<!-- Part Three -->
+
<section id='section_8' style='margin:0px' class='sections'>
+
  <article id='sub_article_12'>
+
<header id='sub_header_12' class='sub_headers'>Geometry</header>
+
<section id='sub_section_1' class='sub_sections'>
+
<p>The first design decision is to express the problem's geometry. At a first glance at the task at hand, to model bacteria growing on an agar plate, one might assume a top-down 2D perspective, with the AHL diffusing across the surface away from the bacteria. The diffusion of AHL is inherently a 3D phenomenon though, and this perspective couldn't easily incorporate the effects of diffusion along the height of the agar gel. In the end, we decided to model the entire 3D volume of an agar plate, with the bacteria at the top of the agar. The agar forms a short cylinder (a disk), surrounded by plastic on 3 sides and air on top. The cylinder is in the order of millimetres in height and centimetres across. An E. Coli cell is about 1μm -- a huge difference in scale! This difference makes the problem quite difficult to solve in practice.</p>
+
<p>An important simplification at this point is to assume axial symmetry around the axis of the cylinder, thus making the problem tractable. We express the PDE in cylindrical coordinates;  after eliminating the angular coefficients of the derivatives, we are left with:</p>
+
$$\rho \frac{\partial{\mathit{[AHL]}}}{\partial{t}} = D \left(\frac{\partial}{\partial{\rho}}\left(\rho \frac{\partial{\mathit{[AHL]}}}{\partial{\rho}}\right) + \frac{\partial}{\partial{z}}\left(\rho \frac{\partial{\mathit{[AHL]}}}{\partial{z}}\right)\right) + \rho q$$
+
<p>If we now transform $\rho \mapsto x$ & $z \mapsto y$, thus having:</p>
+
$$x \frac{\partial{\mathit{[AHL]}}}{\partial{t}} = \nabla \cdot \left(xD \nabla \mathit{[AHL]} \right) + x q$$
+
<p>This is identical to the diffusion equation above, where the time coefficient is $x$, the diffusion coefficient $xD$ and the production coefficient $xq$. Therefore, we'll solve this problem on a 2D vertical cross-section of the cylinder, whose solutions are the same as the initial equation on the full 3D cylinder. The final geometry is shown in [<a href='#figure10'>figure 10</a>]. The bacteria are the small red rectangles shown in the zoomed-in image.</p>
+
</section>
+
    </br>
+
<table>
+
<tr class='image_sector'>
+
<td><img id='image_11' alt='Large Agar' title='Large Agar' class='sub_images' src='https://static.igem.org/mediawiki/2017/d/d9/Greekom_mesh_largeAgar_full.png' /></td>
+
<td><img id='image_11a' alt='Zoom Bacteria' title='Zoom Bacteria' class='sub_images' src='https://static.igem.org/mediawiki/2017/7/7f/Greekom_mesh_largeAgar_zoomBacteria.png' /></td>
+
</tr>
+
<tr class='caption_sector'>
+
<td>Left: The geometry on which the diffusion PDE is solved. It represents an axisymmetric 3D cylindrical geometry: an agar plate. The left side is the cylinder’s axis, the right side is the rim. The top is the cylinder’s surface. On the top near the axis there are some bacteria. Since this perspective is a cross-section of the agar plate, the bacteria actually occupy a small disk on the surface of the agar near the axis (every shape in this geometry should be rotated around the axis to imagine its 3D representation).</br>
+
Right: Each red rectangle represents an E. Coli cell. The bacteria are organized in orderly rings and layers with no spaces between them (maximum density). In this case, there are 600 rings of bacteria and 4 layers. [FOOTNOTE]{Due to constraints with the mesh generation, this isn’t exactly the case. The reality is more complicated, but it simulates bacteria packed closely together. Notice that each red rectangle has a blue rectangle next to it. Only 1 in 3 red rectangles actually interacts with the AHL, the rest is inert geometry. Thus, 1 cell covers the space of 6 rectangles on the same layer, plus 6 more on the layer below. The cell’s AHL output is multiplied by the number of bacteria it replaces, thus in fact concentrating the production of this entire region on 1 cell. This should be a slight source of error though, because the diffusion coefficient is large.} The blue lines are the mesh, the solver’s spatial discretization. Observe how the mesh around the bacteria is very orderly, but also rather coarse (compared to the feature size). The loss in accuracy in this area is intentional: our model inherently can’t resolve concentration differences inside each bacterium’s cytoplasm, therefore a finer mesh would not provide extra information, only modelling artifacts -- and much more computation time!
+
</td>
+
</tr>
+
</br>
+
  </article>
+
</section>
+
<!-- Part Five -->
+
<section id='section_9' style='margin:0px' class='sections'>
+
  <article id='sub_article_13'>
+
<header id='sub_header_13' class='sub_headers'>Boundary and initial conditions</header>
+
<section id='sub_section_13' class='sub_sections'>
+
<p>Boundary conditions specify what happens to the concentration (Dirichlet BC), or to its gradient (Neumann BC), at the geometric boundaries. There's a lot of those: the edges of the agar, as well as the edges of the rectangles that represent a ring of bacteria. A Neumann BC takes the form $$\nabla \mathit{[AHL]} \cdot \hat{n} = q$$, where $q$ is the flux through the boundary. $\hat{n}$ is the unit vector normal to the boundary.</p>
+
<p>The edges of an agar plate are all reflecting boundaries, because AHL can't diffuse through them: plastic walls at the sides and bottom, air at the top. Thus, at the boundary $q=0$: no AHL goes through.</p>
+
<p>A boundary condition on the edge of the bacteria could simulate the semipermeable cell membrane, but unfortunately the current version of Matlab doesn't accommodate conditions on internal boundaries. To evaluate the significance of this restriction we've run a test simulation ([<a href='#figure11'>figure 11</a>]).</p>
+
<p>The initial conditions are described by the [AHL] at each point in the geometry at $t=0$. Here, they are 0.</p>
+
</section>
+
  </article>
+
</section>
+
<!-- Part Six -->
+
<section id='section_10' style='margin:0px' class='sections'>
+
  <article id='sub_article_14'>
+
<header id='sub_header_14' class='sub_headers'>Solution algorithm</header>
+
<section id='sub_section_1' class='sub_sections'>
+
<p>We solve the PDE with the finite element method (FEM). This method discretizes the continuous space into small elements by overlaying a mesh ([<a href='#figure10'>figure 10</a>]) and transforming the continuous geometry into a graph. Each node is a dependent variable. The time-dependent PDE is transformed in this manner into a large system of ODEs, with 1 equation for each node (because the boundary conditions are Neumann).</p>
+
</section>
+
  </article>
+
</section>
+
<!-- Part Seven -->
+
<section id='section_11' style='margin:0px' class='sections'>
+
  <article id='sub_article_15'>
+
<header id='sub_header_15' class='sub_headers'>Coupling AHL diffusion and localized chemical interactions</header>
+
<section id='sub_section_15' class='sub_sections'>
+
<p>By and large the most interesting piece of this puzzle is how to couple the spatially-oblivious chemical interaction network described before with the diffusion. [AHL] has become a spatial field. Each ring/layer of bacteria (see footnote in [<a href='#figure10'>figure 10</a>]) is an autonomous agent that interacts with the locally available AHL. A few things are evident: AHL concentrations at points on cytoplasms depend both on diffusion and on chemical interactions, and many more dependent variables are required, to store the concentration of every species at every bacterium.</p>
+
<p>A bacterium can be thought as a dynamic system with all the other chemical species as its state and local AHL as its input signal. The dynamic system can be seen as a function of the input and the previous state. Multiple dynamic systems can also be seen as a single function, because the function takes a point in space as input and knows which individual system to feed the input to. Since it can be seen as a function, it can readily be plugged into the equation above as the production coefficient - problem solved!</p>
+
<p>… solved, but for the ODE solver which fails miserably at such a convoluted, nested system of equations! The alternative that has been successfully implemented is to augment the system of node equations. To the equations produced by the spatial discretization a new set of equations for every bacterium involved is added.</p>
+
<p>The bacteria expect a single value for the concentration of AHL in the cytoplasm, but as can be seen in [<a href='#figure10'>figure 10</a>] to each bacterium correspond 6 mesh nodes. For simplicity, the value of [AHL] that the bacteria see is the mean of those nodes.</p>
+
<p>The d[AHL] produced from the bacterial equations also has to be distributed correctly on the mesh nodes. What we want to simulate when distributing the bacterial output is a constant source on the entire cytoplasm. There is a certain complexity in how the finite element method formulates the node equations and the nodes are not all equivalent, so we can't simply split the d[AHL] into many parts and give one to each node. Instead, we rely on the FEM algorithm to solve the spatial distribution problem for us by assigning to the bacterial geometries a constant production coefficient of 1, then hijack it by multiplying the resulting node increment coefficients by the d[AHL] produced by the bacteria.</p>
+
<p>Much is assumed or reverse-engineered in order to arrive at this coupling mechanism. To verify that the model is still on track, we run test simulations with a single bacterium and a small surrounding space. The results should be similar to (but not exactly the same with) the non-diffusive model. Indeed, there is close agreement ([<a href='#figure11'>figure 11</a>]).</p>
+
</section>
+
    </br>
+
<table>
+
<tr class='image_sector'>
+
<td><img id='image_12' alt='No wall d212' title='No wall d212' class='sub_images' src='https://static.igem.org/mediawiki/2017/2/2f/Greekom_fewcell_testSingleCell_noWall_V%3D212e-7.png' /></td>
+
<td><img id='image_12a' alt='Few cell' title='Few cell' class='sub_images' src='https://static.igem.org/mediawiki/2017/b/b0/Greekom_fewcell_testSinglecell_wall%3D4dot5e-7_V%3D212e-7.png' /></td>
+
<td><img id='image_12b' alt='' title='' class='sub_images' src='https://static.igem.org/mediawiki/2017/3/37/Greekom_fewcell_testSinglecell_singlecell_V%3D2dot12e-7.png' /></td>
+
</tr>
+
<tr class='caption_sector'>
+
<td>The spatial equilibrium model for 1 bacterium in a $\SI{2.12e-4}{\nano\liter}$ volume.</td>
+
<td>Same conditions, but simulated with the diffusive model. A diffusive barrier simulates the cell wall. Good agreement with the equilibrium model.</td>
+
<td>Diffusive model without the cell wall. Again, quite similar to the case with the wall, but much simpler to scale up. This bacterial model is used in the larger geometries.</td>
+
</tr>
+
</table>
+
  </article>
+
</section>
+
<!-- Part Eight -->
+
<section id='section_12' style='margin:0px' class='sections'>
+
  <article id='sub_article_16'>
+
<header id='sub_header_16' class='sub_headers'>Implementing Growth</header>
+
<section id='sub_section_16' class='sub_sections'>
+
<p>Growth requires adding new bacteria to the geometry, but the finite element method doesn't accommodate such changes. Consequently, the solution has to be stopped and restarted every time a new bacterium is added. This would be computationally prohibitive.</p>
+
<p>Instead, the complete growth curve is precalculated and then <strong>quantized</strong> adaptively to levels corresponding to adding many rings of bacteria at the same time, possibly adding millions of new bacteria at each growth step ([<a href='#figure12'>figure 12</a>]). The final number of bacteria generally depends on the nutrients provided by the growth medium and, for an agar plate with few added nutrients, is expected to be around $10^8.9$ bacteria [<a href='#ref2'>2</a>]. To further mimic the way bacterial colonies grow, once there are enough bacteria the older cells in the center die.</p>
+
<p>Bacterial growth constantly dilutes the cytoplasm - a process which heavily affects QS ([<a href='#figure3'>figure 3</a>]). Here, the growth model is implemented in large discrete steps. In each step, the ratio of existing to new bacteria is calculated and a dilution performed on the existing ones, in order to keep the quantity of every chemical species constant during the growth step, apart from the DNA, which cancels its dilution with replication. This discrete dilution event creates minor artifacts on the simulations that manifest as discontinuities on the graphs.</p>
+
</section>
+
    </br>
+
<table>
+
<tr class='image_sector'>
+
<td><img id='image_13' alt='' title='' class='sub_images' src='https://static.igem.org/mediawiki/2017/3/36/Greekom_fewcell_tinyGrowth_quantGC.png' /></td>
+
<td><img id='image_13a' alt='' title='' class='sub_images' src='' /></td>
+
</tr>
+
<tr class='caption_sector'>
+
<td></td>
+
<td></td>
+
</tr>
+
</table>
+
</br>
+
  </article>
+
</section>
+
<!-- Part Nine -->
+
<section id='section_12' style='margin:0px' class='sections'>
+
  <article id='sub_article_17'>
+
<header id='sub_header_17' class='sub_headers'>Results</header>
+
<section id='sub_section_17' class='sub_sections'>
+
<p>The basic question that we want to answer at this point is if and when will the bacterial colony exhibit QS behavior on an agar disk - faster or later than in a liquid culture with the same number of bacteria? We present 2 simulations that differ in scale. The first is a tiny agar disk measuring 3.4mm across and .551mm in depth, for a volume of 5μL ([<a href='#figure13'>figure 13</a>]). The second is much larger in scope and computational effort: agar in a small petri dish 34mm in diameter and 5.51mm in depth. The larger scope allows direct experimental evaluation of the model.</p>
+
</section>
+
    </br>
+
                            <video id="qs_video" muted="muted" preload="auto" controls>
+
                        <source src="https://static.igem.org/mediawiki/2017/9/9c/Greekom_fewcellvideogrowthOn.mp4" type="video/mp4">
+
                            </video>
+
</br>
+
<section id='sub_section_8' class='sub_sections'>
+
<p>The tiny disk is inoculated with 7e4 bacteria, which gradually grow to a final size of 8.12e6 bacteria over 11 growth steps (see the [video][here]). Quorum sensing is indeed triggered early under these conditions, at 11 hours ([<a href='#figure14'>figure 14</a>][here]).</p>
+
</section>
+
    </br>
+
<img id='image_14' alt='Final Geometry' title='Final Geometry' class='sub_images' src='https://static.igem.org/mediawiki/2017/a/ac/Greekom_fewcell_tinyGrowth_final_geometry.png' />
+
</br>
+
    </br>
+
<img id='image_14a' alt='Final Geometry small' title='Final Geometry small' class='sub_images' src />
+
</br>
+
                                <img id='image_14b' alt='Tiny Grow bact 106' title='Tiny Grow bact 106' class='sub_images' src='https://static.igem.org/mediawiki/2017/c/c7/Greekom_fewcell_tinyGrowthbact106.png' />
+
  </article>
+
</section>
+
  </article>
+
</div>
+
  
<script>
+
</section>
/* Resize the graph svg image */
+
</article>
var mySVG = document.getElementById("paper_no_growth");
+
</div>
 +
<!-- Fifth Collapsable -->
 +
<button id='Section5_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>Recreating proteins in three dimensions - TERTIARY STRUCTURE RECONSTRUCTION</button>
 +
<div id='panelPSP5' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_5' style='margin: 0px;' class='sections'>
 +
<p>The final step of the process is to retrieve the 3d structure from the predicted contact map. To achieve that, we feed the contact map to <a href='http://bioinformatics.cs.unibo.it/FT-COMAR/'>FT-COMAR</a> ], a tool that is able to recreate the tertiary structure by reading a contact map with values 1 (contact), 0 (no contact), -1 (uncertain) [<a href='#ref2_psp' class='reference'>2</a>]. In our implementation, a contact is declared uncertain if the predicted probability is between 0.3 and 0.6. The resulting 3-dimensional information is written to a file that we can load to <a href='https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html'>iCin3D</a> to visualize and interact with the reconstructed structure.</p>
 +
</section>
 +
</article>
 +
</div>
 +
<!-- Sixth Collapsable -->
 +
<button id='Section6_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>What have we learned? - PRELIMINARY RESULTS</button>
 +
<div id='panelPSP6' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_6' style='margin: 0px;' class='sections'>
 +
 +
<p>Due to limited access to computational resources, we were not able to train models of adequate performance to assess the changes in FimH tertiary structure or visualize the estimated structure of Apoptin in a reliable way. Although the models' performance were monotonously improving we could not reach convergence before the wiki freeze, as we were restricted by shallow architectures and small mini-batch sizes. </p>
 +
<p>For example, for the Secondary Structure Predictor we obtained a 27% Q8 accuracy using a relatively shallow 2-layer architecture. Q8 accuracy measures the percent of residues for which 8-state secondary structure is correctly predicted. For the Contact Map Predictor, we set a probability threshold of 0.45 to determine whether two residues are in contact and the model correctly predicted 43% of all contacts.</p>
 +
<p>We will continue to work towards improving the model performances as well as attempt different approaches to solve the contact map.</p>
 +
<p>However, for the sake of completeness, we employed a publicly available tool that offers similar functionality but different approach. RaptorX-Contact-Predict is a web server tool that predicts the contact map and provides direct visualization of the resulting tertiary structure through JSmol.</p>
  
mySVG.addEventListener("load",function() {
+
<div style='text-align: center'><img alt='' title='' class='sub_images' src='https://static.igem.org/mediawiki/2017/b/b2/Greekom_pspapoptin.gif' /></div>
  svgDoc = mySVG.contentDocument;
+
  svgDoc.getElementsByTagName('svg')[0].setAttribute('height', '800px');
+
  svgDoc.getElementsByTagName('svg')[0].setAttribute('width', '100%');
+
  svgDoc.getElementsByTagName('svg')[0].setAttribute('viewBox', '0 0 1920 1000');
+
  console.log(svgDoc);
+
  
}, false);
+
<div style='text-align: center'><table>
</script>
+
<tr>
<script>
+
<td>
$(document).scroll(function() {
+
<div style='text-align: center'><img alt='' title='' class='sub_image' src='https://static.igem.org/mediawiki/2017/5/58/Greekom_PSPcomparison.png' /></div>
    var y = $(document).scrollTop(), //get page y value
+
</td>
        label_elem = $('#label');
+
</tr>
        toggle_elem = $('#toggling');
+
<tr>
        alert_elem = $('.alert');
+
<td>Contact Map comparison of fimH wild type and our modified fimH. Black dots signify common contacts that exist in both structures, green dots the unique contacts in wild type fimH and the magenta dots the unique contacts that exist in our modified fimH sequence. The locations where the RPMrel and tags were inserted are evident in the contact map in places where there exist only magenta dots along the diagonal.</td>
    if(y >= 899.20){ if( y < 4230){
+
</tr>
        label_elem.addClass('label_after');
+
</table></div>
        toggle_elem.addClass('toggling_after');
+
 
        alert_elem.addClass('alert_after');
+
<table>
    }} else {
+
<tr>
        label_elem.removeClass('label_after');
+
<td>
        toggle_elem.removeClass('toggling_after');
+
<video id="video" autoplay='autoplay' muted="muted" preload="auto" loop="loop">
        alert_elem.removeClass('alert_after');
+
<source src="https://static.igem.org/mediawiki/2017/b/bb/Greekom_PSPfimhvideo.mp4" type="video/mp4">
    }
+
</video>
});
+
</td>
</script>
+
</tr>
 +
<tr>
 +
<td>The top left structure corresponds to wild type fimH, the bottom left to our modified fimH. To the right, we see the two structures superimposed, with wild type fimH shown in grey.</td>
 +
</tr>
 +
</table>
 +
</section>
 +
</article>
 +
</div>
 +
<!-- Seventh Collapsable -->
 +
<button id='Section7_collapsable' class='accordion' style='font-size: 20px; letter-spacing: 1.5px;'>A Deep Dive into Specifics - IMPLEMENTATION DETAILS</button>
 +
<div id='panelPSP7' class='panel' style='text-align: justify; padding: 20px 0px 0px 0px;'>
 +
<article id='article_content'>
 +
<!-- Part One -->
 +
<section id='section_7' style='margin: 0px;' class='sections'>
 +
<header><strong class='sub_headers'>Data</strong></header>
 +
<p>Our dataset is a set of 10932 proteins from the PDB database, that were selected based on certain criteria using the <a href='http://dunbrack.fccc.edu/Guoli/pisces_download.php'>PISCES</a> server. Specifically, the percentage identity cutoff is 60%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25. </p>
 +
<p>We observed that 9754 out of 10932 proteins have some missing residues, for which there is no structural information available. In all such cases, the missing residues were discarded and the sequence truncated. 21 proteins were excluded from the dataset as there was no available secondary structure information whatsoever. </p>
 +
<p>We split the dataset of 10911 remaining proteins, into 8728 for training and 2183 for validation. In addition, we tested our Secondary Structure Predictor to the benchmark dataset CB513 in order to compare our approach with existing methods.</p>
 +
</section>
 +
<!-- Part Two -->
 +
<section id='section_7a' style='margin: 0px;' class='sections'>
 +
<header><strong class='sub_headers'>Labels</strong></header>
 +
<p>For the Secondary Structure Predictor the required label for every Amino acid in the sequence is the corresponding secondary structure information. Every Amino acid belongs to one of the 8 classes [<a href='#ref3_psp' class='reference'>3</a>]. Hence, every protein sequence is paired with its label, a sequence of equal length consisting of the class id for every Amino acid.</p>
 +
<p>To derive secondary structure labels for our dataset, we employed STRIDE, an algorithm designed for the assignment of protein secondary structure elements given the atomic coordinates of the protein, as defined by X-ray crystallography, protein NMR, or another protein structure determination method [<a href='#ref4_psp' class='reference'>4</a>].</p>
 +
<p>For the Contact Map Predictor the required label is the native contact map. To obtain the contact map we need the structural information provided in the PDB file of every protein sample, that is the atomic coordinates of every Amino acid. The contact map is constructed by computing the Cb distance for every pair of Amino acids in the chain and ,subsequently, if that distance satisfies a specified threshold [<a href='#ref5_psp' class='reference'>5</a>] the pair is considered to be in contact. </p>
 +
<p>To extract the native contact map for every PDB file, we implemented a group of helper functions that allow for the generation of contact maps of different distance cutoffs and distance types, namely Ca, Cb and Ca + Cb. </p>
 +
</section>
 +
<!-- Part three -->
 +
<section>
 +
<header><strong class='sub_headers'>Model Architecture</strong></header>
 +
<p>The Secondary Structure Predictor model consists of an embedding layer that maps the one hot Amino acid representations to 100-dimensional vectors, a bidirectional LSTM layer of 180 cells and a fully connected layer of 140 neurons with linear activation function. Finally, we have the output softmax layer of 8 neurons that points to the most probable class choice. The training loss is calculated according to the formula of categorical cross entropy.</p>
 +
<p>The Contact Map Predictor model has a tensor input that is feeded to 6 2D-convolutional layers. After each convolutional layer we enforce batch normalization to prevent exploding gradients. The output layer that computes the probability of contact has a linear activation function.</p>
 +
</section>
 +
<!-- Part four -->
 +
<section>
 +
<header><strong class='sub_headers'>Frameworks</strong></header>
 +
<p>The models were developed in Python using the Theano and Keras frameworks. </p>
 +
</section>
 +
</article>
 +
</div>
 +
 
 +
<!-- References -->
 +
<div id='references'>
 +
<article>
 +
<header><strong class='sub_headers'>References</strong></header>
 +
<section style='text-align:justify' class='sub_sections'>
 +
[<span id='ref1_psp'>1</span>] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).</br>
 +
[<span id='ref2_psp'>2</span>] Vassura, M., Margara, L., Di Lena, P., Medri, F., Fariselli, P., & Casadio, R. (2008). FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10), 1313-1315.</br>
 +
[<span id='ref3_psp'>3</span>] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers, 22(12), 2577-2637.</br>
 +
[<span id='ref4_psp'>4</span>] Frishman, D., & Argos, P. (1995). Knowledge‐based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566-579.</br>
 +
[<span id='ref5_psp'>5</span>] Duarte, J. M., Sathyapriya, R., Stehr, H., Filippis, I., & Lappe, M. (2010). Optimal contact definition for reconstruction of contact maps. BMC bioinformatics, 11(1), 283.</br>
 +
</section>
 +
</article>
 +
</div>
 +
 
 +
</div>
  
 
</body>
 
</body>
 
</html>
 
</html>

Latest revision as of 17:52, 14 December 2017

Protein Structure Prediction

Proteins consist of amino acid chains that fold in 3-dimensional space in ways we do not yet completely comprehend. Each protein chain spans from a handful of amino acids to more than a thousand residues, rendering the problem of determining the relationship between primary and tertiary structure immensely complex .

In our project, we employed a mutated fimH modifiedby adding the RPMrel peptide in order to achieve selective adhesion of our E. coli strains on cancer cells. In particular, we the fimH sequence was altered by substituting 1 (Proline to Glutamine) and inserting 28 new residues (RPMrel, SpyTag, HisTag). We would like to explore how these changes affect the 3-dimensional structure of the fimH protein.

To this end, we embarked on a journey to create an artificial neural network model that, given a protein's primary structure, can yield sufficient information to reconstruct the full 3-dimensional structure.

The idea is that our model can provide insight as to how the tertiary structure changes after the modifications, in that we can compare the native structure of the protein with the structure of the protein's modified sequence.

Moreover, such a model would be invaluable as it could predict structures de novo for proteins for which we do not yet have structural information. Such a protein is Apoptin, the toxin we used as our classifier's output in order to induce selective cell death.

Therefore, our motivation is 2-fold:

  • to study how a protein's structure changes after altering its primary sequence profile (in our case FimH)

  • to predict the de novo tertiary protein structures based solely on primary structure information (in our case Apoptin)

What we propose is an end-to-end pipeline that solely requires the amino acid sequence of the protein (primary structure) as input, predicts its corresponding secondary structure & contact map and finally recreates and visualizes the full 3-dimensional tertiary structure.


In order to unravel the protein sequence-to-structure mystery one inevitably needs to summon the power of neural networks due to the aforementioned complexity and data volume. To fully exploit the generalization power of neural networks, we need to design a concise yet unabridged representation for our data, the protein sequences and their corresponding tertiary structures.

To make use of the primary and secondary structure sequence information we map the one-letter Amino acid codes to integers in range 1-20 and we use 0 to signify unknown encountered residues. We have now a vocabulary of 21 words that we project to a high-dimensional vector through the use of word embeddings [1], a technique widely used in natural language processing. Similarly, regarding the secondary structure, we have a vocabulary of 8 words that is mapped to high-dimensional vectors through embedding.

A common representation of a protein's tertiary structure is the contact map. This N x N matrix (N the number of residues in the chain) encompasses all the required information to reconstruct the 3-dimensional structure. Every (x,y) point on the map is interpreted as a contact probability between the residues x and y.

The fimH contact map with distance cutoff 9 Å recreated from chain A of PDB id 2VCO

We can see that the majority of contacts span around the diagonal and that is only natural as it means that residues which are at a closer distance have a higher probability of forming a contact than distant ones. In general, secondary structures form patterns on the contact map, e.g. helices are adjacent to the diagonal, and provide a foundation for the prediction of the most difficult long-range contacts. That being said, we first need to obtain information about the secondary structure and subsequently use that to assist in contact prediction.

Since, we have access to ground truth for every protein sequence in our dataset we can employ supervised learning approaches. Secondary structure prediction is a classification problem of 8 classes, where we need to predict the secondary structure class of every residue, while providing context information for residues of the same sequence. Contact map prediction can be interpreted as a regression problem, where one needs to predict the contact probabilities for every pair of residues in the chain.

The first step, given the protein's primary structure is to make an educated guess about its secondary structure. In 2017, we don't even need to do that. We can have machine learning models do that for us.

Although secondary structure provides valuable information for the local structural entities along the chain, it is not enough on its own to produce the contact map. Therefore, we can interpret this step, in terms of machine learning terminology, as an internal representation of the input that is passed on deeper in the network to assist contact prediction. However, since secondary structure has a semantic meaning of its own in the eyes of biologists, we treat the two models separately as it is standard practice in the literature.

Our Secondary Structure Predictor consists of a wide bidirectional recurrent layer, followed by a number of fully connected layers and the output layer. The model trains on primary sequence samples and predicts the secondary structure class for every Amino acid.

In a nutshell, the sequence passes through the recurrent layer and the model “remembers” what has already happened in the sequence -in terms of secondary structures- to provide context for every subsequent residue that it sees. The bigger the size of the recurrent layer the greater the memory capacity of the network and, in turn, its predictive power.

The additional layers that follow progressively reduce the representation dimension until we are left with a 1-hot vector (1 x 8) that indicates the most prevalent class choice. In the end, the full predicted sequence corresponds to all secondary structure segments.

Top Row: primary structure sequence of one-letter Amino acid codes
Bottom Row: secondary structure sequence of one-letter class of the 8 classes

Our end goal is to predict contacts between residues in a pairwise fashion. For a sequence of N Amino acids we want to derive a N x N map, where each position shows the probability of the two residues being in contact.

At first, we need to build a N x N map, hereafter referred to as “tensor”, that will serve as the input layer to our network and will be updated after each epoch as the network learns to solve the regression task. The tensor captures the semantic relationship between residues and constantly changes as training progresses.

In the previous section, we mentioned how we use embeddings to map our single-letter Amino acid codes to high-dimensional vectors. That is, for a primary structure sequence of N residues, we end up with a N x m matrix, where m is the size of the embedding.

In order to exploit the additional information we have for every Amino acid about its secondary structure classification, we employ, once again, word embeddings of size k that we stack to our previous m-dimensional vector. Now, every Amino acid is encoded as a concatenation of the two embeddings with a resulting vector of size m + k.

The secondary structure information may be incorporated directly as a result of our Secondary Structure Predictor model or - for proteins with known structures- can be provided directly as additional input. However, the latter can only be of use for model evaluation as, in practice, we are more interested to predict structures de novo.

The next step is to create a representation for every pair of Amino acids in order to create the full N x N tensor. For a pair (x,y) of Amino acids we concatenate their respective embeddings thus creating a new vector of size 2 * (m + k). As an alternative, we can perform an element-wise operation (e.g. multiplication) between the two distinct residue embeddings and maintain the resulting pair embedding dimension to m + k.

Now that we have constructed our tensor, we add several 2-dimensional convolutional layers to scan every Amino acid pair on our map for contacts. When training is complete, the resulting map is compared to the native contact map by specifying a probability threshold that distinguishes contacts from non-contacts.

The final step of the process is to retrieve the 3d structure from the predicted contact map. To achieve that, we feed the contact map to FT-COMAR ], a tool that is able to recreate the tertiary structure by reading a contact map with values 1 (contact), 0 (no contact), -1 (uncertain) [2]. In our implementation, a contact is declared uncertain if the predicted probability is between 0.3 and 0.6. The resulting 3-dimensional information is written to a file that we can load to iCin3D to visualize and interact with the reconstructed structure.

Due to limited access to computational resources, we were not able to train models of adequate performance to assess the changes in FimH tertiary structure or visualize the estimated structure of Apoptin in a reliable way. Although the models' performance were monotonously improving we could not reach convergence before the wiki freeze, as we were restricted by shallow architectures and small mini-batch sizes.

For example, for the Secondary Structure Predictor we obtained a 27% Q8 accuracy using a relatively shallow 2-layer architecture. Q8 accuracy measures the percent of residues for which 8-state secondary structure is correctly predicted. For the Contact Map Predictor, we set a probability threshold of 0.45 to determine whether two residues are in contact and the model correctly predicted 43% of all contacts.

We will continue to work towards improving the model performances as well as attempt different approaches to solve the contact map.

However, for the sake of completeness, we employed a publicly available tool that offers similar functionality but different approach. RaptorX-Contact-Predict is a web server tool that predicts the contact map and provides direct visualization of the resulting tertiary structure through JSmol.

Contact Map comparison of fimH wild type and our modified fimH. Black dots signify common contacts that exist in both structures, green dots the unique contacts in wild type fimH and the magenta dots the unique contacts that exist in our modified fimH sequence. The locations where the RPMrel and tags were inserted are evident in the contact map in places where there exist only magenta dots along the diagonal.
The top left structure corresponds to wild type fimH, the bottom left to our modified fimH. To the right, we see the two structures superimposed, with wild type fimH shown in grey.
Data

Our dataset is a set of 10932 proteins from the PDB database, that were selected based on certain criteria using the PISCES server. Specifically, the percentage identity cutoff is 60%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25.

We observed that 9754 out of 10932 proteins have some missing residues, for which there is no structural information available. In all such cases, the missing residues were discarded and the sequence truncated. 21 proteins were excluded from the dataset as there was no available secondary structure information whatsoever.

We split the dataset of 10911 remaining proteins, into 8728 for training and 2183 for validation. In addition, we tested our Secondary Structure Predictor to the benchmark dataset CB513 in order to compare our approach with existing methods.

Labels

For the Secondary Structure Predictor the required label for every Amino acid in the sequence is the corresponding secondary structure information. Every Amino acid belongs to one of the 8 classes [3]. Hence, every protein sequence is paired with its label, a sequence of equal length consisting of the class id for every Amino acid.

To derive secondary structure labels for our dataset, we employed STRIDE, an algorithm designed for the assignment of protein secondary structure elements given the atomic coordinates of the protein, as defined by X-ray crystallography, protein NMR, or another protein structure determination method [4].

For the Contact Map Predictor the required label is the native contact map. To obtain the contact map we need the structural information provided in the PDB file of every protein sample, that is the atomic coordinates of every Amino acid. The contact map is constructed by computing the Cb distance for every pair of Amino acids in the chain and ,subsequently, if that distance satisfies a specified threshold [5] the pair is considered to be in contact.

To extract the native contact map for every PDB file, we implemented a group of helper functions that allow for the generation of contact maps of different distance cutoffs and distance types, namely Ca, Cb and Ca + Cb.

Model Architecture

The Secondary Structure Predictor model consists of an embedding layer that maps the one hot Amino acid representations to 100-dimensional vectors, a bidirectional LSTM layer of 180 cells and a fully connected layer of 140 neurons with linear activation function. Finally, we have the output softmax layer of 8 neurons that points to the most probable class choice. The training loss is calculated according to the formula of categorical cross entropy.

The Contact Map Predictor model has a tensor input that is feeded to 6 2D-convolutional layers. After each convolutional layer we enforce batch normalization to prevent exploding gradients. The output layer that computes the probability of contact has a linear activation function.

Frameworks

The models were developed in Python using the Theano and Keras frameworks.

References
[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[2] Vassura, M., Margara, L., Di Lena, P., Medri, F., Fariselli, P., & Casadio, R. (2008). FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10), 1313-1315.
[3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers, 22(12), 2577-2637.
[4] Frishman, D., & Argos, P. (1995). Knowledge‐based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566-579.
[5] Duarte, J. M., Sathyapriya, R., Stehr, H., Filippis, I., & Lappe, M. (2010). Optimal contact definition for reconstruction of contact maps. BMC bioinformatics, 11(1), 283.