Team:Greece/Model

Quorum Sensing

Motivation

Can we predict if and when our engineered bacteria will invade their surrounding tissue in order to transfer the classifier plasmids?
Tumor invasion is mediated by a quorum sensing mechanism controlling the production of invasin and listeriolysin, the machinery that actuates it. It is a critical module of the integrated system. This basic safety- and efficacy -minded question motivates a closer examination of the system we call “quorum sensing”. It has been thoroughly examined in the literature [1] and by previous iGEM teams (ETH 2014, KU Leuven 2015, etc.), but with the assumption of spatial equilibrium -- for example, bacteria growing in a well-stirred bioreactor. But what happens with bacteria growing on top of a surface, such as the colon epithelium? Are the existing models adequate to describe this phenomenon emerging at the population level -- or does the distinctly different diffusive environment lead to unexpected phenomena?
To answer these questions, we’ve produced a modular matlab code that can serve future iGEM teams concerned with how diffusion can interact with other biological mechanisms.

Overview

Quorum sensing (QS) is the phenomenon of coordinated behaviour triggered by high bacterial population density. The Lux operon, responsible for quorum sensing, is a leaky switch. When it’s turned off, LuxR and LuxI are produced very slowly. LuxI produces a small signalling molecule in the homoserine lactone family (simply called “AHL” hereafter). LuxR is activated by it and in turn induces the Lux operon, in a positive feedback loop. The entire system thus functions as a latch-on switch; once the concentration of AHL enters a critical area, it increases rapidly and reaches a much higher equilibrium.
The name “Quorum sensing” is most often used to describe the Lux operon’s regulatory system - but does the operon itself always function as implied by that name? Does it turn on only when the bacterial population density is high? To answer this question, let’s first imagine a thought experiment. In the shady forest underbrush, on top of a wax-covered leaf there is a minuscule water droplet. Measuring no more that 0.1 nL in size, it has carried with it a few nutrients and a few bacterial cells happen to grow inside it, let’s say 10. Even though the bacterial density is not very high, the limited space could “fill up” with the AHL produced by only a handful of bacteria. In that case, the Lux operon wouldn’t be triggered by the bacterial density per se, but by the restrictive diffusive environment -- thus giving merit to such a term as “Diffusive sensing” ^[3,4,5]. These alternate hypotheses provide extra reasons to explore the system through simulation.
Our modelling starts with a closer look on previous work. We reimplement the deterministic model described in [1] and play with different parameters and with adding extensions such as diffusive losses and growth models [2] to get a better grasp of the phenomenon. This model describes continuous liquid cultures, where the bacteria are kept in a constant exponential growth phase by continuous dilution of the culture medium. Arguably these conditions, amenable though they may be to experimental validation, do not reflect the natural conditions in the colon nor the conditions our experiment entails. The most significant deviations arise from two facts of the natural environment we want to simulate: the microbial community originates from a small initial population colonizing a new environment (thus requiring a custom growth model) and instead of developing in the full volume of a liquid medium, it colonizes the solid, permeable surface of the colon epithelium (thus requiring the simulation of diffusion). This model therefore comprises 3 parts:

1. A network of chemical reactions in fixed-volume, well-stirred conditions that model the production and consumption of AHL as well as the systems regulatory elements.
2. A custom growth model that evolves an initial inoculation to the environment's assumed carrying capacity.
3. A diffusive model for the evolution of AHL spatial distribution.

The “spatial equilibrium model” is built of the first 2 parts, while the “diffusive model” incorporates the last. Previous models of quorum sensing that we have studied only supply the first part.

Availability

The accompanying code is available on GitHub: https://github.com/igem-greece-2017

Chemical reaction model description

Figure 1: Quorum sensing network. The arrows imply chemical reactions.

This model takes the form of a network of chemical reactions that simulate intracellular processes at the population level. Although the same processes seem anything but deterministic upon a closer look at the single cell, the individual variations can be assumed to be independent and identically distributed for each cell and their averaging eliminates the variability at the population level. Consequently, these chemical reactions are simulated as ordinary differential equations.

The centerpiece is AHL. AHL is the small, freely-diffusing molecule that mediates cell-to-cell communication: when AHL levels are low, the quorum sensing switch is turned off; when they’re high, the switch is turned on. The rest of the interactions concern the protein LuxI, which produces AHL, and LuxR, which binds it and then, activated, goes on to induce the “DNA”. The “DNA” species refers to plasmids carrying the Lux regulatory system. Its induction by (LuxR.AHL)₂ marks the off->on transition; when most of the DNA is in the (uninduced) “DNA’ form, the switch is off; when most of it is in the “DNA.(LuxR.AHL)₂” form, the switch is on.

Figure 2: Dilution mechanism that keeps the growth rate and the cell density constant. These are the conditions modelled in [1](Adapted from [1]).

Supplementary to the core QS functionality, DNA undergoes duplication and AHL undergoes diffusion between 2 spaces: the “internal” (intracellular) and the “external”. To understand the latter process, consider all the bacterial cytoplasms conjoined in a single volume separated by a membrane from the outside.

This model has originally been built to simulate an entire bacterial population living in constant exponential growth conditions in a constant volume with continuous dilution.. We consider these conditions not to influence the cell’s internal mechanisms significantly though and expand their approach to other living conditions, only adjusting the DNA duplication and AHL diffusion processes.

We built this first model to better understand the constraints and basic properties of the physical system, before going into greater depth in the following sections. A caveat to our extrapolation is that cellular metabolism could be significantly altered when cells enter a stationary growth phase, impacting the core QS functionality. We keep this in mind, but have found no way to account for it.

In the following section, we present exploratory simulations of the system's behaviour.

Results (no custom growth model)

Figure 3: Evolution of the QS dynamics when the dilution protocol described in [1] is implemented. Dividing bacteria constantly dilute their cytoplasm, severely slowing down QS. Culture volume: 0.2nL

Figure 4: Same culture volume (0.2nL), but without the dilution protocol and with the bacteria in the stationary phase

First, we simulate the system exactly as specified in the source material (figure 3), with the cells in a constant exponential growth phase and their density ($\Rightarrow$ their number) maintained. This implies constant dilution, which affects all chemical species apart from the DNA, which exactly compensates with replication. Because there are only 100 bacteria with a total cytoplasmic volume of 1.7e-4 nL in a 0.2 nL culture, AHL increases very slowly and QS toggling doesn’t happen within 25 hours.

Without growing, and therefore without diluting to keep the bacteria at a constant density, the quorum sensing transition is triggered at 15 hours (figure 4) (t=0 refers to the time when the bacteria have adapted to their environment and begin producing AHL). This is evident by the beginning of the sharp drop in uninduced “DNA”, as well as by a wrinkle on the AHL graph. This wrinkle is telltale: as the DNA is induced, the production rate of LuxI & LuxR is increased. LuxI increases more slowly than LuxR however, resulting in a transient drop in AHL, as more of it is captured by LuxR. A little later, LuxI catches up and AHL levels increase faster.


Figure 5: Same conditions as in figure 4, changing only the culture volume. Left: 0.1nL Right: 0.4nL

Keeping all the conditions the same as in figure 4 and only tweaking the culture volume, its effect on QS becomes evident (figure 5). When the total volume is reduced by half (0.1nL), QS occurs at 10 hours. When it is doubled (0.4nL), QS occurs at 23 hours (but the transition is more gradual). The QS triggering time in these conditions depends linearly on the total volume.

Growth model description

How does a bacterial population colonize a solid surface? Are the dynamics similar to the liquid media?

According to [2], the growth dynamics are indeed very similar, the greatest difference being a more gradual transition between the exponential and stationary phases. We use the growth model III from [2]:

\frac{d N}{d t} = r N (1 - {(\frac{N}{N_{m a x}})}^{m}) (1 - {(\frac{N_{m i n}}{N})}^{n})

Figure 6: Population growth on an agar surface. The population grows exponentially from a small inoculum to the environment’s carrying capacity after a short lag period. This is the full growth model, but in the simulations we disregard the lag phase. Its duration can’t be modelled precisely and, more importantly, we don’t expect the bacteria to actively express the Lux operon at that phase.

The model can be seen as a transition function between 2 population levels. The steepness of the transition, $r$, depends primarily on temperature and to a smaller extent to nutritional levels; $m$ and $n$ are mostly fixed and $N_0$ is a parameter without a clear significance which only affects the duration of the lag phase. Here we ignore the lag phase, so $N_0=0$. In our simulations, we use the best estimates of these parameters for T=30oC and an environment with relatively few resources (agar with 1/25 the usual nutrient levels) based on [2]: r=1.5 [1/h], m=0.52, n=3.5.

The new equation is concatenated to the system that expresses the chemical reactions and supplies a variable dilution loss, dependent on the variable growth rate. The simulations with the growth model keep a constant culture volume, like the previous ones, but allow the bacterial density to increase, without any external dilution of the entire culture. However, the bacterial equations experience the same dilution term, which is a result of the cytoplasm constantly expanding during the growth phase. As the population transits into the stationary phase, the dilution, following the growth rate, slows down as well, allowing the concentrations to increase freely.

Results (with growth model)

Figure 7: QS curves with growth for increasing culture volume. The final time is larger for the last plots. Culture volume left to right: 4μL,16μL,64μL,512μL,1024μL

Figure 8: QS curves with a reduced growth rate. Culture volume: 4μL. Compared to figure 7, QS is indeed triggered later (since here there are fewer bacteria at equal times), but at an earlier growth phase, before the transition to the stationary phase.

Figure 9: QS curves at an extreme colony volume: 5000μL. This volume is about the same as the agar in a small petri dish, which will become a useful reference for the diffusive model.

The growth model impacts the QS system greatly. As is evident in figure 7 vs 8, while the growth rate is high (r=1.5) quorum sensing is difficult to achieve. This corroborates the result in figure 3. As the volumes increase and the growth curve remains the same, more AHL has to be produced to achieve the same concentration, which takes more time. At an extreme volume of 5mL, in figure 9 QS still happens, but much later, at 40 hours. This volume is significant, because it is the volume of a small petri dish, which we would like to simulate with the diffusive model to compare the results.

We’ve modelled a bacterial population in a well-stirred liquid culture so far. Without the “growth model”, we either model a stationary population or one that grows at a steady rate, its density maintained constant by compensating dilution. With the growth model, an initial inoculation grows to the environment’s carrying capacity, modelling a bacterial colonization of a new environment.

Bacteria growing on a surface are packed very closely together, but the AHL they produce is free to leave their immediate surroundings and diffuse into the surrounding area. Diffusion is a well-described physical phenomenon and this model aims to couple the diffusive process with the chemical reactions of AHL inside millions of independent bacterial cells that are geometrically defined. There doesn’t seem to be any case of electrochemical l gradients affecting AHL diffusion, therefore our model is concerned only with its concentration.

The primary goal was to simulate bacteria growing on the surface of an agar plate, as these conditions are easy to recreate experimentally and thus provide verification to our model and afterwards expand it to growth on a cell line monolayer we use in our experiments as a colorectal cancer model. Our collaboration with iGEM Columbia was meant to enable these experiments, but unfortunately material shortages only allowed us to experiment in liquid cultures.

Model Description

We start with Fick’s laws of diffusion. In the simplest case of isotropic media without mass transport phenomena or external potentials, the driving force of diffusion is the concentration gradient and the diffusion coefficient is a constant real number. Thus, the general diffusion equation takes the form of the simpler heat equation:

\frac{\partial [A H L]}{\partial t} = D \nabla^{2} [A H L]

It is a parabolic partial differential equation (PDE) in space and time. To specify a solvable problem based on such an equation, many more ingredients are needed:

a geometry
initial conditions
boundary conditions

To actually solve it, we furthermore need to select a solution algorithm, which requires its own ingredients.

If we allowed the diffusion coefficient to vary in space, we’d have a more general form of diffusion. Adding a production term $q$ to the right side, it becomes:

\frac{\partial [A H L]}{\partial t} = \nabla \cdot (D \nabla [A H L]) + q

Geometry

The first design decision is to express the problem’s geometry. At a first glance at the task at hand, to model bacteria growing on an agar plate, one might assume a top-down 2D perspective, with the AHL diffusing across the surface away from the bacteria. The diffusion of AHL is inherently a 3D phenomenon though, and this perspective couldn’t easily incorporate the effects of diffusion along the height of the agar gel. In the end, we decided to model the entire 3D volume of an agar plate, with the bacteria at the top of the agar. The agar forms a short cylinder (a disk), surrounded by plastic on 3 sides and air on top. The cylinder is in the order of millimetres in height and centimetres across. An E. Coli cell is about 1μm -- a huge difference in scale! This difference makes the problem quite difficult to solve in practice.

An important simplification at this point is to assume axial symmetry around the axis of the cylinder, thus making the problem tractable. We express the PDE in cylindrical coordinates; after eliminating the angular coefficients of the derivatives, we are left with:

$$\rho \frac{\partial{\mathit{[AHL]}}}{\partial{t}} = D \left(\frac{\partial}{\partial{\rho}}\left(\rho \frac{\partial{\mathit{[AHL]}}}{\partial{\rho}}\right) + \frac{\partial}{\partial{z}}\left(\rho \frac{\partial{\mathit{[AHL]}}}{\partial{z}}\right)\right) + \rho q$$

If we now transform $\rho \mapsto x$ & $z \mapsto y$, thus having:

$$x \frac{\partial{\mathit{[AHL]}}}{\partial{t}} = \nabla \cdot \left(xD \nabla \mathit{[AHL]} \right) + x q$$

This is identical to the diffusion equation above, where the time coefficient is $x$, the diffusion coefficient $xD$ and the production coefficient $xq$. Therefore, we’ll solve this problem on a 2D vertical cross-section of the cylinder, whose solutions are the same as the initial equation on the full 3D cylinder. The final geometry is shown in figure 10. The bacteria are the small red rectangles shown in the zoomed-in image.

Figure 10 left: The geometry on which the diffusion PDE is solved. It represents an axisymmetric 3D cylindrical geometry: an agar plate. The left side is the cylinder’s axis, the right side is the rim. The top is the cylinder’s surface. On the top near the axis there are some bacteria. Since this perspective is a cross-section of the agar plate, the bacteria actually occupy a small disk on the surface of the agar near the axis (every shape in this geometry should be rotated around the axis to imagine its 3D representation).

Figure 10 right: Each red rectangle represents an E. Coli cell. The bacteria are organized in orderly rings and layers with no spaces between them (maximum density). In this case, there are 600 rings of bacteria and 4 layers**. The blue lines are the mesh, the solver’s spatial discretization. Observe how the mesh around the bacteria is very orderly, but also rather coarse (compared to the feature size). The loss in accuracy in this area is intentional: our model inherently can’t resolve concentration differences inside each bacterium’s cytoplasm, therefore a finer mesh would not provide extra information, only modelling artifacts -- and much more computation time! _{** Due to constraints with the mesh generation, this isn’t exactly the case. The reality is more complicated, but it simulates bacteria packed closely together. Notice that each red rectangle has a blue rectangle next to it. Only 1 in 3 red rectangles actually interacts with the AHL, the rest is inert geometry. Thus, 1 cell covers the space of 6 rectangles on the same layer, plus 6 more on the layer below. The cell’s AHL output is multiplied by the number of bacteria it replaces, thus in fact concentrating the production of this entire region on 1 cell. This should be a slight source of error though, because the diffusion coefficient is large.}

Boundary and initial conditions

Boundary conditions specify what happens to the concentration (Dirichlet BC), or to its gradient (Neumann BC), at the geometric boundaries. There’s a lot of those: the edges of the agar, as well as the edges of the rectangles that represent a ring of bacteria. A Neumann BC takes the form $\nabla \mathit{[AHL]} \cdot \hat{n} = q$, where $q$ is the flux through the boundary. $\hat{n}$ is the unit vector normal to the boundary.

The edges of an agar plate are all reflecting boundaries, because AHL can’t diffuse through them: plastic walls at the sides and bottom, air at the top. Thus, at the boundary $q=0$: no AHL goes through.

A boundary condition on the edge of the bacteria could simulate the semipermeable cell membrane, but unfortunately the current version of Matlab doesn’t accommodate conditions on internal boundaries. To evaluate the significance of this restriction we’ve run a test simulation (figure 11).

The initial conditions are described by the [AHL] at each point in the geometry at $t=0$. Here, they are 0.

Solution algorithm

We solve the PDE with the finite element method (FEM). This method discretizes the continuous space into small elements by overlaying a mesh (figure 10) and transforming the continuous geometry into a graph. Each node is a dependent variable. The time-dependent PDE is transformed in this manner into a large system of ODEs, with 1 equation for each node (because the boundary conditions are Neumann).

Coupling AHL diffusion and localized chemical interactions

By and large the most interesting piece of this puzzle is how to couple the spatially-oblivious chemical reaction network described above with the diffusion. [AHL] has become a spatial field. Each ring/layer of bacteria (see footnote in figure 10) is an autonomous agent that interacts with the locally available AHL. A few things are evident: AHL concentrations at points on cytoplasms depend both on diffusion and on chemical interactions, and many more dependent variables are required, to store the concentration of every species at every bacterium.

A bacterium can be thought of as a dynamical system with all the other chemical species as its state and local AHL as its input signal. The dynamical system can be seen as a function of the input and the previous state. Multiple dynamical systems can also be seen as a single function, because the function takes a point in space as input and knows which individual system to feed the input to. Since it can be seen as a function, it can readily be plugged into the equation above as the production coefficient - problem solved!

… solved, but for the ODE solver which fails miserably at such a convoluted, nested system of equations! The alternative that has been successfully implemented is to augment the system of node equations. To the equations produced by the spatial discretization a new set of equations for every bacterium involved is added.

The bacteria expect a single value for the concentration of AHL in the cytoplasm, but as can be seen in figure 10 to each bacterium correspond 6 mesh nodes. For simplicity, the value of [AHL] that the bacteria see is the mean of those nodes.

The d[AHL] (change in [AHL]) produced from the bacterial equations also has to be distributed correctly on the mesh nodes. What we want to simulate when distributing the bacterial output is a constant source on the entire cytoplasm. There is a certain complexity in how the finite element method formulates the node equations and the nodes are not all equivalent, so we can’t simply split the d[AHL] into many parts and give one to each node. Instead, we rely on the FEM algorithm to solve the spatial distribution problem for us by assigning to the bacterial geometries a constant production coefficient of 1, then hijack it by multiplying the resulting node increment coefficients by the d[AHL] produced by the bacteria.

Much is assumed or reverse-engineered in order to arrive at this coupling mechanism. To verify that the model is still on track, we run test simulations with a single bacterium and a small surrounding space. The results should be similar to (but not exactly the same with) the non-diffusive model. Indeed, there is close agreement (figure 11).

Figure 11:

A: The spatial equilibrium model for 1 bacterium in a 2.12E-4 nL volume.	B: Same conditions, but simulated with the diffusive model. A diffusive barrier simulates the cell wall. Good agreement with the equilibrium model.	C: Diffusive model without the cell wall. Again, quite similar to the case with the wall, but much simpler to scale up. This bacterial model is used in the larger geometries.

Implementing Growth

Growth requires adding new bacteria to the geometry, but the finite element method doesn’t accommodate such changes. Consequently, the solution has to be stopped and restarted every time a new bacterium is added. This would be computationally prohibitive.

Instead, the complete growth curve is precalculated and then quantized adaptively to levels corresponding to adding many rings of bacteria at the same time, possibly adding millions of new bacteria at each growth step (figure 12). The final number of bacteria generally depends on the nutrients provided by the growth medium and, for an agar plate with few added nutrients, is expected to be around $10^{8.9}$ bacteria [2]. To further mimic the way bacterial colonies grow, once there are enough bacteria the older cells in the center die.

Bacterial growth constantly dilutes the cytoplasm - a process which heavily affects QS (figure 3). Here, the growth model is implemented in large discrete steps. In each step, the ratio of existing to new bacteria is calculated and a dilution performed on the existing ones, in order to keep the quantity of every chemical species constant during the growth step, apart from the DNA, which cancels its dilution with replication. This discrete dilution event creates minor artifacts on the simulations that manifest as discontinuities on the graphs.


Figure 12 left: The growth curve (blue) of a bacterial colony on a tiny agar disk (3.4mm across) without the lag phase. The simulation follows the quantized version of the growth curve (red).	Figure 12 right: Quantized growth curve for a small agar disk 34mm across. Notice the much higher final population. It simulates growth on agar with few added nutrients as described in [2]. Mean relative quantization error: 4.9%

Results

The basic question that we want to answer at this point is if and when will the bacterial colony exhibit QS behavior on an agar plate - faster or later than in a liquid culture with the same number of bacteria? We present 2 simulations that differ in scale. The first is a tiny agar plate measuring 3.4mm across and .551mm in depth, for a volume of 5μL (figure 13). The second is much larger in scope and computational effort: agar in a small petri dish 34mm in diameter and 5.51mm in depth. The larger scope allows direct experimental evaluation of the model, but to make it more computationally friendly the growth happens in a more spread-out fashion than normal and the colony appears to “walk” across the agar.

Progression of [AHL] dynamics on the tiny agar plate of simulation 1. Left: spatial distribution of [AHL] on the agar cylinder as it evolves in time. Same geometry as on [figure 13a][here] (cylinder axis on the left, rim on the right, surface on top). During the first 6sec (video time), the dynamics are dominated by the colony’s outwards growth. This expansion phase prevents AHL from reaching critical concentrations and delays QS. After the colony growth slows down, the spatial maximum in [AHL] stops moving and the QS transition quickly follows. Right top: total amount of AHL in the agar plate. Right bottom: maximum [AHL] from the left graph at each time step.

The tiny disk is inoculated with 7e4 bacteria, which gradually grow to a final size of 8.12e6 bacteria over 11 growth steps (video). Quorum sensing is indeed triggered early under these conditions, at 11 hours (figure 14).

The small agar plate (which, despite its name, is the large simulation!) is inoculated with 8.2e4 bacteria that grow over the course of 15 hours to become 5.1e8. They achieve quorum sensing at 20 hours, after filling a 5mL agar disk with 1.6pmol AHL (see video). This is a huge amount in comparison to the previous simulation.

Conclusions

A comparison between the two simulations at 5mL volumes is telling. The growth curve is the same in both by design: same inoculations, same growth rates and same final populations. In the surface growth case however, as AHL diffuses slowly from the bacteria to the rest of the agar plate, it has the chance to achieve higher local concentrations. That’s why quorum sensing is achieved at 20 hours, versus 41 for the spatial equilibrium model.

This observation has significant repercussions on the applicability of our project. Bacteria that manage to colonize a solid surface seem to have a much higher probability of achieving quorum sensing under conditions of competitive growth, where their total population size will be limited by resource availability, such as in the colon. Of even greater practical importance, it shows that transfections to monolayers of Caco-2 cells in the lab, which we use as a cancer model, by quorum sensing bacteria have a reasonable chance of success.

The simulations were designed in this way to allow for easy experimental validation. Unfortunately, material shortages prevented the realization of the suggested experiments and the experimental validation of our model is deferred.


Figure 13 left: The geometry of the first simulation on the tiny agar disk with the bacterial colony fully grown. On the left is the disk’s axis and on the right is the rim. The bacteria sit on the disk’s surface.	Figure 13 right: Geometry for the second simulation. Notice the much larger agar disk. The large scale difference makes the bacteria invisible at this zooms. Their position can be inferred by the fine mesh around them.

Figure 14: Quorum sensing is achieved by an E. Coli bacterium, part of a colony growing on a tiny agar plate at 11 hours. This bacterium was created at 3 hours and lived until the end. Because of its central position, it is the first bacterium in the colony to transition into quorum sensing.

Figure 16: Close look at a bacterium from simulation 2. It is created at 9 hours and the levels of [AHL] around it are rising rapidly, because the colony has just expanded to its area. At 20 hours [DNA.(LuxR.AHL)2] begins to rise rapidly, signifying the quorum sensing transition. It is one of the first bacteria in the colony to transition into QS, thanks to its central position.

Progression of [AHL] dynamics on the small agar plate of simulation 2. Left: spatial distribution of [AHL] on the agar cylinder as it evolves in time. Same geometry as on [figure 13b][here] (cylinder axis on the left, rim on the right, surface on top). During the first 10sec (video time), the dynamics are dominated by the colony’s outwards growth. This expansion phase prevents AHL from reaching critical concentrations and delays QS. After the colony growth slows down, the spatial maximum in [AHL] stops moving and grows slowly. In this case, there are many more bacteria occupying a much larger area than in simulation 1. At 14sec the quorum sensing transition begins, but the visualization appears quite different than in simulation 1. One reason for this is that the larger area the bacteria occupy produces a phase difference between them, with those that are further away from the [AHL] maximum taking more time to reach QS. The transition is therefore more gradual. After 16sec, most bacteria have transitioned and the rapid increase in AHL production is evident by the sharpening spatial distribution. Right top: total amount of AHL in the agar plate. Right bottom: maximum [AHL] from the left graph at each time step.

References

[1] Weber, M. & Buceta, J. (2013). Dynamics of the quorum sensing switch: stochastic and non-stationary effects. BMC Systems Biology 2013, 7:6
[2] Fujikawa, H. & Morozumi, S. (2005). Modeling Surface Growth of Escherichia coli on Agar Plates. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, Dec. 2005, 7920–7926
[3] Trovato, A., Seno, F, Zanardo, M., Alberghini, S., Tondello, A. & Squartini A. (2014). Quorum vs. diffusion sensing: a quantitative analysis of the relevance of absorbing or reflecting boundaries. FEMS Microbiology Letters, Volume 352, Issue 2, 1 March 2014, Pages 198–203
[4] Rosemary, J. R. Is quorum sensing a side effect of diffusion sensing? Trends in Microbiology, Volume 10, Issue 8, Pages 365-370
[5] West, S. A., Winzer, K., Gardner, A., Diggle, S. P. Quorum sensing and the confusion about diffusion. Trends in Microbiology, Volume 20, Issue 12, 2012, Pages 586-594. https://doi.org/10.1016/j.tim.2012.09.004.

pANDORRA Design Algorithm

“If you know yourself and you know your enemy, you shouldn’t fear the result of a thousand battles”
-Sun Tzu

If knowledge is power, then modeling is the key to a successful engineering endeavour. Our modeling of the biophysical characteristics of the classifier circuits that follows in this section is largely based upon the exciting work of research teams led by Professors Z. Xie and Y. Benenson [1, 2]. The latter’s newest paper [2] faces the same questions: it analyzes the modeling process and optimizes the classifier’s logical expression. Standing on their shoulders, we expand upon their work by analyzing architectural variations of the classifier circuits, which add an additional optimization axis.

Architecture of a cell-type classifier

pANDORRA is based on a family of layered genetic circuits that aim to biophysically implement logical expressions involving miRNA molecules in 2 basic roles: upregulated or downregulated (on the target cells in respect to the control). The family is characterized by a basic structure emanating from these two roles. The target gene (the system’s output) is controlled by a promoter and RNA interference (RNAi) from the downregulated miRNAs and forms the lower layer of the system. The upper layer comprises target sites for the upregulated miRNAs and an inhibitor for the lower layer.

This basic architecture can be significantly improved with a double-inversion module for the the upper layer [1]. It takes the form of a middle layer with a promoter that is activated by the upper layer’s product and in turn produces the inhibitor for the lower layer. This extra part significantly improves the system’s efficiency.

But how do we define the system’s efficiency and how do we optimize its architecture, out of the many options left open by the previous definitions? And after finding the optimal architecture, how do we choose the optimal miRNA targets to separate particular cell groups? These are the basic questions we set out to answer by modeling the pANDORRA system.

Modeling classifiers

The main task of a classifier system, given a group of target and control cells, is to produce large amounts of output when inside a target cell and little output when inside a control cell. Ideally, it should function as a logical function: maximum output expression in all target cells and zero in all controls. We will therefore judge its efficiency by how close it is to the ideal. We define the fold change of a concrete classifier instance as the average output margin between the two groups.

The circuit’s layered design naturally guides the modeling effort in two paths: estimating each layer’s output and the effects this output has on the next layer. Basic elements of the model therefore would be relationships between the concentrations of the involved regulatory species and the outputs of the regulated elements.

Take the RNAi mechanism: let P_max be the unregulated (maximum) output concentration and C* the miRNA concentration that results in a 50% reduction of the output. Then the relationship between the output [P] and the input [miRNA] is:

[P] = P_{m a x} \frac{1}{1 + \frac{[m i R N A]}{C *}}

Next question: if P is regulated by 2 miRNAs together, how do they interact? Following the same logic as in [2], we assume that the behave additively. If each miRNA has a C_i* concentration of half repression and n is the number of miRNAs, the cumulative regulation is:

[P] = P_{m a x} \frac{1}{1 + \sum_{i = 1}^{n} \frac{[m i R N A_{i}]}{C_{i}^{*}}}

If there are multiple target sites for a particular miRNA we can assume that their effect on interference will also be additive, so if there are r_i target repeats for miRNA_i, the output will be:

[P] = P_{m a x} \frac{1}{1 + \sum_{i = 1}^{n} r_{i} \frac{[m i R N A_{i}]}{C_{i}^{*}}}

To simplify the following expressions, we can define the interference on a scale of 0 to 1 based on the above:

R N A i = \frac{1}{1 + \sum_{i = 1}^{n} r_{i} \frac{[m i R N A_{i}]}{C_{i}^{*}}}

These observations can provide the output of the classifier’s upper layer. A valid design is not limited to 1 OR gate, therefore the total output (let’s assume the output is rtTA) is the sum of each gate’s output:

[r t T A] = r t T A_{m a x} \sum_{g = 1}^{m} {R N A i}_{g}

rtTA is an inducer of the pTRE promoter and its induction can be modelled on a scale of 0 to 1 with the use of the reaction’s dissociation constant K_d as follows:

A = \frac{r t T A}{K_{d} r t T A}

The second layer can be regulated in 2 ways: by rtTA through its promoter and by RNA interference, like the other layers. The first regulation is transcriptional, whereas the second post-transcriptional. Let’s assume the outputs of the second layer can be the protein LacI and the miRNA FF4. LacI will be affected by both regulatory mechanisms; in our particular implementation though FF4 will only be affected by the transcriptional regulation due to the FF4 being produced. Their levels would be then:

[L a c I] = {[L a c I]}_{m a x} A (r t T A) R N A i (up_miR)

[L a c I] = {[L a c I]}_{m a x} A (r t T A)

The final layer’s output applies the exact same logic:

[O u t p u t] = {[O u t p u t]}_{m a x} R N A i (FF4,down_miR) A (L a c I)

This biophysical model expresses an input-output relationship between the miR inputs and the classifier output. Given an miRNA expression dataset, we can calculate the expected output for each sample and evaluate the performance of a concrete classifier.

Classifier design process

Circuit optimization algorithm

pANDORRA’s goal is to discriminate 2 groups of cells. If there is an miRNA expression dataset for these cells, we need to find the optimal circuit inputs that maximize output in cancer cells and minimize it in control cells. Thanks to the evaluation function derived above, if we have a way to generate candidate circuits, we can employ a search method to optimize the logical circuit.

The search space is huge and looking at every possible combination of circuit inputs is impractical; the search is guided instead by a genetic algorithm. The search begins at the simplest circuits: those with a single up- or downregulated miRNA. At each search step, the pool of candidate circuits is evaluated and only the top best-performing are kept. These are then used as a seed to propose new circuits by randomly recombining them together or adding to them single miRNAs. This process is repeated until the best-performing circuit isn’t dethroned for enough steps [2]. The parallels between this process and natural selection and genetic recombination are obvious, justifying the name “genetic algorithm”.

Optimizing the logical circuit therefore requires selecting an architecture first (to have a concrete evaluation function). However, if there are many candidate architectures it can’t generally be known a priori which one will perform best on a particular dataset. Different architectures can be sensitive to different levels of miRNA input - imagine using a particularly strong repressor in the middle layer: low concentrations would be sufficient to completely knock out the output. The only way this classifier would produce output is if the levels of the upregulated miRNAs are exceedingly high on an absolute scale. This architecture wouldn’t work well to recognize cancer cells whose upregulated miRNAs are expressed in lower levels. Therefore, deciding on an architecture is also an optimization problem that should be solved for a specific dataset.

Contrary to the huge search space in circuit optimization, the biologically realizable architectures are few and each can be examined closely. To optimize across the architectural axis then, one need only find the best classifier circuit given the current architecture’s evaluation function. In the following section, we examine the various architectures in more detail.

Classifier Architectures

The first proposed architecture is straightforward: it accommodates upregulated miR on the top layer, downregulated miR directly on the output layer and couples the upregulated ones with 2 parallel mechanisms in the middle layer: transcriptional repression of the output layer by LacI and post-transcriptional inhibition by FF4.

The second architecture removes the parallel inhibition mechanism by FF4 and loosens the coupling of low levels of upregulated miR to the output. At the same time, by adding upregulated miR targets at the middle layer, it increases the coupling of high levels of upregulated miR to the output.

By removing upregulated miR targets on the upper layer, the third architecture allows greater LacI leakage when the upregulated miR are high.

The fourth architecture is very similar to the third, only simpler to implement.

The fifth architecture again combines the complementary couplings provided by LacI and FF4.

By reestablishing the upper layer, we get architecture six.

And by removing LacI we get architecture seven.

Transcriptional vs Post-transcriptional Repression

The basic variable element among these architectures is the method of double inversion. The two middle layer coupling mechanisms we explore, LacI and FF4, are used individually or combined. To assess the significance of this decision, we explore the input-output relationship for the corresponding architectures by scanning a pair of input upregulated miRNAs connected together in the same OR gate along the biologically plausible range they can cover.

There is stark difference in their behaviour. LacI maps low counts of both the miRNAs to relatively low output levels (OUT_max = 10000) and high counts of either on to high output levels, thus implementing the OR gate. Its transition occurs very rapidly at relatively low miRNA counts, while its on/off ratio is 20. FF4 on the other hand is even more effective at mapping low levels of both miRNA to low output levels and high levels of both to high output, but maps high levels of only one of them to middle output, therefore implementing an arithmetic AND gate more than a logical OR gate. Its transition is much more gradual than LacI’s, leaving a large zone of uncertain output, while its overall on/off ratio is 15. It also appears that the FF4’s presence masks the effect of LacI.

AND gates are implemented more successfully, at least for upregulated miRNAs. Although the overall on/off ratio remains the same, the output is more consistent with a logical expression when one miRNA is high and the other is low.

Overall this analysis shows that LacI is more appropriate to implement logic gates. However, it is not a silver bullet. Its transition occurs at a point that might be inappropriate for some datasets, leaving even the miRNAs that are “low” on the positive side of the output.

miRNA Target Tandem Repeats

An interesting possibility is the adaptation of the number of target sites where each miRNA can attach to cause interference. It has been shown by [3] that there is little interaction between multiple miRNA and RISC complexes attaching next to each other, allowing us to model the overall dynamics additively. This assumption is challenged when there are low copy numbers of the participating molecules, but let’s follow it as a working hypothesis. Additive RNAi dynamics implies that, given an interference strength x by some miRNA for 1 repeat of the target site, if there are multiple repeats the interference becomes n*x.

To examine the effect of target site repeats, we’ll use an example dataset and circuit on the previous architectures.

Our Caco-2 and healthy dataset was obtained from the work by Cummins et al., 2006 [4]. The miRNA expression data regarding the HEK-293 and A549 cell lines were retrieved from the mammalian microRNA expression atlas [5]

Architecture & Circuit Optimization

We use the methods and insights derived so far to implement classifiers for our lab experiments. The experimental goal is to prove the ability of an optimized classifier to target cells from the Caco-2 cell line versus a healthy tissue control. Due to practical considerations, the controls on the actual experiments were the HEK-293 and A549 cell lines.

Running the circuit optimization algorithm on a dataset comprising Caco-2 and healthy samples [4] for the various architectures suggests the following circuits for each:

- where “cMargin” is the logarithm of the fold change

Apparently the best architectures are indeed the ones that use LacI in the middle layer and don’t use FF4 (which overrules LacI’s influence).

The LacI architectures are optimal with a very small number of inputs. The others have more inputs, but due to the way the optimization works, many of those inputs might be of only marginal significance. It is then in our best interest from a designer's perspective to attempt to simplify the proposed circuits - a process called “pruning”. Taking cues from LacI, we attempt to prune the other architectures to the same small circuit, which is a subset of all the larger circuits.

Experiments are expensive and models are useful lies; acknowledging this statement, it is also useful to put the results of our model in a broader perspective provided by the literature and increase our chances of a successful experiment. miR-21 was included due to exceptionally high expression level in most cance cell lines but not in the majority of healty and immortalized tissues [2]. While examining miR-373, we notice it shares the same seed sequence with miR-372. The optimization algorithm takes this similarity into account at a preprocessing step, adding the expressions of miR-372 and miR-373 together and reporting only miR-372.

Putting all these elements together leads us to propose the following circuit for the wet lab experiments:

Architecture: 2 (LacI + miR targets at middle layer)
Circuit:
hsa-mir-372 OR hsa-mir-373
AND
hsa-mir-21
AND
~hsa-mir-143
AND
~hsa-mir-145

Re-evaluation of the proposed circuit

Although we’ve determined architecture 2 to be optimal, we reevaluate the proposed circuit across all the architectures. The resulting logarithm of the fold change is:

arch1	arch2	arch3	arch4	arch5	arch6	arch7
1.45	2.69	2.64	2.65	0.90	1.67	0.77

Architecture 2 retains the same high fold change with the proposed circuit.

Implementation problems and fallback

We designed experiments to implement this final classifier circuit in the lab. The promoter bearing LacO operator sites, CAGop, was ordered to be synthesized, but unfortunately synthesis proved very difficult due to the high GC content and the construct arrived at the 31st of October. Necessity forced us to fall back to an alternate architecture for which we already had the Parts: architecture 7 (FF4). Thanks to the way we had structured the circuit elements, we also had some leeway in choosing which miRNA targets to actually include: miR-21, miR-372 OR miR-372, or both. We reevaluate these 3 options for architecture 7 and also compute the efficiency of the circuits on datasets with HEK-293 and A549 as controls, apart from healthy tissue samples.

Expected fold change logarithm for healthy samples:

mir21 & ~mir143 & ~mir145	0.835
mir372\|mir373 & ~mir143 & ~mir145	0.881
mir21 & mir372\|mir373 & ~mir143 & ~mir145	0.766

Expected fold change logarithm for the HEK-293 cell line:

mir21 & ~mir143 & ~mir145	1.572
mir372\|mir373 & ~mir143 & ~mir145	1.038
mir21 & mir372\|mir373 & ~mir143 & ~mir145	1.054

Expected fold change logarithm for the A549 cell line:

mir21 & ~mir143 & ~mir145	0.620
mir372\|mir373 & ~mir143 & ~mir145	1.138
mir21 & mir372\|mir373 & ~mir143 & ~mir145	1.068

References

[1] Xie, Z., Wroblewska, L., Prochazka, L., Weiss, R., & Benenson, Y. (2011). Multi-input RNAi-based logic circuit for identification of specific cancer cells. Science, 333(6047), 1307-1311.

[2] Mohammadi, P., Beerenwinkel, N., & Benenson, Y. (2017). Automated Design of Synthetic Cell Classifier Circuits Using a Two-Step Optimization Strategy. Cell Systems, 4(2), 207-218.

[3] Schmitz, U., Lai, X., Winter, F., Wolkenhauer, O., Vera, J., & Gupta, S. K. (2014). Cooperative gene regulation by microRNA pairs and their identification using a computational workflow. Nucleic acids research, 42(12), 7539-7552.

[4] Cummins, J. M., He, Y., Leary, R. J., Pagliarini, R., Diaz, L. A., Sjoblom, T., ... & Raymond, C. K. (2006). The colorectal microRNAome. Proceedings of the National Academy of Sciences of the United States of America, 103(10), 3687-3692.

[5] Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., Aravin, A., ... & Lin, C. (2007). A mammalian microRNA expression atlas based on small RNA library sequencing. Cell, 129(7), 1401-1414.

Protein Structure Prediction

Proteins consist of amino acid chains that fold in 3-dimensional space in ways we do not yet completely comprehend. Each protein chain spans from a handful of amino acids to more than a thousand residues, rendering the problem of determining the relationship between primary and tertiary structure immensely complex .

In our project, we employed a mutated fimH modifiedby adding the RPMrel peptide in order to achieve selective adhesion of our E. coli strains on cancer cells. In particular, we the fimH sequence was altered by substituting 1 (Proline to Glutamine) and inserting 28 new residues (RPMrel, SpyTag, HisTag). We would like to explore how these changes affect the 3-dimensional structure of the fimH protein.

To this end, we embarked on a journey to create an artificial neural network model that, given a protein's primary structure, can yield sufficient information to reconstruct the full 3-dimensional structure.

The idea is that our model can provide insight as to how the tertiary structure changes after the modifications, in that we can compare the native structure of the protein with the structure of the protein's modified sequence.

Moreover, such a model would be invaluable as it could predict structures de novo for proteins for which we do not yet have structural information. Such a protein is Apoptin, the toxin we used as our classifier's output in order to induce selective cell death.

Therefore, our motivation is 2-fold:

to study how a protein's structure changes after altering its primary sequence profile (in our case FimH)
to predict the de novo tertiary protein structures based solely on primary structure information (in our case Apoptin)

What we propose is an end-to-end pipeline that solely requires the amino acid sequence of the protein (primary structure) as input, predicts its corresponding secondary structure & contact map and finally recreates and visualizes the full 3-dimensional tertiary structure.

In order to unravel the protein sequence-to-structure mystery one inevitably needs to summon the power of neural networks due to the aforementioned complexity and data volume. To fully exploit the generalization power of neural networks, we need to design a concise yet unabridged representation for our data, the protein sequences and their corresponding tertiary structures.

To make use of the primary and secondary structure sequence information we map the one-letter Amino acid codes to integers in range 1-20 and we use 0 to signify unknown encountered residues. We have now a vocabulary of 21 words that we project to a high-dimensional vector through the use of word embeddings [1], a technique widely used in natural language processing. Similarly, regarding the secondary structure, we have a vocabulary of 8 words that is mapped to high-dimensional vectors through embedding.

A common representation of a protein's tertiary structure is the contact map. This N x N matrix (N the number of residues in the chain) encompasses all the required information to reconstruct the 3-dimensional structure. Every (x,y) point on the map is interpreted as a contact probability between the residues x and y.

The fimH contact map with distance cutoff 9 Å recreated from chain A of PDB id 2VCO

We can see that the majority of contacts span around the diagonal and that is only natural as it means that residues which are at a closer distance have a higher probability of forming a contact than distant ones. In general, secondary structures form patterns on the contact map, e.g. helices are adjacent to the diagonal, and provide a foundation for the prediction of the most difficult long-range contacts. That being said, we first need to obtain information about the secondary structure and subsequently use that to assist in contact prediction.

Since, we have access to ground truth for every protein sequence in our dataset we can employ supervised learning approaches. Secondary structure prediction is a classification problem of 8 classes, where we need to predict the secondary structure class of every residue, while providing context information for residues of the same sequence. Contact map prediction can be interpreted as a regression problem, where one needs to predict the contact probabilities for every pair of residues in the chain.

The first step, given the protein's primary structure is to make an educated guess about its secondary structure. In 2017, we don't even need to do that. We can have machine learning models do that for us.

Although secondary structure provides valuable information for the local structural entities along the chain, it is not enough on its own to produce the contact map. Therefore, we can interpret this step, in terms of machine learning terminology, as an internal representation of the input that is passed on deeper in the network to assist contact prediction. However, since secondary structure has a semantic meaning of its own in the eyes of biologists, we treat the two models separately as it is standard practice in the literature.

Our Secondary Structure Predictor consists of a wide bidirectional recurrent layer, followed by a number of fully connected layers and the output layer. The model trains on primary sequence samples and predicts the secondary structure class for every Amino acid.

In a nutshell, the sequence passes through the recurrent layer and the model “remembers” what has already happened in the sequence -in terms of secondary structures- to provide context for every subsequent residue that it sees. The bigger the size of the recurrent layer the greater the memory capacity of the network and, in turn, its predictive power.

The additional layers that follow progressively reduce the representation dimension until we are left with a 1-hot vector (1 x 8) that indicates the most prevalent class choice. In the end, the full predicted sequence corresponds to all secondary structure segments.

Top Row: primary structure sequence of one-letter Amino acid codes

Bottom Row: secondary structure sequence of one-letter class of the 8 classes

Our end goal is to predict contacts between residues in a pairwise fashion. For a sequence of N Amino acids we want to derive a N x N map, where each position shows the probability of the two residues being in contact.

At first, we need to build a N x N map, hereafter referred to as “tensor”, that will serve as the input layer to our network and will be updated after each epoch as the network learns to solve the regression task. The tensor captures the semantic relationship between residues and constantly changes as training progresses.

In the previous section, we mentioned how we use embeddings to map our single-letter Amino acid codes to high-dimensional vectors. That is, for a primary structure sequence of N residues, we end up with a N x m matrix, where m is the size of the embedding.

In order to exploit the additional information we have for every Amino acid about its secondary structure classification, we employ, once again, word embeddings of size k that we stack to our previous m-dimensional vector. Now, every Amino acid is encoded as a concatenation of the two embeddings with a resulting vector of size m + k.

The secondary structure information may be incorporated directly as a result of our Secondary Structure Predictor model or - for proteins with known structures- can be provided directly as additional input. However, the latter can only be of use for model evaluation as, in practice, we are more interested to predict structures de novo.

The next step is to create a representation for every pair of Amino acids in order to create the full N x N tensor. For a pair (x,y) of Amino acids we concatenate their respective embeddings thus creating a new vector of size 2 * (m + k). As an alternative, we can perform an element-wise operation (e.g. multiplication) between the two distinct residue embeddings and maintain the resulting pair embedding dimension to m + k.

Now that we have constructed our tensor, we add several 2-dimensional convolutional layers to scan every Amino acid pair on our map for contacts. When training is complete, the resulting map is compared to the native contact map by specifying a probability threshold that distinguishes contacts from non-contacts.

The final step of the process is to retrieve the 3d structure from the predicted contact map. To achieve that, we feed the contact map to FT-COMAR ], a tool that is able to recreate the tertiary structure by reading a contact map with values 1 (contact), 0 (no contact), -1 (uncertain) [2]. In our implementation, a contact is declared uncertain if the predicted probability is between 0.3 and 0.6. The resulting 3-dimensional information is written to a file that we can load to iCin3D to visualize and interact with the reconstructed structure.

Due to limited access to computational resources, we were not able to train models of adequate performance to assess the changes in FimH tertiary structure or visualize the estimated structure of Apoptin in a reliable way. Although the models' performance were monotonously improving we could not reach convergence before the wiki freeze, as we were restricted by shallow architectures and small mini-batch sizes.

For example, for the Secondary Structure Predictor we obtained a 27% Q8 accuracy using a relatively shallow 2-layer architecture. Q8 accuracy measures the percent of residues for which 8-state secondary structure is correctly predicted. For the Contact Map Predictor, we set a probability threshold of 0.45 to determine whether two residues are in contact and the model correctly predicted 43% of all contacts.

We will continue to work towards improving the model performances as well as attempt different approaches to solve the contact map.

However, for the sake of completeness, we employed a publicly available tool that offers similar functionality but different approach. RaptorX-Contact-Predict is a web server tool that predicts the contact map and provides direct visualization of the resulting tertiary structure through JSmol.

Contact Map comparison of fimH wild type and our modified fimH. Black dots signify common contacts that exist in both structures, green dots the unique contacts in wild type fimH and the magenta dots the unique contacts that exist in our modified fimH sequence. The locations where the RPMrel and tags were inserted are evident in the contact map in places where there exist only magenta dots along the diagonal.

The top left structure corresponds to wild type fimH, the bottom left to our modified fimH. To the right, we see the two structures superimposed, with wild type fimH shown in grey.

Data

Our dataset is a set of 10932 proteins from the PDB database, that were selected based on certain criteria using the PISCES server. Specifically, the percentage identity cutoff is 60%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25.

We observed that 9754 out of 10932 proteins have some missing residues, for which there is no structural information available. In all such cases, the missing residues were discarded and the sequence truncated. 21 proteins were excluded from the dataset as there was no available secondary structure information whatsoever.

We split the dataset of 10911 remaining proteins, into 8728 for training and 2183 for validation. In addition, we tested our Secondary Structure Predictor to the benchmark dataset CB513 in order to compare our approach with existing methods.

Labels

For the Secondary Structure Predictor the required label for every Amino acid in the sequence is the corresponding secondary structure information. Every Amino acid belongs to one of the 8 classes [3]. Hence, every protein sequence is paired with its label, a sequence of equal length consisting of the class id for every Amino acid.

To derive secondary structure labels for our dataset, we employed STRIDE, an algorithm designed for the assignment of protein secondary structure elements given the atomic coordinates of the protein, as defined by X-ray crystallography, protein NMR, or another protein structure determination method [4].

For the Contact Map Predictor the required label is the native contact map. To obtain the contact map we need the structural information provided in the PDB file of every protein sample, that is the atomic coordinates of every Amino acid. The contact map is constructed by computing the Cb distance for every pair of Amino acids in the chain and ,subsequently, if that distance satisfies a specified threshold [5] the pair is considered to be in contact.

To extract the native contact map for every PDB file, we implemented a group of helper functions that allow for the generation of contact maps of different distance cutoffs and distance types, namely Ca, Cb and Ca + Cb.

Model Architecture

The Secondary Structure Predictor model consists of an embedding layer that maps the one hot Amino acid representations to 100-dimensional vectors, a bidirectional LSTM layer of 180 cells and a fully connected layer of 140 neurons with linear activation function. Finally, we have the output softmax layer of 8 neurons that points to the most probable class choice. The training loss is calculated according to the formula of categorical cross entropy.

The Contact Map Predictor model has a tensor input that is feeded to 6 2D-convolutional layers. After each convolutional layer we enforce batch normalization to prevent exploding gradients. The output layer that computes the probability of contact has a linear activation function.

Frameworks

The models were developed in Python using the Theano and Keras frameworks.

References

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[2] Vassura, M., Margara, L., Di Lena, P., Medri, F., Fariselli, P., & Casadio, R. (2008). FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10), 1313-1315.
[3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers, 22(12), 2577-2637.
[4] Frishman, D., & Argos, P. (1995). Knowledge‐based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566-579.
[5] Duarte, J. M., Sathyapriya, R., Stehr, H., Filippis, I., & Lappe, M. (2010). Optimal contact definition for reconstruction of contact maps. BMC bioinformatics, 11(1), 283.

	The mesh generated by Comsol Multiphysics is a physics-controlled mesh with finer elements. The tetrahedral space discretization is visible on the mesh plot. We present here two mesh plots; the first shows the tetrahedral meshing of the intestinal geometry and the second one depicts the interior region where the tumor meshing is visible.