Difference between revisions of "Team:William and Mary/Model"

Revision as of 02:16, 2 November 2017

Abstract

By constructing a mechanistic model of our degradation system, we were able to rigorously analyze our timeseries datasets using Bayesian Parameter Inference and obtain parameter distributions. We then used this analysis to feed back into our predictive saturation model, which will inform future teams on how to best use our inducible protease and collection of PDTs.

Overview

In our results, we demonstrated a modular system of protein degradation tags (PDTs) that confer changes in the speed of a genetic circuit. However, all of these measurements were performed in a system containing only one type of tagged protein. As future iGEM teams move towards designing larger and more complex circuits, our initial measurements of speed control in a single gene could potentially become less relevant. This is because all pdts within a circuit share the same protease, so tagging more and more proteins with pdts will cause the degradation burden placed on the protease to increase. This effect is called loading, and in the case where the concentration of tagged protein is higher than the maximal effective degradation rate of the Lon in the system, we say that the proteases are saturated (Fig 1). In a circuit operating in a saturated regime, tagging an additional protein with a pdt will confer a heavily reduced effect on its speed compared to performing the same tagging in an unsaturated regime.

Figure 1: As the concentration of tagged protein (green) increases for a given concentration of Lon (blue), the tagged protein will eventually reach a sufficiently high concentration that all of the Lon proteins in the cell are operating at their maximal effective rate (left, middle panel). By increasing the concentration of tagged protein past this point, the effective degradation strength of the Lon will decrease, as the Lon is already degrading as many proteins as it can (right panel). In this case we say that the protease is saturated. Measurements of degradation rate or speed control that were obtained in an unsaturated scenario will not necessarily apply directly to a saturated scenario. Images adapted from [1].

This presents an issue for teams who wish to make use of our existing characterization of the pdt parts in a larger circuit context, as the presence of loading and saturation will prevent them from being able to directly apply our measurements into their system. In order to address this issue, we decided to develop a mathematical model of protease saturation to investigate how loading effects influence the magnitude of the speed control provided by our pdt system.

Figure 2. Our ODE model of protease saturation. Load is represented as a generic protein (F) that competes with RFP (R) for degradation by Lon (L). Tuning the kinetic parameters associated with F, relative to the parameters associated with R and L, represents the amount of load in the circuit. We explicitly account for the bound Lon-protein complexes (C) as species in the model so that the sequestration effects which drive loading and saturation can be directly interrogated. Please see the technical modeling report [[HYPERLINK TO WUKUN AND CEDAR’S PDF]] for more detailed information about our model’s derivation and analysis.

Saturation models for protease activity which are very similar to our own have previously been developed and studied extensively in the literature [2]. However, our investigation is innovative in its specific focus on the effect of saturation on protease-driven speed control, which is an area which lacks extensive grounding in the literature. Using our model, we aimed to develop the beginnings of a theoretical framework for understanding how speed control effects change in magnitude as a function of protease load.

One of the more difficult aspects of using kinetic models like the one in Figure 2 to describe genetic processes is that these expressions require many parameter values, some of which cannot easily be determined from the literature. Even with parameters for which literature values exist, there are often a large number of implicit assumptions involved in the process of plugging in such a parameter into an ODE model. In order to be as confident as possible that the insights that we draw from our modeling investigation would be directly relevant to future teams, we decided to perform parameter estimation on our datasets.

Popular methods for parameter estimation, such as linear and non-linear regression, return point estimates for parameter values that potentially obscure rich information that relies on the correlations between distributions of parameters. However, in aiming to develop a groundwork for a theoretical understanding of speed control, we felt that we could not afford to lose information which could potentially lead to valuable insights. Therefore, we decided to perform a Bayesian parameter estimation to determine our model.

Bayes’ theorem can used to turn a mathematical model (which is converted into a statistical model) into a probability distribution for your parameters, known as the posterior distribution [3]. The posterior distribution defines how probable it is that a parameter takes a specific value based on the data observed and any prior information available about the parameter distributions (previous iGEM characterization, published literature, etc.). This method has a clear advantage over regression and maximum likelihood estimation approaches, as it allows you to estimate the entire probability distributions for parameter values instead of a single point estimate. In addition, you can visualize all covariation across parameters by plotting their joint parameter distributions, which leads to additional insights into your experimental system.

To estimate the posterior distributions of our parameters we use Markov chain Monte Carlo (MCMC). Markov chain Monte Carlo approaches rely on random walkers that traverse parameter space, taking each step with a probability proportional to the likelihood that those parameters explain your data. As the walkers take steps across the parameter space they sample the underlying posterior distribution, and their relative abundance across parameter space is used to measure how probable it is for a given set of to explain your data. The specific implementation used for MCMC has large implications for how long the sampler has to run in order to obtain good estimates of the posterior distribution for your parameters. When looking for the best implementation to use for performing our parameter estimation we wanted an open-source, well-documented package that had previously been used to successfully identify parameters from biological models. Emcee (commonly known as “the MCMC hammer”) is a Python package developed for efficient MCMC parameter estimation [4]. Emcee is an affine-invariant MCMC method that is extremely fast, flexible, and incredibly well-documented which lowers the barrier to entry for future iGEM teams allowing them to also move into Bayesian parameter estimation.

A necessary component of any MCMC algorithm is the ability of the sampler (Emcee) to assess the likelihood of a given set of parameters given the data you observed. An extremely fast way to evaluate the parameters is by using an analytical solution to your system of ODEs, however quite frequently these can’t be found or are extremely difficult and time-consuming to identify. Another more general approach is to use the parameters at a given point to simulate your data using your underlying ODE model and then check to see how far off your simulated data is from the ground truth. We use a least-square error model to calculate the error between our simulated trajectories and the ground truth reference.

Given that we have a method of randomly sampling parameter values (Emcee), and a penalty function (LSE), we now need an efficient method to simulate our ODE models. Given that we are using Emcee, we were constrained to Python ODE solvers. We also wanted to choose an extremely fast and flexible modeling framework for rapid integration across different possible models. We selected to use the open-source BioSCRAPE package for simulation [5]. BioSCRAPE has been used to successfully identify parameters from biological datasets, supports SBML input files, can simulate deterministic and stochastic models, and is easy to install and use.

Figure 3: Kinetic diagram of a simple RFP protein production and decay model.

Overview P9

Overview P10

Protease Saturation Model

Protease 1

@@ Line 37: / Line 37: @@
 <div style='padding-top: 0px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Abstract P
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >By constructing a mechanistic model of our degradation system, we were able to rigorously analyze our timeseries datasets using Bayesian Parameter Inference and obtain parameter distributions. We then used this analysis to feed back into our predictive saturation model, which will inform future teams on how to best use our inducible protease and collection of PDTs.
 </div>
 <div style="padding-top: 40px;"></div>
@@ Line 49: / Line 49: @@
 <div style='padding-top: 0px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P1
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >In our <a href='https://2017.igem.org/Team:William_and_Mary/Results' style='text-decoration: underline;'><u>results</u></a>, we demonstrated a modular system of protein degradation tags (PDTs) that confer changes in the speed of a genetic circuit. However, all of these measurements were performed in a system containing only one type of tagged protein. As future iGEM teams move towards designing larger and more complex circuits, our initial measurements of speed control in a single gene could potentially become less relevant. This is because all pdts within a circuit share the same protease, so tagging more and more proteins with pdts will cause the degradation burden placed on the protease to increase. This effect is called loading, and in the case where the concentration of tagged protein is higher than the maximal effective degradation rate of the Lon in the system, we say that the proteases are saturated (Fig 1). In a circuit operating in a saturated regime, tagging an additional protein with a pdt will confer a heavily reduced effect on its speed compared to performing the same tagging in an unsaturated regime.
 </div>
-<div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P2
+<div style='padding-top: 40px;'></div>
+<center>
+<figure style='padding-left: px;'>
+<img src='https://static.igem.org/mediawiki/2017/2/2f/T--William_and_Mary--mcmc-sat1.png' width = "50%"/>
+<figcaption><div style='padding-left: 20%;padding-right:20%; padding-top: 15px; color: #808080; font-size: 14px;'>Figure 1: As the concentration of tagged protein (green) increases for a given concentration of Lon (blue), the tagged protein will eventually reach a sufficiently high concentration that all of the Lon proteins in the cell are operating at their maximal effective rate (left, middle panel). By increasing the concentration of tagged protein past this point, the effective degradation strength of the Lon will decrease, as the Lon is already degrading as many proteins as it can (right panel). In this case we say that the protease is saturated. Measurements of degradation rate or speed control that were obtained in an unsaturated scenario will not necessarily apply directly to a saturated scenario. Images adapted from [1].
+</div></figcaption>
+</figure>
+</center>
+<div style='padding-top: 40px;'></div>
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' > This presents an issue for teams who wish to make use of our existing characterization of the pdt parts in a larger circuit context, as the presence of loading and saturation will prevent them from being able to directly apply our measurements into their system. In order to address this issue, we decided to develop a mathematical model of protease saturation to investigate how loading effects influence the magnitude of the speed control provided by our pdt system.
+</div>
+<div style='padding-top: 40px;'></div>
+<center>
+<figure style='padding-left: px;'>
+<img src='https://static.igem.org/mediawiki/2017/b/b4/T--William_and_Mary--satmodel.png' width = "50%"/>
+<figcaption><div style='padding-left: 20%;padding-right:20%; padding-top: 15px; color: #808080; font-size: 14px;'>Figure 2. Our ODE model of protease saturation. Load is represented as a generic protein (F) that competes with RFP (R) for degradation by Lon (L). Tuning the kinetic parameters associated with F, relative to the parameters associated with R and L, represents the amount of load in the circuit. We explicitly account for the bound Lon-protein complexes (C) as species in the model so that the sequestration effects which drive loading and saturation can be directly interrogated. Please see the technical modeling report [[HYPERLINK TO WUKUN AND CEDAR’S PDF]] for more detailed information about our model’s derivation and analysis.
+</div></figcaption>
+</figure>
+</center>
+<div style='padding-top: 40px;'></div>
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >
+Saturation models for protease activity which are very similar to our own have previously been developed and studied extensively in the literature [2]. However, our investigation is innovative in its specific focus on the effect of saturation on protease-driven speed control, which is an area which lacks extensive grounding in the literature. Using our model, we aimed to develop the beginnings of a theoretical framework for understanding how speed control effects change in magnitude as a function of protease load.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P3
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >
+One of the more difficult aspects of using kinetic models like the one in Figure 2 to describe genetic processes is that these expressions require many parameter values, some of which cannot easily be determined from the literature. Even with parameters for which literature values exist, there are often a large number of implicit assumptions involved in the process of plugging in such a parameter into an ODE model. In order to be as confident as possible that the insights that we draw from our modeling investigation would be directly relevant to future teams, we decided to perform parameter estimation on our datasets.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P4
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Popular methods for parameter estimation, such as linear and non-linear regression, return point estimates for parameter values that potentially obscure rich information that relies on the correlations between distributions of parameters. However, in aiming to develop a groundwork for a theoretical understanding of speed control, we felt that we could not afford to lose information which could potentially lead to valuable insights. Therefore, we decided to perform a Bayesian parameter estimation to determine our model.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P5
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Bayes’ theorem can used to turn a mathematical model (which is converted into a statistical model) into a probability distribution for your parameters, known as the posterior distribution [3]. The posterior distribution defines how probable it is that a parameter takes a specific value based on the data observed and any prior information available about the parameter distributions (previous iGEM characterization, published literature, etc.). This method has a clear advantage over regression and maximum likelihood estimation approaches, as it allows you to estimate the entire probability distributions for parameter values instead of a single point estimate. In addition, you can visualize all covariation across parameters by plotting their joint parameter distributions, which leads to additional insights into your experimental system.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P6
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >To estimate the posterior distributions of our parameters we use Markov chain Monte Carlo (MCMC). Markov chain Monte Carlo approaches rely on random walkers that traverse parameter space, taking each step with a probability proportional to the likelihood that those parameters explain your data. As the walkers take steps across the parameter space they sample the underlying posterior distribution, and their relative abundance across parameter space is used to measure how probable it is for a given set of to explain your data. The specific implementation used for MCMC has large implications for how long the sampler has to run in order to obtain good estimates of the posterior distribution for your parameters. When looking for the best implementation to use for performing our parameter estimation we wanted an open-source, well-documented package that had previously been used to successfully identify parameters from biological models. Emcee (commonly known as “the MCMC hammer”) is a Python package developed for efficient MCMC parameter estimation [4]. Emcee is an affine-invariant MCMC method that is extremely fast, flexible, and incredibly well-documented which lowers the barrier to entry for future iGEM teams allowing them to also move into Bayesian parameter estimation.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P7
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >A necessary component of any MCMC algorithm is the ability of the sampler (Emcee) to assess the likelihood of a given set of parameters given the data you observed. An extremely fast way to evaluate the parameters is by using an analytical solution to your system of ODEs, however quite frequently these can’t be found or are extremely difficult and time-consuming to identify. Another more general approach is to use the parameters at a given point to simulate your data using your underlying ODE model and then check to see how far off your simulated data is from the ground truth. We use a least-square error model to calculate the error between our simulated trajectories and the ground truth reference.
 </div>
 <div style='padding-top: 15px;'></div>
-<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Overview P8
+<div style = 'padding-right: 14%; padding-left: 14%; text-indent: 50px;line-height: 25px;' >Given that we have a method of randomly sampling parameter values (Emcee), and a penalty function (LSE), we now need an efficient method to simulate our ODE models. Given that we are using Emcee, we were constrained to Python ODE solvers. We also wanted to choose an extremely fast and flexible modeling framework for rapid integration across different possible models. We selected to use the open-source BioSCRAPE package for simulation [5]. BioSCRAPE has been used to successfully identify parameters from biological datasets, supports SBML input files, can simulate deterministic and stochastic models, and is easy to install and use.
 </div>
 <div style='padding-top: 40px;'></div>
@@ Line 85: / Line 122: @@
 <figure style='padding-left: px;'>
 <img src='https://static.igem.org/mediawiki/2017/6/64/T--William_and_Mary--MCMC-F1.png' width = "50%"/>
-<figcaption><div style='padding-left: 20%;padding-right:20%; padding-top: 15px; color: #808080; font-size: 14px;'>Figure 1:
+<figcaption><div style='padding-left: 20%;padding-right:20%; padding-top: 15px; color: #808080; font-size: 14px;'>Figure 3: Kinetic diagram of a simple RFP protein production and decay model.
 </div></figcaption>
 </figure>