By constructing a mechanistic model of our degradation system, we were able to rigorously analyze our timeseries datasets using Bayesian Parameter Inference and obtain parameter distributions. We then used this analysis to feed back into our predictive saturation model, which will inform future teams on how to best use our inducible protease and collection of PDTs.
Overview
In our results, we demonstrated a modular system of protein degradation tags (PDTs) that confer changes in the speed of a genetic circuit. However, all of these measurements were performed in a system containing only one type of tagged protein. As future iGEM teams move towards designing larger and more complex circuits, our initial measurements of speed control in a single gene could potentially become less relevant. This is because all pdts within a circuit share the same protease, so tagging more and more proteins with pdts will cause the degradation burden placed on the protease to increase. This effect is called loading, and in the case where the concentration of tagged protein is higher than the maximal effective degradation rate of the Lon in the system, we say that the proteases are saturated (Fig 1). In a circuit operating in a saturated regime, tagging an additional protein with a pdt will confer a heavily reduced effect on its speed compared to performing the same tagging in an unsaturated regime.
This presents an issue for teams who wish to make use of our existing characterization of the pdt parts in a larger circuit context, as the presence of loading and saturation will prevent them from being able to directly apply our measurements into their system. In order to address this issue, we decided to develop a mathematical model of protease saturation to investigate how loading effects influence the magnitude of the speed control provided by our pdt system.
Saturation models for protease activity which are very similar to our own have previously been developed and studied extensively in the literature [2]. However, our investigation is innovative in its specific focus on the effect of saturation on protease-driven speed control, which is an area which lacks extensive grounding in the literature. Using our model, we aimed to develop the beginnings of a theoretical framework for understanding how speed control effects change in magnitude as a function of protease load.
One of the more difficult aspects of using kinetic models like the one in Figure 2 to describe genetic processes is that these expressions require many parameter values, some of which cannot easily be determined from the literature. Even with parameters for which literature values exist, there are often a large number of implicit assumptions involved in the process of plugging in such a parameter into an ODE model. In order to be as confident as possible that the insights that we draw from our modeling investigation would be directly relevant to future teams, we decided to perform parameter estimation on our datasets.
Popular methods for parameter estimation, such as linear and non-linear regression, return point estimates for parameter values that potentially obscure rich information that relies on the correlations between distributions of parameters. However, in aiming to develop a groundwork for a theoretical understanding of speed control, we felt that we could not afford to lose information which could potentially lead to valuable insights. Therefore, we decided to perform a Bayesian parameter estimation to determine our model.
Bayes’ theorem can used to turn a mathematical model (which is converted into a statistical model) into a probability distribution for your parameters, known as the posterior distribution [3]. The posterior distribution defines how probable it is that a parameter takes a specific value based on the data observed and any prior information available about the parameter distributions (previous iGEM characterization, published literature, etc.). This method has a clear advantage over regression and maximum likelihood estimation approaches, as it allows you to estimate the entire probability distributions for parameter values instead of a single point estimate. In addition, you can visualize all covariation across parameters by plotting their joint parameter distributions, which leads to additional insights into your experimental system.
To estimate the posterior distributions of our parameters we use Markov chain Monte Carlo (MCMC). Markov chain Monte Carlo approaches rely on random walkers that traverse parameter space, taking each step with a probability proportional to the likelihood that those parameters explain your data. As the walkers take steps across the parameter space they sample the underlying posterior distribution, and their relative abundance across parameter space is used to measure how probable it is for a given set of to explain your data. The specific implementation used for MCMC has large implications for how long the sampler has to run in order to obtain good estimates of the posterior distribution for your parameters. When looking for the best implementation to use for performing our parameter estimation we wanted an open-source, well-documented package that had previously been used to successfully identify parameters from biological models. Emcee (commonly known as “the MCMC hammer”) is a Python package developed for efficient MCMC parameter estimation [4]. Emcee is an affine-invariant MCMC method that is extremely fast, flexible, and incredibly well-documented which lowers the barrier to entry for future iGEM teams allowing them to also move into Bayesian parameter estimation.
A necessary component of any MCMC algorithm is the ability of the sampler (Emcee) to assess the likelihood of a given set of parameters given the data you observed. An extremely fast way to evaluate the parameters is by using an analytical solution to your system of ODEs, however quite frequently these can’t be found or are extremely difficult and time-consuming to identify. Another more general approach is to use the parameters at a given point to simulate your data using your underlying ODE model and then check to see how far off your simulated data is from the ground truth. We use a least-square error model to calculate the error between our simulated trajectories and the ground truth reference.
Given that we have a method of randomly sampling parameter values (Emcee), and a penalty function (LSE), we now need an efficient method to simulate our ODE models. Given that we are using Emcee, we were constrained to Python ODE solvers. We also wanted to choose an extremely fast and flexible modeling framework for rapid integration across different possible models. We selected to use the open-source BioSCRAPE package for simulation [5]. BioSCRAPE has been used to successfully identify parameters from biological datasets, supports SBML input files, can simulate deterministic and stochastic models, and is easy to install and use.
For any parameter estimation problem a key first step is to ascertain whether or not the parameters in your model are identifiable using the data you have available to you [6]. This is especially true for estimation in biological models where typically there are many fewer observed variables (fluorescent proteins, etc.) than parameters (binding rates, production rates, degradation rates, etc.). As an initial check of our MCMC sampling routine, we generated a simple model of gene expression where protein is produced at some rate, Beta, and is degraded at some rate, Gamma (Figure 3). We then picked parameters to generate a ground truth, benchmark simulation. We then used this ground truth as a reference to see if MCMC could find the true parameters with sufficient accuracy to recreate the observed timecourse. For this parameter estimation we used 100 walkers (50 times the number of parameters), 1000 burn steps, and 100 stored steps. As shown in Figure 4, we are able to identify the parameters accurately, and, in addition, identify a strong positive correlation between Beta (production) and Gamma (degradation). This fits well with our understanding of the simple protein model because steady state is determined by the ratio of Beta/Gamma, so as beta increases so too must gamma in order to obtain the original steady state. We then used parameter sets identified by MCMC to simulate trajectories following the ODEs derived from Figure 1. The simulated traces from estimated parameters exactly coincide with the ground truth simulation (Figure 5). We have now demonstrated that our MCMC approach can not only identify the most likely values for kinetic parameters in a mechanistic model of gene expression, but also can elucidate relationships between these parameters that would have been lost by other parameter estimation techniques.
While our MCMC was able to identify parameters for the simple protein expression model shown in Figure 3, we wanted to determine whether or not our Lon ODE model was identifiable (Figure 6). As we are only measuring one output (RFP fluorescence), it’s possible that MCMC sampling cannot differentiate between values like the amount of Lon vs. the affinity of the degradation tag for Lon. We took the same approach as above to first check whether our Lon model was identifiable with only a single observable output. To do that we again simulated a reference timecourse, this time using the model shown in Figure 6, with known parameters and attempted to identify the parameters using MCMC. We now have several free parameters (7), so we have increased the number of walkers (350) and increased the number of burn steps (2000). As shown in Figure 7 we were able to accurately identify the parameters in the Lon-PDT model using only timecourse fluorescent measurements of a single reporter gene.
After confirming our model was identifiable using simulated data, we applied the same MCMC sampling routine to our timecourse fluorescence measurements. We had our MCMC model fit both our weakest tag case (set by K_min) and our strongest tag case (set by K_max) simultaneously, where the only thing it could vary for between two observed datasets was the strength of K. We were able to identify tight posterior distributions for all parameters (values provided in Table 1). In addition, the joint distributions of Beta_Lon and Alpha_Lon and K_min and K_max revealed more information than a single point estimate would allow (Figure 8 and Figure 9).
We found that Alpha_Lon and Beta_Lon are negatively correlated (r = -0.8, p=0), suggesting that an important quantity in the model is the total amount of Lon (set by the product of Alpha_Lon and Beta_Lon). Our results from the joint distribution of K_min and K_max are strongly positively correlated (r=1, p=0), which strengthens our intuition of the model. As the total amount of Lon is fixed for each pair of K_min and K_max tested, if there is an increased amount of Lon both K_min and K_max values will be lower, and if there is a reduced amount of Lon both K_min and K_max will be higher.
Parameter
Units
MCMC estimate
Beta_R
nM/min
5.22/td>
Alpha_R
1/min
3.21
Beta_L
nM/min
0.27
Alpha_L
1/min
0.32
Gamma_dilution
1/min
0.03
Gamma_RNA
1/min
0.03
K_max
1/nM*min
2.3e-05
K_min
1/nM*min
3.9e-03
Table 1: Parameter estimates from MCMC fitting. All MCMC estimates are the median of full posterior distribution. As all of our parameters are fit well, we believe these estimates are much more informative for our saturation model than pulling values from the literature.