In order to better ensure that our measurements performed on a single reporter gene will be translatable to teams who want to use our pdt system in larger, more complex circuits, we developed a model of protease-driven speed control which accounts for loading and saturation effects on the protease. We then performed Bayesian parameter estimation on our measured datasets with Markov Chain Monte Carlo sampling to parametrize our model specifically to our pdt system. We then used the model to investigate the relationship between protease saturation and speed control, drawing insights that will inform future teams how to better incorporate pdts into their genetic circuit designs.
Overview
In our results, we demonstrated a modular system of protein degradation tags (PDTs) that confer changes in the speed of a genetic circuit. However, all of these measurements were performed in a system containing only one type of tagged protein. As future iGEM teams move towards designing larger and more complex circuits, our initial measurements of speed control in a single gene could potentially become less relevant. This is because all pdts within a circuit share the same protease, so tagging more and more proteins with pdts will cause the degradation burden placed on the protease to increase. This effect is called loading, and in the case where the concentration of tagged protein is higher than the maximal effective degradation rate of the Lon in the system, we say that the proteases are saturated (Fig 1). In a circuit operating in a saturated regime, tagging an additional protein with a pdt will confer a heavily reduced effect on its speed compared to performing the same tagging in an unsaturated regime.
This presents an issue for teams who wish to make use of our existing characterization of the pdt parts in a larger circuit context, as the presence of loading and saturation will prevent them from being able to directly apply our measurements into their system. In order to address this issue, we decided to develop a mathematical model of protease saturation to investigate how loading effects influence the magnitude of the speed control provided by our pdt system.
Saturation models for protease activity which are very similar to our own have previously been developed and studied extensively in the literature [2]. However, our investigation is innovative in its specific focus on the effect of saturation on protease-driven speed control, which is an area which lacks extensive grounding in the literature. Using our model, we aimed to develop the beginnings of a theoretical framework for understanding how speed control effects change in magnitude as a function of protease load.
One of the more difficult aspects of using kinetic models like the one in Figure 2 to describe genetic processes is that these expressions require many parameter values, some of which cannot easily be determined from the literature. Even with parameters for which literature values exist, there are often a large number of implicit assumptions involved in the process of plugging in such a parameter into an ODE model. In order to be as confident as possible that the insights that we draw from our modeling investigation would be directly relevant to future teams, we decided to perform parameter estimation on our datasets.
Popular methods for parameter estimation, such as linear and non-linear regression, return point estimates for parameter values that potentially obscure rich information that relies on the correlations between distributions of parameters. However, in aiming to develop a groundwork for a theoretical understanding of speed control, we felt that we could not afford to lose information which could potentially lead to valuable insights. Therefore, we decided to perform a Bayesian parameter estimation to determine our model.
Bayesian Parameter Estimation with MCMC
Bayes’ theorem can used to turn a mathematical model (which is converted into a statistical model) into a probability distribution for your parameters, known as the posterior distribution [3]. The posterior distribution defines how probable it is that a parameter takes a specific value based on the data observed and any prior information available about the parameter distributions (previous iGEM characterization, published literature, etc.). This method has a clear advantage over regression and maximum likelihood estimation approaches, as it allows you to estimate the entire probability distributions for parameter values instead of a single point estimate. In addition, you can visualize all covariation across parameters by plotting their joint parameter distributions, which leads to additional insights into your experimental system.
To estimate the posterior distributions of our parameters we use Markov chain Monte Carlo (MCMC). Markov chain Monte Carlo approaches rely on random walkers that traverse parameter space, taking each step with a probability proportional to the likelihood that those parameters explain your data. As the walkers take steps across the parameter space they sample the underlying posterior distribution, and their relative abundance across parameter space is used to measure how probable it is for a given set of to explain your data. The specific implementation used for MCMC has large implications for how long the sampler has to run in order to obtain good estimates of the posterior distribution for your parameters. When looking for the best implementation to use for performing our parameter estimation we wanted an open-source, well-documented package that had previously been used to successfully identify parameters from biological models. Emcee (commonly known as “the MCMC hammer”) is a Python package developed for efficient MCMC parameter estimation [4]. Emcee is an affine-invariant MCMC method that is extremely fast, flexible, and incredibly well-documented which lowers the barrier to entry for future iGEM teams allowing them to also move into Bayesian parameter estimation.
A necessary component of any MCMC algorithm is the ability of the sampler (Emcee) to assess the likelihood of a given set of parameters given the data you observed. An extremely fast way to evaluate the parameters is by using an analytical solution to your system of ODEs, however quite frequently these can’t be found or are extremely difficult and time-consuming to identify. Another more general approach is to use the parameters at a given point to simulate your data using your underlying ODE model and then check to see how far off your simulated data is from the ground truth. We use a least-square error model to calculate the error between our simulated trajectories and the ground truth reference.
Given that we have a method of randomly sampling parameter values (Emcee), and a penalty function (LSE), we now need an efficient method to simulate our ODE models. Given that we are using Emcee, we were constrained to Python ODE solvers. We also wanted to choose an extremely fast and flexible modeling framework for rapid integration across different possible models. We selected to use the open-source BioSCRAPE package for simulation [5]. BioSCRAPE has been used to successfully identify parameters from biological datasets, supports SBML input files, can simulate deterministic and stochastic models, and is easy to install and use.
For any parameter estimation problem a key first step is to ascertain whether or not the parameters in your model are identifiable using the data you have available to you [6]. This is especially true for estimation in biological models where typically there are many fewer observed variables (fluorescent proteins, etc.) than parameters (binding rates, production rates, degradation rates, etc.). As an initial check of our MCMC sampling routine, we generated a simple model of gene expression where protein is produced at some rate, Beta, and is degraded at some rate, Gamma (Figure 3). We then picked parameters to generate a ground truth, benchmark simulation. We then used this ground truth as a reference to see if MCMC could find the true parameters with sufficient accuracy to recreate the observed timecourse. For this parameter estimation we used 100 walkers (50 times the number of parameters), 1000 burn steps, and 100 stored steps. As shown in Figure 4, we are able to identify the parameters accurately, and, in addition, identify a strong positive correlation between Beta (production) and Gamma (degradation). This fits well with our understanding of the simple protein model because steady state is determined by the ratio of Beta/Gamma, so as beta increases so too must gamma in order to obtain the original steady state. We then used parameter sets identified by MCMC to simulate trajectories following the ODEs derived from Figure 1. The simulated traces from estimated parameters exactly coincide with the ground truth simulation (Figure 5). We have now demonstrated that our MCMC approach can not only identify the most likely values for kinetic parameters in a mechanistic model of gene expression, but also can elucidate relationships between these parameters that would have been lost by other parameter estimation techniques.
While our MCMC was able to identify parameters for the simple protein expression model shown in Figure 3, we wanted to determine whether or not our Lon ODE model was identifiable (Figure 6). As we are only measuring one output (RFP fluorescence), it’s possible that MCMC sampling cannot differentiate between values like the amount of Lon vs. the affinity of the degradation tag for Lon. We took the same approach as above to first check whether our Lon model was identifiable with only a single observable output. To do that we again simulated a reference timecourse, this time using the model shown in Figure 6, with known parameters and attempted to identify the parameters using MCMC. We now have several free parameters (7), so we have increased the number of walkers (350) and increased the number of burn steps (2000). As shown in Figure 7 we were able to accurately identify the parameters in the Lon-PDT model using only timecourse fluorescent measurements of a single reporter gene.
After confirming our model was identifiable using simulated data, we applied the same MCMC sampling routine to our timecourse fluorescence measurements. We had our MCMC model fit both our weakest tag case (set by K_min) and our strongest tag case (set by K_max) simultaneously, where the only thing it could vary for between two observed datasets was the strength of K. We were able to identify tight posterior distributions for all parameters (values provided in Table 1). In addition, the joint distributions of Beta_Lon and Alpha_Lon and K_min and K_max revealed more information than a single point estimate would allow (Figure 8 and Figure 9).
We found that Alpha_Lon and Beta_Lon are negatively correlated (r = -0.8, p=0), suggesting that an important quantity in the model is the total amount of Lon (set by the product of Alpha_Lon and Beta_Lon). Our results from the joint distribution of K_min and K_max are strongly positively correlated (r=1, p=0), which strengthens our intuition of the model. As the total amount of Lon is fixed for each pair of K_min and K_max tested, if there is an increased amount of Lon both K_min and K_max values will be lower, and if there is a reduced amount of Lon both K_min and K_max will be higher.
Parameter
Units
MCMC estimate
Beta_R
nM/min
5.22
Alpha_R
1/min
3.21
Beta_L
nM/min
0.27
Alpha_L
1/min
0.32
Gamma_dilution
1/min
0.03
Gamma_RNA
1/min
0.03
K_max
1/(nM*min)
2.3e-05
K_min
1/(nM*min)
3.9e-03
Table 1: Parameter estimates from MCMC fitting. All MCMC estimates are the median of full posterior distribution. As all of our parameters are fit well, we believe these estimates are much more informative for our saturation model than pulling values from the literature.
By rigorously analyzing our timecourse degradation experiments using bayesian parameter estimation, we were able to obtain parameter values that better informed our saturation model. Further, we hope that future iGEM teams will build on this work and move away from regression-based approaches and towards the rich world of Bayesian inference for gene expression models. The Python-Emcee-BioSCRAPE pipeline is a completely open-source approach to performing sophisticated analysis in a fast and efficient way and will help to better inform predictive models that guide the next generation of synthetic biology.
Protease Saturation Model
Now that we have obtained direct estimates of our kinetic parameters using MCMC, we can now use our model to make assessments about the effect of saturation on our speed measurements, with the confidence that we have maximized our potential to obtain insights that will be directly relevant to the use of our pdt system.
In order to investigate the effect of saturation on speed, we added in a load protein into our ODE model that effectively sequesters Lon away from RFP, decreasing its ability to degrade RFP, and hence reducing the change in RFP’s expression speed. By varying the production rate of this load protein over a large range of values, we can observe the system’s transition from an unsaturated state to a saturated state where the addition of a pdt does not confer any significant change in RFP’s speed. Based on this understanding of saturation, we formulated our intuitive understanding into a quantitative measure by defining the Fold Increase in Speed for a given choice of parameters as the ratio between the time to half max for a simulated untagged RFP and the time to half max for a simulated RFP tagged with pdt A. In the limit of this ratio approaching 1, there is no difference in speed between the two pdt conditions so the system is completely saturated (Figure 10).
We first investigated the saturation effect induced by a load protein with kinetic parameter values which are identical to that of the RFP (Fig 10). We observed a sharp transition from the unsaturated state to the saturated state, which occurred when the production rate of the load protein is equal to that of the RFP. This is an interesting result as it implies that, at least in this specification of the system where the load protein is biochemically very similar to the reporter protein, that the damping of speed change by protease saturation is closer to an all-or-none effect than a gradual loading of the protease.
We then proceeded to ask whether tuning other properties of the load protein would alter the nature of this transition. In particular, we investigated the effect of varying the affinity of the load protein to Lon in the above simulation. This affinity is a parameter which can be directly tuned during the design of the circuit by selecting a particular pdt. Interestingly, we found that varying the affinity of the load protein by an order of magnitude in either direction did not significantly vary the location of the saturation transition, but that it instead controlled the sharpness of the transition (Figure 11).
We did find, however, that the location of the transition would shift as a function of the proteolysis rate of the load protein (Figure 12). If the proteolysis rate is lower, the Lon remains bound to the load protein for a longer time, effectively amplifying the sequestration of the Lon away from RFP. Because of this, a lower concentration of load protein with a low proteolysis rate is able to induce a level of protease saturation comparable to a higher concentration of another load protein with a high proteolysis rate. This result implies that teams who wish to use our pdt system to tag large, bulky proteins such as dCas9 should take caution and perform additional characterization on the load levels of their circuit components in the initial phases of their circuit design, to ensure that our measurements of speed control will translate accurately to their circuit.
Finally, we wanted to determine whether the functional quantity driving the saturation transition was explicitly dependent on the number of types of load proteins, or whether the transition was driven entirely by the total load placed on the protease regardless of the number of proteins making up that load. To address this distinction, we added a second load protein into our model and simulated the result of the three-way competition between RFP and the two load protein for degradation by Lon (Figure 13).
We then compared the system with two load proteins to a system with one load protein with an affinity 2X stronger than each of the individuals. We found that the saturation transition was very similar in the two conditions (Fig 14). This suggests that, at least in this simple case comparing one against two load proteins, there is no intrinsic additional property conferred by the inclusion of an additional protein in the circuit, and that teams should be able to design their circuits by accounting for total load on their protease rather than having to enumerate each type of protein individually.
Discussion
Our modeling and analysis was focused to achieve a better theoretical grounding of the effect of load and protease saturation on the speed control provided by our pdt system. By integrating our saturation model with bayesian parameter estimation from the experimental data we collected, we were able to provide design considerations for future iGEM team who will build upon our work, building increasingly complex circuits that rely on our effective speed increases.
Our hope is that future iGEM teams will not only use our pdt system to control the speed of their genes and the dynamical properties of their circuits, but also will both use and build upon our theoretical framework for understanding the effect of saturation on protease-driven speed control.
References
[1] Cameron DE, Collins JJ. Tunable protein degradation in bacteria. Nature biotechnology. 2014 Dec 1;32(12):1276-81.
[2] McBride, Cameron, and Domitilla Del Vecchio. "Analyzing and exploiting the effects of protease sharing in genetic circuits." Proceedings of the 20th World Congress of International Federation of Automatic Control (IFAC). 2017
[3] Jaynes, Edwin T. Probability theory: The logic of science. Cambridge university press, 2003.
[4] Foreman-Mackey, Daniel, et al. "emcee: the MCMC hammer." Publications of the Astronomical Society of the Pacific 125.925 (2013): 306.
[5] Swaminathan, Anandh, Victoria Hsiao, and Richard M. Murray. "Quantitative Modeling of Integrase Dynamics Using a Novel Python Toolbox for Parameter Inference in Synthetic Biology." bioRxiv (2017): 121152.
[6] Swaminathan, Anandh, and Richard M. Murray. "Identification of Markov Chains From Distributional Measurements and Applications to Systems Biology." IFAC Proceedings Volumes 47.3 (2014): 4400-4405.
[7] Gur E, Sauer RT. Evolution of the ssrA degradation tag in Mycoplasma: specificity switch to a different protease. Proceedings of the Natural Academy of Sciences. 2008 Oct 21; 105(42): 16113-8.