By constructing a mechanistic model of our degradation system, we were able to rigorously analyze our timeseries datasets using Bayesian Parameter Inference and obtain parameter distributions. We then used this analysis to feed back into our predictive saturation model, which will inform future teams on how to best use our inducible protease and collection of PDTs.
Overview
In our results, we demonstrated a modular system of protein degradation tags (PDTs) that confer changes in the speed of a genetic circuit. However, all of these measurements were performed in a system containing only one type of tagged protein. As future iGEM teams move towards designing larger and more complex circuits, our initial measurements of speed control in a single gene could potentially become less relevant. This is because all pdts within a circuit share the same protease, so tagging more and more proteins with pdts will cause the degradation burden placed on the protease to increase. This effect is called loading, and in the case where the concentration of tagged protein is higher than the maximal effective degradation rate of the Lon in the system, we say that the proteases are saturated (Fig 1). In a circuit operating in a saturated regime, tagging an additional protein with a pdt will confer a heavily reduced effect on its speed compared to performing the same tagging in an unsaturated regime.
This presents an issue for teams who wish to make use of our existing characterization of the pdt parts in a larger circuit context, as the presence of loading and saturation will prevent them from being able to directly apply our measurements into their system. In order to address this issue, we decided to develop a mathematical model of protease saturation to investigate how loading effects influence the magnitude of the speed control provided by our pdt system.
Saturation models for protease activity which are very similar to our own have previously been developed and studied extensively in the literature [2]. However, our investigation is innovative in its specific focus on the effect of saturation on protease-driven speed control, which is an area which lacks extensive grounding in the literature. Using our model, we aimed to develop the beginnings of a theoretical framework for understanding how speed control effects change in magnitude as a function of protease load.
One of the more difficult aspects of using kinetic models like the one in Figure 2 to describe genetic processes is that these expressions require many parameter values, some of which cannot easily be determined from the literature. Even with parameters for which literature values exist, there are often a large number of implicit assumptions involved in the process of plugging in such a parameter into an ODE model. In order to be as confident as possible that the insights that we draw from our modeling investigation would be directly relevant to future teams, we decided to perform parameter estimation on our datasets.
Popular methods for parameter estimation, such as linear and non-linear regression, return point estimates for parameter values that potentially obscure rich information that relies on the correlations between distributions of parameters. However, in aiming to develop a groundwork for a theoretical understanding of speed control, we felt that we could not afford to lose information which could potentially lead to valuable insights. Therefore, we decided to perform a Bayesian parameter estimation to determine our model.
Bayes’ theorem can used to turn a mathematical model (which is converted into a statistical model) into a probability distribution for your parameters, known as the posterior distribution [3]. The posterior distribution defines how probable it is that a parameter takes a specific value based on the data observed and any prior information available about the parameter distributions (previous iGEM characterization, published literature, etc.). This method has a clear advantage over regression and maximum likelihood estimation approaches, as it allows you to estimate the entire probability distributions for parameter values instead of a single point estimate. In addition, you can visualize all covariation across parameters by plotting their joint parameter distributions, which leads to additional insights into your experimental system.
To estimate the posterior distributions of our parameters we use Markov chain Monte Carlo (MCMC). Markov chain Monte Carlo approaches rely on random walkers that traverse parameter space, taking each step with a probability proportional to the likelihood that those parameters explain your data. As the walkers take steps across the parameter space they sample the underlying posterior distribution, and their relative abundance across parameter space is used to measure how probable it is for a given set of to explain your data. The specific implementation used for MCMC has large implications for how long the sampler has to run in order to obtain good estimates of the posterior distribution for your parameters. When looking for the best implementation to use for performing our parameter estimation we wanted an open-source, well-documented package that had previously been used to successfully identify parameters from biological models. Emcee (commonly known as “the MCMC hammer”) is a Python package developed for efficient MCMC parameter estimation [4]. Emcee is an affine-invariant MCMC method that is extremely fast, flexible, and incredibly well-documented which lowers the barrier to entry for future iGEM teams allowing them to also move into Bayesian parameter estimation.
A necessary component of any MCMC algorithm is the ability of the sampler (Emcee) to assess the likelihood of a given set of parameters given the data you observed. An extremely fast way to evaluate the parameters is by using an analytical solution to your system of ODEs, however quite frequently these can’t be found or are extremely difficult and time-consuming to identify. Another more general approach is to use the parameters at a given point to simulate your data using your underlying ODE model and then check to see how far off your simulated data is from the ground truth. We use a least-square error model to calculate the error between our simulated trajectories and the ground truth reference.
Given that we have a method of randomly sampling parameter values (Emcee), and a penalty function (LSE), we now need an efficient method to simulate our ODE models. Given that we are using Emcee, we were constrained to Python ODE solvers. We also wanted to choose an extremely fast and flexible modeling framework for rapid integration across different possible models. We selected to use the open-source BioSCRAPE package for simulation [5]. BioSCRAPE has been used to successfully identify parameters from biological datasets, supports SBML input files, can simulate deterministic and stochastic models, and is easy to install and use.