Team:Uppsala/Model

Modeling

Summary

The three enzymes in our crocin pathway are relatively newly discovered and thus poorly characterized, with no structures available. In addition, we were only able to find enzymatic parameters concerning one of the enzymes, UGTCs2 (1). Via homology modeling we could draw the conclusion that UGTCs2 and CaCCD2 are monomers while CsADH2946 is a tetramer. We could also ensure that the N-terminal was facing outward from the protein and could safely put a His-tag there for purification. Furthermore, the extensive 100 ns MD simulation of our models verified that the models were stable in solvent and that we could further work with the models to estimate the binding affinity between enzyme and substrate. Steered molecular dynamics simulations showed that CsADH2946 is specific to its substrate crocetin dialdehyde and that it has an affinity in the µM range to the substrate. The estimation of K_M from the experimental activity measurements also showed that CsADH2946 has a high affinity towards the substrate crocetin dialdehyde, in agreement with Steered Molecular Dynamics results.

Homology Modeling of the Uncharacterized Enzymes

One of the first things we noticed when looking at our selected enzymes was that no three-dimensional structures were available and the enzymes were poorly characterized in general. To be able to get structural information we turned to homology modeling.

The idea of homology modeling is to construct a 3D-structure by mapping the amino acid sequence onto a template – i.e. a known structure of a related, homologous protein through sequence alignment. For this to work you need a template with a solved 3D-structure. The quality of the model is determined by the alignment with, and, the structure of the template. Thus, ideally, you would like a high-resolution structure with high sequence identity.

The tool we used to generate our homology models was SwissDock /2/, a fully automated protein structure homology-modeling server (2, 3, 4). We plugged in the sequences of our chosen enzymes ((CaCCD2), (CsADH2946) and (UGTCs2)) and started modeling. The templates were chosen based upon quality parameters such as Global Mean Quality Estimation, QMEAN, resolution of the crystal structure and sequence identity (5). After choosing the best templates we obtained the models detailed in figure 1 and the models along with the quality scoring are summarized in table 1. We could make two immediate observations. The CsADH2946 model seemed the most promising one, quality-wise with GMQE close to 1 and higher QMEAN being better. In addition, looking at the N-terminals (the blue ends in figure 1) we could see that they are stretched out, poking outwards from the protein. This was a good indication that we could put a His-tag at this end, with no complications. Another discovery we made was that the second step enzyme CsADH2946 is most likely a tetramer. This information was helpful when purifying the enzyme and going forth with molecular dynamics.

Table 1. The enzymes modeled, which templates were used and quality assessment by means of GMQE and QMEAN. GMQE ranges between 0 and 1, where one is good and zero is bad. As for QMEAN the higher the score, the better(1).

Figure 1. Homology models of the three enzymes selected for the zeaxanthin-crocin pathway. The first and the third enzyme are monomers, while CsADH2946 is a tetramer. The homology models were created using SWISS-MODEL /2/.

Molecular Dynamics in GROMACS

– The Art of Putting Digital Molecules in Digital Boxes of Water

To assess the models and prepare them for further characterization we used GROMACS /1/ to simulate our enzymes in saline water for a total of 100 ns. This lets us assess their stability and obtain new models that are closer to their native conformation which would be the most probable state of the enzymes during the activity measurements in our wet lab.

How We Simulated Our Enzymes

Molecular dynamics is a simulation approach where the potential energies of a molecular system is parameterized using a predefined force-field. A trajectory of the atoms of the system, that is how the atoms move over time, can then be simulated by numerically intergrating Newton's equations of motion.

The MD simulations were performed based on previously defined protocols, which were nicely summarized by Justin Lemkul in a tutorial (6). All the steps until the production MD can be done on practically any system while the production MD will require a supercomputer to reduce time. Next follows a short description of the main steps in the protocols as well as some tips and general comments.

Define Box and Solvate

The protein has to be put in a box containing solvent where the simulation will take place. We chose a cubic box with at least 1 nm between the protein and the box. The simulation is run with periodic boundary conditions, meaning that if the molecule touches the boundary it will appear on the other side of the box.

Add Ions

We neutralized our enzymes and added ions to a concentration of 0.15 M to get closer to physiological conditions (7).

Energy Minimization

In this step the energy of the system is minimized using a steepest descent algorithm. This is done to remove any conformations that are unfavourable in terms of energy. The potential energy of our system was calculated using the CHARMM27 force-field in this and all later steps of our simulations.

Equilibration

Equilibration is done to relax the system into an equilibrated state. It starts of by doing a simulation with constant volume, temperature (310 K) and number of particles. Once the first one is finnished, another simulation with constant temperature (310 K), pressure (1 bar) and a constant amount of particles is run. The system is also equilibrated in the beginning of the production simulation, but by performing a more refined equilibration protocol in advance, simulation time is saved.

Production Molecular Dynamics

Once the system was equilibrated production MD was carried out with constant pressure (1 bar), temperature (310 K) and number of particles. The systems were simulated for 100 ns on a supercomputer with 24 processors running in parallel. The time it takes for the simulation to complete is dependant on the amount of particles present. Because CsADH2946 is a tetramer with the highest amount of atoms in the simulated system, it took the longest time to finnish (approximately one month).

The Resulting Stable Enzyme Models

The result we got from the 100 ns stability simulations in GROMACS show a trajectory of each enzyme wiggling around trying to equilibrate into a low energy state. We created movies of these simulations that can be viewed in figure 2.

Figure 2. These videos are the result of simulating CaCCD2 (leftmost), CsADH2946 (middle) and UGTCs2 (rightmost) for 100 ns in saline water.

The movies all confirmed that our homology models are good in the sense that they are not melting. Along with the movies we received RMSD plots. The RMSD plot describes the distances between the atoms of the initial conformation going into the simulation and the atoms of the different conformational states the enzyme develops over the 100 ns. As can be seen in figure 3, The RMSD curve for each enzyme model converge towards one low, stable value. This shows that the models equilibrate to a stable state with low energy in the solvent and indicates that our homology models are in fact good models to represent our enzymes.

Figure 3. RMSD plot of a 100 ns stability simulation in GROMACS of homology models to the enzymes CaCCD2, CsADH2946 and UGTCs2.

Steered Molecular Dynamics (Pulling) with CsADH2946 and Crocetin Dialdehyde

Since CsADH2946 had great advancements in the lab, has the best homology model score of our enzymes and also has the RMSD which converges fastest at the lowest value showing that it is a stable structure, we decided to focus the rest of our computer power to do Steered Molecular Dynamics (pulling simulations) with this enzyme. A pulling simulation between the protein and a substrate allows us to estimate the dissociation constant K_d which can be compared to K_M. This is very useful to compare to the experimentally estimated K_M for CsADH2946 in order to validate both the simulation and experiments.

Docking

To be able to execute pulling simulations we had to pick the ligands that were going to be pulled in the simulations. The obvious ligand of choice was crocetin dialdehyde, which is converted to crocetin by CsADH2946. To be able to verify that CsADH2946 specifically binds crocetin dialdehyde, another ligand, the common acetaldehyde was chosen to be able to compare the enzyme's specificity. Acetaldehyde was run in a separate simulation to crocetin dialdehyde.

Before the pulling simulations can be started, it is required to select an initial configuration where the ligand is bound to the active site of the enzyme. This position was found by docking the ligands with one subunit from the CsADH2946 tetramer. The location of the active site was found by looking at the template molecule for the homology modeling (PDB entry 4fqf) (8). The active site of this homology template can be found at a loop containing three cysteines (C301, C302 and C303). In our homology model that corresponded to C337, C338 and C339, and docking was therefore performed in the patch of the protein where this loop can be found. The docking of both ligands were done using SwissDock (9, 10), where the structure used for CsADH2946 was an equilibrated conformation 40 ns into the production MD simulation. The best docked structure where the ligand was in close contact with the triple cystein loop was then chosen as starting configuration for the pulling simulation.

Pulling

Steered molecular dynamics (pulling) is a way of estimating the binding affinity between a protein and a ligand. We based our simulations on a previously defined protocol by Justin Lemkul (11). Before starting the simulation we needed to select a trajectory in which the ligands were to be pulled. The choice of trajectory was based on a path that was not obstructed by any residues. After being equilibrated for 200 ps, the pulling simulation was run for 8000 picoseconds pulling the ligands at a velocity of 0.25 nm/ns. The results from the simulation can be seen in the movies below (figure 4) and in figure 5.

Figure 4. Results of the pulling simulations of one subunit of CsADH2946 with crocetin dialdehyde (leftmost) and acetaldehyde (rightmost). The molecules were pulled with a velocity of 0.25 nm/ns.

These indicated that more force is required to be able to pull out crocetin dialdehyde than acetaldehyde from the active site of CsADH2946 as the peaks in figure 5 can be interpreted as energy barriers being crossed. This can also be seen in the movies where crocetin dialdehyde gets stuck in two positions. Thus, the CsADH2946 model demonstrates specificity towards crocetin dialdehyde.

Figure 5: Pulling force applied to crocetin dialdehyde and acetaldehyde from one subunit of our CsADH2946 homology model. For crocetin dialdehyde there are two peaks which indicates energy barriers where more force is required to pull crocetin dialdehyde out. For acetaldehyde there is no clear peak indicating that our model is not able to bind acetaldehyde as strongly as crocetin dialdehyde.

Using the Pulling Simulation to Estimate the Dissociation Constant of Crocetin Dialdehyde

The resulting maximum force applied to crocetin dialdehyde could then be used to roughly calculate the dissociation constant. To do this the following equation was used where 𝜟G is Gibbs free energy, R is the gas constant, T the temperature and K_d the dissociation constant:

\Delta G = -RTln(K_{d})

What we need is 𝜟G to be able to calculate the dissociation constant. To get 𝜟G the maximum force (284.2160 kJ/(mol*nm)) was first converted into pN. The resulting force was 471.9628 pN. 𝜟G was then calculated by extrapolating data from a study where they plotted the maximum force against their experimental 𝜟G values (12). Calculation of K_d using the equation above yielded a value of 4.9321 µM.

Thus we are able to compare the dissociation constant from our molecular dynamics simulations to the value of K_M (~20.7842 µM) we estimated in our Bayesian inference model using our experimental data. Under the equilibrium approximation where substrate and enzyme complex are in an instantaneous equilibrium K_M and K_d are equal. From our experiments and simulations we indeed see that the estimated K_M and K_d are similar considering the fact that the dissociation constant has been calculated from an energy. This further shows that the molecular dynamics simulation is accurate, validating the result that CsADH2946 is specific to crocetin dialdehyde.

Experimental Estimation of K_M for CsADH2946

Since the kinetic parameters for CsADH2946 have never been determined, we wanted to characterize the Michaelis-Menten kinetics of the reaction and estimate K_M of our CsADH2946 using the activity measurement results. The substrate crocetin dialdehyde has low water solubility and will therefore partially precipitate when coming in contact with water. Quantitative kinetic characterization of such a system is especially complicated since the substrate concentration at the beginning of the experiment, S₀, is not known. We have therefore developed a probabilistic method that is capable of estimating Michaelis-Menten model parameters as well as the error associated with them for such systems and we demonstrate its use on our enzyme CsADH2946. The MATLAB-script we created can be downloaded here.

Enzymatic reactions are often modeled with Michaelis-Menten kinetics. In this model, the reaction rate is described by the differential equation:

(1)

where [C] and [S] are the concentrations of product and substrate, respectively. V_max is the maximum production rate of the system at saturated substrate concentration and K_M is the substrate concentration when the reaction rate is half of V_max. Reaction rate is not directly measured in the assay, so we used the experimental product concentrations Ĉ(t) to estimate the Michaelis-Menten parameters. This was done by integrating equation 1 numerically and fitting the integrated C(K_M,V_max,S₀,t) to the experimentally measured product concentrations, Ĉ(t).

The product concentrations, Ĉ(t), were estimated at every timepoint from the experimentally measured absorbance spectra. Fitting was then performed by minimizing the sum of the squared differences (SSD) between the model prediction, C (from integration) and the experimentally measured progress curve, Ĉ(t). These fitting methods are good at finding the parameters that describe the data best but give no information about the uncertainty in the parameters described by the data. Use of methods minimizing the SSD is further complicated when the substrate concentration at the beginning of the experiment, S₀, is not known, introducing further uncertainty in the data and demonstrating the need for good error estimation of the model parameters.

We developed a probabilistic method for estimating the kinetic parameters and the uncertainty in them. The method computes and integrates the probability density P(K_M,V_max,S₀,σ) as a function of all model parameters. We set the notation θ = (K_M,V_max,S₀,σ), where σ is the model noise. In this method we used Bayesian statistical inference to estimate the parameters from the experimental data. In short that means calculating P(θ|Ĉ); the probability of the model parameters, given the experimental data. We estimated this distribution from the experimental data and the mean and variance of each parameter could be calculated as:

(2)

The posterior probability P(θ|Ĉ) was calculated using Bayes theorem:

(3)

where the probability of the data P(Ĉ) is a normalization constant, the prior P(θ) describes your expectations of the model parameters before observing any data and the likelihood P(Ĉ|θ) is the probability of the data given the parameters.

The likelihood is a function of the product concentration C(K_M,V_max,S₀,t) calculated from the Michaelis-Menten model. We modelled the deviation between the experimentally measured concentrations and model product concentration C as independent and identically normally distributed at every time-point. This gave us the likelihood

(4)

where σ is the standard deviation of the normal distribution describing the deviation between the data and the model. With a flat prior (a uniform prior where all model parameters are equally likely before observing any data) the posterior P(θ|Ĉ) can simply be written as

(5)

(6)

where Z is a normalization constant.

To estimate the mean and variance of our four model parameters we calculated the integrals in equation 2 and 3, i.e., numerically integrating the likelihood L(θ) over the four dimensions of the parameters. It is a problem that is very numerically complex if L(θ) is to be evaluated for all values of θ. Therefore, we implemented a version of importance sampling, where we only evaluate L(θ) were it is large enough to contribute to the integral. We start at the maximum likelihood set of parameter θ_max (found by SSD fitting) where L(θ) is maximized and take discrete steps in θ until L(θ) is essentially 0. Therefore we do not have to compute L(θ) for all values of θ, but only at the values that actually contribute to the integral. A visual of this is displayed in figure 6. We saved the values of L(θ) of all these points in four dimensional parameter space and the marginalized distribution for the individual parameters could then be calculated. For example for K_M.

(7)

The mean and error estimate of K_M and the other parameters was then obtained by calculating the mean and standard deviation of the marginalized distribution. A plot of the marginalized distribution for each parameter can be seen in Figure 7.

Figure 6. Illustration of importance sampling of L(θ). Only samples of L(θ) that are large enough to contribute to the integral are evaluated (the blue area). Therefore, the starting value is θ_max where L(θ) is maximized and discretes steps are taken until the values of L(θ) are close to 0 and does not contribute to the integral.

Figure 7. Plot of the marginalized distribution for each parameter estimated. A. V_max B. K_M C. S₀ D. Model noise.

The Resulting Estimated Kinetic Parameters

So, the kinetic parameters for CsADH2946 were estimated using this probabilistic method where the experimental data was fitted to the model. As you can see in Figure 8, we could compute a curve (blue) with the estimated parameters that fit the experimental data. This means that our parameters are good approximations that can be used to describe the actual data. We obtained K_M = 20.7842 µM ± 3.5264 with standard diviation.

Figure 8. Plot of experimental product concentration Ĉ against time (red) and a modelled curve using estimated Michaelis-Menten kinetic parameters to best fit the experimental data (blue). The values of the estimated kinetic parameters obtained by the algorithm are also displayed in the plot. K_M = 20.7842 µM ± 3.5264 with standard diviation.

To evaluate our estimated parameters, we first compared the values obtained by our algorithm with values of the parameters calculated with maximum likelihood. These were similar, which shows that our values are reasonable. To further analyse and validate our algorithm, we ran our algorithm on a simulated a progress curve based on the estimated parameters but with added normally distributed noise. The fitting of the simulated noisy data using our algorithm should return similar parameter values as the initial experimental data generated. As can be seen in figure 9, the parameters from the fitting of the simulated noisy data did return similar parameter values. We also tried to compare our results with parameters estimated from fitting the experimental data in DynaFit. However, DynaFit was unable to return a good model fit, possibly because of our unknown substrate concentration in the beginning of the experiment.

Figure 9. Plot of simulated data based on the estimated parameters with added normally distributed noise (red) and a modelled curve using our algorithm to best fit the simulated noisy data (blue). The parameters from the fitting of the simulated noisy data returns similar parameter values as the parameters returned when fitting the experimental data.

The K_M estimated for CsADH2946 is low, which indicates a high affinity towards the substrate crocin dialdehyde. This is also agrees with the pulling simulation, where the enzyme showed specificity towards this substrate. We compared the results with the binding of acetaldehyde and validated that CsADH2946 has specificity against crocetin dialdehyde since a higher force was required to pull this substrate from the active site. These results greatly support our choice in selecting CsADH2946 as an enzyme candidate in our zeaxanthin-crocin pathway.

References

(1) Moraga AR, Nohales PF, Pérez JAF, Gómez-Gómez L. Glucosylation of the saffron apocarotenoid crocetin by a glucosyltransferase isolated from Crocus sativus stigmas. Planta. 2004 Oct 1;219(6):955–66.

(2) Biasini M, Bienert S, Waterhouse A, Arnold K, Studer G, Schmidt T, et al. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Research. 2014 Jul 1;42(W1):W252–8.

(3) Kiefer F, Arnold K, Künzli M, Bordoli L, Schwede T. The SWISS-MODEL Repository and associated resources. Nucleic Acids Research. 2009 Jan 1;37(suppl_1):D387–92.

(4) Arnold K, Bordoli L, Kopp J, Schwede T. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics. 2006 Jan 15;22(2):195–201.

(5) Expasy.org [homepage on the Internet]. Basel: Swiss Institute of Bioinformatics; [cited 30 September 2017] Available from: https://swissmodel.expasy.org/docs/help

(6) Vt.edu [homepage on the Internet]. Virginia: Department of Biochemistry; 2015 [cited 1 October 2017]. Available from: http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin/gmx-tutorials/lysozyme/

(7) Search BioNumbers - The Database of Useful Biological Numbers. online: http://bionumbers.hms.harvard.edu/search.aspx?log=y&task=searchbytrmorg&trm=Salt&org=. Accessed October 30, 2017.

(8) Lang BS, Gorren ACF, Oberdorfer G, Wenzl MV, Furdui CM, Poole LB, et al. Vascular Bioactivation of Nitroglycerin by Aldehyde Dehydrogenase-2. J Biol Chem. 2012 Nov 2;287(45):38124–34.

(9) Grosdidier A, Zoete V, Michielin O. SwissDock, a protein-small molecule docking web service based on EADock DSS. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W270-277.

(10) Grosdidier A, Zoete V, Michielin O. Fast docking using the CHARMM force field with EADock DSS. J Comput Chem. 2011 Jul 30;32(10):2149–59.

(11) Umbrella Sampling [Internet]. [cited 2017 Oct 29]. Available from: http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin/gmx-tutorials/umbrella/index.html

(12) Tam NM, Nguyen MT, Ngo ST. Evaluation of the absolute affinity of neuraminidase inhibitor using steered molecular dynamics simulations. Journal of Molecular Graphics and Modelling. 2017 Oct 1;77(Supplement C):137–42.

Further Reading

/1/ http://www.gromacs.org/

/2/ https://swissmodel.expasy.org/