Revision as of 00:27, 1 November 2017

Software Validation

From AiGEM to the bench and back

Deep learing is an extremely powerful method for representation learing. With the AiGEM (Artificial intelligence for Genetic Evolution Mimicking) suite we set out to harness this power in the context of proteins. By capturing the complex relation of protein sequence to protein function, intend to shift the directed evolution process in silico. Our software tool GAIA (Genetic Artifically Intelligent Algorithm) is thereby the evolving component interfaced with the deep neural network DeeProtein. In order to validate both parts of the AiGEM suite, we first demonstrate the correlation of the DeeProtein classification score of in silivo evolved \(\beta\) lactamases with the minimium inihibitory concentration of carbenicillin. Second we in silico reprogram a \(\beta\)-glucuroniase into a galactosidase and assert the ezyme kinetics in the wetlab.

Evolution of \(\beta\)-Lactamases

Motivation

The deep neural network DeeProtein represents the heart of our AiGEM software suite. DeeProtein was trained on about 8 million sequences of the UniProt database to grasp the complex sequence to function relation in proteins. It is able to categorize a sequence multimodal into 886 classes of gene-ontology (GO) terms of the molecular function GO graph. As gene ontology terms are labels for protein functionality, we hypothesize that it is possible to assert protein activity based on the learned representation of the sequence to function relation in DeeProtein. To back this claim we set out for a comprehensive validation of the DeeProtein classification score. We applied GAIA, interfaced with a DeeProtein variant specifically trained on \(\beta\)-lactamases, to predict a set of \(\beta\)-lactamase sequences matching the following criterion: The set should contain a broad range of variants with higher and lower DeeProtein classification scores compared to the wildtype.
As a measure for enzyme activity we state the minimal inhibitory concentration (MIC) of carbenicillin for each candidate in the set. We demonstrate a correlation between the MIC of carbenicillin and the average DeeProtein classification score of the screened candidates. Further we improve the performance of our deep neural network by incorporating the generated wet-lab data into our training process.

Experimental Design - Software

We ran GAIA separately for each mutative window for up to 1000 generations in a single mutation mode: The rate of mutation was limited to one amino acid substitution per generation with one initial mutation. Additionally, we ran GAIA in double mutation mode over the combined frames for the same number of generations.
For each generation, the top five suggested candidates were saved and added to the pool of suggested sequences. Subsequently, we selected a subset of the proposed sequence pool for wet lab validation under the premise of covering a broad activity spectrum. Thus, we selected variants scoring higher than the wildtype and mutants scoring lower than the wildtype.

Table 1: Selected single and double mutants proposed by GAIA. Of the pool of proposed \(\beta\)-lactamase sequences we selected 9 single and 16 double mutations for wet-lab validation.

Mutation	Fragment	Classification_score
G236L	F	0.56584185
G216C_P217F	E	0.56683183
L218D	E	0.56554008
L167G_P164L	D	0.56678349
P164G	D	0.56673431
P164H	D	0.56748658
D129L_N130I	C	0.56734717
D129L	C	0.56697899
E102F	B	0.56683093
Y103D	B	0.56666845
K71W	A	0.56605697
M67G	A	0.56762165
M67G_F70G	A	0.56777865
K71W_G236L	A+F	0.56503928
M67G_P164G	A+D	0.5675621
M67G_P164H	A+D	0.56818455
E102F_P164H	B+D	0.56749099
E102F_P164G	B+D	0.56676418
E102F_G236L	B+F	0.5658384
E102F_M67G	B+A	0.56763196
D129L_P164G	C+D	0.56688529
D129L_M67G	C+A	0.56781197
D129L_L218L	C+E	0.56570756
L218D_P164G	E+D	0.56537378
L218D_K71W	E+A	0.56467879
WT	None	0.56685823
WT	None	0.56685823

Results - Software

We calculated a MIC-score from the measured data points for all candidates, by first applying a threshold on the measured relative OD600s at 0.8. As the OD600 was measured in a plate reader it can only be seen as relative measure of growth. Thus, we consider any OD600 below 0.8 as inhibited (0) and higher ODs as growing (1). From the thresholded values, we then calculated the MIC-score as the number of consecutive observed data points until the first carbenicillin concentration where the OD fell below 0.8. Next, the DeeProtein scores were averaged among the respective MIC-score intervals and the mean DeeProtein score was plotted against the MIC of carbenicillin.

Figure 2: The DeeProtein classification score for screened \(\beta\)-lactamase variants correlates with the MIC of Carbenicillin.

he average DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.

Reprogrammation of \(\beta\)-Glucuronidase

Motivation

The main objective of the AiGEM software suite is the improvement of directed evolution experiments. As the protein space is tremendous in size and impossible to assert with brute force or random walk methods, a directed evolution tool needs to reduce the combinatory complexity of the protein space. Using DeeProtein, we learned the complex relation between protein sequence and protein function, thus the properties of the thin manifold of functional protein sequences. To harness this learned representation in a generative approach we developed our directed evolution tool GAIA (Genetically Artificially Intelligent Algorithm). GAIA deploys the pre-trained DeeProtein models as scoring function, to assert the class probability of a certain protein function distribution during the evolution process. We hypothesize that by maximization of the class probability of the goal protein function through introduction of amino acid substitutions on the entry sequence its function gets shifted towards the goal term.
To demonstrate the capabilities of GAIA we set out to reprogram the E. Coli \(\beta\)-glucuronidase (GUS) towards \(\beta\)-galactosidase (GAL) activity in silico. Sequences were predicted by GAIA and subsequently the enzyme kinetics were asserted in the wet-lab.

Experimental Setup - Software

We prepared our experiments by performing equilibration molecular dynamics simulations on the wildtype and a known variant matsumura2001vitro. Based on equilibration molecular dynamics simulations of the wildtype GUS and the mutant introduced by Matsumura et al.matsumura2001vitro, we determined three mutative windows on the GUS sequence. The limitation of mutations to certain sequence windows was necessary to facilitate the cloning procedure of the mutants in the wet-lab. Subsequently, the GUS with its defined mutative regions was submitted to GAIA with the objective of maximization of the \(\beta\)-galactosidase-activity GO-term (GO:0004565). GAIA was run for 1000 generations and the top five candidates of every generation were added to the candidate library. Thus, we picked five candidates from the library to test them in the wet-lab.

Equilibration Molecular Dynamics Simulations

In order to assert the effects of the introduced mutations on the \(\beta\)-glucuronidase structure, we performed equilibration molecular dynamics (MD) simulations on the wild-type and the mutated variant introduced by matsumura et al. matsumura2001vitro. All simulations were performed in openMM eastman2010openmm with the amber99 forcefield pearlman1995amber. The B-chain of the \(\beta\)-glucuronidase (PDB-code: 3lpg) served as the wildtype and basis for in silico mutagenesis. First all selenomethionines in the sequence were corrected to methionines and the ligand was excluded. Subsequently mutations were introduced in pyMOL delano2002pymol : D508G, T509A, S557P, N566S, K568T to obtain the described GUS variant matsumura2001vitro.
The apo structure was protonated at pH 6.5 using the modeller class of the openMM library, then the protein was solvated in a cubic box of 10nm with the tip3p water model. The system was then equiibrated for charge with sodium and chloride ions.

Figure 2: Introduced mutations have little effect on the protein folding in equilibration MD

Superimposed structures for the wildtype \(\beta\)-glucuronidase (A) (PDB-code: 3lpg) and the mutated version (B) introduced by Matsumura matsumura2001vitro. In both depictions the grey structure is the first frame of the equilibration trajectory and the blue structure is the protein after equilibration MD. The catalytic residues are colored yellow, the symbolic ligand (was not included in MD simulations) in green. Matsumura mutations are colored magenta. The grey and blue structures vary just slightly, thus the introduced mutations did not affect the overall protein folding.

Subsequently the system was heated from 0 to 300K in 100K intervalls. Each intervall was NPT-simulated for 10000 steps with a stepsize of 2fs. After the heating was completed the system was NPT simulated for another 10000000 steps (20ns) for equilibration. The last frame of the resulting trajectory was then aligned to the input structure (figure [/]) and a the global root-mean-square deviation of atomic positions (RMSD) was calculated (table 2).

Table 2: Calculated RMSD-values after equilibration for the wildtype \(\beta\)-glucuronidase and the Matsumura variant. {{{3}}}

Structure	RMSD (Angstroem)
Wildtype (3lpg)	1.412
Matsumura mutant	1.728

As the RMSD-values are comparable there is no devastating effect of the mutations suggested by Matsumura matsumura2001vitro on the protein folding. Thus we consider the region where the mutations were introduced as mutable regions.

Definition of Mutationsites

To facilitate the cloning process in the wetlab and accelerate the production, we limited the mutagnesis on the glucuronidase sequence to three patches (Figure [/]). The defined patches are located around the active site with fragement A and C partly forming the enzymatic pocket. All positions mutated by Matsumura are contained in the defined patches.

Figure [/]: Patches defined for mutagenesis on the \(\beta\)-glucuronidase sequence.

Displayed is the wildtype \(\beta\)-glucuronidase, the catalytic residues are colored in yellow. The sequence patches defined for mutagenesis are colored in green: Fragment A, 351-371, blue: Fragment B, 506-512 and magenta: Fragment C, 548-568. All patches are located closely at the active site, where fragments A and C partly form the pocket.

Results

Hi I bims, 1 result.

@@ Line 14: / Line 14: @@
 <h1 id="bLac">Evolution of \(\beta\)-Lactamases</h1>
 <h3>Motivation</h3>
-The deep neural network DeeProtein is the heart of our AiGEM software suite. DeeProtein was trained on ~8 million sequences of the uniprot database to grasp the complex sequence to function relation in proteins. It is able to categorize a sequence multimodal into 886 classes of gene-ontology (GO) terms of the molecular function GO graph. As gene ontology terms are labels for protein functionality, we hypothesize that is is possible to assert protein activity based on the learned representation of the sequence to function relation in DeeProtein. To back this claim we set out for a comprehensive validation of the DeeProtein classification score. We applied GAIA, interfaced with a DeeProtein variant specifially trained on \(\beta\)-lactamases, to predict a set of \(\beta\)-lactamase sequences matching the following criteria: The set should contain a broad range of variants with higher and lower DeeProtein classification scores compared to the wildtype.<br>
+The deep neural network DeeProtein represents the heart of our AiGEM software suite. DeeProtein was trained on about 8 million sequences of the UniProt database to grasp the complex sequence to function relation in proteins. It is able to categorize a sequence multimodal into 886 classes of gene-ontology (GO) terms of the molecular function GO graph. As gene ontology terms are labels for protein functionality, we hypothesize that it is possible to assert protein activity based on the learned representation of the sequence to function relation in DeeProtein. To back this claim we set out for a comprehensive validation of the DeeProtein classification score. We applied GAIA, interfaced with a DeeProtein variant specifically trained on \(\beta\)-lactamases, to predict a set of \(\beta\)-lactamase sequences matching the following criterion: The set should contain a broad range of variants with higher and lower DeeProtein classification scores compared to the wildtype.<br>
-As a measure for enzyme activity we assert the minimum inhibitory concentration (MIC) of carbenicillin for each candidate in the set. We demonstrate a correlation between the MIC of carbenicillin and the average DeeProtein classification score of the screened candidates. Further we improve the performance of out deep neural network by incorporating the generated wetlab data into our training process.
+As a measure for enzyme activity we state the minimal inhibitory concentration (MIC) of carbenicillin for each candidate in the set. We demonstrate a correlation between the MIC of carbenicillin and the average DeeProtein classification score of the screened candidates. Further we improve the performance of our deep neural network by incorporating the generated wet-lab data into our training process.
 }}}}
      {{Heidelberg/templateus/Contentsection|
 {{#tag:html|
 <h3>Experimental Design - Software</h3>
-We ran GAIA seperately for each mutative window for up to 1000 generations in a singl mutation mode: The mutationrate was limited to 1 amino acid substitution per generation with 1 initial mutation. Additionally we ran GAIA in double mutation mode over the combined frames for the same number of generations.<br>
+We ran GAIA separately for each mutative window for up to 1000 generations in a single mutation mode: The rate of mutation was limited to one amino acid substitution per generation with one initial mutation. Additionally, we ran GAIA in double mutation mode over the combined frames for the same number of generations.<br>
-    For each generation the top five suggested candidates were saved and added to the pool of suggested sequences. Subsequently we selected a subset of the proposed sequence pool for wet lab validation under the premise of covering a broad acitvity sepctrum. Thus we selected variants scoring higher than the wildtype and mutants scoring lower than the wildtype.
+For each generation, the top five suggested candidates were saved and added to the pool of suggested sequences. Subsequently, we selected a subset of the proposed sequence pool for wet lab validation under the premise of covering a broad activity spectrum. Thus, we selected variants scoring higher than the wildtype and mutants scoring lower than the wildtype.
              {{Heidelberg/templateus/Tablebox|
 Table 1: Selected single and double mutants proposed by GAIA. |
@@ Line 168: / Line 168: @@
 </tbody></table>
                  }}|
-Of the pool of proposed \(\beta\)-lactamase sequences we selected 9 single and 16 double mutations for wetlab validation.
+Of the pool of proposed \(\beta\)-lactamase sequences we selected 9 single and 16 double mutations for wet-lab validation.
 }}
 }}}}
@@ Line 174: / Line 174: @@
 {{#tag:html|
 <h3>Results - Software</h3>
-We calculated a MIC-score from the measured datapoints for all candidates, by first applying a threshold on the measured relative OD600s at 0.8. As the OD600 was measured in a platereader it can only be seen as relative meassure for the growth. Thus we consider any OD600 below 0.8 as inhibited (0) and higher ODs as growing (1). From the thresholded values we then calculated the MIC-score as the number of consecutive observed datapoints until the first Carbenicilline concentration where the OD fell below 0.8. Next the DeeProtein scores were avergaged among the respective MIC-score intervals and the mean DeeProtein score was plotted against the MIC of carbenicilline.
+We calculated a MIC-score from the measured data points for all candidates, by first applying a threshold on the measured relative OD600s at 0.8. As the OD600 was measured in a plate reader it can only be seen as relative measure of growth. Thus, we consider any OD600 below 0.8 as inhibited (0) and higher ODs as growing (1). From the thresholded values, we then calculated the MIC-score as the number of consecutive observed data points until the first carbenicillin concentration where the OD fell below 0.8. Next, the DeeProtein scores were averaged among the respective MIC-score intervals and the mean DeeProtein score was plotted against the MIC of carbenicillin.
        {{Heidelberg/templateus/Imagesection|
        https://static.igem.org/mediawiki/2017/a/a9/T--Heidelberg--2017_DeeProtein_ClassifierVSmic.svg |
        Figure 2: The DeeProtein classification score for screened \(\beta\)-lactamase variants correlates with the MIC of Carbenicillin. |
-       The avergage DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.
+       he average DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.
        }}
      }}}}
@@ Line 193: / Line 193: @@
 <h1 id="GUS2GAL">Reprogrammation of \(\beta\)-Glucuronidase</h1>
 <h3>Motivation</h3>
-    The main objective of the AiGEM software suite is the improvement of directed evolution experiments. As the protein space is tremendous in size and impossible to assert with brute force or random walk methods a directed evolution tool needs to reduce the combinatory complexity of the protein space. With <a href="https://2017.igem.org/Team:Heidelberg/Software/DeeProtein#Representation">DeeProtein</a> we learned the complex relation between protein sequence and protein function, thus the properties of the thin manifold of functional protein sequences. To harness this learned representation in a <a href="https://2017.igem.org/Team:Heidelberg/Software/GAIA#Generative">generative approach</a> we developed our directed evolution tool <a href="https://2017.igem.org/Team:Heidelberg/Software/GAIA">GAIA</a> (Genetically Artificially Intelligent Algorithm). GAIA deploys the pretrained DeeProtein models as scoring function, to assert the class probability of a certain protein function distribution during the evolution process. We hypothesize that by maximization of the class probability of the goal protein function through introduction of amino acid substitutions on the entry sequence its function gets shifted towards the goal term.<br>
+  The main objective of the AiGEM software suite is the improvement of directed evolution experiments. As the protein space is tremendous in size and impossible to assert with brute force or random walk methods, a directed evolution tool needs to reduce the combinatory complexity of the protein space. Using <a href="https://2017.igem.org/Team:Heidelberg/Software/DeeProtein#Representation">DeeProtein</a>, we learned the complex relation between protein sequence and protein function, thus the properties of the thin manifold of functional protein sequences. To harness this learned representation in a <a href="https://2017.igem.org/Team:Heidelberg/Software/GAIA#Generative">generative approach</a> we developed our directed evolution tool <a href="https://2017.igem.org/Team:Heidelberg/Software/GAIA">GAIA</a> (Genetically Artificially Intelligent Algorithm). GAIA deploys the pre-trained DeeProtein models as scoring function, to assert the class probability of a certain protein function distribution during the evolution process. We hypothesize that by maximization of the class probability of the goal protein function through introduction of amino acid substitutions on the entry sequence its function gets shifted towards the goal term.<br>
-To demonstrate the capabilities of GAIA we set out to reprogramm the <i>E. Coli</i> \(\beta\)-glucuronidase (GUS) towards \(\beta\)-galactosidase (GAL) activity <i>in silico</i>. Sequences were predicted by GAIA and subsequenlty the enzyme kinetics were asserted in the wet-lab.
+To demonstrate the capabilities of GAIA we set out to reprogram the E. Coli \(\beta\)-glucuronidase (GUS) towards \(\beta\)-galactosidase (GAL) activity <i>in silico</i>. Sequences were predicted by GAIA and subsequently the enzyme kinetics were asserted in the wet-lab.
      }}}}
 {{Heidelberg/templateus/Contentsection|
    {{#tag:html|
 <h3>Experimental Setup - Software</h3>
-We prepared out experiments by performing <a href="https://2017.igem.org/Team:Heidelberg/Validation/#MD">equilibration molecular dynamics</a> simulations on the wildtype and a known variant <x-ref>matsumura2001vitro</x-ref>. Based on equilibration molecular dynamics simulations of the wildtype GUS and the mutant introduced by Matsumura et al.<x-ref>matsumura2001vitro</x-ref>, we determined three mutative windows on the GUS sequence. The limitation of mutations to certain sequence windows was necessary to facilitate the cloning procedure of the mutants in the wetlab. Subsequently the GUS with its defined mutative regions was submitted to GAIA with the objective of maximization of the \(\beta\)-galactosidase-activity GO-term (GO:0004565). GAIA was run for 1000 generations and the top five candidates of every generation were added to the candiate library. Subsequently we picked 5 candidates from the library to be tested in the wetlab.
+We prepared our experiments by performing <a href="https://2017.igem.org/Team:Heidelberg/Validation/#MD">equilibration molecular dynamics</a> simulations on the wildtype and a known variant <x-ref>matsumura2001vitro</x-ref>. Based on equilibration molecular dynamics simulations of the wildtype GUS and the mutant introduced by Matsumura et al.<x-ref>matsumura2001vitro</x-ref>, we determined three mutative windows on the GUS sequence. The limitation of mutations to certain sequence windows was necessary to facilitate the cloning procedure of the mutants in the wet-lab. Subsequently, the GUS with its defined mutative regions was submitted to GAIA with the objective of maximization of the \(\beta\)-galactosidase-activity GO-term (GO:0004565). GAIA was run for 1000 generations and the top five candidates of every generation were added to the candidate library. Thus, we picked five candidates from the library to test them in the wet-lab.
          }}}}

Difference between revisions of "Team:Heidelberg/Validation"

Revision as of 00:27, 1 November 2017

Evolution of \(\beta\)-Lactamases

Motivation

Experimental Design - Software

Results - Software

Reprogrammation of \(\beta\)-Glucuronidase

Motivation

Experimental Setup - Software

Equilibration Molecular Dynamics Simulations

Definition of Mutationsites

Results

References

Quote

Useful Links

Follow us on

Contact us