# Team:Heidelberg/Validation

Software Validation
From AiGEM to the bench and back
In this subproject, we aimed at evaluating the performance of AiGEM, our Artificial Intelligence for Genetic Evolution Mimicking software suite in the wet lab. In particular we wanted to investigate (i) the sensitivity of AiGEM in discriminating well-functioning from less-functioning protein mutants and (ii) its power in generating functionality de novo. To this end, we designed two simple, but highly informative experiments. Using GAIA, we first generated a set of ~30 $\beta$-lactamase single and double coding mutants and computed their corresponding protein label scores. Complementary, we evaluated the efficiency of our $\beta$-lactamase variants in the wet lab by determining the minimal inhibitory antibiotics concentration (MIC). Despite the high degree of sequence similarity between the variants (~99 %), we observed a noticeable correlation of the enzymatic activity (MIC) measured in the wet lab and the protein label scores computed by AiGEM. Of note, some of the created, high-scoring $\beta$-lactamase variants even highly outperformed wild type $\beta$-lactamase regarding the tolerated antibiotics concentration, indicating improved catalytic activity. (ii) To demonstrate AiGEM’s ability in generating functionality de novo, we finally performed in silico evolution of novel $\beta$-galactosidases from a $\beta$-glucuronidase parent. We cloned and purified seven of the high scoring, in silico evolved $\beta$-glucuronidase offsprings, one of which showed $\beta$-galactosidase activity similar to the wild type $\beta$-galactosidase in a quantitative PNPG calorimetric assay. We can conclude, that the AiGEM software suite bears great potential towards in silico directed evolution and is highly accelerating the directed evolution of proteins towards desired functions.

# Evolution of $\beta$-Lactamases

## Motivation

The deep neural network DeeProtein represents the heart of our AiGEM software suite. DeeProtein was trained on about 8 million sequences of the UniProt database to grasp the complex sequence to function relation in proteins. It is able to categorize a sequence multimodal into 886 classes of gene-ontology (GO) terms of the molecular function GO graph. As gene ontology terms are labels for protein functionality, we hypothesize that it is possible to assert protein activity based on the learned representation of the sequence to function relation in DeeProtein. To back this claim we set out for a comprehensive validation of the DeeProtein classification score. We applied GAIA, interfaced with a DeeProtein variant specifically trained on $\beta$-lactamases, to predict a set of $\beta$-lactamase sequences matching the following criterion: The set should contain a broad range of variants with higher and lower DeeProtein classification scores compared to the wildtype.
As a measure for enzyme activity we state the minimal inhibitory concentration (MIC) of carbenicillin for each candidate in the set. We demonstrate a correlation between the MIC of carbenicillin and the average DeeProtein classification score of the screened candidates. Further we improve the performance of our deep neural network by incorporating the generated wet-lab data into our training process.

## Experimental Design - Software

### Definition of Mutationsites

In order to facilitate the cloning process in the wetlab and as a consequence of our limited ressource we defined patches on the $\beta$-lactamase sequence to introduce mutations to. The patches were defined after manual assertion of the structure in pyMOL under the premisse of a borad residue spectrum. Thus we selected patches forming the pocket, as well as central structural elements of the $\beta$-lactamase enzyme.
Figure 1: Defined mutation sites on the $beta$-Lactamase sequence
Displayed is the wildtype $\beta$-lactamase (PDB-code: 1M40), the catalytic residues are colored in yellow. The sequence patches defined for mutagenesis are colored in green: Fragment A, 69-73, light blue: Fragment B, 104-107, magenta: Fragment C, 129-133, light green: Fragment D: 166-171, purple: Fragement E, 215-220 and in blue: Fragement F, 237-245. All patches are located closely around the active site, where fragments A, C, E and F partly form the pocket.

### GAIA preferences and Variant selection

We ran GAIA separately for each mutative window for up to 1000 generations in a single mutation mode: The rate of mutation was limited to one amino acid substitution per generation with one initial mutation. Additionally, we ran GAIA in double mutation mode over the combined frames for the same number of generations.
For each generation, the top five suggested candidates were saved and added to the pool of suggested sequences. Subsequently, we selected a subset of the proposed sequence pool for wet lab validation under the premise of covering a broad activity spectrum. Thus, we selected variants scoring higher than the wildtype and mutants scoring lower than the wildtype. A table displaying the scores for each mutation is depicted at the end of the page.

### Experimental Design - Wet-lab

#### Construct Design

To create a plasmid that is compatible with other biobrick parts, we decided to clone the ß-lactamase cassette into the pSB1C3 backbone. The ampicillin resistance gene was obtained from pSB1A3 and was introduced into the biobrick backbone via Golden Gate cloning. The BsaI restriction site, which is located inside the ß-lactamase coding sequence was previously removed to facilitate easy and fast cloning. The second antibiotic resistance cassette, derived by the pSB1C3 backbone is important, because it enabled us to select for the correct constructs, even when the ß-lactamase gene was not active or activity heavily reduced. The mutations, we cloned were located within six different windows. Cloning of these variants were performed by Golden Gate. The different parts of the plasmid were PCR amplified and the mutations were placed in primer extensions. Subsequently, the mutated plasmid was assembled with Golden Gate cloning. In this way, we generated single, double and triple mutants of the gene.

#### Determination of the Minimal Inhibitory Concentration

The activity of an antibiotic resistance can be determined via the minimal inhibitory concentration (MIC). The MIC is the minimum concentration is the concentration, at which the growth of an microorganism, in our case E. coli can survive. At mic50, for instance, the growth of a culture is slowed down by fifty percent. Typical methods to determine MICs are broth or agar dilutions. The principle is as simple as effective. The microorganism is cultured in liquid medium or on plates with different concentrations of the inhibiting agent JM10291. With this method, concentrations which are tolerated by the organism and those who inhibit growth can be determined. However, such an assay gives no information about the kind of toxicity e.g. whether the chemical is cytotoxic or cytostatic.
In our case, we want to evaluate differences in the activity of a well characterized protein. Consequently, such an dilution method is ideal. It is easy to setup, very robust and gives exactly the information that is needed for the characterization.
To start the assay, colonies were picked into LB medium with 100 µg/ml chloramphenicol, but without carbenicillin and were grown to the stationary phase for 20 h. Then, deep well plates with 500 µl LB broth per well were prepared. The medium was supplemented with 100 µg/ml chloramphenicol and for each mutant, eight wells with different carbenicillin concentrations, ranging from 10-1280 µg/ml were prepared. Each well was inoculated with 2 µl of the proculture. The new cultures were incubated another 8 h. Finally, the OD600 of each well was measured. By this measure we were able to assert the growth ability of each variant.

## Results

### Results - Wet-lab

The MIC of 37 cloned variants, a positive control with the wildtype ß-lactamase and a negative control, without an ampicillin resistance were tested within a carbenicillin concentration range between 10 µg/ml and 1280 µg/ml. When the OD600 was measured, a value of 0.08 was set as cutoff between bacteria that were resistant to a specific concentration and those that were not. The negative control could not grow under all conditions. Thus the lowest tested concentration of 10 µg/ml was already above the MIC of E. coli. 13 of the 37 tested variants did not show any significant ß-lactamase activity and did not grow at any carbenicillin concentration. As expected, the majority, eleven of thirteen, were double and triple mutants, whereas only two single mutants lead to the destruction of enzyme activity. Interestingly, these two single mutations, D129L and K71W appear in 10 of the 13 catalytically dead mutants, which underlines their meaning for enzyme activity. Furthermore, it is noteworthy, that one double mutant, D129L, L218D has a MIC above 640 µg/ml. This suggests that L218D can somehow compensate for the negative effect of D129L.
Several variants, 12, had their MIC in the range that was tested in our assay. We proved that it is possible to gradually influence enzyme activity with help of our software. Another 11 mutants, as well as the wildtype ß-lactamase could grow under all conditions. Interestingly, only half of the variants with high ß-lactamase activity (6 of 12), were single mutants, whereas the remaining five variants had two or three mutations. Within this set of mutations, M67G and P164G/H are very prominent. These exchanges don’t have drastic effects on the enzyme activity. However these mutations appear also in variants with weak or no activity.
As a result, we decided to perform a second test for these candidates with higher antibiotic concentrations. The three highest carbenicillin concentrations of the first data set were included and the range was extended to a maximum concentration of 19.2 mg/ml. While several candidates turned out to have a weaker activity than the wildtype protein, five of them showed improved properties. The two best candidates could even grow at 19.2 mg/ml. The variants were contained one point mutation each, E102F and M67G. These most benefitial mutations, appear in many of our better candidates as well, which underlines there functionality.
Figure 2: Gowth Behaviour of $\beta$-Lactamase Mutants under Different Carbenicillin Concentrations
The plot shows the OD600 of the different $\beta$-lactamase mutants at different carbenicillin concentrations. The pool of mutants is very heterogenous. Some proteins are not active at all, some show different grades of activity and 11 mutants, as well as the wildtype enzyme can grow in all conditions.
Figure 3: Gowth Behaviour of $\beta$-Lactamase Mutants under Elevated Carbenicillin Concentrations
The plot depicts the $OD_{600}$ of the $\beta$-lactamase mutants that could survive at 1280 $\frac{µg}{ml}$ in the first assay, under more stringent conditions. The majority of these $\beta$-lactamases have a slightly weaker activity than the wildtype, but we could determine five variants with a higher enzymatic activity.

### Results - Software

We calculated a MIC-score from the measured data points for all candidates, by first applying a threshold on the measured relative $OD_{600}$ at 0.08. As the $OD_{600}$ was measured in a plate reader it can only be seen as relative measure of growth. Thus, we consider any $OD_{600}$ below 0.08 as inhibited (0) and higher ODs as growing (1). From the thresholded values, we then calculated the MIC-score as the number of consecutive observed data points until the first carbenicillin concentration where the $OD_{600}$ fell below 0.08. Next, the DeeProtein scores were averaged among the respective MIC-score intervals and the mean DeeProtein score was plotted against the MIC of carbenicillin for all double mutants. The average DeeProtein score correlates with the MIC with an correlation coefficient of 0.6 as displayed in fig 4.
Figure 4: The DeeProtein classification score for screened $\beta$-lactamase variants correlates with the MIC of Carbenicillin.
he average DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.
To improve the predictive power of DeeProtein, we subsequently incorporated the collected data into our training set. with the aim to better catch the effect of single amino acid substitutions on enzyme activity.

## Discussion

In course of the $\beta$-lactamase subproject we set the DeeProtein classification score in correlation to the MIC as a measure for enzyme activity. As we obtained a positive correlation, we further demonstrate, that by application of the AiGEM software suite it is possible to gradually modulate enzyme activity in $\beta$-Lactamases.
Based on the mutations suggested by GAIA for postitive and negative evolution, we generated a set of mutants spanning the whole spectrum of activity. We obtained catalytically dead candidates as well as many proteins with intermediate activity. Most importantly however we also obtained a set of five improved variants, some of them exceeding the wildtype MIC by far. As our experimental results are in correspondence to the averaged DeeProtein classification scores for each MIC-score bin, we state that the scores of the underlying DeeProtein model can be applied as a measure for activity in single and double mutants. However, as all variants were correctly classified as $\beta$-lactamases, the assigned scores were all situated in a short intervall. In order to address this issue we incorporated the collected data from the single and double mutations into out training set, forming an integrative cycle of software application, wetlab data collection and software improvement.
The overall set of mutants revealed four different amino acid substitutions that seem to have drastic effects on protein function. D129L and K71W lead to a complete loss of activity. Also in combination with other mutations, no or only weak activity could be detected. M67G and E102F instead, were very prominent in the improved $beta$-lactamase subset. The negative effect of K71W can easily be explained. A lysine in the enzymatic pocket is replaced by a tryptophane, disrupting both pocket shape and polarity. In addition the much larger tryptophane inhibits proper access of the substrates. The D129L substitution in contrast, entails an increase of hydrophobicity in that area, potentially leading to the loss of a hydrogen bond. GAIA scored candidates, containing one of these two mutants with the lower scores, underlining its potential to classify protein function.

# Reprogrammation of $\beta$-Glucuronidase

### Motivation

The main objective of the AiGEM software suite is the improvement of directed evolution experiments. As the protein space is tremendous in size and impossible to assert with brute force or random walk methods, a directed evolution tool needs to reduce the combinatory complexity of the protein space. Using DeeProtein, we learned the complex relation between protein sequence and protein function, thus the properties of the thin manifold of functional protein sequences. To harness this learned representation in a generative approach we developed our directed evolution tool GAIA (Genetically Artificially Intelligent Algorithm). GAIA deploys the pre-trained DeeProtein models as scoring function, to assert the class probability of a certain protein function distribution during the evolution process. We hypothesize that by maximization of the class probability of the goal protein function through introduction of amino acid substitutions on the entry sequence its function gets shifted towards the goal term.
To demonstrate the capabilities of GAIA we set out to reprogram the E. Coli $\beta$-glucuronidase (GUS) towards $\beta$-galactosidase (GAL) activity in silico. Sequences were predicted by GAIA and subsequently the enzyme kinetics were asserted in the wet-lab.

### Experimental Setup - Software

We prepared our experiments by performing equilibration molecular dynamics simulations on the wildtype and a known variant matsumura2001vitro. Based on equilibration molecular dynamics simulations of the wildtype GUS and the mutant introduced by Matsumura et al.matsumura2001vitro, we determined three mutative windows on the GUS sequence. The limitation of mutations to certain sequence windows was necessary to facilitate the cloning procedure of the mutants in the wet-lab. Subsequently, the GUS with its defined mutative regions was submitted to GAIA with the objective of maximization of the $\beta$-galactosidase-activity GO-term (GO:0004565). GAIA was run for 1000 generations and the top five candidates of every generation were added to the candidate library. Thus, we picked five candidates from the library to test them in the wet-lab.

### Equilibration Molecular Dynamics Simulations

In order to assert the effects of the introduced mutations on the $\beta$-glucuronidase structure, we performed equilibration molecular dynamics (MD) simulations on the wild-type and the mutated variant introduced by matsumura et al. matsumura2001vitro. All simulations were performed in openMM eastman2010openmm with the amber99 forcefield pearlman1995amber. The B-chain of the $\beta$-glucuronidase (PDB-code: 3lpg) served as the wildtype and basis for in silico mutagenesis. First all selenomethionines in the sequence were corrected to methionines and the ligand was excluded. Subsequently mutations were introduced in pyMOL delano2002pymol : D508G, T509A, S557P, N566S, K568T to obtain the described GUS variant matsumura2001vitro.
The apo structure was protonated at pH 6.5 using the modeller class of the openMM library, then the protein was solvated in a cubic box of 10nm with the tip3p water model. The system was then equiibrated for charge with sodium and chloride ions.
Figure 5: Introduced mutations have little effect on the protein folding in equilibration MD
Superimposed structures for the wildtype $\beta$-glucuronidase (A) (PDB-code: 3lpg) and the mutated version (B) introduced by Matsumura matsumura2001vitro. In both depictions the grey structure is the first frame of the equilibration trajectory and the blue structure is the protein after equilibration MD. The catalytic residues are colored yellow, the symbolic ligand (was not included in MD simulations) in green. Matsumura mutations are colored magenta. The grey and blue structures vary just slightly, thus the introduced mutations did not affect the overall protein folding.
Subsequently the system was heated from 0 to 300K in 100K intervalls. Each intervall was NPT-simulated for 10000 steps with a stepsize of 2fs. After the heating was completed the system was NPT simulated for another 10000000 steps (20ns) for equilibration. The last frame of the resulting trajectory was then aligned to the input structure (Fig. 5) and a the global root-mean-square deviation of atomic positions (RMSD) was calculated (Tab. 1).

Table 1: Calculated RMSD-values after equilibration for the wildtype $\beta$-glucuronidase and the Matsumura variant. As the RMSD-values are comparable there is no devastating effect of the mutations suggested by Matsumura matsumura2001vitro on the protein folding. Thus we consider the region where the mutations were introduced as mutable regions.

Structure RMSD (Angstroem)
Wildtype (3lpg) 1.412
Matsumura mutant 1.728

### Definition of Mutationsites

To facilitate the cloning process in the wetlab and accelerate the production, we limited the mutagnesis on the glucuronidase sequence to three patches (Fig. 6). The defined patches are located around the active site with fragement A and C partly forming the enzymatic pocket. All positions mutated by Matsumura are contained in the defined patches.
Figure 4: Patches defined for mutagenesis on the $\beta$-glucuronidase sequence.
Displayed is the wildtype (PDB-code: 3lpg) $\beta$-glucuronidase, the catalytic residues are colored in yellow. The sequence patches defined for mutagenesis are colored in green: Fragment A, 351-371, blue: Fragment B, 506-512 and magenta: Fragment C, 548-568. All patches are located closely at the active site, where fragments A and C partly form the pocket.

### Variant Selection

As our resources were limited, it was not feasible for us to synthesize all variants suggested by GAIA. Thus we had to carefully select the variants to be asserted in the wet lab. Matsumura et al. matsumura2001vitro reported a set of single amino acid substitutions and a reprogrammed GUS with 5 amino acid substitutions. In order to compare the predictions made by GAIA with the variants reported by Matsumura, we incorporated single amino acid mutations on positions matching those of the Matsumura variants. For two of these positions, D508 and T509 we additionally assembled the exact variants reported by Matsumura et. al for direct comparison to the predictions made by GAIA.
For the sequence prediction we ran GAIA over 2000 generations and constrained the mutation sites to the determined sequence areas. The number of mutations per sequence was limited to 10 with an initial mutation rate of 10. The mutation rate had a linear decay of 1 mutation every 300 generations, to facilitate conversion and backmutations were allowed throughout the whole run.
Next, the predicted variants were manually curated and investigated for steric conflicts. The single amino acid substitutions were selected with respect to the overall mutation frequencies in these postitions.
The resulting set of variants is listed in tab 2.

Table 2: Set of synthesized GUS variants. We synthesized a set of 7 GUS variants. In order to be able to compare the power of in silico evolution by GAIA the set comprises single amino acid mutations on the positions reported by Matsumura et al matsumura2001vitro. These positions are D508, T509, K568 and N556. For two of these positions, D508 and T509 we additionally assembled the exact variants reported by Matsumura et. al for direct comparison to the predictions made by GAIA.

Variant Fragment
D508V B
D508G B
T509L B
T509A B
K568G C
N556S C
V355M, F357D, N358L, G364L, D508L,
T509A, F551G, D553F, F554E, G565L
A+B+C

#### Position 509

A plot of the calculated mutation frequencies at position 509 is depicted in Fig. 7. For position 509 GAIA suggests the insertion of hydrophobic residues. This is in correspondence to the amino acid exchange in position 509 reported by Matsumura et al matsumura2001vitro, where the threonine was exchanged for an alanine. As isoleucine and leucine are similar in charge, weight and polarity, we selected the most frequent substitution for postition 509: leucine.
Figure 7: Substitution rates over 2000 generations of GAIA evolution for position 509.
GAIA selectively introduced hydrophobic residues to position 509. This is congruent to the mutant T509A reported by Matsumura et al. As isoleucine and leucine are similar in charge, weight and polarity, we selected the most frequently introduced amino acid, leucine.

### Experimental Design - Wet-lab

#### Cloning of different GUS Variants

An expression plasmid of the wiltype GUS was obtained from AddGene. In this plasmid, GUS is provided under a lac-promoter and a His-tag is attached to the C-terminus of the protein. First, the wildtype GAL was cloned into the same plasmid. Second, different mutants of the glucuronidase were created. With this method, seven different mutants were generated.

#### Expression and Purification

After cloning, one colony of each transformation was inoculated as preculture overnight at 37°C and 220 rpm. The main cultures were inoculated with the respective volume of the overnight cultures to a final density of 0.01 at the starting point. The inducer IPTG was added after cultures reached an OD600nm of 0.6 - 0.8 and were afterwards incubated for 20 hours at 25°C, 220 rpm. Cells were centrifuged and lysed as described in the protocol. Lysates were subjected to the Ni-NTA cartridge and affinity purified throughout several washing steps. Purified proteins were obtained after elution with 500 mM Imidazole solution followed by dialysis. The protein concentration was determind using both, the photospectrometer and a Bradford assay.

#### Activity Assay

To determine the activity of the different mutants, an assay with p-nitrophenyl-galactopyranoside (PNPG) as substrate was set up. PNPG is a perfect substrate for GAL. The enzyme hydrolyzes the p-nitophenyl group from the galactopyranosid. The yellow color of the product can be measured at 420 nm. The enzyme was added to different concentrations of the substrate, ranging from 0.047 mM to 3 mM final conentration and the reaction was followed over time for 45 minutes at 21 °C.
With the data, generated from this assay, the kinetic parameters according to Michaelis Menten were calculated via nonlinear regression.

## Results

### Results - Wet-lab

When the enzymes were cloned and purified, kinetic Assays were performed to reveal their activity. We tested the mutants, as well as both wildtype enzymes, GUS and GAL as controls. As expected GUS showed no enzyme activity at all (Fig. 8, 11). No significant product formation could be measured. Among the inactive proteins were two mutants, which were previously published by Matsumura et al. matsumura2001vitro. As the published activity increase was only 1.5-2 fold and our enzyme concentration was relatively low, it is no surprise, the we couldn’t determine these activities. For the wildtype GAL we could determine a $K_{M}$ value of $0.3466 mM$ and a $K_{cat}$ value of $49.38 \frac{1}{s}$. The $K_{M}$ is higher and $K_{cat}$ is lower, than similar values in the literature (see Juersetal et al.JUERSETAL..2003), which is mainly due to the reaction conditions, like assay buffer and temperature. The kinetic parameters that were determined can be seen in tab 4.

Table 4: Kinetic Key Values of wildtype GAL (GAL_wt) and the best GUS mutant T509L

Protein $K_{M} [mM]$ $K_{cat}[\frac{1}{s}]$ $K_{M}/K_{cat}[\frac{1}/{M \cdot min}]$
Gal_wt 0.3466 49.38 8.5E6
GUS_T509L 0.1834 25.02 8.1E6
Figure 8: Colorimetric Assay for the Determination of GUS Mutant Activity
The photo was taken after 45 min reaction time. One can clearly see that the wildtype GUS as well as most of the mutants show no activity. GUS_T409L instead has an intense yellow color.
Figure 9: Time Course Measurement of Galactodidase Activity
The product concentration is plotted against the educt concentration for various time points after reaction initiation. The time points are given in seconds.Both Enzymes, the wildtype GAL and GUS_T509L show strong enzymatic activity on PNPG.
Figure 10: Determination of Catalytic Constants
For the last time point that was measured, 2640 s, product concentration is plotted versus substrate concentration. $K_{m}$ and $K_{cat}$ were determined via nonlinear regression. They prove a high overall activity of GUS_T509L.
Figure 11: Comparison Between the Wildtype Proteins and GUS_T509L
Our assay demonstrated that the wildtype GUS has no activity on a GAL substrate. The mutant predicted by GAIA however, exhibits extraordinary enzymatic activity on the GAL substrate.

## Discussion

With help of our AiGEM software suite we successfully reprogrammed a $\beta$-glucuronidase (GUS) variant towards $\beta$-galactosidase (GAL) activity. We ran equilibration molecular dynamics simulations on the wildtype and the variants reported by Matsumura et al. matsumura2001vitro. The RMSD values of the simulated structures compared to the respective input structue suggests that the amino acid exchanges introduced by Matsumura do not entail structural changes. Thus we liberately defined the mutative windows for GAIA with a constant central amino acid to facilitate the cloning procedure. As the definintion of mutative windows reduces the degrees of freedom for GAIA the outputs are bias by this constraint. Despite this constraint GAIA-scores converged within 2000 generations, however with lower overall scores than for open mutation.
The sequences suggested by GAIA were subsequently curated manually in order to select single substitution variants. The selection was based on the mutation frequencies over all 2000 generations and an additional assertion of steric conflicts in PyMOL.
The single amino acid substitution T509L was suggested by GAIA with the highest frequency in that position (see Fig. 7) and selected based on the substitution frequency. The mutant T509L displayed a significantly higher activity compared to all other asserted variants, including the variant T509A reported by Matsumura matsumura2001vitro. GAIA correctly identified the potential of a substitution with a hydrophic residue in position 509, suggesting both the Matsumura (T509A) and the improved (T509L) mutants.
We asserted the kinects of the T509L variant in a colorimetric assay and compared it to the wildtype GAL and GUS. The mutant had a $K_{cat}$ of $25.02 \frac{1}{s}$ and a $K_{M}$ of $0.1834 mM$. This is not only significantly higher than the tested wildtype GUS, which lacked any activity, but also close to the activity of the wildtype GAL. We thus state that with the help of the AiGEM software suite, we were able to reprogram a $\beta$ glucuronidase towards decent $\beta$ galactosidase function fully in silico.
Under our experimental conditions none of the variants reported by Matsumura displayed any significant activity. The reasons for this oberservation could lay in the different experimental conditions, differing in buffer, temperature and substrate concentration. Further Matsumura reported a maximum increase of activity of 2 log-scales compared to the wildtype for all their single amino acid substitutions. However such a slight increase in activity is immeasurable under our experimental conditions.
We demonstrated the capabilities of the AiGEM software suite in real world application in functionality transfer by fully in silico evolution of $\beta$-glucuronidases. However this very specific functionality transfer between closely related enzymes is only the first step on the path towards a universal in silico evolution tool. We began taking next steps by incorporating the data collected in the wetlab in the training set for our model. For a comprehensive, universal approach however much more data is needed. With automated directed evolution experiments becoming more easily applicable and more common, we hope this shortage in data will be overcome in close future.

# Appendix

Table 1: Selected single and double mutants proposed by GAIA. Of the pool of proposed $\beta$-lactamase sequences we selected 9 single and 16 double mutations for wet-lab validation.

 Mutation Fragment Classification_score G236L F 0.56584185 G216C_P217F E 0.56683183 L218D E 0.56554008 L167G_P164L D 0.56678349 P164G D 0.56673431 P164H D 0.56748658 D129L_N130I C 0.56734717 D129L C 0.56697899 E102F B 0.56683093 Y103D B 0.56666845 K71W A 0.56605697 M67G A 0.56762165 M67G_F70G A 0.56777865 K71W_G236L A+F 0.56503928 M67G_P164G A+D 0.5675621 M67G_P164H A+D 0.56818455 E102F_P164H B+D 0.56749099 E102F_P164G B+D 0.56676418 E102F_G236L B+F 0.5658384 E102F_M67G B+A 0.56763196 D129L_P164G C+D 0.56688529 D129L_M67G C+A 0.56781197 D129L_L218L C+E 0.56570756 L218D_P164G E+D 0.56537378 L218D_K71W E+A 0.56467879 WT None 0.56685823 WT None 0.56685823