GAIA
Genetic artificially intelligent algorithm
Here we present GAIA (Genetic Aritficially Intelligent Algorithm) an innovative software able to fast-forward the directed evolution of protein sequences in silico. GAIA is a genetic algorithm responsible for sequence mutation and selection, which is interfaced with DeeProtein, a pretrained deep neural network providing functional protein sequence classification. GAIA achieves in silico directed evolution of proteins by iterative random amino acid substitution followed by selection via maximization of the DeeProtein classification score for the desired target functionality. By finetuning DeeProtein to the specific evolutionary task, GAIA can be optimized to facilitate virtually any functionality transfer on a given input protein sequence. Thereby, GAIA provides the required evolutionary stepping stones critical to success of directed evolution experiments by means of pure computation.
Introduction
Directed evolution of protein sequences is a field of increasing importance. Especially with the exploitation of enzymes in the chemical and biotechnological industry, evolved enzymes, tailored to certain reactions or environments became highly sought after. Despite conventional directed protein evolution experiments are an arduous, time consuming task, consisting of multiple rounds of library generation (exploration) with subsequent selection of the fittest candidates (exploitation). Unlike for other similarly expensive experiments, to this day there is no computational tool available dedicated to the a priori simplification and speedup of directed evolution apporaches.That is because from a computational perspective the task of directed evolution is more than complex: An average protein has a length of 350 Amino acid residues, considering an alphabet of 20 amino acids the combinatory options (\(20^{350}\)) exceeds the number of particles in the universe (estimated on \(10^{89}\)) by far. Thus simple brute force algorithms do not succeed here, in fact a certain knowledge, an intelligent feature selection and dimensionality reduction is crucial to focus on the profitable mutations. Handcrafted features however would be tailored to a partiular directed evolution task and thus need to be fitted and tuned before they could be applied on other directed evolution experiments. A more convenient solution could be provided by deep learning, allowing to learn the properties of the manifold of the functional sequence space. With that knowledge, the tremendous combinatory space is reduced to a processable size.
Generative Modeling
As we demonstrated with DeeProtein we were able to learn an abstract protein representation, encoding sequence and features. Generative modeling exploits learned data distributions to sample new artificial datapoints from that distribution (figure 1). Sampling from trained representations is possible through calculation of the gradients with respect to the inputs, leaving the weights untouched. While this form of generative modelling is applied in style transfersTo circumvent these obstacles by mimicking nature's concept of evolution, we decided to implement a genetic algorithm for sequence generation and attach a deep neural network model as a scoring function to control sequence selection. For this task we exploited our learned protein representation in form of a deep residual neural network.
Algorithm
Genetic Component
For sequence generation we implemented a genetic algorithm, as it is inspired by the biological principle of evolution. The algorithm starts from an entry sequence, introduces mutations and scores the sequences by a deep neural network. High scoring sequences are retained and used as starting points for the next generation. Within one generation a pool of 100 amino acids is considered. To commence the input sequence is duplicated to fill the generation pool and on each copy a user defined number of initial mutations is introduced. Additionally the last highest scoring sequence is kept as reference in the pool. The mutation rate decreases with increasing generation numbers to facilitate conversion. In order to introduce a mutation, position and the introduced amino acid are determined randomly. The position to be mutated is drawn from a uniformal distribution over the sequence (either the complete sequence or a subset of interest). The new amino acid is drawn from a distribution considering the E.Coli codon usage to mimic the biological environment and optimize expression. This has the addidional benefit that rare amino acids, which tend to affect the DeeProtein score to greater extent due to their higher information content, do not get artificially overrepresented.Subsequently each sequence in the generation pool is scored by the deep learning classification model for protein functionality. Depending on the classification scores a GAIA internal score (see below) is calculated, considering the goal functionality and eventual undesireable functionalities. Further a term is added accounting for the BLOSSUM62 distance between the sequence and the original entry sequence. The generation is then ranked by scores and the top 5 candidates are retained to build the next generation, all other sequences are discarded. The next generation is constructed from the retained candidates in a 50:20:20:10 ratio depending on the sequence rank. This cycle is repeated until convergence of the GAIA-scores. An overview on the GAIA algorithm is provided in figure 2.
GAIA-Score
The GAIA score is calculated as: $$ S = (\sum_{g}^G g_{weight} \cdot g_{logit} - \sigma (g_{variance}) - \sum_{a}^A a_{weight} \cdot a_{logit} + \sigma (a_{variance})) \cdot \frac{1}{\sum_{g}^G} - b_{weight} \cdot b_{score} $$ where \( G \) is the specified set of goal terms and \( A \) the specified terms to avoid.The blossum weight is determined as: $$ \frac{2}{11 \cdot l} $$ with sequence length \(l\) and the blossum score respectively as: $$ \sum_{i}^l B_{62}(m_{i}, r_{i}) $$ with \(B_{62}\) is the BLOSSUM62 matrix, \(r\) the residue in the original sequence, and \(m\) the same position on the mutated sequence.
Deep Learning Model
As the scoring function for the genetic algorithm we apply a DeeProtein model (in its ResNet30 architecture) as described here. In order to better capture the sequence space of the evolutionary task we recommend to fine tune the pretrained, broad network to a more narrow sequence space. In our case we fine tuned the model applied in the validation process on the narrow space of beta-glucuronidase related functions and beta-lactamase related sequences respectively. Fine tuning a pretrained model to a specific task is a common technique in deep learning is it is much easier to carry out than training a new network for the specific task from scratch. Furthermore it is reasonable to add additional data obtained from wet lab experiments to this fine scale training to improve classification performance and force the model to recognize the relevant positions or features in greater detail. Therefore the proposed GAIA system can be improved by this recursive engineering cycle.Objectives
To verify the concept of GAIA and to demonstrate that a deep learning-driven genetic algorithm is capable of helping synthetic biologists in the context of protein engineering and directed we focused on two model proteins to evaluate key properties. We applied GAIA in the context of beta-lactamase and beta-glucuronidase enzymes with different objectives. While the objective in the beta-lactamase was the improvement of the existing functionality and the generation of a library with a broad activity spectrum, the objective in the beta glucuronidase context was the reprogrammation to a new functionality.in silico Results
Selectivity in mutationsites
Before we set out to the evolution of beta-lactamases we asserted the performance of GAIA by comprehensive metrics. The distribution of mutation rates over the residues is an important factor in directed evolution experiments, both determining the output and driving the evolution process. As this distribution is unknown and context specific it can not be approximated universally by computational tools. In GAIA the DeeProtein component is deployed to score the candidates naively generated by the genetic algorithm. Thus the latent knowledge in the deep neural network mimics the hidden distribution of mutation rates. As neural networks are a black box method, where the reasons for the internal states are extremely complex and very hard to disentangle, we performed comprehensive tests to shed light on the hidden distribution by asserting the effect of each mutation position-wise.Impact of Mutagenesis on GAIA-Score
To investigate the relation between GAIA score and the number of introduced mutations, we performed random mutagenesis studies and plotted the resulting scores of the candidates. The score decreases sigmoidal with the number of introduced mutations. This suggests our model to be tolerant to up to 20 mutations before a strong decrease in scores. The mark of positive predictions is crossed at about 50 mutations. For the generation of this plot GAIA was run in complete random mode introducing a new mutation ever generation until saturation. In every generation the scores were averaged over a set of 100 candidates.Functionality Transfer
To assert the in silico evolution capabilities of GAIA in a real world application, we set out to reprogram the E. coli beta-glucuronidase to beta-galactosidase activity. A comprehensive report of the software validation process can be found here.We prepared the in silico evolution by performing equilibration molecular dynamics simulations on the wildtype GUS and the GUS variant suggested by Matsumura et al.
Table 1: Defined sequence patches open for mutagenesis. The three defined sequence patches for the functionality transfer in beta glucuronidase. Fragments were determined after equilibration MD simulations and structure assertion in pyMOL.
Fragment | Positions | Constant Residue |
---|---|---|
A | 351-371 | G362 |
B | 506-512 | None |
C | 548-568 | G559 |