Difference between revisions of "Team:Heidelberg/Software/GAIA"

Revision as of 17:07, 31 October 2017

GAIA

Genetic artificially intelligent algorithm

Directed evolution of protein sequences is an arduous task often requiring multiple rounds of library generation and selection and the application of different surrogate objective functionalities as evolutionary stepping stones. To this day there is no computational tool available that is able to reduce the number of required cycles of exploration and selection, to minimize the required library complexity and increase the possible stepsize of one round of directed evolution.
Here we present GAIA (Genetic Artificially Intelligent Algorithm) a software tool for directed in silico evolution of protein sequences. GAIA is composed of a genetic algorithm responsible for sequence mutation and propagation and a pre-trained deep neural network providing functional protein sequence classification. A protein is evolved by maximization of the classification score for the target functionality on an entry sequence through generation wise introduction of amino acid substitutions. By fine-tuning the neural network to the specific evolutionary task GAIA can be applied on virtually any functionality transfer.

Introduction

Directed evolution of protein sequences is a field of increasing importancy. Especially with the exploitation of enzymes in the chemical and biotechnological industry, evolved enzymes, tailored to certain reactions or environments became highly sought after. Despite conventional directed protein evolution experiments are an arduous, time consuming task, consisting of multiple rounds of library generation (exploration) with subsequent selection of the fittest candidates (exploitation). Unlike for other similarly expensive experiments, to this day there is no computational tool available dedicated to a simplification and speedup of directed evolution apporaches.
That is because from a computational perspective the task of directed evolution is more than complex: An average protein has a length of 350 Amino acid residues, considering an alphabet of 20 amino acids the combinatory options ($20^{350}$) exceeds the number of particles in the universe (estimated on $10^{89}$) by far. Thus simple brute force algorithms do not succeed here, in fact a certain knowledge, an intelligent feature selection and dimensionality reduction is crucial to focus on the profitable mutations. Handcrafted features however would be tailored to a partiular directed evolution task and thus need to be fitted and tuned before they could be applied on other directed evolution experiments. A more convenient solution could be provided by deep learning, allowing to learn the properties of the manifold of the functional sequence space. With that knowledge, the tremendous combinatory space is reduced to a processable size.

Generative Modeling

As we demonstrated in DeeProtein we were able to learn an abstract protein representation, encoding sequence and features. Generative modeling exploits learned data distributions to sample new artificial datapoints from that distribution (figure 1). Sampling from trained representations is possible through calculation of the gradients with respect to the inputs, leaving the weights untouched. While this form of generative modelling is applied in style transfers Gatys2016Image and googles deep dream generator google2015deep, it has strong limitations as the results stringly depend on the distribution of the input. More elaborated models have been described in forms of generative adversarial networks goodfellow2014generative and variatiobal autoencoders kingma2013auto. While these models have been applied on image goodfellow2014generative and sound Li2017Midi data, their output is often noisy, blurred and distorted. For functional protein sequence generation, noisy or distorted outputs would be fatal. Further the training of such complex models like GANs and VAEs is rather difficult and conversion in case of GANs not even guaranteed.
To circumvent these obstacles we decided to implement a genetic algorithm for seuqence generation and attach a deep neural network model as a scoring function to control seqeuence selection. For this task we exploited our learned protein representation protein representation in form of a deep residual neural network.

Figure 1: The idea of generative modeling.

Generative modeling exploits the learned distribution of proteins of a certain class in the protein space (left) to generate artificial samples from that distribution (right). This can be achieved by various techniques such as generative adversarial networks , variational autoencoders and simple inversion of the gradients. However the learned distribution can also be used to categorize a given, perturbated sample, returning its class probability. With its interface between deep learning model, holding the latent data representation and its genetic algorithm, GAIA performs generative modelling by perturbating an input sequence towards the distribution of the goal functionality.

Algorithm

Figure 2: The architecture of the GAIA algorithm.

An input sequence is passed to the algorithm and mutations are introduced, based on a user defined mutation rate, to expand sequence to a library of 100 mutation variants. Subsequently the variants are passed to the interfaced deep neural network. The obtained scores are used to compute the GAIA score of the variants and rank them. The top 5 sequences are kept and stored to rebuild the library by mutagenesis, while the rest of the library is discarded. This cycle is repeated until convergence of the GAIA score.

Genetic Component

For sequence generation we implemented a genetic algorithm. The algorithm starts from an entry sequence, introduces mutations and scores the sequences by a deep neural network. High scoring sequences are retained and used as starting points for the next generation. Within one generation a pool of 100 amino acids is considered. To commence the input sequence is duplicated to fill the generation pool and on each copy a user defined number of initial mutations is introduced. The mutation rate deceases with increasing generation numbers to facilitate conversion. In order to introduce a mutation, position and the introduced amino acid are determined randomly. While the position is drawn from a uniformal distribution over the sequence, the amino acid is drawn from a distribution considering the E.Coli codon usage to mimic the biological environment and optimize expression.
Subsequently each sequence in the generation pool is scored by the deep learning classification model for protein functionality. Depending on the classification scores a GAIA internal score (see below) is calculated, considering the goal functionality and eventual undesireable functionalities. Further a term is added accounting for the BLOSSUM62 distance between the sequence and the original entry sequence. The generation is then ranked by scores and the top 5 candidates are retained to build the next generation, all other sequences are discarded. The next generation is constructed from the retained candidates in a 50:20:20:10 ratio depending on the sequence rank. This cycle is repeated until convergence of the GAIA-scores. An overview on the GAIA algorithm is provided in figure 2.

GAIA-Score

The GAIA score is calculated as: $$ S = (\sum_{g}^G g_{weight} \cdot g_{logit} - \sigma (g_{variance}) - \sum_{a}^A a_{weight} \cdot a_{logit} + \sigma (a_{variance})) \cdot \frac{1}{\sum_{g}^G} - b_{weight} \cdot b_{score} $$ where $ G $ is the specified set of goal terms and $ A $ the specified terms to avoid.
The blossum weight is determined as: $$ \frac{2}{11 \cdot l} $$ with sequence length $l$ and the blossum score respectively as: $$ \sum_{i}^l B_{62}(m_{i}, r_{i}) $$ with $B_{62}$ is the BLOSSUM62 matrix, $r$ the residue in the original sequence, and $m$ the same position on the mutated sequence.

Deep Learning Model

In combination with the genetic algorithm we apply a DeeProtein (ResNet30) model as described here. In order to better capture the sequence space of the evolutionary task we recommend to fine tune the pretrained, broad network to a more narrow sequence space. For instance, we fine tuned the model applied in the validation process on the narrow space of beta-glucuronidase related functions and beta-lactamase related sequences respectively. Fine tuning a pretrained model to a specific task is a common technique as it is much easier to carry out as training a new network for the specific task from scratch.

Results

We applied GAIA in the context of beta-lactamase and beta-glucuronidase enzymes with different objectives. While the objective in the beta-lactamase was the improvement of the existing functionality and the generation of a library with a broad activity spectrum, the objective in the beta glucuronidase context was the reprogrammation to a new functionality.

Selectivity in Mutationsides

Before we set out to the evolution of beta-lactamases we asserted the performance of GAIA by comprehensive metrics. The distribution of mutation rates over the residues is an important factor in directed evolution experiments, both determining the output and driving the evolution process. As this distribution is unknown and context specific it can not be approximated universally in computational tools. In GAIA the DeeProtein component is deployed to score the candidates naively generated by the genetic algorithm. Thus the latent knowledge in the deep neural network mimics the hidden distribution of mutation rates. As neural networks are a black box method, where the reasons for the internal states are extremely complex and very hard to disentangle we performed comprehensive tests to shed light on the hidden distribution by asserting the effect of each mutation position-wise.

Figure 2: The effect of mutations is not uniformal throughout the sequence.

The heatmap shows the impact on the GAIA score for every possible amino acid subsitution in the beta-lactamase sequence. The sequence position is depicted on the x-axis, while the y-axis displays the performed substitution. The impact is color coded from yellow for positive to red for negative impacts. While positive impact is made on small patches accross the whole sequence, a cluster of positively impacting residues can be observed for residues 242 to 246. In contrast strong negative impact for all kind of subsitutions is detected for the residues 255-263. Interestingly these residues form a beta-sheet through the center of the protein. Negative impact is also detected selectively for the introduction of Aspartic acid and glutamic acid at the sequence patches from 68-81 and from 180-200.

Impact of Mutagenesis on GAIA-Score

To investigate the relation between GAIA score and the number of introduced mutations, we performed random mutagenesis studies and plotted the resulting scores of the candidates. The score decreases sigmoidal with the number of introduced mutations. This suggests our model to be tolerant to up to 20 mutations before a strong decrease in scores. The mark of positive predictions is crossed at about 50 mutations. For the generation of this plot GAIA was run in complete random mode introducing a new mutation ever generation until saturation. In every generation the scores were averaged over a set of 100 candidates.

Figure 3: The GAIA score decreases linearly with the number of introduced mutations.

The random mutagenesis of the beta-lactamase sequence, shows diminishing scores with increasing mutation numbers. The decrease in scores is sigmoidal and falls below the threshold of positive prediciton after about 50 mutations. Initially the model however tolerates up to 20 mutations before the score starts to decrease.

Functionality Transfer

To assert the in silico evolution capabilities of GAIA in a real world application, we set out to reprogram the E. coli beta-glucuronidase to beta-galactosidase activity. A comprehensive report of the software validation process can be found here.
We prepared the in silico evolution by performing equilibration molecular dynamics simulations on the wildtype GUS and the GUS variant suggested by Matsumura et al. matsumura2001vitro to determine the effect of the introduced mutations on the protein fold. As no significant change in protein fold was observable we proceeded by liberately defining three mutative fragments on the GUS sequence, displayed in table 1 [/]. The limitation of mutagenesis to defined sequence regions was required to ensure the cloneability of the resulting constructs in the wet lab.

Table 1: Defined sequence patches open for mutagenesis. The three defined sequence patches for the functionality transfer in beta glucuronidase. Fragments were determined after equilibration MD simulations and structure assertion in pyMOL.

Fragment	Positions	Constant Residue
A	351-371	G362
B	506-512	None
C	548-568	G559

Subsequently we run GAIA on the determined fragments, with the objective to maximize the term for Galactosidase activity for up to 1000 generations. The mutation rate was thereby limited to 9 amino acid substituions per generation and reduced by one mutation every 300 generations.

@@ Line 70: / Line 70: @@
 {{#tag:html|
 <h3>Selectivity in Mutationsides</h3>
-Before we set out to the evolution of beta-lactamases we asserted the performance of GAIA by comprehensive metrics. The distribution of mutation rates over the residues is an important factor in directed evolution experiments, both determining the output and driving the evolution process. As this distribution is unknown and context specific it can not be approximated universally in computational tools. In GAIA the DeeProtein component is deployed to score the candidates naively generated by the genetic algorithm. Thus the latent knowledge in the deep neural network mimicks the hidden distribution of mutation rates. As neural networks are a black box method, where the reasons for the internal states are extremely complex and very hard to disentangle we performed comprehensive tests to shed light on the hidden distribution by asserting the effect of each mutation position-wise.<br>
+Before we set out to the evolution of beta-lactamases we asserted the performance of GAIA by comprehensive metrics. The distribution of mutation rates over the residues is an important factor in directed evolution experiments, both determining the output and driving the evolution process. As this distribution is unknown and context specific it can not be approximated universally in computational tools. In GAIA the DeeProtein component is deployed to score the candidates naively generated by the genetic algorithm. Thus the latent knowledge in the deep neural network mimics the hidden distribution of mutation rates. As neural networks are a black box method, where the reasons for the internal states are extremely complex and very hard to disentangle we performed comprehensive tests to shed light on the hidden distribution by asserting the effect of each mutation position-wise.<br>
-            {{Heidelberg/templateus/Imagesection|
+{{Heidelberg/templateus/scrollableImagesection|
-https://static.igem.org/mediawiki/2017/c/cf/T--Heidelberg--2017_GAIA_walkheatmap.png |
+https://static.igem.org/mediawiki/2017/9/96/T--Heidelberg--2017_example-heatmap_left.png|https://static.igem.org/mediawiki/2017/8/80/T--Heidelberg--2017_example-heatmap_center.png|https://static.igem.org/mediawiki/2017/4/40/T--Heidelberg--2017_example-heatmap_right.png|
 Figure 2: The effect of mutations is not uniformal throughout the sequence. |
 The heatmap shows the impact on the GAIA score for every possible amino acid subsitution in the beta-lactamase sequence. The sequence position is depicted on the x-axis, while the y-axis displays the performed substitution. The impact is color coded from yellow for positive to red for negative impacts. While positive impact is made on small patches accross the whole sequence, a cluster of positively impacting residues can be observed for residues 242 to 246. In contrast strong negative impact for all kind of subsitutions is detected for the residues 255-263. Interestingly these residues form a beta-sheet through the center of the protein. Negative impact is also detected selectively for the introduction of Aspartic acid and glutamic acid at the sequence patches from 68-81 and from 180-200.
-            }}
+           }}
 }}}}
      {{Heidelberg/templateus/Contentsection|