Team:Heidelberg/Software/GAIA

GAIA

Genetic artificially intelligent algorithm

GAIA

Directed evolution of protein sequences is an arduous task often requiring multiple rounds of library generation and selection and the application of different surrogate objective functionalities as evolutionary stepping stones. To this day there is no computational tool available that is able to reduce the number of required cycles of exploration and selection, to minimize the required library complexity and increase the possible stepsize of one round of directed evolution. Here we present GAIA (Genetic Aritficially Intelligent Algorithm) a software tool for directed **in silico** evolution of protein sequences. GAIA is composed of a genetic algorithm responsible for sequence mutation and propagation and a pretrained deep neural network providing functional protein sequence classification. A protein is evolved by maximization of the functionality classification score on an entry sequence through generation wise introduction of amino acid substitutions. By finetuning the neural network to the specific evolutionary task GAIA can be applied on virtually any functionality transfer. For a facilitated application we provide a comprehensive data preprocessing pipeline as well as metrics and tools for validation. https://static.igem.org/mediawiki/2017/8/88/T--Heidelberg--2017_modelling-graphical-abstract.svg

Introduction

Directed evolution of protein sequences is a field of increasing importancy. Especially with the exploitation of enzymes in the chemical and biotechnological industry, evolved enzymes, tailored to certain reactions or environments became highly sought after. Despite conventional directed protein evolution experiments are an arduous, time consuming task, consisting of multiple rounds of library generation (exploration) with subsequent selection of the fittest candidates (exploitation). Unlike for other similarly expensive experiments, to this day there is no computational tool available dedicated to a simplification and speedup of directed evolution apporaches. That is because from a computational perspective the task of directed evolution is more than complex: An average protein has a length of 350 Amino acid residues, considering an alphabet of 20 amino acids the combinatory options (\(20^350\)) exceeds the number of particles in the universe (estimated on \(10^89\)) by far. Thus simple brute force algorithms do not succeed here, in fact a certain knowledge, an intelligent feature selection and dimensionality reduction is crucial to focus on the profitable mutations. Handcrafted features however would be tailored to a partiular directed evolution task and thus need to be fitted and tuned before they could be applied on other directed evolution experiments. A more convenient solution could be provided by deep learning, allowing to learn the properties of the manifold of the functional sequence space. With that knowledge, the tremendous combinatory space is reduced to a processable size.

Algorithm

Genetic Part

For sequence generation we implemented a genetic algorithm. The algorithm starts from an entry sequence, introduces mutations and scores the sequences by a deep neural network. High scoring sequences are retained and used as starting points for the next generation. Within one generation a pool of 100 amino acids is considered. To commence the input sequence is duplicated to fill the generation pool and on each copy a user defined number of initial mutations is introduced. The mutation rate deceases with increasing generation numbers to facilitate conversion. In order to introduce a mutation, position and the introduced amino acid are determined randomly. While the position is drawn from a uniformal distribution over the sequence, the amino acid is drawn from a distribution considering the E.Coli codon usage to mimic the biological environment and optimize expression. Subsequently each sequence in the generation pool is scored by the deep learning classification model for protein functionality. Depending on the classification scores a GAIA internal score is calculated [/ IMAGE: SCORE FORMULA], considering the goal functionality and eventual undesireable functionalities. Further a term is added accounting for the BLOSSUM62 distance between the sequence and the original entry sequence. The generation is then ranked by scores and the top 5 candidates are retained to build the next generation, all other sequences are discarded. The next generation is constructed from the retained candidates in a 50:20:20:10 ratio depending on the sequence rank. This cycle is repeated until convergence of the GAIA-scores.

Deep Learning Classification Modelling

Results

Selectivity in residues to mutate

Our generative model is proved to be selective for residues it mutates. In a given window.

Score Correlates with Activity

Functionality Transfer

References