SELEX Optimization with Machine Learning

This project was designed as a way to help the wetlab improve the efficiency of their DNAzyme. Systematic evolution of ligands by exponential enrichment (SELEX) is an in vitro evolutionary selection process that was used to find high binding affinity sequences for a specific region in E. coli, where the DNAzyme acts to autofluoresce. Higher DNAzyme binding affinities would lead to higher observable rates of autofluorescence- serving as a potential of E. coli detection in hospitals. SELEX is an expensive, difficult, time-consuming process, with all of these increasing with each successive generation being run.

Genetic algorithms for sequence optimization


This project sought to simulate this SELEX optimization process in silico using a genetic algorithm coupled with a neural network. A genetic algorithm is an optimization process inspired by natural selection.

Our genetic algorithm (GA) aimed to serve as a computational aid in the design of high-affinity DNA aptamers to a certain target ligand, based on the concept of the systematic evolution of ligands by exponential enrichment (SELEX). From a pool of sequenced aptamers, our GA selects a specified proportion of the most abundant aptamers and creates random cross-overs, insertions, deletions and single nucleotide polymorphisms among these parental aptamers to give rise to a new set of progeny aptamers. The number of progeny returned by our GA is user-defined. These in silico predicted aptamers can then be synthesized and have their efficacies against the target ligand assayed in vitro. Aptamers which show the greatest affinities for the target ligand will be enriched for and then sequenced. These sequencing reads can be assembled and fed back into our GA pipeline, to generate a new set of aptamers, which theoretically have better affinities to the target ligand than their parental generation. In this way, several iterations of SELEX and our GA can be conducted to yield high specificity aptamers for any target ligand.

The code for this project, and more detailed documentation, is available in our Github repository.

Mechanics of the genetic algorithm

The optimization process began with a population of solutions (DNAzyme sequences) to the problem of focus (finding high-binding affinity DNAzyme sequences). These solutions normally require string encoding, but since the solutions are DNA sequences, this has already been completed. Next the top x% (usually 10) are taken from this solution population and are bred together to get an entirely new population of solutions.

Figure 1: The genetic algorithm selects individuals for breeding using some fitness function.

The breeding involves crossover, mutation and in our case, filtering.

Breeding is when two members of the 10% subset are picked and the first part of one member is randomly combined with the second part of another member:

  • If AAAAA and TTTTTT are bred, then a random number between 1 and 5 (length of the shorter sequence) is chosen and the sequences are crossed over at that point.
  • If 3 were chosen as the random crossover point then the corresponding child sequences would be AAATTT.

  Mutations may occur in the algorithm where, for each base, there is a 2% chance (can be varied) where that base will randomly switch to another base:

  • If AAATTT were to be mutated, then a possible outcome is AAGTTT.

The filtering step involved checking whether a sequence contained a hairpin (palindromic sequence that can bind to itself), as hairpins are removed from the population. Functionally, hairpins can interfere with binding as they increase the likelihood of self-binding, so sequences with hairpin structures were removed during each generation.

Figure 2: Selected sequences are bred together by crossing them over at a random point, and then mutated by changing bases at random.

Running this GA, with a simple fitness function (% of bases that are A's), showed significant fitness improvements, even over a short number of generations (< 10).

Neural networks as fitness functions

The next phase of the project was to implement a neural network. We built a LSTM network and trained it on SELEX data from the wet lab to build a model to predict the fitness of a DNA sequence. Through this method, multiple generations can be run and the top scoring individuals can be selected for breeding.

Due to the nature of long sequence data (the initial solution population had an average length of 290-300 bases), building an accurate model proved difficult. Using neural nets to complete regression on sequence data was challenging. However despite these struggles, we were able to build a functional LSTM- with limited ability to accurately predict binding affinity to a useful degree. The model was trained under multiple parameter sets, with various training datasets, but ultimately due to the nature of the problem and the quality/amount of the training data we were ultimately unable to produce a fully functional model for predicting binding affinity.

Despite this we were able to prove the validity of the GA + NN strategy for generating higher quality solutions. We generated a data set of short 15-30bp sequences, where fitness was calculated as abs(65-melting temperature)/65. Sixty-five degrees is often thought of as an ideal melting temperature for primers during PCR. As this is a very simple metric, that can easily be calculated for any sequence, we used this fitness measurement as a way to test the validity of this strategy at all. First, a dataset of 50000 random sequences was generated, along with their fitness values. Next, our LSTM model was trained on this data, making it able to predict melting temperature of a sequence with less than a 2% error on average. Here error was defined as abs(predicted fitness - actual fitness)*100/actual fitness, fitness being defined for melting temperature above. As the starting fitness of the initial population was rather high (80%), a 3000 sample subset was taken (avg fitness ~54%). The LSTM was integrated into the GA and allowed to run for between 1-90 generations. After even 10 generations, the solution populations were becoming fixed at 1-2 unique solutions (out of ~1700-2100 total solutions)- but for any number of generations there was a fitness increase of either 5% for 1 generation and 20-35% increase past 10 generations. Varying the input data set would likely result in more variation at higher generation numbers. Taking solutions from every generation, there were 71 unique solutions, starting from a set of 3000 initial members in the solution population.

The error in predictions, while higher than when testing the LSTM on the validation level, was lower than the actual aptamer prediction error. Ultimately, this model provides a proof of concept that integrating a genetic algorithm and machine learning model can be used to optimize sequences in silico.The most challenging problem ahead is finding ways to increase the accuracy of the ML model, either by optimizing the model parameters (batch size, epochs, architecture etc.), finding ways to better extract data sets, using different techniques (random forest) or any combination.

A genetic algorithm is an excellent optimization tool, but it can only work insofar as it has an accurate fitness estimation function, provided in our case, by a neural network.

Next Steps

Next steps include experimenting with other machine learning techniques, as well as designing SELEX experiments to obtain data specifically tailored to serving as good training sets (high variety, high volume) Also, extracting more out of our own data, to try and expand the feature set (adding values for GC%, length, or other feature that can be extracted from a sequences) can help increase the model's accuracy, or at least have it indicate further heuristics to guide fitness estimation.