Introduction
Deep Learning in general
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s
rosenblatt1958perceptron, it was not until the 1990s, when the first convolutional neural networks were introduced
LeCun1990Handwritten, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification
Krizhevsky2012ImageNet paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data
alipanahi2015predicting,
kadurin2017cornucopia.
Powerful function approximator to untangle the complex relation between sequence and function
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input data space. However it was not until the introduciton of convolutional neural networks
LeCun1990Handwritten, that made deep learning such a powerful method. Convolutional neural networks rely on trainable filters or kernels to extract the valuable information from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning
oquab2014learning, detection
lee2009unsupervised and classification
Krizhevsky2012ImageNet tasks. A convolutional neural network can thus extract the information present in the input space and encode the input in a compressed representation. Handwritten freature extraction thus becomes obsolete.
Applied models and Architecture
Protein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in **de novo** protein sequence generation. Attempts for protein sequence classification have been made with CNNs
szalkai2017near as well as with recurrent neural networks
liu2017deep with good success, however without the possibility for generative modelling.
To find the optimal feature representation of proteins we apply and test various representation techniques.
Protein sequence embedding
A protein representation first described by Asgari et al is prot2vec asgari2015continuous. The technique originates in the natural language processing and is based on the word2vec model mikolov2013efficient originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences asgari2015continuous. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding kimothi2016distributed. However training is here extremely slow as a proteinsequence itself is embedding in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.
Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model mikolov2013efficient on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, apweiler2004uniprot), composed of over 87 million sequences.
Results
KEINE AMK
Fig: 1a Numeric solution calculated with explicit Euler approach
Logarithmic plot of the concentrations of all E. coli populations cE, uninfected E. coli ceu, infected E. coli cei, phage-producing E. coli cep and M13 phage cP