Line 8: | Line 8: | ||
{{Heidelberg/abstract| | {{Heidelberg/abstract| | ||
NONE| | NONE| | ||
− | Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is | + | Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is greatly determined by structure, a simple sequence based classification is extremely difficult and manual feature extraction along with conventional machine learning models is not satisfyingly applicable in this context. With the advent of deep learning, however, a possible solution to this fundamental problem appears to be in reach. Here we present DeeProtein, a mutli-label model for protein function prediction from raw sequence data. DeeProtein was trained on 10 million protein sequences representing more than 800 protein classes. It is confident in classification achieving 99% AUC on the ROC metric on the validation set, with an average F1-score of 78%. To better understand the protein sequence space, DeeProtein comprises a word2vec embedding for amino acid 3-mers, generated on the complete UniProt database. |
− | + | ||
}} | }} | ||
{{Heidelberg/templateus/Contentsection| | {{Heidelberg/templateus/Contentsection| |
Revision as of 01:27, 1 November 2017
DeeProtein
Deep learning for protein sequences
Introduction
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960sArtificial neural networks are powerful function approximators, able to untangle complex relations in the input space
Protein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNsAlso handwritten feature extractors exist for protein sequences
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions based on the ResNet architecture he2016deep . The motivation behind applying a residual neural network is an optimized training stability and facilitated learining process. As the identity is added to the output of a residual block through a shortcut, the fear of fading gradients and accuracy saturation is reduced.
We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channelsIandola2016SqueezeNet and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.
We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels
Label selection
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotationData preprocessing
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17)Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.
Results
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 1).Wet Lab Validation
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Protein Sequence Embedding
A protein representation first described by Asgari et al is prot2vecThus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model