{{Heidelberg/main|
DeeProtein| Deep learning for protein sequences| https://static.igem.org/mediawiki/2017/b/ba/T--Heidelberg--2017_Background_Tunnel1.jpg%7C color|Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is mostly determined by structure, sequence based classification is difficulta and manual feature extraction along with conventional machine learning models did not yield satisfying results. However with the advent of deep learning, especially representation learning the obstacle of linking sequences to a functionality without further structural information can be overcome. Here we present DeeProtein, a deep convolutional neural network for multilabel protein sequence classification on functional gene ontology terms. We trained our model on a subset of the uniprot database and achieved an AUC under the ROC curve of 99% on our validation set with an average F1-Score of 78%.
{{Heidelberg/templateus/Contentsection|
Introduction
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960sArtificial neural networks are powerful function approximators, able to untangle complex relations in the input space
Protein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNsAlso handwritten feature extractors exist for protein sequences
Convolutional Approach
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotationData preprocessing
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17)Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.
Results
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 1).Wet Lab Validation
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Protein Sequence Embedding
A protein representation first described by Asgari et al is prot2vecThus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model