Line 19: | Line 19: | ||
}} | }} | ||
{{#tag:html| | {{#tag:html| | ||
− | <a href="www.youtube.com">Youtube,AMK</a> | + | <a href="www.youtube.com">Youtube,AMK</a> |
<h2>Protein representation learning</h2> | <h2>Protein representation learning</h2> | ||
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in <i>de novo</i> protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.<br> | The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in <i>de novo</i> protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.<br> | ||
Line 39: | Line 39: | ||
URL | | URL | | ||
Figure 1: Performance on the validation set after completed training process. | | Figure 1: Performance on the validation set after completed training process. | | ||
− | A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30. | + | A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30. |
}} | }} | ||
}} | }} |
Revision as of 12:54, 29 October 2017
DeeProtein
Deep learning for protein sequences
Introduction
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960sArtificial neural networks are powerful function approximators, able to untangle complex relations in the input space
Protein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNsAlso handwritten feature extractors exist for protein sequences
Convolutional Approach
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotationData preprocessing
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17)Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.
Results
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 1).Wet Lab Validation
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Protein Sequence Embedding
A protein representation first described by Asgari et al is prot2vecThus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model