Line 42: | Line 42: | ||
{{Heidelberg/templateus/Imagesection| | {{Heidelberg/templateus/Imagesection| | ||
URL | | URL | | ||
− | Figure 1: Performance on the validation set after completed training process. | | + | Figure 1: Performance on the validation set after completed training process.| |
− | A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30. | + | A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30. |
}} | }} | ||
}}}} | }}}} | ||
Line 68: | Line 68: | ||
We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in fig 3. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space. | We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in fig 3. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space. | ||
{{Heidelberg/templateus/Imagesection| | {{Heidelberg/templateus/Imagesection| | ||
− | https:// | + | https://static.igem.org/mediawiki/2017/7/71/T--Heidelberg--2017_170827mod-aas.gif | |
Figure 3: The 3mer sequence space reduced on two dimensions by PCA. | | Figure 3: The 3mer sequence space reduced on two dimensions by PCA. | | ||
Highlighted are particular kmers containing a ceratin amino acid in sequential order. For the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) clear clusters are observable. Others, like Glycine and Serine are distrbuted over the whole kmer space. | Highlighted are particular kmers containing a ceratin amino acid in sequential order. For the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) clear clusters are observable. Others, like Glycine and Serine are distrbuted over the whole kmer space. |
Revision as of 13:06, 29 October 2017
DeeProtein
Deep learning for protein sequences
Introduction
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960sArtificial neural networks are powerful function approximators, able to untangle complex relations in the input space
Youtube,AMK.
szalkai2017near as well as with recurrent neural networks liu2017deep with good success, however without the possibility for generative modelling.
Also handwritten feature extractors exist for protein sequencesbandyopadhyay2005efficient saeys2007review. Along with support vector machines they were applied in protein-protein interaction prediciton hamp2015evolutionary as well as in protein family classification cai2003svm leslie2002spectrum . However they are outperformed by trainable approaches applying CNNs szalkai2017near or word2vec models leslie2002spectrum . To find the optimal feature representation of proteins we applied a word2vec embedding as well as convolutional neural network.
Protein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNsAlso handwritten feature extractors exist for protein sequences
Convolutional Approach
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotationData preprocessing
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17)Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.
Results
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 1).Wet Lab Validation
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Protein Sequence Embedding
A protein representation first described by Asgari et al is prot2vecThus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model