Difference between revisions of "Team:Heidelberg/Software/DeeProtein"

Line 8: Line 8:
 
     {{Heidelberg/abstract|       
 
     {{Heidelberg/abstract|       
 
         https://static.igem.org/mediawiki/2017/d/d2/T--Heidelberg--2017_DeeProtein_GA.jpg|
 
         https://static.igem.org/mediawiki/2017/d/d2/T--Heidelberg--2017_DeeProtein_GA.jpg|
         Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is greatly determined by structure, a simple sequence based classification is extremely difficult and manual feature extraction along with conventional machine learning models is not satisfyingly applicable in this context. With the advent of deep learning, however, a possible solution to this fundamental problem appears to be in reach. Here we present DeeProtein, a mutli-label model for protein function prediction from raw sequence data. DeeProtein was trained on 10 million protein sequences representing more than 800 protein classes. It is confident in classification achieving an area under the curve of the reciever operating characteristic of 99% on the validation set, with an average F1-score of 78%. To better understand the protein sequence space, DeeProtein comprises a word2vec embedding for amino acid 3-mers, generated on the complete UniProt database.
+
         Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is greatly determined by structure, a simple sequence based classification is extremely difficult. Thus manual feature extraction along with conventional machine learning models is not satisfyingly applicable in this context. With the advent of deep learning, however, a possible solution to this fundamental problem appears to be in reach. Here we present DeeProtein, a mutli-label model for protein function prediction from raw sequence data. DeeProtein was trained on ~10 million protein sequences representing 886 protein classes. It is confident in classification achieving an area under the curve of the reciever operating characteristic of 99% on the validation set, with an average F1-score of 78%. To better understand the protein sequence space, DeeProtein comprises a word2vec embedding for amino acid 3-mers, generated on the complete UniProt database.
 
     }}
 
     }}
 
{{Heidelberg/formblank|{{#tag:html|
 
{{Heidelberg/formblank|{{#tag:html|
Line 16: Line 16:
 
{{Heidelberg/templateus/Contentsection|
 
{{Heidelberg/templateus/Contentsection|
 
{{#tag:html|
 
{{#tag:html|
<h2>Introduction</h2>
+
<h1>Introduction</h1>
 
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s <x-ref>rosenblatt1958perceptron</x-ref>, it was not until the 1990s, when the first convolutional neural networks were introduced <x-ref>LeCun1990Handwritten</x-ref>, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification <x-ref>Krizhevsky2012ImageNet</x-ref> paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data <x-ref>alipanahi2015predicting</x-ref>, <x-ref>kadurin2017cornucopia</x-ref>.<br>
 
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s <x-ref>rosenblatt1958perceptron</x-ref>, it was not until the 1990s, when the first convolutional neural networks were introduced <x-ref>LeCun1990Handwritten</x-ref>, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification <x-ref>Krizhevsky2012ImageNet</x-ref> paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data <x-ref>alipanahi2015predicting</x-ref>, <x-ref>kadurin2017cornucopia</x-ref>.<br>
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input space <x-ref>cybenko1989approximation</x-ref>. However it were the convolutional neural networks proposed in the early 1990s <x-ref>LeCun1990Handwritten</x-ref> that made deep learning possible.  Convolutional neural networks rely on trainable filters or kernels to extract valuable information (features) from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning <x-ref>oquab2014learning</x-ref>, detection <x-ref>lee2009unsupervised</x-ref> and classification <x-ref>Krizhevsky2012ImageNet</x-ref> tasks. Similar to the visual cortex of mammals, convolutional neural networks comprise different layers of abstraction. While the lower layers detect simple properties like edges and corners, higher layers assemble the features from the lower layers and detect more complex shapes <x-ref>lecun2015deep</x-ref>. With increasing depth the layers have a larger receptive field and are thus able to combine more signals from the layers below <x-ref>lecun2015deep</x-ref>. A convolutional neural network can thus extract the information present in the input space and encode the input in a compressed representation. Handwritten, man-designed feature extraction thus becomes obsolete. Often a convolutional neural network is complemented by a small fully connected (dense) neural network part processing the extracted features to perform the real classification or detection task <x-ref>Krizhevsky2012ImageNet</x-ref>.  
+
 
}}}}
+
<h3>Background</h3>
{{Heidelberg/templateus/Contentsection|
+
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input space <x-ref>cybenko1989approximation</x-ref>. However it were the convolutional neural networks proposed in the early 1990s <x-ref>LeCun1990Handwritten</x-ref> that made deep learning possible.  Convolutional neural networks rely on trainable filters or kernels to extract valuable information (features) from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning <x-ref>oquab2014learning</x-ref>, detection <x-ref>lee2009unsupervised</x-ref> and classification <x-ref>Krizhevsky2012ImageNet</x-ref> tasks. Similar to the visual cortex of mammals, convolutional neural networks comprise different layers of abstraction. While the lower layers detect simple properties like edges and corners, higher layers assemble the features from the lower layers and detect more complex shapes <x-ref>lecun2015deep</x-ref>. With increasing depth the layers have a larger receptive field and are thus able to combine more signals from the layers below <x-ref>lecun2015deep</x-ref>. A convolutional neural network can thus be trained to extract the relevant information present in the input space and encode the input in a compressed representation. Handwritten, human-designed feature extraction thus becomes obsolete. Often a convolutional neural network is complemented by a small fully connected (dense) neural network part processing the extracted features to perform the real classification or detection task <x-ref>Krizhevsky2012ImageNet</x-ref>.<br>
{{#tag:html|
+
 
   <h2>Protein representation learning</h2>
+
   <h1>Protein representation learning</h1>
 
   The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in <i>de novo</i> protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.<br>
 
   The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in <i>de novo</i> protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.<br>
   Also handwritten feature extractors exist for protein sequences <x-ref>bandyopadhyay2005efficient</x-ref><x-ref>saeys2007review<x/ref>. Along with support vector machines they were applied in protein-protein interaction prediciton <x-ref>hamp2015evolutionary</x-ref> as well as in protein family classification <x-ref>cai2003svm</x-ref><x-ref>leslie2002spectrum</x-ref>. However they are outperformed by trainable approaches applying CNNs <x-ref>szalkai2017near</x-ref> or word2vec models <x-ref>leslie2002spectrum</x-ref>. To find the optimal feature representation of proteins we applied a word2vec embedding as well as convolutional neural network.
+
   Also handwritten feature extractors exist for protein sequences <x-ref>bandyopadhyay2005efficient</x-ref><x-ref>saeys2007review</x-ref>. Along with support vector machines they were applied in protein-protein interaction prediciton <x-ref>hamp2015evolutionary</x-ref> as well as in protein family classification <x-ref>cai2003svm</x-ref><x-ref>leslie2002spectrum</x-ref>. However they are outperformed by trainable approaches applying CNNs <x-ref>szalkai2017near</x-ref> or word2vec models <x-ref>leslie2002spectrum</x-ref>.<br>
       }}}}
+
       <h2>Network Architecture</h2>
{{Heidelberg/templateus/Contentsection|
+
      {{#tag:html|
+
 
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions based on the ResNet architecture <x-ref>he2016deep</x-ref>. The motivation behind applying a residual neural network is an optimized training stability and facilitated learining process. As the identity is added to the output of a residual block through a shortcut, the fear of fading gradients and accuracy saturation is reduced.<br>
 
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions based on the ResNet architecture <x-ref>he2016deep</x-ref>. The motivation behind applying a residual neural network is an optimized training stability and facilitated learining process. As the identity is added to the output of a residual block through a shortcut, the fear of fading gradients and accuracy saturation is reduced.<br>
       We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels <x-ref>Iandola2016SqueezeNet</x-ref> and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.
+
       We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels <x-ref>Iandola2016SqueezeNet</x-ref> and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.<br>
 
       {{Heidelberg/templateus/Imagesection|
 
       {{Heidelberg/templateus/Imagesection|
 
       https://static.igem.org/mediawiki/2017/5/5b/T--Heidelberg--2017_DeeProtein_Architecture.png |
 
       https://static.igem.org/mediawiki/2017/5/5b/T--Heidelberg--2017_DeeProtein_Architecture.png |
Line 40: Line 38:
 
A Residual block is composed of two 1-d convolutional layers with kernel size 3, and a 1-d 1x1 convolutional layer. Every convolutional layer has a batchnorm layer attached. The output of the 1x1 layer is then pooled with a stride of 2 and kernel size 2. Subsequently the input of the residual block is added to the output of the pooling layer in an element-wise addition (shortcut).
 
A Residual block is composed of two 1-d convolutional layers with kernel size 3, and a 1-d 1x1 convolutional layer. Every convolutional layer has a batchnorm layer attached. The output of the 1x1 layer is then pooled with a stride of 2 and kernel size 2. Subsequently the input of the residual block is added to the output of the pooling layer in an element-wise addition (shortcut).
 
       }}
 
       }}
}}}}
+
       <h2>Label selection</h2>
{{Heidelberg/templateus/Contentsection|
+
       To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotation <x-ref>gene2004gene</x-ref>. The gene ontology annotation is hierarchical and best described as a directed acyclic graph (DAG). It contains labels providing information on the cellular location, pathway and molecular function of a particular protein. As we were interested solely in protein function classification, we considered on GO-labels in the molecular function sub-DAG. The molecular function sub-DAG has up to 12 levels and 11135 GO-terms, [/] of the leaf nodes. As the population between terms varies greatly and strongly depends on the terms level in the DAG, with terms towards the roots being stringer populated than leaf terms. We thresholded the considered labels based on their minimum population, ending with a set of 1509 GO terms with a minimum population of 50 samples when considering the manually annotated SwissProt database <x-ref>apweiler2004uniprot</x-ref>. As the hierachy from leaf node towards the root is fully inferable we further excluded all non-leaf nodes from the 1509-nodes sub-DAG, ending up with 886 GO-terms.<br>
      {{#tag:html|
+
 
       <h4>Label selection</h4>
+
       <h2>Data preprocessing</h2>
       To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotation <x-ref>gene2004gene</x-ref>. The gene ontology annotation is hierarchical and best described as a directed acyclic graph (DAG). It contains labels providing information on the cellular location, pathway and molecular function of a particular protein. As we were interested solely in protein function classification, we considered on GO-labels in the molecular function sub-DAG. The molecular function sub-DAG has up to 12 levels and 11135 GO-terms, [/] of the leaf nodes. As the population between terms varies greatly and strongly depends on the terms level in the DAG, with terms towards the roots being stringer populated than leaf terms. We thresholded the considered labels based on their minimum population, ending with a set of 1509 GO terms with a minimum population of 50 samples when considering the manually annotated SwissProt database <x-ref>apweiler2004uniprot</x-ref>. As the hierachy from leaf node towards the root is fully inferable we further excluded all non-leaf nodes from the 1509-nodes sub-DAG, ending up with 886 GO-terms.
+
      }}}}
+
{{Heidelberg/templateus/Contentsection|
+
{{#tag:html|
+
       <h4>Data preprocessing</h4>
+
 
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17) <x-ref>apweiler2004uniprot</x-ref>. For the classification task of 886 GO-labels we genererated a dataset containing 180774 sequences for SwissProt and ~7 million sequences for Uniprot respectively. The dataset was then split into training and validation set, where the validation set was composed of at least 5 distinct sequences per GO-term and the training set the remaining sequences.<br>
 
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17) <x-ref>apweiler2004uniprot</x-ref>. For the classification task of 886 GO-labels we genererated a dataset containing 180774 sequences for SwissProt and ~7 million sequences for Uniprot respectively. The dataset was then split into training and validation set, where the validation set was composed of at least 5 distinct sequences per GO-term and the training set the remaining sequences.<br>
       Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.
+
       Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.<br>
      }}}}
+
       <h2>Results</h2>
{{Heidelberg/templateus/Contentsection|
+
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 3).<br>
{{#tag:html|
+
       <h4>Results</h4>
+
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 1).
+
 
       {{Heidelberg/templateus/Imagesection|
 
       {{Heidelberg/templateus/Imagesection|
 
       https://static.igem.org/mediawiki/2017/8/82/T--Heidelberg--2017_DeeProtein_metrics.svg  |
 
       https://static.igem.org/mediawiki/2017/8/82/T--Heidelberg--2017_DeeProtein_metrics.svg  |
       Figure 1: Performance on the validation set after completed training process.|
+
       Figure 3: Performance on the validation set after completed training process.|
 
       A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30.
 
       A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30.
 
       }}
 
       }}
}}}}
+
       <h2>Wet Lab Validation</h2>
{{Heidelberg/templateus/Contentsection|
+
{{#tag:html|
+
       <h4>Wet Lab Validation</h4>
+
 
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.<br>
 
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.<br>
       In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
+
       In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).<br>
 
       {{Heidelberg/templateus/Imagesection|
 
       {{Heidelberg/templateus/Imagesection|
 
       https://static.igem.org/mediawiki/2017/a/a9/T--Heidelberg--2017_DeeProtein_ClassifierVSmic.svg |
 
       https://static.igem.org/mediawiki/2017/a/a9/T--Heidelberg--2017_DeeProtein_ClassifierVSmic.svg |
       Figure 2: The DeeProtein classification score for beta Lactamases correlates with the minimum inhibitory concentration (MIC) of Carbenicillin. |
+
       Figure 4: The DeeProtein classification score for beta Lactamases correlates with the minimum inhibitory concentration (MIC) of Carbenicillin. |
 
       The avergage DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.
 
       The avergage DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.
 
       }}
 
       }}
}}}}
+
      <br>
{{Heidelberg/templateus/Contentsection|
+
      <br>
{{#tag:html|
+
       <h1>Protein Sequence Embedding</h1>
       <h3>Protein Sequence Embedding</h3>
+
 
       A protein representation first described by Asgari et al is prot2vec <x-ref>asgari2015continuous</x-ref>. The technique originates in the natural language processing and is based on the word2vec model <x-ref>mikolov2013efficient</x-ref> originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences <x-ref>asgari2015continuous</x-ref>. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding <x-ref>kimothi2016distributed</x-ref>. However training is here extremely slow as a proteinsequence itself is embedded in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.<br>
 
       A protein representation first described by Asgari et al is prot2vec <x-ref>asgari2015continuous</x-ref>. The technique originates in the natural language processing and is based on the word2vec model <x-ref>mikolov2013efficient</x-ref> originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences <x-ref>asgari2015continuous</x-ref>. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding <x-ref>kimothi2016distributed</x-ref>. However training is here extremely slow as a proteinsequence itself is embedded in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.<br>
       Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model <x-ref>mikolov2013efficient</x-ref> on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, <x-ref>apweiler2004uniprot</x-ref>), composed of over 87 million sequences, exceeding the training base of 324,018 sequences derived by <x-ref>asgari2015continuous</x-ref>.
+
       Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model <x-ref>mikolov2013efficient</x-ref> on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, <x-ref>apweiler2004uniprot</x-ref>), composed of over 87 million sequences, exceeding the training base of 324,018 sequences derived by <x-ref>asgari2015continuous</x-ref>.<br>
      }}}}
+
       <h2>Results</h2>
{{Heidelberg/templateus/Contentsection|
+
We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in Fig 5. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space.<br>
{{#tag:html|
+
       <h4>Results</h4>
+
We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in fig 3. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space.  
+
 
       {{Heidelberg/templateus/Imagesection|
 
       {{Heidelberg/templateus/Imagesection|
 
       https://static.igem.org/mediawiki/2017/7/71/T--Heidelberg--2017_170827mod-aas.gif |
 
       https://static.igem.org/mediawiki/2017/7/71/T--Heidelberg--2017_170827mod-aas.gif |
       Figure 3: The 3mer sequence space reduced on two dimensions by PCA. |
+
       Figure 5: The 3mer sequence space reduced on two dimensions by PCA. |
 
       Highlighted are particular kmers containing a ceratin amino acid in sequential order. For the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) clear clusters are observable. Others, like Glycine and Serine are distrbuted over the whole kmer space.
 
       Highlighted are particular kmers containing a ceratin amino acid in sequential order. For the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) clear clusters are observable. Others, like Glycine and Serine are distrbuted over the whole kmer space.
 
       }}
 
       }}
      }}}}
+
       <h2>Application</h2>
{{Heidelberg/templateus/Contentsection|
+
      Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated by <x-ref>asgari2015continuous</x-ref> but also in alternation of existing sequences. By exploiting the intrinsic properties of the vector space, a sequence defined as the path through that vector space (namely by its start and ending point) may be altered by exchanging vector components along that path to similar ones. For instance if the kmer 'AAG' and 'GAG' cluster closely in figure 3 and the distance between their embedding vectors is close, out hypothesis is they're exchangeable, without huge perturbations to the sequence because the two kmers are likely in the same sequence context.
{{#tag:html|
+
 
       <h4>Application</h4>
+
<h3>Neural Networks 101</h3>
Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated by <x-ref>asgari2015continuous</x-ref> but also in alternation of existing sequences. By exploiting the intrinsic properties of the vector space, a sequence defined as the path through that vector space (namely by its start and ending point) may be altered by exchanging vector components along that path to similar ones. For instance if the kmer 'AAG' and 'GAG' cluster closely in figure 3 and the distance between their embedding vectors is close, out hypothesis is they're exchangeable, without huge perturbations to the sequence because the two kmers are likely in the same sequence context.
+
<h4>1. Neural Network Basics</h4>
 +
The idea of neural networks origins in the late 1960 where Rosenblatt et al. first descriped the perceptron as a the functional unit of neural networks <x-ref>rosenblatt1958perceptron</x-ref> and subsequently their application in mulitlayered perceptrons (MLPs) <x-ref>rosenblatt1961principles</x-ref>. Most neural networks are feedforward neural networks where the information flows from the input layer through the networks until the last layer generates an output.<br>
 +
In general a neural network can be seen as a function approximator mapping an input to an output function, in terms of a classifier for instance:<br>
 +
 
 +
$$y = f(x)$$
 +
 
 +
In contrast to a "handdesigned" classification algorithm, a neural network learns the parameters \(\theta\) required for a successfull classification:
 +
 
 +
$$y = f(x, \theta)$$
 +
 
 +
And when we consider the chain like structure of neural networks:
 +
 
 +
$$y = f_3(f_2(f_1(x, \theta_1), \theta_2), \theta_3)$$
 +
 
 +
A single neuron thereby consists of a linear mapping and an non-linear activation function applied after the linear activation:
 +
 
 +
$$z_j = w_{ij} i + b_j$$
 +
$$y_j = a(z_j)$$
 +
 
 +
In the context of a layer with multiple neurons:
 +
 
 +
$$ z_j = \sum w_{ij} i + b_j$$
 +
$$ y_j = a(z_j)$$
 +
 
 +
The weights \(w\) and biases \(b\) thereby describe the trainable parameters of the layer.
 +
 
 +
<h4>2. Training - Backpropagation</h4>
 +
Like other machine learning models neural networks are trained with gradient based methods <x-ref>Goodfellow-et-al-2016</x-ref>. The most common technique that is applied on artificial neural networks is error backpropagation with stochastic gradient descent (SGD). Here the network is shown small batches of the training data and the trainable parameters are updated after each step with the goal to minimize the error of a cost or loss function. A training cycle consists of a forward and backward pass. In the forward pass the information flows from input through the network, an output is generated and the error is estimated by computing the loss function. Subsequently, during the backward pass, the error is backpropagated through the network and the trainable parameters of the network are updated to minimize the value of the loss function. Technically the gradient of the loss function with respect to the wights and biases is computed and distributed throught the network (see Figure [/]).<br>
 +
 
 +
      {{Heidelberg/templateus/Imagesection|
 +
      https://static.igem.org/mediawiki/2017/a/a9/T--Heidelberg--2017_DeeProtein_ClassifierVSmic.svg |
 +
Figure [/]: A simple forward and backward pass through a neural network. |
 +
 
 +
}}
 +
  The most common cost or loss functions for classification tasks include the mean squared error (MSE) and the cross entropy (CE):
 +
$$ MSE = \frac{1}{2}(y - \hat{y})^2 $$
 +
$$ CE = \sum_y y log(\hat{y})$$
 +
 
 +
<h4>Normal vs. Convolutional Nets</h4>
 +
A convolutional neural network (CNN) is a special neural network architecture first described by Yann LeCun <x-ref>LeCun1990Handwritten</x-ref>. CNNs are specialized on analyzing spatial or time-series data, with a grid like topology. Instead of matrix multiplication, CNNs apply a mathematical operation called concolution in at least one of their layers <x-ref>Goodfellow-et-al-2016</x-ref>. Inspired by the structure of neurons on the visual cortex of mammals CNNs, have archieved tremendous results on various tasks like image, video, sound and language processing.<br>
 +
A convolution is defined as:
 +
    $$ s(t)=\int x(a)w(t-a) da $$
 +
    Where the input function \(x()\) is smoothed by the weighting function (or Kernel) \(w\), leading to the output \(s\). Typically a convolution is denoted with an asterix:
 +
    $$ s(t)=(x\astw)(t) $$
 +
    Applied on a two dimenstional discrete input, a convolution is described as:
 +
    $$ S(i,j) = \sum_m \sum_n I(i-m,j-n)K(m,n) $$
 +
<br>
 +
A ConvNet in abstract representation could look like this:<br>
 +
<br>
 +
    $$Input -> Conv -> Conv2 -> Conv3 -> Fully Connected -> Fully Connected -> Output$$<br>
 +
<br>
 +
    The information is propagated from the input throught the convolutional layers and fully connected layers to generate an output. By repeated application of convolutions and different kernel sizes the size of the image is reduced as the information proceedes through the network. The kernels are thereby modular, and optimized during the training process. The result is a trainable feature extractor (Conv1-Conv3) that can be examined by a small fully connected neural network.
 +
    Another advantage is that CNNs rely on parameter sharing. As a kernel is often much smaller as the image it is applied on, and a kernel is applied on the whole image, the size of parameters that needs to be restored is greatly reduced compared to conventional neural networks that rely on general matrix multiplication <x-ref>Goodfellow-et-al-2016</x-ref>.
 +
 
 +
<h3>More Ressources</h3>
 +
We provide a brief overview on what a neural network does in our <a href="">neuralnetworks101</a>. For a comprehensice explanation please rely on these ressources:<br>
 +
- <a href="http://colah.github.io/">colah's Blog</a> provides decent explanations of virtually all neural network architectures<br>
 +
- <a href="https://adeshpande3.github.io/">Adit Deshpande's Blog</a> is a very good resource for introductory explanations and overviews on applications<br>
 +
- <a href="http://www.deeplearningbook.org/">The Deep Learning Book</a> is a comprehensive work by three of the fields most prominent figures<br>
 +
Practical tips are given in these resources:<br>
 +
- <a href="http://cs229.stanford.edu/materials/ML-advice.pdf">Andrew Ng's tips</a> on practical implementation<br>
 +
- <a href="https://github.com/Hvass-Labs/TensorFlow-Tutorials">Comprehensive TensoFlow tutorials</a> by the Hvass labs<br>
 
}}}}
 
}}}}
 
}}
 
}}
 
{{Heidelberg/references2}}
 
{{Heidelberg/references2}}
 
{{Heidelberg/footer}}
 
{{Heidelberg/footer}}

Revision as of 22:15, 1 November 2017


DeeProtein
Deep learning for protein sequences
Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is greatly determined by structure, a simple sequence based classification is extremely difficult. Thus manual feature extraction along with conventional machine learning models is not satisfyingly applicable in this context. With the advent of deep learning, however, a possible solution to this fundamental problem appears to be in reach. Here we present DeeProtein, a mutli-label model for protein function prediction from raw sequence data. DeeProtein was trained on ~10 million protein sequences representing 886 protein classes. It is confident in classification achieving an area under the curve of the reciever operating characteristic of 99% on the validation set, with an average F1-score of 78%. To better understand the protein sequence space, DeeProtein comprises a word2vec embedding for amino acid 3-mers, generated on the complete UniProt database.

Introduction

While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s rosenblatt1958perceptron, it was not until the 1990s, when the first convolutional neural networks were introduced LeCun1990Handwritten, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification Krizhevsky2012ImageNet paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data alipanahi2015predicting, kadurin2017cornucopia.

Background

Artificial neural networks are powerful function approximators, able to untangle complex relations in the input space cybenko1989approximation. However it were the convolutional neural networks proposed in the early 1990s LeCun1990Handwritten that made deep learning possible. Convolutional neural networks rely on trainable filters or kernels to extract valuable information (features) from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning oquab2014learning, detection lee2009unsupervised and classification Krizhevsky2012ImageNet tasks. Similar to the visual cortex of mammals, convolutional neural networks comprise different layers of abstraction. While the lower layers detect simple properties like edges and corners, higher layers assemble the features from the lower layers and detect more complex shapes lecun2015deep. With increasing depth the layers have a larger receptive field and are thus able to combine more signals from the layers below lecun2015deep. A convolutional neural network can thus be trained to extract the relevant information present in the input space and encode the input in a compressed representation. Handwritten, human-designed feature extraction thus becomes obsolete. Often a convolutional neural network is complemented by a small fully connected (dense) neural network part processing the extracted features to perform the real classification or detection task Krizhevsky2012ImageNet.

Protein representation learning

The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNs szalkai2017near as well as with recurrent neural networks liu2017deep with good success, however without the possibility for generative modelling.
Also handwritten feature extractors exist for protein sequences bandyopadhyay2005efficientsaeys2007review. Along with support vector machines they were applied in protein-protein interaction prediciton hamp2015evolutionary as well as in protein family classification cai2003svmleslie2002spectrum. However they are outperformed by trainable approaches applying CNNs szalkai2017near or word2vec models leslie2002spectrum.

Network Architecture

To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions based on the ResNet architecture he2016deep. The motivation behind applying a residual neural network is an optimized training stability and facilitated learining process. As the identity is added to the output of a residual block through a shortcut, the fear of fading gradients and accuracy saturation is reduced.
We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels Iandola2016SqueezeNet and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.
Figure 1: The Architecture of DeeProtein ResNet30.
The one-hot encoded input sequence is embedded by the input processig layer, a 2D convolutional layer with kernel size [20, 1], resulting in a 1D output per sequence. Following the embedding processing layer all subsequent convolutions are 1 dimensional. Through the network we apply 13 pooling layers sizing the length dimension from 1000 positions to 1. At the same time the channel size increases with depth from 64 in the input processing layer to 512 in the last residual block.
Figure 2: Schematic view on the graph of a single ResNet-Block
A Residual block is composed of two 1-d convolutional layers with kernel size 3, and a 1-d 1x1 convolutional layer. Every convolutional layer has a batchnorm layer attached. The output of the 1x1 layer is then pooled with a stride of 2 and kernel size 2. Subsequently the input of the residual block is added to the output of the pooling layer in an element-wise addition (shortcut).

Label selection

To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotation gene2004gene. The gene ontology annotation is hierarchical and best described as a directed acyclic graph (DAG). It contains labels providing information on the cellular location, pathway and molecular function of a particular protein. As we were interested solely in protein function classification, we considered on GO-labels in the molecular function sub-DAG. The molecular function sub-DAG has up to 12 levels and 11135 GO-terms, [/] of the leaf nodes. As the population between terms varies greatly and strongly depends on the terms level in the DAG, with terms towards the roots being stringer populated than leaf terms. We thresholded the considered labels based on their minimum population, ending with a set of 1509 GO terms with a minimum population of 50 samples when considering the manually annotated SwissProt database apweiler2004uniprot. As the hierachy from leaf node towards the root is fully inferable we further excluded all non-leaf nodes from the 1509-nodes sub-DAG, ending up with 886 GO-terms.

Data preprocessing

In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17) apweiler2004uniprot. For the classification task of 886 GO-labels we genererated a dataset containing 180774 sequences for SwissProt and ~7 million sequences for Uniprot respectively. The dataset was then split into training and validation set, where the validation set was composed of at least 5 distinct sequences per GO-term and the training set the remaining sequences.
Subsequently all sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded, where the average sequence had 1.3 labels assigned.

Results

The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 3).
Figure 3: Performance on the validation set after completed training process.
A) The reciever operating characteristic (ROC) curve for DeeProtein ResNet30. The area under the ROC curve is 99%. B) The precision recall curve for DeeProtein ResNet30.

Wet Lab Validation

To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.
In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Figure 4: The DeeProtein classification score for beta Lactamases correlates with the minimum inhibitory concentration (MIC) of Carbenicillin.
The avergage DeeProtein classification scores assigned to samples in the MIC-score bins are depicted as black dots. The red line is the fitted linear model. Samples assigned with a high classification score tend to sustain higher carbenicillin concentrations, whereas a low classification score is assigned to variants with a low MIC.


Protein Sequence Embedding

A protein representation first described by Asgari et al is prot2vec asgari2015continuous. The technique originates in the natural language processing and is based on the word2vec model mikolov2013efficient originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences asgari2015continuous. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding kimothi2016distributed. However training is here extremely slow as a proteinsequence itself is embedded in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.
Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model mikolov2013efficient on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, apweiler2004uniprot), composed of over 87 million sequences, exceeding the training base of 324,018 sequences derived by asgari2015continuous.

Results

We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in Fig 5. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space.
Figure 5: The 3mer sequence space reduced on two dimensions by PCA.
Highlighted are particular kmers containing a ceratin amino acid in sequential order. For the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) clear clusters are observable. Others, like Glycine and Serine are distrbuted over the whole kmer space.

Application

Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated by asgari2015continuous but also in alternation of existing sequences. By exploiting the intrinsic properties of the vector space, a sequence defined as the path through that vector space (namely by its start and ending point) may be altered by exchanging vector components along that path to similar ones. For instance if the kmer 'AAG' and 'GAG' cluster closely in figure 3 and the distance between their embedding vectors is close, out hypothesis is they're exchangeable, without huge perturbations to the sequence because the two kmers are likely in the same sequence context.

Neural Networks 101

1. Neural Network Basics

The idea of neural networks origins in the late 1960 where Rosenblatt et al. first descriped the perceptron as a the functional unit of neural networks rosenblatt1958perceptron and subsequently their application in mulitlayered perceptrons (MLPs) rosenblatt1961principles. Most neural networks are feedforward neural networks where the information flows from the input layer through the networks until the last layer generates an output.
In general a neural network can be seen as a function approximator mapping an input to an output function, in terms of a classifier for instance:
$$y = f(x)$$ In contrast to a "handdesigned" classification algorithm, a neural network learns the parameters \(\theta\) required for a successfull classification: $$y = f(x, \theta)$$ And when we consider the chain like structure of neural networks: $$y = f_3(f_2(f_1(x, \theta_1), \theta_2), \theta_3)$$ A single neuron thereby consists of a linear mapping and an non-linear activation function applied after the linear activation: $$z_j = w_{ij} i + b_j$$ $$y_j = a(z_j)$$ In the context of a layer with multiple neurons: $$ z_j = \sum w_{ij} i + b_j$$ $$ y_j = a(z_j)$$ The weights \(w\) and biases \(b\) thereby describe the trainable parameters of the layer.

2. Training - Backpropagation

Like other machine learning models neural networks are trained with gradient based methods Goodfellow-et-al-2016. The most common technique that is applied on artificial neural networks is error backpropagation with stochastic gradient descent (SGD). Here the network is shown small batches of the training data and the trainable parameters are updated after each step with the goal to minimize the error of a cost or loss function. A training cycle consists of a forward and backward pass. In the forward pass the information flows from input through the network, an output is generated and the error is estimated by computing the loss function. Subsequently, during the backward pass, the error is backpropagated through the network and the trainable parameters of the network are updated to minimize the value of the loss function. Technically the gradient of the loss function with respect to the wights and biases is computed and distributed throught the network (see Figure [/]).
Figure [/]: A simple forward and backward pass through a neural network.
The most common cost or loss functions for classification tasks include the mean squared error (MSE) and the cross entropy (CE): $$ MSE = \frac{1}{2}(y - \hat{y})^2 $$ $$ CE = \sum_y y log(\hat{y})$$

Normal vs. Convolutional Nets

A convolutional neural network (CNN) is a special neural network architecture first described by Yann LeCun LeCun1990Handwritten. CNNs are specialized on analyzing spatial or time-series data, with a grid like topology. Instead of matrix multiplication, CNNs apply a mathematical operation called concolution in at least one of their layers Goodfellow-et-al-2016. Inspired by the structure of neurons on the visual cortex of mammals CNNs, have archieved tremendous results on various tasks like image, video, sound and language processing.
A convolution is defined as: $$ s(t)=\int x(a)w(t-a) da $$ Where the input function \(x()\) is smoothed by the weighting function (or Kernel) \(w\), leading to the output \(s\). Typically a convolution is denoted with an asterix: $$ s(t)=(x\astw)(t) $$ Applied on a two dimenstional discrete input, a convolution is described as: $$ S(i,j) = \sum_m \sum_n I(i-m,j-n)K(m,n) $$
A ConvNet in abstract representation could look like this:

$$Input -> Conv -> Conv2 -> Conv3 -> Fully Connected -> Fully Connected -> Output$$

The information is propagated from the input throught the convolutional layers and fully connected layers to generate an output. By repeated application of convolutions and different kernel sizes the size of the image is reduced as the information proceedes through the network. The kernels are thereby modular, and optimized during the training process. The result is a trainable feature extractor (Conv1-Conv3) that can be examined by a small fully connected neural network. Another advantage is that CNNs rely on parameter sharing. As a kernel is often much smaller as the image it is applied on, and a kernel is applied on the whole image, the size of parameters that needs to be restored is greatly reduced compared to conventional neural networks that rely on general matrix multiplication Goodfellow-et-al-2016.

More Ressources

We provide a brief overview on what a neural network does in our neuralnetworks101. For a comprehensice explanation please rely on these ressources:
- colah's Blog provides decent explanations of virtually all neural network architectures
- Adit Deshpande's Blog is a very good resource for introductory explanations and overviews on applications
- The Deep Learning Book is a comprehensive work by three of the fields most prominent figures
Practical tips are given in these resources:
- Andrew Ng's tips on practical implementation
- Comprehensive TensoFlow tutorials by the Hvass labs

References