(9 intermediate revisions by the same user not shown) | |||
Line 29: | Line 29: | ||
We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels <x-ref>Iandola2016SqueezeNet</x-ref> and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.<br> | We define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels <x-ref>Iandola2016SqueezeNet</x-ref> and a pooling layer. Further a batch normalisation layer is attached to every convolutional layer. The input of a residual block is added to its output after being resized by a pooling layer and padded on the channels. A schematic view on a residual block is depicted in figure 1, together with an overview of the overall architecture. The model consists of an input processing layer, 30 residual blocks and two independent convolutional outlayers.<br> | ||
{{Heidelberg/templateus/Imagesection| | {{Heidelberg/templateus/Imagesection| | ||
− | https://static.igem.org/mediawiki/2017/ | + | https://static.igem.org/mediawiki/2017/9/98/T--Heidelberg--2017_DeeProtein_ARch.png | |
Figure 1: The Architecture of DeeProtein ResNet30. | | Figure 1: The Architecture of DeeProtein ResNet30. | | ||
− | The one-hot encoded input sequence is embedded by the input processig layer, a 2D convolutional layer with kernel size [20, 1], resulting in a 1D output per sequence. Following the embedding processing layer all subsequent convolutions are 1 dimensional. Through the network we apply 13 pooling layers sizing the length dimension from 1000 positions to 1. At the same time the channel size increases with depth from 64 in the input processing layer to 512 in the last residual block. | + | The one-hot encoded input sequence is embedded by the input processig layer, a 2D convolutional layer with kernel size [20, 1], resulting in a 1D output per sequence. Following the embedding processing layer all subsequent convolutions are 1 dimensional. Through the network we apply 13 pooling layers sizing the length dimension from 1000 positions to 1. At the same time the channel size increases with depth from 64 in the input processing layer to 512 in the last residual block. A Residual block is composed of two 1-d convolutional layers with kernel size 3, and a 1-d 1x1 convolutional layer. Every convolutional layer has a batchnorm layer attached. The output of the 1x1 layer is then pooled with a stride of 2 and kernel size 2. Subsequently the input of the residual block is added to the output of the pooling layer in an element-wise addition (shortcut). |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | A Residual block is composed of two 1-d convolutional layers with kernel size 3, and a 1-d 1x1 convolutional layer. Every convolutional layer has a batchnorm layer attached. The output of the 1x1 layer is then pooled with a stride of 2 and kernel size 2. Subsequently the input of the residual block is added to the output of the pooling layer in an element-wise addition (shortcut). | + | |
}} | }} | ||
<h2>Label selection</h2> | <h2>Label selection</h2> | ||
Line 49: | Line 44: | ||
<h3>Training</h3> | <h3>Training</h3> | ||
− | The | + | The model was trained using the Adagrad optimizer <x-ref>duchi2011adaptive</x-ref> on a Nvidia Tesla K80 GPU for overall 4.8 epochs (600.000 steps) with a mini-batchsize of 64 and an initial learing rate of 0.01. We minimized the cross entropy between sigmoid logits and labels, while for metrics calculations a label was considered predicted with an output value higher than 0.5. During the training process the independent outlayers where updated alternatingly depending on their argmax. The signal for both outlayers was averaged for inference and the variance between the outlayers was considered as a metric for model uncertainty. Upon loss convergence the training was stopped and the model was reinitialized with a factor 10 reduced learining rate with an ultimate learining rate of 0.001. |
<h2>Results</h2> | <h2>Results</h2> | ||
Line 81: | Line 76: | ||
Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated by <x-ref>asgari2015continuous</x-ref> but also in alternation of existing sequences. By exploiting the intrinsic properties of the vector space, a sequence defined as the path through that vector space (namely by its start and ending point) may be altered by exchanging vector components along that path to similar ones. For instance if the kmer 'AAG' and 'GAG' cluster closely in figure 3 and the distance between their embedding vectors is close, out hypothesis is they're exchangeable, without huge perturbations to the sequence because the two kmers are likely in the same sequence context. | Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated by <x-ref>asgari2015continuous</x-ref> but also in alternation of existing sequences. By exploiting the intrinsic properties of the vector space, a sequence defined as the path through that vector space (namely by its start and ending point) may be altered by exchanging vector components along that path to similar ones. For instance if the kmer 'AAG' and 'GAG' cluster closely in figure 3 and the distance between their embedding vectors is close, out hypothesis is they're exchangeable, without huge perturbations to the sequence because the two kmers are likely in the same sequence context. | ||
− | < | + | <h1 id="nn101">Neural Networks 101</h1> |
− | < | + | <h3>1. Neural Network Basics</h3> |
The idea of neural networks origins in the late 1960 where Rosenblatt et al. first descriped the perceptron as a the functional unit of neural networks <x-ref>rosenblatt1958perceptron</x-ref> and subsequently their application in mulitlayered perceptrons (MLPs) <x-ref>rosenblatt1961principles</x-ref>. Most neural networks are feedforward neural networks where the information flows from the input layer through the networks until the last layer generates an output.<br> | The idea of neural networks origins in the late 1960 where Rosenblatt et al. first descriped the perceptron as a the functional unit of neural networks <x-ref>rosenblatt1958perceptron</x-ref> and subsequently their application in mulitlayered perceptrons (MLPs) <x-ref>rosenblatt1961principles</x-ref>. Most neural networks are feedforward neural networks where the information flows from the input layer through the networks until the last layer generates an output.<br> | ||
In general a neural network can be seen as a function approximator mapping an input to an output function, in terms of a classifier for instance:<br> | In general a neural network can be seen as a function approximator mapping an input to an output function, in terms of a classifier for instance:<br> | ||
Line 108: | Line 103: | ||
The weights \(w\) and biases \(b\) thereby describe the trainable parameters of the layer. | The weights \(w\) and biases \(b\) thereby describe the trainable parameters of the layer. | ||
− | < | + | <h3>2. Training - Backpropagation</h3> |
Like other machine learning models neural networks are trained with gradient based methods <x-ref>Goodfellow-et-al-2016</x-ref>. The most common technique that is applied on artificial neural networks is error backpropagation with stochastic gradient descent (SGD). Here the network is shown small batches of the training data and the trainable parameters are updated after each step with the goal to minimize the error of a cost or loss function. A training cycle consists of a forward and backward pass. In the forward pass the information flows from input through the network, an output is generated and the error is estimated by computing the loss function. Subsequently, during the backward pass, the error is backpropagated through the network and the trainable parameters of the network are updated to minimize the value of the loss function. Technically the gradient of the loss function with respect to the wights and biases is computed and distributed throught the network (see Fig. 6).<br> | Like other machine learning models neural networks are trained with gradient based methods <x-ref>Goodfellow-et-al-2016</x-ref>. The most common technique that is applied on artificial neural networks is error backpropagation with stochastic gradient descent (SGD). Here the network is shown small batches of the training data and the trainable parameters are updated after each step with the goal to minimize the error of a cost or loss function. A training cycle consists of a forward and backward pass. In the forward pass the information flows from input through the network, an output is generated and the error is estimated by computing the loss function. Subsequently, during the backward pass, the error is backpropagated through the network and the trainable parameters of the network are updated to minimize the value of the loss function. Technically the gradient of the loss function with respect to the wights and biases is computed and distributed throught the network (see Fig. 6).<br> | ||
{{Heidelberg/templateus/Imagesection| | {{Heidelberg/templateus/Imagesection| | ||
https://static.igem.org/mediawiki/2017/7/7f/T--Heidelberg--2017_DeeProtein_NERUALNET.svg | | https://static.igem.org/mediawiki/2017/7/7f/T--Heidelberg--2017_DeeProtein_NERUALNET.svg | | ||
− | Figure 6: A | + | Figure 6: A forward and backward pass through a neural network. | |
− | + | A forward pass through a neural network with one hidden layer (left). First the input for each neuron is computed as a weighted sum of the outputs of the previous layer \(y\) or in case of the first hidden layer, the input \(x_{i}\) with an added trainable bias term. Next the \(z\) is squished through a non-linear function (activation) and the output of a layer \(y\) is obtained. The last layer (outlayer) is then compared to the target in order to estimate the error of the model in the loss-function \(L\). This initializes the backward pass (right). Here the gradients of the loss-function are backpropagated through the network by application of the chain rule. In order to do so, the error derivatives in each unit are calculated with respect to each units output, denoted as a weighted sum of the derivatives with respect to the previous layers inputs \(z_{l}\). By multiplication with the gradient of the activation function \(\frac{\partial y_{z} }{\partial z}\) the gradient with respect to a layers output is converted in a gradient with respect to a layers input \(z_{l-1}\). Once the loss is known, the error derivatives for each weight can be computed as \( y_{l} \frac{\partial L}{\partial z_{l+1} }\). Subsequently all weights are updated by their gradient value multiplied with a learning rate. Then the next forward pass is performed.}} | |
− | }} | + | <br> |
− | + | The most common cost or loss functions for classification tasks include the mean squared error (MSE) and the cross entropy (CE): | |
$$ MSE = \frac{1}{2}(y - \hat{y})^2 $$ | $$ MSE = \frac{1}{2}(y - \hat{y})^2 $$ | ||
$$ CE = \sum_y y log(\hat{y})$$ | $$ CE = \sum_y y log(\hat{y})$$ | ||
− | < | + | <h3>3. Fully Connected vs. Convolutional Nets</h3> |
− | A convolutional neural network (CNN) is a special neural network architecture first described by Yann LeCun <x-ref>LeCun1990Handwritten</x-ref>. CNNs are specialized on analyzing spatial or time-series data, with a grid like topology. Instead of matrix multiplication, CNNs apply a mathematical operation called | + | A convolutional neural network (CNN) is a special neural network architecture first described by Yann LeCun <x-ref>LeCun1990Handwritten</x-ref>. CNNs are specialized on analyzing spatial or time-series data, with a grid like topology. Instead of matrix multiplication, CNNs apply a mathematical operation called convolution in at least one of their layers <x-ref>Goodfellow-et-al-2016</x-ref>. Inspired by the structure of neurons on the visual cortex of mammals CNNs, have achieved tremendous results on various tasks like image, video, sound and language processing.<br> |
A convolution is defined as: | A convolution is defined as: | ||
$$ s(t)=\int x(a)w(t-a) da $$ | $$ s(t)=\int x(a)w(t-a) da $$ |
Latest revision as of 22:14, 14 December 2017
DeeProtein
Deep learning for protein sequences
Introduction
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960sBackground
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input spaceProtein representation learning
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in de novo protein sequence generation. Attempts for protein sequence classification have been made with CNNsAlso handwritten feature extractors exist for protein sequences
Network Architecture
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions based on the ResNet architectureWe define a residual block as a set of two convolutional layers, a convolutional layer with kernel size 1 to squeeze the channels
Label selection
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotationThus, we thresholded the considered labels based on their minimum population, ending with a set of 1509 GO terms with a minimum population of 50 samples when considering the manually annotated SwissProt database
Data preprocessing
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17)During datapreprocessing the full GO-annotation was infered through the DAG from the annotated Go-terms to facilitate the subsequent preprocessing steps. Sequences were then filtered for a minimum length of 175 amino acids and sequences containing non canonical amino acids were excluded.
To ensure a random distribution of sequences over in the validation and train sets for each label and at the same time account for the extreme class in-balance among the GO-terms, the validation set was created on the fly during generation of the training set. This was done by randomly sampling sequences from the preprocessing streams, ensuring the validation set to contain at least 5 sequences per GO-term. Training and validation sets were mutually exclusive.
Prior to training the generated datasets were processed to a binary file format, to speed up the feed streams for the GPUs. Thereby sequences were one-hot encoded and clipped or zero padded to a window of 1000 residues. The labels were also one-hot encoded. In the uniprot dataset the average sequence had 1.3 labels assigned.
Training
The model was trained using the Adagrad optimizerResults
The performance of the network was asserted on an exclusive validation set of 4425 sequences. For each GO-label the validation set contained at least 5 distinct samples. Our model achieved an area under the curve (AUC) for the reciever operating characteristic (ROC) of 99.8% with an average F1 score of 78% (Figure 3).Wet Lab Validation
To assert the value of DeeProtein in sequence activtiy evaluation context, we validated the correlation between the DeeProtein classification score and enzyme activity in the wetlab. First we predicted a set of 25 single and double mutant beta-Lactamase variants with both higher and lower scores as the wildtype and subsequently asserted the activity in the wetlab.In order to derive a measure for enzyme activity we investigated the minimum inhibitory concentration (MIC) of Carbenicillin for all predicted mutants. The MIC was asserted by OD600-measurement in Carbenicillin containing media. As the OD was measuren in a 96-well plate the values are not absolute. From the measurements the MIC-score was calculated as the first Carbenicillin concentration where the OD fell below a threshold of 0.08. Next the classification scores were averaged for each MIC-score and then plotted against the Carbenicilline concentration (Figure 2).
Protein Sequence Embedding
A protein representation first described by Asgari et al is prot2vecThus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model
Results
We visualized our 100 dimensional embedding through PCA dimensionality reduction as shown in Fig 5. Highlighted in sequence are all kmers containing a certain amino acid. Clear clusters can be observed for the aminoacids Cysteine (top right corner), Lysine (top left corner), Tryptophane (center right), Glutamate (center left), Proline (bottom) and Arginine (center) even after dimensionality reduction. In contrast, for aminoacids like Glycine, Serine and Valine are distributed over the whole kmer space.Application
Our kmer embedding provides a great base to explore the protein space for future research. The embedding may be applied in classification as demonstrated byNeural Networks 101
1. Neural Network Basics
The idea of neural networks origins in the late 1960 where Rosenblatt et al. first descriped the perceptron as a the functional unit of neural networksIn general a neural network can be seen as a function approximator mapping an input to an output function, in terms of a classifier for instance:
$$y = f(x)$$ In contrast to a "handdesigned" classification algorithm, a neural network learns the parameters \(\theta\) required for a successfull classification: $$y = f(x, \theta)$$ And when we consider the chain like structure of neural networks: $$y = f_3(f_2(f_1(x, \theta_1), \theta_2), \theta_3)$$ A single neuron thereby consists of a linear mapping and an non-linear activation function applied after the linear activation: $$z_j = w_{ij} i + b_j$$ $$y_j = a(z_j)$$ In the context of a layer with multiple neurons: $$ z_j = \sum w_{ij} i + b_j$$ $$ y_j = a(z_j)$$ The weights \(w\) and biases \(b\) thereby describe the trainable parameters of the layer.
2. Training - Backpropagation
Like other machine learning models neural networks are trained with gradient based methodsThe most common cost or loss functions for classification tasks include the mean squared error (MSE) and the cross entropy (CE): $$ MSE = \frac{1}{2}(y - \hat{y})^2 $$ $$ CE = \sum_y y log(\hat{y})$$
3. Fully Connected vs. Convolutional Nets
A convolutional neural network (CNN) is a special neural network architecture first described by Yann LeCunA convolution is defined as: $$ s(t)=\int x(a)w(t-a) da $$ Where the input function \(x()\) is smoothed by the weighting function (or Kernel) \(w\), leading to the output \(s\). Typically a convolution is denoted with an asterix: $$ s(t)=(x \ast w)(t) $$ Applied on a two dimenstional discrete input, a convolution is described as: $$ S(i,j) = \sum_m \sum_n I(i-m,j-n)K(m,n) $$
A ConvNet in abstract representation could look like this:
$$Input \rightarrow Conv \rightarrow Conv2 \rightarrow Conv3 \rightarrow Fully Connected \rightarrow Fully Connected \rightarrow Output$$
The information is propagated from the input throught the convolutional layers and fully connected layers to generate an output. By repeated application of convolutions and different kernel sizes the size of the image is reduced as the information proceedes through the network. The kernels are thereby modular, and optimized during the training process. The result is a trainable feature extractor (Conv1-Conv3) that can be examined by a small fully connected neural network. Another advantage is that CNNs rely on parameter sharing. As a kernel is often much smaller as the image it is applied on, and a kernel is applied on the whole image, the size of parameters that needs to be restored is greatly reduced compared to conventional neural networks that rely on general matrix multiplication
More Ressources
We provide a brief overview on what a neural network does in our neuralnetworks101. For a comprehensice explanation please rely on these ressources:- colah's Blog provides decent explanations of virtually all neural network architectures
- Adit Deshpande's Blog is a very good resource for introductory explanations and overviews on applications
- The Deep Learning Book is a comprehensive work by three of the fields most prominent figures
Practical tips are given in these resources:
- Andrew Ng's tips on practical implementation
- Comprehensive TensoFlow tutorials by the Hvass labs