Difference between revisions of "Team:Heidelberg/SandboxTHORE2"

(Created page with "{{Heidelberg/header }} {{Heidelberg/navbar }} {{Heidelberg/templateus/Mainbody| DeeProtein | Learning Proteins | https://static.igem.org/mediawiki/2017/c/c1/T--Heidelber...")
 
(Blanked the page)
 
Line 1: Line 1:
{{Heidelberg/header
 
    }}
 
{{Heidelberg/navbar
 
    }}
 
{{Heidelberg/templateus/Mainbody|
 
    DeeProtein | Learning Proteins |
 
https://static.igem.org/mediawiki/2017/c/c1/T--Heidelberg--2017_Background_Tunnel.jpg|
 
  
    {{Heidelberg/templateus/AbstractboxV2|
 
DeeProtein - Deep Learning for proteins |
 
Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is mostly determined by structure, sequence based classification is difficulta and manual feature extraction along with conventional machine learning models did not yield satisfying results. However with the advent of deep learning, especially representation learning the obstacle of linking sequences to a functionality without further structural information can be overcome.
 
Here we present DeeProtein, a deep convolutional neural network for multilabel protein sequence classification on functional gene ontology terms. We trained our model on a subset of the uniprot database and achieved an AUC under the ROC curve of 99% on our validation set.
 
 
https://static.igem.org/mediawiki/2017/8/88/T--Heidelberg--2017_modelling-graphical-abstract.svg
 
    }}
 
    {{Heidelberg/templateus/Contentsection|
 
        {{#tag:html|
 
            {{Heidelberg/templateus/Heading|
 
                Introduction
 
}}
 
 
<h1>Introduction</h1>
 
<h2>What deep learning is about</h2>
 
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s <x-ref>rosenblatt1958perceptron</x-ref>, it was not until the 1990s, when the first convolutional neural networks were introduced <x-ref>LeCun1990Handwritten</x-ref>, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification <x-ref>Krizhevsky2012ImageNet</x-ref> paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data <x-ref>alipanahi2015predicting</x-ref>, <x-ref>kadurin2017cornucopia</x-ref>.
 
 
            <h2>Powerful function approximator to untangle the complex relation between sequence and function</h2>
 
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input data space. However it was not until the introduciton of convolutional neural networks <x-ref>LeCun1990Handwritten</x-ref>, that made deep learning such a powerful method. Convolutional neural networks rely on trainable filters or kernels to extract the valuable information from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning <x-ref>oquab2014learning</x-ref>, detection <x-ref>lee2009unsupervised</x-ref> and classification <x-ref>Krizhevsky2012ImageNet</x-ref> tasks. A convolutional neural network can thus extract the information present in the input space and encode the input in a compressed representation. Handwritten freature extraction thus becomes obsolete.
 
        }}
 
        {{#tag:html|
 
            <h1>Applied models and Architecture</h1>
 
            <h2>Protein representation learning</h2>
 
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in **de novo** protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.
 
 
            To find the optimal feature representation of proteins we apply and test various representation techniques.
 
        }}
 
        {{#tag:html|
 
            <h2>Protein sequence embedding</h2>
 
A protein representation first described by Asgari et al is prot2vec <x-ref>asgari2015continuous</x-ref>. The technique originates in the natural language processing and is based on the word2vec model <x-ref>mikolov2013efficient</x-ref> originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences <x-ref>asgari2015continuous</x-ref>. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding <x-ref>kimothi2016distributed</x-ref>. However training is here extremely slow as a proteinsequence itself is embedding in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.
 
Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model <x-ref>mikolov2013efficient</x-ref> on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, <x-ref>apweiler2004uniprot</x-ref>), composed of over 87 million sequences.
 
            }}
 
            {{#tag:html|
 
            <h3>Results</h3>
 
 
            KEINE AMK
 
 
            }}
 
        {{Heidelberg/templateus/Imagebox|
 
            https://static.igem.org/mediawiki/2015/thumb/4/49/Heidelberg_CLT_Fig.7_Splinted_Ligation.png/800px-Heidelberg_CLT_Fig.7_Splinted_Ligation.png|
 
            Fig: 1a Numeric solution calculated with explicit Euler approach|
 
            {{#tag:html|
 
                Logarithmic plot of the concentrations of all <i>E. coli</i> populations cE, uninfected <i>E. coli</i> ceu, infected <i>E. coli</i> cei, phage-producing <i>E. coli</i> cep and M13 phage cP
 
            }}|
 
            pos = left
 
            }}
 
      {{#tag:html|
 
            <h2>Convolutional Approach</h2>
 
            TEXT TEXT TEXT ABOUT THE CONCOLUTION
 
            }}
 
      {{#tag:html|
 
            <h3>Label selection</h3>
 
To harness the strenghts of convolutional networks in representation learning and feature extraction we implemented a fully convolutional architecture to classify protein sequences to functions. Function labels were thereby defined by the gene ontology (GO) annotation <x-ref>gene2004gene</x-ref>. The gene ontology annotation is hierarchical and can be described by a directed acyclic graph (DAG), it contains labels providing information on the cellular location, pathway and molecular function of a particular protein. As we were interested solely in protein function classification, we considered on GO-labels in the molecular function sub-DAG. The molecular function sub-DAG is has up to 12 levels and XXXXXX GO-terms, YYY of the leaf nodes. As the population between terms varies greatly and strongly depends on the terms level in the DAG, with terms towards the roots being stringer populated than leaf terms. We thresholded the considered labels based on their minimum population, ending with a set of 1509 GO terms with a minimum population of 50 samples when considering the manually annotated SwissProt database <x-ref>apweiler2004uniprot</x-ref>. As the hierachy from leaf node towards the root is fully inferable we further excluded all non-leaf nodes from the 1509-nodes sub-DAG, ending up with 886 GO-terms.
 
            }}
 
      {{#tag:html|
 
            <h3>Data preprocessing</h3>
 
In order to convert the protein sequences into a machine readable format we preprocessed the whole UniProt database (release 08/17) as well as the SwissProt database (release 08/17) <x-ref>apweiler2004uniprot</x-ref>. For the classification task of 886 GO-labels we genererated a dataset containing xxxx sequences for SwissProt and xxxx sequences for Uniprot respectively. The dataset was then split into training and validation set, where the validation set was recieved 5 sequences per GO-term and the training set the remaining sequences.
 
    }}
 
      {{#tag:html|
 
            <h3>Network Architecture</h3>
 
We implemented a residual learning network <x-ref>he2016deep</x-ref> with 1x1 convolutional layers in the residual blocks to reduce the overall parameter size and retain the depth <x-ref>Iandola2016SqueezeNet</x-ref>. Our fully convolutional architecture is depicted in Fig [/], along with the composition of a residual block. First the 20x1000x1 one-hot encoded input is embedded into 1 dimensional vectors on 64 channels. Subsequently the deep residual network consisting of 20 blocks extracts features, which are then passed to two independent outlayers. Through its way through the network the signal is pooled down to a length dimenstion of 1 with a width of 512 channels prior to the outlayers.
 
    }}
 
      {{#tag:html|
 
            <h3>Training</h3>
 
            The model was trained using the Adagrad optimizer <x-ref>duchi2011adaptive</x-ref>, with an initial learining rate of 0.01 and a mini-batchsize of 64 samples, minimizing the cross entropy between sigmoid logits and labels. A label was considered predicted with an output value higher than 0.5. During the training process the independent outlayers where updated alternatingly depending on their argmax. The signal for both outlayers was avergaged for inference and the variance between the outlayers was considered as a metric for model uncertainty.
 
    }}
 
      {{#tag:html|
 
            <h2>Results</h2>
 
            NUR DRECK.
 
    }}
 
}}
 
{{Heidelberg/references2
 
    }}
 
{{Heidelberg/footer
 
    }}
 

Latest revision as of 18:04, 14 December 2017