Difference between revisions of "Team:Heidelberg/SandboxTHORE1"

Line 4: Line 4:
 
     }}
 
     }}
 
{{Heidelberg/templateus/Mainbody|
 
{{Heidelberg/templateus/Mainbody|
     DeeProtein.|
+
     DeeProtein | Learning Proteins |
    Disentangling protein sequence space with artificial intelligence.|
+
https://static.igem.org/mediawiki/2017/c/c1/T--Heidelberg--2017_Background_Tunnel.jpg|
    https://static.igem.org/mediawiki/2017/c/c1/T--Heidelberg--2017_Background_Tunnel.jpg|
+
  
 
     {{Heidelberg/templateus/AbstractboxV2|
 
     {{Heidelberg/templateus/AbstractboxV2|
        Deeprotein|
+
DeeProtein - Deep Learning for proteins |
        With Interactive Modelling iGEM Heidelberg provides a comprehensive set
+
Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is mostly determined by structure, sequence based classification is difficulta and manual feature extraction along with conventional machine learning models did not yield satisfying results. However with the advent of deep learning, especially representation learning the obstacle of linking sequences to a functionality without further structural information can be overcome.
        of tools that not only help to facilitate the implementation of PACE but
+
Here we present DeeProtein, a deep convolutional neural network for multilabel protein sequence classification on functional gene ontology terms. We trained our model on a subset of the uniprot database and achieved an AUC under the ROC curve of 99% on our validation set.
        also give an intuitive understanding of underlying mechanisms. To control
+
        highly complex processes such as PACE or PALE in a near-ideal way enables
+
        to exploit as much of it's potential as possible. The most important
+
        parameters were determined and examined with ODE systems, solved
+
        analytically or numerically, [stochastic and
+
        distributional] models. As far as possible the models are available
+
        online to make them accessible to anyone interested. When useful, a [tool
+
        for comparison of experimental data and the model] is available.
+
        In addition the Interactive modelling helps to monitor parameters that
+
        cannot be easily be interpreted from raw data, such as [] and combines
+
        different parameters to make useful statements about an experiment.|
+
  
        https://static.igem.org/mediawiki/2017/8/88/T--Heidelberg--2017_modelling-graphical-abstract.svg
+
https://static.igem.org/mediawiki/2017/8/88/T--Heidelberg--2017_modelling-graphical-abstract.svg
 
     }}
 
     }}
 
     {{Heidelberg/templateus/Contentsection|
 
     {{Heidelberg/templateus/Contentsection|
Line 30: Line 18:
 
             {{Heidelberg/templateus/Heading|
 
             {{Heidelberg/templateus/Heading|
 
                 Introduction
 
                 Introduction
            }}
+
}}
        iGEM Heidelberg provides a comprehensive set of models that allows for
+
        both control and evaluation of continuous and discontinuous direction
+
        evolution. The interactive models facilitate regular use of the models
+
        in everyday lab work and are easier to understand as they provide an
+
        intuitive understanding by enabling the user to observe how the model
+
        behaves when parameters are changed.
+
        Predictions from the models helped to design the novel method Predcel to
+
        be both reliable and time efficient.
+
        To get accurate modelling results for the used setup, a selection of
+
        parameters was determined experimentally and included in the models.
+
        As models for different levels of abstraction were needed, a variety of
+
        approaches from ordinary differential equations, delayed
+
        differential equations over stochastic simulations to molecular dynamics
+
        was applied to obtain valuable information on the different aspects of
+
        directed evolution.|
+
        color=blue
+
  
 +
            <h2>Deep Learning in general</h2>
 +
While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s <x-ref>rosenblatt1958perceptron</x-ref>, it was not until the 1990s, when the first convolutional neural networks were introduced <x-ref>LeCun1990Handwritten</x-ref>, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification <x-ref>Krizhevsky2012ImageNet</x-ref> paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data <x-ref>alipanahi2015predicting</x-ref>, <x-ref>kadurin2017cornucopia</x-ref>.
  
             <h2>Modelling concentrations in one Lagoon</h2>
+
             <h2>Powerful function approximator to untangle the complex relation between sequence and function</h2>
            Here the concentrations \(c\) of uninfected <i>E. coli</i>, infected <i>E. coli</i> and phage producing <i>E. coli</i> as well as the <i>M13</i> phage are modelled. They are denoted with the subscripts \(_{u}\), \(_{i}\), \(_{p}\) and \(_{P}\). If the whole <i>E. coli</i> population is referred to, \(c_{E}\) is used. If an arbitrary <i> E. coli</i> population is meant, the subscript \(_{e}\) is used. The phage concentration \(c_{P}\) refers to the free phage only, phage that are contained in an <i>E. coli</i> they infected are not included.
+
Artificial neural networks are powerful function approximators, able to untangle complex relations in the input data space. However it was not until the introduciton of convolutional neural networks <x-ref>LeCun1990Handwritten</x-ref>, that made deep learning such a powerful method. Convolutional neural networks rely on trainable filters or kernels to extract the valuable information from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning <x-ref>oquab2014learning</x-ref>, detection <x-ref>lee2009unsupervised</x-ref> and classification <x-ref>Krizhevsky2012ImageNet</x-ref> tasks. A convolutional neural network can thus extract the information present in the input space and encode the input in a compressed representation. Handwritten freature extraction thus becomes obsolete.
            The used parameters include the time \(t\), the affinity of phage for <i>E. coli</i> \(k\), the duration between infection of an <i>E. coli</i> and the first phage leaving the <i>E. coli</i> \(t_{P}\). The three different <i>E. coli</i> populations each have a division time \(t\) that is denoted with their subscript. The fitness of a phage population is \(f\).
+
        }}
+
   
+
        {{Heidelberg/templateus/Tablebox|
+
            Table 1: Variables and Parameters used in this model |
+
            {{#tag:html|
+
                <table class="table table-bordered mdl-shadow--4dp" XSSCleaned="overflow-x: scroll !important">
+
                    <thead>
+
                        <tr>
+
                            <th>Symbol</th>
+
                            <th>Name in source code</th>
+
                            <th>Value and Unit</th>
+
                            <th>Explanation</th>
+
 
+
                        </tr>
+
                    </thead>
+
                    <tbody>
+
                        <tr>
+
                            <td>\(c \)</td>
+
                            <td>-</td>
+
                            <td>[cfu] or [pfu] </td>
+
                            <td>colony forming units for <i> E. coli</i> [cfu] or plaque forming units [pfu] for M13 phage</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _u\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript for uninfected <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _i\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript for infected <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _p\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript for phage-producing <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _e\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript any the of <i>E. coli</i> populations on its own</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _E\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript for all populations of <i>E. coli</i> together</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( _P\)</td>
+
                            <td>-</td>
+
                            <td> - </td>
+
                            <td>Subscript for M13 phage</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(c_{c} \)</td>
+
                            <td><pre>capacity</pre></td>
+
                            <td>[cfu/ml]</td>
+
                            <td>Maximum concentration of <i>E. coli</i> possible under given conditions, important for logistic growth</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(t\)</td>
+
                            <td><pre>t</pre></td>
+
                            <td>[min]</td>
+
                            <td>Duration since the experiment modeled was started</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(t_{u} \)</td>
+
                            <td><pre>tu</pre></td>
+
                            <td>\(20\) min</td>
+
                            <td>Duration one division of uninfected <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(t_{i} \)</td>
+
                            <td><pre>ti</pre></td>
+
                            <td>\(30\) min</td>
+
                            <td>Duration one division of infected <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(t_{p} \)</td>
+
                            <td><pre>tp</pre></td>
+
                            <td>\(40\) min</td>
+
                            <td>Duration one division of phage producing <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( t_{P}\)</td>
+
                            <td><pre>tpp</pre></td>
+
                            <td>[min]</td>
+
                            <td>Duration between an <i>E. coli</i> being infected by an M13 phage and releasing the first new phage</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(g_{e} \)</td>
+
                            <td><pre>e_growth_rate</pre></td>
+
                            <td>[cfu/min]</td>
+
                            <td>Growth rate of <i>E. coli</i>, depending on the type of growth (either logistic or exponential), the current concentration \(c_{e}\), the maximum concentration \(c_{c}\), and the division time \(t_{e}\)</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( k\)</td>
+
                            <td><pre>k</pre></td>
+
                            <td>\(3 \cdot 10^{-11}\frac{1}{cfu \cdot pfu \cdot ml \cdot min}\)</td>
+
                            <td>Affinity of M13 phage for <i>E. coli</i></td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( \mu_{max}\)</td>
+
                            <td><pre>mumax</pre></td>
+
                            <td>\(16.67 \frac{cfu}{min \cdot ml \cdot cfu}\)</td>
+
                            <td>Wildtype M13 phage production rate</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\( f\)</td>
+
                            <td><pre>f</pre></td>
+
                            <td>?</td>
+
                            <td>Fitnessvalue, fraction of actual \(\mu\) and \(\mu_{max}\)</td>
+
                        </tr>     
+
                    </tbody>
+
                </table>
+
            }}|
+
            List of all paramters and variables used in this model. When possible values are given.
+
 
         }}
 
         }}
 
         {{#tag:html|
 
         {{#tag:html|
             Each term describing the change of an <i>E. coli</i> concentration contains its growth, \(g_{e}\). The growth rate of an <i>E. coli</i> population can be modelled by exponential growth or by logistic growth. Especially, when long durations per lagoon are modelled, the logistic growth model is more exact. [source].
+
             <h2>Applied models and Architecture</h2>
            In the exponential case the growth rate \(g_{e}\) is modelled as
+
            <h2>Protein representation learning</h2>
            $$
+
The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in **de novo** protein sequence generation. Attempts for protein sequence classification have been made with CNNs <x-ref>szalkai2017near</x-ref> as well as with recurrent neural networks <x-ref>liu2017deep</x-ref> with good success, however without the possibility for generative modelling.
            g_{e} (t_{e}) = c_{e} \cdot \frac{log(2)}{t_{e} }
+
            $$
+
            Note that the growth rate in the model increases over time, while in the modelled culture, the nutrient concentration decreases.  
+
           
+
            That makes the logistic model more plausible, it models \(g_{e}\) as
+
            $$
+
            g_{e} (t_{e}, \: c_{e}(t), \: c_{c}) = \frac{c_{c} - c_{e} (t)}{c_{c} } \cdot \frac{log(2)}{t_{e} }
+
            $$
+
            In this case the learning rate decreases as the current concentration \(c_{e}\) approaches the maximum capacity for <i>E. coli</i> in the given setup \(c_{c}\). With this model \(c_{e} \leq c_{c}\) is true for any point in time.
+
 
+
  
             <b>Change of concentration of uninfected <i>E. coli</i>, \(\frac{\partial c_{u} }{\partial t} \: [cfu/min]\)</b>
+
             To find the optimal feature representation of proteins we apply and test various representation techniques.
            $$
+
            \frac{\partial c_{u} }{\partial t}(t) = g_{u} (t_{u}, \: c_{u}(t), \: c_{c})
+
            - k \cdot c_{u}(t) \cdot c_{p}(t)
+
            $$
+
            In addition to the growth term, the concentration of uninfected <i>E. coli</i> is described by a term for infection that takes into account the concentration of uninfected <i>E. coli</i> and the concentration of free phage and reduces the conentration of uninfected <i>E. coli</i>.
+
           
+
            <b>Change of concentration of uninfected <i>E. coli</i>, \(\frac{\partial c_{i} }{\partial t} \: [cfu/min]\)</b>
+
            $$
+
            \frac{\partial c_{i} }{\partial t}(t) = \begin{cases}
+
            g_{i} (t_{i},  \: c_{i}(t),  \:c_{c})
+
            + k \cdot c_{i}(t) \cdot c_{p}(t)
+
            - c_{i}(t - t_{P}),
+
            \quad \text{for} \: t > t_{P} \\
+
            g_{i} (t_{i}, \: c_{i}(t), \: c_{c})
+
            + k \cdot c_{i}(t) \cdot c_{p}(t),
+
            \quad \text{otherwise}
+
            \end{cases}
+
            $$
+
            Until \(t > t_{P}\) the concentration of infected <i>E. coli</i> increases by growth and infection of previouly uninfected <i>E. coli</i>. When \(t > t_{P}\), a third term describing that infected <i>E. coli</i> turn into phage-producing <i>E. coli</i> is subtracted.
+
           
+
            <b>Change of concentration of phage producing <i>E. coli</i>, \(\frac{\partial c_{p} }{\partial t} \: [cfu/min]\)</b>
+
           
+
            $$
+
            \frac{\partial c_{p} }{\partial t}(t) = \begin{cases}
+
            g_{p} (t_{p}, \: c_{p}(t), \: c_{c}) -
+
            c_{i}(t - t_{P}),
+
            \quad \text{for} \: t > t_{P} \\
+
            g_{p} (t_{p}, \: c_{p}(t), \: c_{c}),
+
            \quad \text{otherwise}
+
            \end{cases}
+
            $$
+
            The population of phage producing E. coli only increases by growth until \(t > t_{P}\). When infected <i>E. coli</i> drop their first phage they turn into producing <i>E. coli</i> as described by the second term. 
+
           
+
            <b>Change of concentration of <i>M13</i> phage, \(\frac{\partial c_{P} }{\partial t} \: [cpu/min]\)</b>
+
           
+
            $$
+
            \frac{\partial c_{P} }{\partial t}(t) = c_{P}(t) \cdot \mu_{max} \cdot f - k \cdot c_{u}(t)\cdot c_{P}(t)
+
            $$
+
            The phage concentration is only increased by phage that leave phage-producing <i>E. coli</i>, which happens at a rate of \(f \cdot \mu_{max}\) per time unit, with f being the fitness, a value between 0 and 1, equal to the share of the wildtype <i>M13</i> phages fitness and \(\mu_{max}\) being the wildtype phages production rate. We assume that the only negative influence on the free phage titer is phage infecting <i>E. coli</i>, which depends on both the phage titer \(c_{P}\) and the titer of uninfected <i>E. coli</i>, \(c_{i}\).
+
           
+
            The fitness \(f\) is assumed to be constant during the time spent in one lagoon, it is assumed that all phages have the same fitness.
+
           
+
 
         }}
 
         }}
 
         {{#tag:html|
 
         {{#tag:html|
             <h2>Modelling concentrations over multiple Lagoons</h2>
+
             <h1>Protein sequence embedding</h2>
            When transfer from one volume to the next is performed, new lagoon can be modelled with starting values calculated from the last lagoons end values.
+
A protein representation first described by Asgari et al is prot2vec <x-ref>asgari2015continuous</x-ref>. The technique originates in the natural language processing and is based on the word2vec model <x-ref>mikolov2013efficient</x-ref> originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences <x-ref>asgari2015continuous</x-ref>. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding <x-ref>kimothi2016distributed</x-ref>. However training is here extremely slow as a proteinsequence itself is embedding in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change.
            For each concentration from the previous lagoon \(c_{t}\), the concentration in the next lagoon \(c_{t+1}\) is calculated as
+
Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model <x-ref>mikolov2013efficient</x-ref> on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, <x-ref>apweiler2004uniprot</x-ref>), composed of over 87 million sequences.
            $$
+
             }}
            c_{t+1} = \frac{v_{t} }{v_{l} } \cdot c_{t}
+
            {{#tag:html|
            $$
+
             <h2>Results</h2>
            with \(v_{l}\), the volume of a lagoon and \(v_{t}\), the volume that is transferred.
+
            If the transfered volume is spinned down before it is added to the new lagoon, the initial value for \(c_{P}\) is calculated this way. The initial concentration of uninfected <i>E. coli</i> is set to the initial cell density. Initial concentrations of infected and phage-producing <i>E. coli</i> are set to zero, because before the transfer, no phages are present in the new lagoon.
+
            If the transfer volume is not spinned down, the concentration of infected and phage-producing <i>E. coli</i> are calculated, using the above formula. The initial concentration of uninfected <i>E. coli</i> is the calculated the same way, but the initial cell density is added.
+
           
+
            In directed evolution the fitness should increase over time. A linear increase in fitness between to given values was implemented to show this. The problem with this approach is its basic assumption being that all phage-producing <i>E. coli</i> are infected by phages with the same fitness.  
+
            To make the model more plausible, a distribution of fitness was introduced. For a set of discrete fitness values each fitness values share of the phage-producing <i>E. coli</i> population is calculated.
+
            That changes the equation for the change in the concentration of phage-producing <i>E. coli</i> to
+
            $$
+
            \frac{\partial c_{P} (t)}{\partial t} = -k \cdot c_{u}(t) \cdot c_{P} (t)
+
            + \sum_{i = 0}^N f_{i} \cdot s_{i} \cdot \mu \cdot c_{p} (t)
+
            $$
+
            The calculation is for \(N\) different fitness values \(f_{i}\) and their share of the total phage-producing <i>E. coli</i> population \(s_{i}\).
+
              
+
        }}
+
        {{#tag:html|
+
             <h2>Numeric solutions</h2>
+
  
             The problem described above is a system of four differential equations, of which two ( \(\frac{\partial c_{i} }{\partial t} \:, \: \frac{\partial c_{p} }{\partial t}\) ) are so called delayed differential equations. They contain a term that needs to be evaluated at a timepoint in the past \(t - t_{P}\). A custom script was used to solve the problem numerically, using the explicit Euler method.[Source!]
+
             KEINE AMK
            The basic idea is that from a point in time with all values and all derivatives values given, the next point in time can be calculated by assuming a linear progress between the two points.
+
            $$
+
            f(t_{n+1}) = f(t_{n}) + (t_{n+1} - t_{n}) \cdot f'(t_{n})
+
            $$
+
            This is performed for \(c_{u}(t)\), \(c_{i}(t)\), \(c_{p}(t)\) and \(c_{P}(t)\) rotatory, to always have the needed values from \(t_{n}\) ready for \(t_{n+1}\).
+
           
+
            To explore, how unprecise parameters and noise influence the outcome of the model, a mode was implemented, that adds gaussian noise to all parameters. It uses the function \(n\) that makes a value \(v\) noisy with a random parameter \(r\).
+
            $$
+
                n(v) = \big(1 - 2r\big) \cdot \sigma_{G} \cdot \sigma_{v} \cdot v, \quad r \in (0, 1)
+
            $$
+
            Here, \(\sigma_{G}\) is a factor that is the same for all \(v\), \(\sigma_{v}\) is specific for \(v\). This way, it is possible to have one parameter being noisier than another, while being able to tune the noise globally.
+
            [Results]
+
        }}
+
        {{Heidelberg/templateus/Tablebox|
+
            Table 2: Additional Variables and Parameters used in the numeric solution of the model |
+
            {{#tag:html|
+
                <table class="table table-bordered mdl-shadow--4dp" XSSCleaned="overflow-x: scroll !important">
+
                    <thead>
+
                        <tr>
+
                            <th>Symbol</th>
+
                            <th>Name in Source code</th>
+
                            <th>Value and Unit</th>
+
                            <th>Explanation</th>
+
  
                        </tr>
+
             }}
                    </thead>
+
                    <tbody>
+
                        <tr>
+
                            <td>\(v_{l}\)</td>
+
                            <td><pre>vl</pre></td>
+
                            <td>[ml]</td>
+
                            <td>Volume of lagoon</td>
+
                        </tr>   
+
                        <tr>
+
                            <td>\(t_{l} \)</td>
+
                            <td><pre>tl</pre></td>
+
                            <td>[min]</td>
+
                            <td>Duration until transfer to the next lagoon</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(c_{u}(t_{0})\)</td>
+
                            <td><pre>ceu0</pre></td>
+
                            <td>[cfu]</td>
+
                            <td>Concentration of <i>E. coli</i> in a lagoon when M13 phages are transfered to it</td>
+
                        </tr> 
+
                        <tr>
+
                            <td>\(c_{P}(t_{0})\)</td>
+
                            <td><pre>cp0</pre></td>
+
                            <td>[pfu]</td>
+
                            <td>Initial concentration of M13 phage in the first lagoon</td>
+
                        </tr>
+
                        <tr>
+
                            <td>\(n\)</td>
+
                            <td><pre>epochs</pre></td>
+
                            <td>-</td>
+
                            <td>Number of epochs that are modelled, one epoch being everything that happens in one particular lagoon</td>
+
                        </tr> 
+
                        <tr>
+
                            <td>\(s\)</td>
+
                            <td><pre>tsteps</pre></td>
+
                            <td>-</td>
+
                            <td>Number of time steps for which numeric solutions are calculated, counted per epoch</td>
+
                        </tr> 
+
                        <tr>
+
                            <td>\(c_{P}^{min}\)</td>
+
                            <td><pre>min_cp</pre></td>
+
                            <td>[pfu]</td>
+
                            <td>Lower threshold for valid phage titers</td>
+
                        </tr> 
+
                        <tr>
+
                            <td>\(c_{P}^{max}\)</td>
+
                            <td><pre>max_cp</pre></td>
+
                            <td>[pfu]</td>
+
                            <td>Upper threshold for valid phage titers</td>
+
                        </tr> 
+
                    </tbody>
+
                </table>
+
             }}|
+
            List of all additional paramters and variables used in the numeric solution of this model. When possible values are given.
+
        }}
+
 
         {{Heidelberg/templateus/Imagebox|
 
         {{Heidelberg/templateus/Imagebox|
 
             https://static.igem.org/mediawiki/2015/thumb/4/49/Heidelberg_CLT_Fig.7_Splinted_Ligation.png/800px-Heidelberg_CLT_Fig.7_Splinted_Ligation.png|
 
             https://static.igem.org/mediawiki/2015/thumb/4/49/Heidelberg_CLT_Fig.7_Splinted_Ligation.png/800px-Heidelberg_CLT_Fig.7_Splinted_Ligation.png|
Line 346: Line 51:
 
             }}|
 
             }}|
 
             pos = left
 
             pos = left
        }}
 
        {{Heidelberg/templateus/Imagebox|
 
            https://static.igem.org/mediawiki/2015/thumb/4/49/Heidelberg_CLT_Fig.7_Splinted_Ligation.png/800px-Heidelberg_CLT_Fig.7_Splinted_Ligation.png|
 
            Fig: 1b Numeric solution calculated with explicit Euler approach|
 
            {{#tag:html|
 
                Non-logarithmic plot of the derivatives of concentrations of all <i>E. coli</i> populations cE, uninfected <i>E. coli</i> ceu, infected <i>E. coli</i> cei, phage-producing <i>E. coli</i> cep and M13 phage cP
 
            }}|
 
            pos = left
 
        }}
 
        {{Heidelberg/templateus/Imagebox|
 
            https://static.igem.org/mediawiki/2015/thumb/4/49/Heidelberg_CLT_Fig.7_Splinted_Ligation.png/800px-Heidelberg_CLT_Fig.7_Splinted_Ligation.png|
 
            Fig: 1c First derivative of concentrations calculated with explicit Euler approach|
 
            {{#tag:html|
 
                Logarithmic plot of the concentrations of all <i>E. coli</i> populations cE, uninfected <i>E. coli</i> ceu, infected <i>E. coli</i> cei, phage-producing <i>E. coli</i> cep and M13 phage cP
 
            }}|
 
            pos = left
 
        }}
 
        {{Heidelberg/templateus/Imagesection|
 
            https://static.igem.org/mediawiki/2017/a/ae/T--Heidelberg--2017_Background_Tiger.jpg|
 
            Fig: 2 Numeric solution for a range of values for \(t_{l}\) and for \(v_{t}\)|
 
            {{#tag:html|
 
                All combinations of setups for the two ranges were calculated. The number of epochs plotted is counted until either the phage titer is less than a minimal threshold (orange) or larger than a maximum threshold (blue)
 
            }}
 
 
         }}
 
         }}
 
     }}
 
     }}
 
+
}}
 
+
    }}
+
 
+
 
{{Heidelberg/references2
 
{{Heidelberg/references2
 
     }}
 
     }}
 
{{Heidelberg/footer
 
{{Heidelberg/footer
 
     }}
 
     }}

Revision as of 19:42, 23 October 2017

DeeProtein

Learning Proteins

DeeProtein - Deep Learning for proteins

Sequence based, functional protein classification is a multi-label, hierarchical classification problem that remains largely unsolved. As protein function is mostly determined by structure, sequence based classification is difficulta and manual feature extraction along with conventional machine learning models did not yield satisfying results. However with the advent of deep learning, especially representation learning the obstacle of linking sequences to a functionality without further structural information can be overcome. Here we present DeeProtein, a deep convolutional neural network for multilabel protein sequence classification on functional gene ontology terms. We trained our model on a subset of the uniprot database and achieved an AUC under the ROC curve of 99% on our validation set. https://static.igem.org/mediawiki/2017/8/88/T--Heidelberg--2017_modelling-graphical-abstract.svg

Introduction

Deep Learning in general

While the idea of applying a stack of layers composed of computational nodes to estimate complex functions origins in the 1960s rosenblatt1958perceptron, it was not until the 1990s, when the first convolutional neural networks were introduced LeCun1990Handwritten, that artificial neural networks were successfully applied on real world classification tasks. With the beginning of this decade and the massive increase in broadly available computing power the advent of Deep Learning begun. Groundbreaking work by Krizhevsky in image classification Krizhevsky2012ImageNet paved the way for many applications in image, video, sound and natural language processing. There has also been successful work on biological and medical data alipanahi2015predicting, kadurin2017cornucopia.

Powerful function approximator to untangle the complex relation between sequence and function

Artificial neural networks are powerful function approximators, able to untangle complex relations in the input data space. However it was not until the introduciton of convolutional neural networks LeCun1990Handwritten, that made deep learning such a powerful method. Convolutional neural networks rely on trainable filters or kernels to extract the valuable information from the input space. The application of trainable kernels for feature extraction has been demonstrated to be extremely powerful in representation learning oquab2014learning, detection lee2009unsupervised and classification Krizhevsky2012ImageNet tasks. A convolutional neural network can thus extract the information present in the input space and encode the input in a compressed representation. Handwritten freature extraction thus becomes obsolete.

Applied models and Architecture

Protein representation learning

The protein space is extremely complex. The amino acid alphabet knows 20 basic letters and an average protein has a length of 500 residues, making the combinatory complexity of the space tremendous. Comparable to images however, functional protein sequences reside on a thin manyfold within the total sequence space. Learning the properties of the protein distribution of a certain functionality would enable not only a decent classification of sequences into functions but also unlimited sampling from this distribution resulting in **de novo** protein sequence generation. Attempts for protein sequence classification have been made with CNNs szalkai2017near as well as with recurrent neural networks liu2017deep with good success, however without the possibility for generative modelling. To find the optimal feature representation of proteins we apply and test various representation techniques.

Protein sequence embedding

A protein representation first described by Asgari et al is prot2vec asgari2015continuous. The technique originates in the natural language processing and is based on the word2vec model mikolov2013efficient originally deriving vectorized word representations. Applied to proteins a word is defined as a k-mer of 3 amino acid residues. A protein sequence can thus be respresented as the sum over all internal k-mers. Interesting properties have been described in the resulting vectorspace, for example clustering of hydrophobic and hydrophilic k-mers and sequences asgari2015continuous. However there are limitations to the prot2vec model, the most important being the information loss on the sequence order. This has been addressed by application of the continuous bag of words model, with a paragraph embedding kimothi2016distributed. However training is here extremely slow as a proteinsequence itself is embedding in the paragraph context, where a paragraph is a greater set of protein sequences (e.g. SwissProt-DB). Further new protein sequences can not be added to the embedding as the paragraph context may not change. Thus we intended to find a optimized word2vec approach for fast, reproducible and simple protein sequence embedding. Therefore we applied a word2vec model mikolov2013efficient on kmers of length 3 with a total dimension size of 100. As the quality of the representation estimate scales with the number of training samples we trained our model on the whole UniProt database (Release 8/2017, apweiler2004uniprot), composed of over 87 million sequences.

Results

KEINE AMK
Fig: 1a Numeric solution calculated with explicit Euler approach
Logarithmic plot of the concentrations of all E. coli populations cE, uninfected E. coli ceu, infected E. coli cei, phage-producing E. coli cep and M13 phage cP

References