|
|
(11 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | {{Heidelberg/navbar}}
| |
− | {{Heidelberg/header}}
| |
| | | |
− |
| |
− | {{Heidelberg/boxopen|
| |
− | Week 34
| |
− | {{#tag:html|
| |
− | <h2>Optopace</h2>
| |
− |
| |
− |
| |
− | Weekly summary 14.-20.08.2017 CG
| |
− | ====
| |
− |
| |
− | Opto PACE
| |
− | ===
| |
− | Primer for Opto PACE arrived and cloning was started.
| |
− |
| |
− | Phage propagation of the **unevolved** Dickinson phage
| |
− | ===
| |
− | Phage supernatant of the unevolved Dickinson phage: target_133_N-term_T7-C was received from Dickinson group. To propagate the phages, 4 ml *E. coli* culture (Stock ID: 47) was cultivated to an OD600 of 0.6 (in LB media + 25 mM Glucose + Amp) and infected with 4 µl of the phage supernatant. Culture was shaked at 37 °C overnight. On the next morning culture was centrifuged at 6,000 g for 5 min and supernatant, which contains the phages, was stored at 4 °C.
| |
− |
| |
− | A Blue Plaque Assay was performed to determine the phage titer of the supernatant. 143 plaques were counted at the 10<sup>-10</sup> dilution, which leads to a phage titer of 1.43*10<sup>15</sup> PFU/ml.
| |
− |
| |
− | A plaque of this plate was picked to infect a 4 ml *E. coli* culture (Stock ID: 47) with an OD600 of 0.4. This culture was cultivated for two hours shaking at 37 °C and was subsequently transferred to 100 ml fresh 2xYT medium. After 1 hour carbenicillin (1000x) was add.
| |
− |
| |
− | On the next day, culture was centrifuged at 3640 g for 20 min. Supernatant was stored at 4 °C.
| |
− |
| |
− | A Blue Plaque Assay was performed to determine the phage titer of the supernatant. The monoclonal Dickinson phage target_133_N-term_T7-C exhibited a phage titer of 1.85*10<sup>9</sup> PFU/ml.
| |
− | <h2>Software</h2>
| |
− |
| |
− |
| |
− | KW34
| |
− | =====
| |
− | Word2Vec Embeddings on Proteinsequences
| |
− | ---------------------
| |
− | We rewrote a word2vec implementation from tensorflows tutorials that implements Efficient Estimation of Word Representations in Vector Space, ICLR 2013 (Mikolov, et. al.). The model is a skipgram model with negative sample that uses custom ops written in C. The code was adapted to our needs, mainly by changing datatypes in the C kernels and writing a different evaluation function based on predicting the nearest words to the most frequent words instead of using analogies. Two new datasets were generated based on both swissprot and uniprot. Training of 4mer embeddings in 50, 100 and 200 dimensions were started but have not been calculated yet.
| |
− | Visualisation of the first checkpoints is possible via tensorboard [Visualisation of an example embedding via tensorboard](170820ai-vistestemb).
| |
− |
| |
− | IMPLEMENTATION OF SQUEEZENET Architecture
| |
− | ---------------------------------
| |
− | With implamentation of a new architecture based on Sequeeze-net (Forrest N. Iandola, 2017), relying on 1x1 convolutions we were able to grasp the 299 as well as the 637 classes dataset. The new model architecture looks the following:
| |
− |
| |
− | - InputLayer model_valid/input_layer_valid: (64, 20, 1000, 1)
| |
− | - PadLayer model_valid/block1/pad_layer_valid: paddings:[[0, 0], [0, 0], [3, 3], [0, 0]] mode:CONSTANT
| |
− | - Conv2dLayer model_valid/block1/cnn_layer_valid: shape:[20, 7, 1, 128] strides:[1, 5, 1, 1] pad:VALID act:prelu
| |
− | - Conv1dLayer model_valid/block2/cnn_layer_valid: shape:[6, 128, 128] stride:1 pad:SAME act:prelu
| |
− | - Conv1dLayer model_valid/1x1_I/1x1_valid: shape:[1, 128, 64] stride:1 pad:SAME act:prelu
| |
− | - BatchNormLayer model_valid/1x1_I/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block3/cnn_layer_valid: shape:[5, 64, 256] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block3/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block3/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block4/cnn_layer_valid: shape:[5, 256, 256] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block4/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block4/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/1x1_II/1x1_valid: shape:[1, 256, 128] stride:1 pad:SAME act:prelu
| |
− | - BatchNormLayer model_valid/1x1_II/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block5/cnn_layer_valid: shape:[5, 128, 256] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block5/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block5/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block6/cnn_layer_valid: shape:[5, 256, 512] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block6/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block6/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/1x1_III/1x1_valid: shape:[1, 512, 256] stride:1 pad:SAME act:prelu
| |
− | - BatchNormLayer model_valid/1x1_III/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block7/cnn_layer_valid: shape:[5, 256, 516] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block7/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block7/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block8/cnn_layer_valid: shape:[5, 516, 1024] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block8/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block8/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/1x1_IV/cnn_layer_valid: shape:[1, 1024, 512] stride:1 pad:SAME act:prelu
| |
− | - BatchNormLayer model_valid/1x1_IV/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/block9/cnn_layer_valid: shape:[5, 512, 1024] stride:1 pad:SAME act:prelu
| |
− | - PoolLayer model_valid/block9/pool_layer_valid: ksize:[2] strides:[2] padding:VALID pool:pool
| |
− | - BatchNormLayer model_valid/block9/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - Conv1dLayer model_valid/outlayer/cnn_layer_valid: shape:[1, 1024, 637] stride:1 pad:SAME act:prelu
| |
− | - BatchNormLayer model_valid/outlayer/batchnorm_layer_valid: decay:0.900000 epsilon:0.000010 act:identity is_train:False
| |
− | - MeanPool1d global_avg_pool: filter_size:[7] strides:1 padding:valid
| |
− |
| |
− | The architecture is fully convolutional, ending in an average pooling layer as outlayer, with the channels dimension corresponding to the number of classes. All inputs were 1-hot encoded and zero padded to a boxsize of 1000 positions.
| |
− |
| |
− | | Model | lr | classes | Comment | restored | maxstep | boxsize | ACC |
| |
− | |-------|------|---------|---------|----------|---------|---------|--------------|
| |
− | | | 0.01 | 299 | | NO | 220000 | 1000 | 0.8 (valid) |
| |
− | | | 0.01 | 637 | | NO | 180000 | 1000 | 0.55 (valid) |
| |
− | | | 0.01 | 637 | | YES | 35000 | 1000 | 0.75 (valid) |
| |
− |
| |
− |
| |
− | References:
| |
− | -----
| |
− | 1. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.
| |
− | 2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
| |
− |
| |
− | <h2>Modeling</h2>
| |
− |
| |
− |
| |
− | A first numeric model of PredCel works without oscillations. Graphs look reasonable so far. Modelling was performed on three levels: On the lowest level one Step of Predcel monitoring Phage concentration as well as uninfected, infected and phage producing E. coli concentration [Graph of level 1 Predcel model](170820mod-lvl1.png). One level above all the concentrations were tracked over 100 iterations of Predcel [Graph of level 2 Predcel model](170820mod-lvl2.png). And on the third level different sets of values for starting fitness, starting phage concentration and starting E. coli concentration were tested. In this case we only monitored how long the phage titer at the end of each iteration of Predcel stayed above 1 pfu/mL and below 1e8 pfu/mL.[Graph of level 3 Predcel model](170820mod-lvl3.png)
| |
− | At least the two higher levels probably only work in python, but maybe an interactive version of what happens during one iteration is possible. A final more comfortable version of the script was started.
| |
− |
| |
− | }}
| |
− | }}
| |
− | {{Heidelberg/boxopen|
| |
− | Week 35
| |
− | {{#tag:html|
| |
− | <h2>Optopace</h2>
| |
− |
| |
− |
| |
− | Weekly summary 21.-27.08.2017 CG
| |
− | ====
| |
− | Opto PACE
| |
− | ===
| |
− |
| |
− | SP Opto
| |
− | --
| |
− | PCR amplification of SP fragments:
| |
− | ---
| |
− | The blue-light mediated transcription activator EL222 was amplified from a gblock (EL222_expression_cassette) using CG_012_fwd and CG_013_rev as primers in a 25 µl Phusion Flash reaction mix.
| |
− | The cycler conditions for the 678 bp fragment were as follows:
| |
− |
| |
− | | Phase | Temperature [°C] | Time [min] | Cycles |
| |
− | |---|---|---|---|
| |
− | | Initial Denaturation | 98 | 0:10 | 1 |
| |
− | | Denaturation | 98 | 0:01 | 35 |
| |
− | | Annealing | 72 | 0:05 | 35 |
| |
− | | Extension | 72 | 0:11 | 35 |
| |
− | | Final Extension | 72 | 1:00 | 1 |
| |
− | | Hold | 10 | ∞ | 1 |
| |
− |
| |
− |
| |
− | Oligo annealing and phosphorylation of the AP promoters:
| |
− | ---
| |
− | pBLind: CG_010_fwd, CG_009_rev
| |
− | pBLrep: CG_007_fwd, CG_006_rev
| |
− |
| |
− | 5 µl of each primer (CG_010_fwd, CG_009_rev for pBLind and CG_007_fwd, CG_006_rev for pBLrep) (100 µM) was added to 1.1 µl T4 ligase buffer in a PCR tube. Reaction mix was heated up to 98 °C and slowly cooled down (0.1 °C/s).
| |
− |
| |
− | The annealed oligos were phosphorylated using the T4 PNK according to the neb protocol "Non-radioactive Phosphorylation with T4 PNK or T4 PNK (3´ phosphatase minus)".
| |
− |
| |
− | Golden Gate assembly of SP and AP:
| |
− | ---
| |
− | GG of Opto SP (20 µl Reaction):
| |
− |
| |
− | | Reagents | Volume [µl] |
| |
− | |---|---|---|---|
| |
− | | SP_BB (Purification ID: 515) | 15 |
| |
− | | EL_222 (Purification ID: 520) | 2 (of a 1:10 dilution) |
| |
− | | BsaI | 0.5 |
| |
− | | T4 Ligase | 0.5 |
| |
− | | t4 Ligase Buffer | 2 |
| |
− |
| |
− | GG of Opto APs (20 µl Reaction):
| |
− |
| |
− | AP_light
| |
− |
| |
− | | reagents | Volume [µl] |
| |
− | |---|---|---|---|
| |
− | | gIII_luxAB (Purification ID: 514)| 2 |
| |
− | | AP_light_BB (Purification ID: 516) |0.5 |
| |
− | | pBLind | 1 (of a 1:100 dilution) |
| |
− | | BsaI | 0.5 |
| |
− | | T4 Ligase | 0.5 |
| |
− | | t4 Ligase Buffer | 2 |
| |
− |
| |
− | AP_dark
| |
− |
| |
− | | reagents | Volume [µl] |
| |
− | |---|---|---|---|
| |
− | | gIII_luxAB (Purification ID: 514)| 2 |
| |
− | | AP_dark_BB (Purification ID: 517) |0.5 |
| |
− | | pBLrep | 1 (of a 1:100 dilution) |
| |
− | | BsaI | 0.5 |
| |
− | | T4 Ligase | 0.5 |
| |
− | | t4 Ligase Buffer | 2 |
| |
− |
| |
− | Cycler conditions were as follows:
| |
− |
| |
− | | Temperature [°C] | Time [min] | Cycles |
| |
− | |---|---|---|---|
| |
− | | 37 | 3:00 | 15 |
| |
− | | 16 | 4:00 | 15 |
| |
− | |37 | 30:00 | 1 |
| |
− | | 65 | 0:05 | 1 |
| |
− |
| |
− |
| |
− | Trafos of APs and SP:
| |
− | ---
| |
− | APs were transformed into Top 10 cells (chemically competent) and SP was transformed into S1059 (electrocompetent).
| |
− |
| |
− | AP trafos were plated on LB-plates containing Amp. SP trafos were cultivated in SOC for 4 h and a Blue Plaque Assay was subsequently performed with the phage supernatant. The phage supernatant was also used to perform a PCR of the phage insert. Therefore, a 25 µl Q5 PCR reaction was prepared. To amplify the 1.9 kb fragment containing EL222, the primers JM_068_rev and JM_069_fwd were used together with 2 µl of the phage supernatant as template.
| |
− |
| |
− | The cycler conditions were as follows:
| |
− |
| |
− | | Phase | Temperature [°C] | Time [min] | Cycles |
| |
− | |---|---|---|---|
| |
− | | Initial Denaturation | 98 | 3:00 | 1 |
| |
− | | Denaturation | 98 | 0:10 | 35 |
| |
− | | Annealing | 64 | 0:25 | 35 |
| |
− | | Extension | 72 | 0:50 | 35 |
| |
− | | Final Extension | 72 | 2:00 | 1 |
| |
− | | Hold | 10 | ∞ | 1 |
| |
− |
| |
− | Analysis by Gel electrophoresis revealed two bands at 1 kb and 1.9 kb.
| |
− | <h2>Software</h2>
| |
− |
| |
− |
| |
− | KW35
| |
− | ======
| |
− |
| |
− | Performance of the Squeezenet Architecture - singlelabel 599
| |
− | -------
| |
− | The model was run successfully on the old 599 classes dataset.
| |
− | Parameters: lr = E-2, batchsize=64, epsilon=0.1
| |
− | [ROC](DeeProtein_TFRECORDS_PURECONV_1x1tuned_750k_restored750kfull_sce_adam_1dconv637_1000_one_hot_padded_64_0.001_0.1.roc_16.svg)
| |
− | [Precision](DeeProtein_TFRECORDS_PURECONV_1x1tuned_750k_restored750kfull_sce_adam_1dconv637_1000_one_hot_padded_64_0.001_0.1.precision_16.svg)
| |
− |
| |
− | Performance of the Squeezenet Architecture - singlelabel 679
| |
− | -------
| |
− | The model was run successfully on the 679 classes dataset.
| |
− | Parameters: lr = E-5, batchsize=64, epsilon=0.1
| |
− | [ROC](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored679_sce_adam_1dconv_EC_679_1000_one_hot_padded_64_0.001_0.1.roc_9.svg)
| |
− | [Precision](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored679_sce_adam_1dconv_EC_679_1000_one_hot_padded_64_0.001_0.1.precision_9.svg)
| |
− |
| |
− | Performance of the Squeezenet Architecture - Multilabel 1084
| |
− | -------
| |
− | The model was run successfully on the 1084 GO-classes dataset.
| |
− | Parameters: lr = E-3, batchsize=64, epsilon=0.1
| |
− | [ROC](DeeProtein_TFRECORDS_PURECONV_1x1LARGE_MULTI_restored1084_sce_adam_1dconv_EC_1084_1000_one_hot_padded_64_0.0001_0.1.roc_39.svg)
| |
− | [Precision](DeeProtein_TFRECORDS_PURECONV_1x1LARGE_MULTI_restored1084_sce_adam_1dconv_EC_1084_1000_one_hot_padded_64_0.0001_0.1.precision_39.svg)
| |
− |
| |
− | Corrected datasets for missing classes, reworte ```eval()``` to enclude the whole validation set
| |
− | ------------------------
| |
− | - Dataset 637, was missing 138 classes due to the min. length requirement in the ```DatasetGenerator``` class. The requirement was lowered to 175AA. Further the ```DatasetGenerator``` class was rewritten, to ensure to contain 5 samples from every class in the validation set.
| |
− | - the ```eval()``` function of ```DeeProtein``` was rewritten to perform the validaion on the _whole_ validation set at given steps.
| |
− |
| |
− | Performance on 679 classes with minlength 175:
| |
− | lr=0.01, e=0.1, batchsize=64
| |
− | [ROC](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored637750kfull_sce_adam_1dconv679_1000_one_hot_padded_64_0.01_0.1.roc_32.svg)
| |
− | [Precision](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored637750kfull_sce_adam_1dconv679_1000_one_hot_padded_64_0.01_0.1.precision_32.svg)
| |
− |
| |
− | lr=0.001, e=0.1, batchsize=64
| |
− | [ROC](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored637750kfull_sce_adam_1dconv679_1000_one_hot_padded_64_0.01_0.1.roc_32.svg)
| |
− | [Precision](DeeProtein_TFRECORDS_PURECONV_1x1tuned_restored637750kfull_sce_adam_1dconv679_1000_one_hot_padded_64_0.01_0.1.precision_32.svg)
| |
− |
| |
− | Reinitialization with pretrained parameters and lower learning rate allowed finetuning of the classifier. Especially as the validation set is uniformally distributed (in contrast to the training set) the classifier can be considered as trained.
| |
− |
| |
− | ROC/ACC/AUC-metrics
| |
− | --------
| |
− | ROC and AUC was added to be calculated on the fly (after validation on the whole validation set.).
| |
− |
| |
− | Training models on the embedded sequences
| |
− | ------------------
| |
− | We generated batches from the word embeddings (dim=100, kmer-length=3) for the 679(EC) and the 1084 mulilabel network. However training proceeds much more slowly as the parametersize is 5 times the size of the one-hot network.
| |
− |
| |
− | Multilabel-classification
| |
− | --------
| |
− | In order to be able to perform multilabel classification, we rewrite the input pipeline (```DatasetGenerator, BatchGenerator, TFrecordsgenerator```) and generated two datasets with 339 and 1084 classes respectively. The considered labels were chosen solely based on their polulation. As the GO-term hierarchy follows a directed acyclic graph (DAG) we looked up all parent nodes for each leaf nodes and included the total set of annotations for each sequence.
| |
− |
| |
− | First models were run after extending the network for 2 convolutional and 2 1x1 layers on the 1084 classes dataset. Results were disenchanting.
| |
− |
| |
− | Comparison of datasets
| |
− | ----------------
| |
− | Total seqs after filtering (EC): 220488
| |
− | Total seqs after filtering (GO): 235767
| |
− |
| |
− | | Datatype | Dataset | Samples | % of filtered sequences considered | % of total sequences considered |
| |
− | |----------|---------|---------|------------------------------------|---------------------------------|
| |
− | | EC | 679 | 165658 | 75.13 | 63.18 |
| |
− | | GO | 1084 | 233386 | 98.99 | 89.00 |
| |
− |
| |
− | Word2Vec
| |
− | ------
| |
− | The new word2vec adaption we implemented last week was optimized by a few minor changes. Metadata of the embeddings can now be used to analyse the embeddings with tensorboard. Search for a single k-mer, finding the nearest neighbours, annotating frequencies or other properties to the points and search for groups of k-mers defined by regular expressions work. Embeddings of 3-mers and 4-mers in 50, 100 and 200 dimensions were calculated on whole swissprot and whole uniprot. Principle Component Analysis (PCA) was performed on a 100-dimensional embedding of 3-mers from swissprot and showed interesting properties. However, reduction of 100 dimensions to two or three may account for at least some of these.
| |
− | ![Frequencies of 3-mers](170827mod-freqs.png)
| |
− | Here the 3-mers marked darker are more frequent, those seem to cluster together.
| |
− | ![All 3-mers containing a specific amino acid](170827mod-aas.gif)
| |
− | For each amino acid all 3-mers containing it are marked in read. Some of those selection cluster together, others do not. This may also be due to the dimension reduction.
| |
− | ![All 3-mers containing cysteine](170827mod-c.png)
| |
− | All 3-mers containing cysteine are marked in red.
| |
− | ![All 3-mers containing lysine](170827mod-k.png)
| |
− | All 3-mers containing lysine are marked in red.
| |
− | ![All 3-mers containing proline](170827mod-p.png)
| |
− | All 3-mers containing proline are marked in red.
| |
− | ![All 3-mers containing valine](170827mod-v.png)
| |
− | All 3-mers containing valine are marked in red.
| |
− | Especially cysteine, proline and lysine containing 3-mers cluster together, while the ones containing valine appear to be more equally distributed.
| |
− | Clustering of k-mers based on amino acid content is a first hint, embeddings can be useful for input of neural networks. However their performance can probably only be measured indirectly by comparing the performance of a single architecture on different inputs.
| |
− | <h2>Modeling</h2>
| |
− |
| |
− |
| |
− | Calculations of medium consumption for iGEM goes green were performed and made interactive. A heatmap visualizes medium consumption, calculation of ideal turbidostat and lagoon sizes are possible. The user can annotate own experiments and compare them to others.
| |
− | [Heatmap with default values](170827mod-heatmap.png)
| |
− | Experimentation with elements of a Predcel model based on distributions instead of scalars were started. The idea is that a population of phages does not have one fitness between 0 and 1 but rather has individuals that have different fitness values. In this more complex model the concentrations have to be calculated for each phage fitness value, depending on the amount of phages that have that fitness value. The fitness distribution is changed by mutation and by selection. The first naive approach for mutation was programmed. It simply substracts a given percentage of the difference between the amount of a fitness value and the mean amount from the amount of a fitness value. Obviously this is oversimplified and will therefore be replaced by a model based on the idea that every sequence that mutates gets better or worse with normally distributed changes.
| |
− |
| |
− | }}
| |
− | }}
| |
− |
| |
− |
| |
− | {{Heidelberg/footer}}
| |