Zhiling Zhou (Talk | contribs) |
Zhiling Zhou (Talk | contribs) |
||
Line 540: | Line 540: | ||
<h2 id="vocclassification" class="H2Head">VOC Classification</h2> | <h2 id="vocclassification" class="H2Head">VOC Classification</h2> | ||
<h3 id="overview" class="H3Head">Overview</h3> | <h3 id="overview" class="H3Head">Overview</h3> | ||
− | <p class="PP">The VOC device is designed to | + | <p class="PP">The VOC device is designed to tell whether the tobacco is heathy or infected. Since this is an inquiry experiment, algorithms in data analysis are widely used in our modeling. We did data preprocessing, data analysis, and algorithm optimization on the data collected by VOC device. Finally, we used Logistic regression and detected the infected tobacco with 91% confidence.</p> |
<h3 id="datapreprocessing" class="H3Head">Data preprocessing</h3> | <h3 id="datapreprocessing" class="H3Head">Data preprocessing</h3> | ||
− | <p class="PP">First we | + | <p class="PP">First we defragmented the raw input data, and reorganized them into a matrix. 10 VOC factors were served as features, and the status(heathy or infected) was served as a tag to be predicted.</p> |
<div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/4/49/ZJU_China_VOC_1.png' alt=''/></div> | <div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/4/49/ZJU_China_VOC_1.png' alt=''/></div> | ||
− | <p class="PP">Then we | + | <p class="PP">Then we analyzed the data using box plot and discovered that most data were normal, but some records were singular, whose box plot is shown as follows:</p> |
<div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/9/97/ZJU_China_VOC_2.png' alt=''/></div> | <div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/9/97/ZJU_China_VOC_2.png' alt=''/></div> | ||
− | <p class="PP">We | + | <p class="PP">We removed those records with singular value, it turned out that the data left obey the normal distribution:</p> |
<div class="imgdiv col-md-6 col-sm-6"><img class="textimg" style="height: 230px !important; width:auto !important;" src='https://static.igem.org/mediawiki/2017/3/32/ZJU_China_VOC_3.png' alt=''/></div> | <div class="imgdiv col-md-6 col-sm-6"><img class="textimg" style="height: 230px !important; width:auto !important;" src='https://static.igem.org/mediawiki/2017/3/32/ZJU_China_VOC_3.png' alt=''/></div> | ||
<div class="imgdiv col-md-6 col-sm-6"><img class="textimg" style="height: 230px !important; width:auto !important;" src='https://static.igem.org/mediawiki/2017/e/e0/ZJU_China_VOC_4.png' alt=''/></div> | <div class="imgdiv col-md-6 col-sm-6"><img class="textimg" style="height: 230px !important; width:auto !important;" src='https://static.igem.org/mediawiki/2017/e/e0/ZJU_China_VOC_4.png' alt=''/></div> | ||
<h3 id="dataanalysis" class="H3Head">Data analysis</h3> | <h3 id="dataanalysis" class="H3Head">Data analysis</h3> | ||
− | <p class="PP">Our target | + | <p class="PP">Our target was to create a model to predicted tobacco's status according to 10 input features. This is a classic two classification problem, which we had several algrithm to solve. The sampling algorithm is cross validation and the scoring policy we applied is ridit test</p> |
<p class="PP"><strong>Decision Tree</strong></p> | <p class="PP"><strong>Decision Tree</strong></p> | ||
− | <p class="PP">First we | + | <p class="PP">First we used decision tree, which is based on information theory. ID3 decision tree was used to reduce the most information gain, while CART tree was used to reduce the GINI index. The performance of these two algorithm is almost the same. <strong>R = 0.83</strong></p> |
Line 561: | Line 561: | ||
<p class="PP"><strong>MLP</strong></p> | <p class="PP"><strong>MLP</strong></p> | ||
− | <p class="PP">The second algorithm we | + | <p class="PP">The second algorithm we applied is Multi-Layer Perception, also called neural network. In this model, we used more than 100 neurons in each layer and the activation function is relu.</p> |
<p class="PP">The result of MLP is much better than decision tree.<strong>R = 0.89</strong></p> | <p class="PP">The result of MLP is much better than decision tree.<strong>R = 0.89</strong></p> | ||
Line 567: | Line 567: | ||
</p> | </p> | ||
<p class="PP"><strong>Leaner Model</strong></p> | <p class="PP"><strong>Leaner Model</strong></p> | ||
− | <p class="PP">Although the performance of MLP | + | <p class="PP">Although the performance of MLP had been good enough, it's difficult to extract konwledge learnt by algorithm, which means the interpretability is weak. Why not try a simple model with high interpretability? First we tried LDA algorithm to compress the 10 dimensions data into 2 dimensions.</p> |
− | + | ||
− | + | ||
<p class="PP" style="text-align: center !important;"><span class="MathJax_Preview"></span><span class="MathJax_SVG_Display" | <p class="PP" style="text-align: center !important;"><span class="MathJax_Preview"></span><span class="MathJax_SVG_Display" | ||
style="text-align: center;"><span | style="text-align: center;"><span | ||
Line 899: | Line 897: | ||
<script type="math/tex" id="MathJax-Element-9">J</script> | <script type="math/tex" id="MathJax-Element-9">J</script> | ||
</p> | </p> | ||
− | <p class="PP">The result of LDA algorithm is as | + | <p class="PP">The result of LDA algorithm is as follows:<span class="MathJax_Preview"></span><span |
class="MathJax_SVG" id="MathJax-Element-10-Frame" tabindex="-1" | class="MathJax_SVG" id="MathJax-Element-10-Frame" tabindex="-1" | ||
style="font-size: 100%; display: inline-block;"><svg | style="font-size: 100%; display: inline-block;"><svg | ||
Line 927: | Line 925: | ||
<div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/6/61/ZJU_China_VOC_8.png' alt=''/></div> | <div class="imgdiv"><img class="textimg" src='https://static.igem.org/mediawiki/2017/6/61/ZJU_China_VOC_8.png' alt=''/></div> | ||
− | <p class="PP">This result | + | <p class="PP">This result proved the data are linear separable, which enabled us to chose logistics regression algorithm.</p> |
− | + | ||
<p class="PP">We difine <span class="MathJax_Preview"></span><span class="MathJax_SVG" | <p class="PP">We difine <span class="MathJax_Preview"></span><span class="MathJax_SVG" | ||
id="MathJax-Element-11-Frame" | id="MathJax-Element-11-Frame" | ||
Line 1,145: | Line 1,142: | ||
</p> | </p> | ||
<p class="PP">Then we can apply maximum likelihood method algorithm to estimate the paramaters.</p> | <p class="PP">Then we can apply maximum likelihood method algorithm to estimate the paramaters.</p> | ||
− | <p class="PP">The result is as | + | <p class="PP">The result is as follows:</p> |
<figure class="codes"><pre> | <figure class="codes"><pre> | ||
Weight: | Weight: | ||
Line 1,169: | Line 1,166: | ||
</pre></figure> | </pre></figure> | ||
<h2 id="algorithmoptimization" class="H2Head">Algorithm optimization</h2> | <h2 id="algorithmoptimization" class="H2Head">Algorithm optimization</h2> | ||
− | <p class="PP">From the result of logistics regression, factor C and I and etc. are with less important weight, | + | <p class="PP">From the result of logistics regression, factor C and I and etc. are with less important weight, these factors may disturb the classifaction. We tried to reduce insigfinicant factors to simplify the model.</p> |
− | + | <p class="PP">Finally, we reserved 4 factors with which we can predict the tobacco's status with 91% confidence and also reduced the VOC device.</p> | |
− | + | ||
− | <p class="PP">Finally, we | + | |
− | + | ||
<figure class="codes"><pre> | <figure class="codes"><pre> | ||
Weight: | Weight: | ||
Line 1,191: | Line 1,185: | ||
<h2 id="summary" class="H2Head">Summary</h2> | <h2 id="summary" class="H2Head">Summary</h2> | ||
− | <p class="PP">In this model, we | + | <p class="PP">In this model, we tried different algorithm to abttain a robust, interpretable, and accurate solution to predict whether the tobacco is infected only according to 4 features in 91% confidence. Since there are 6 VOC sensors left unused in this model, the device can also be simplified in the future by reducing them. We can also try to add more functions to this device by making use of the left sensors.</p> |
− | + | ||
− | + | ||
− | + | ||
<br><br><br> | <br><br><br> | ||
<div style="text-align: center"> | <div style="text-align: center"> |
Latest revision as of 15:58, 3 December 2017
Modeling
VOC Classification
Overview
The VOC device is designed to tell whether the tobacco is heathy or infected. Since this is an inquiry experiment, algorithms in data analysis are widely used in our modeling. We did data preprocessing, data analysis, and algorithm optimization on the data collected by VOC device. Finally, we used Logistic regression and detected the infected tobacco with 91% confidence.
Data preprocessing
First we defragmented the raw input data, and reorganized them into a matrix. 10 VOC factors were served as features, and the status(heathy or infected) was served as a tag to be predicted.
Then we analyzed the data using box plot and discovered that most data were normal, but some records were singular, whose box plot is shown as follows:
We removed those records with singular value, it turned out that the data left obey the normal distribution:
Data analysis
Our target was to create a model to predicted tobacco's status according to 10 input features. This is a classic two classification problem, which we had several algrithm to solve. The sampling algorithm is cross validation and the scoring policy we applied is ridit test
Decision Tree
First we used decision tree, which is based on information theory. ID3 decision tree was used to reduce the most information gain, while CART tree was used to reduce the GINI index. The performance of these two algorithm is almost the same. R = 0.83
MLP
The second algorithm we applied is Multi-Layer Perception, also called neural network. In this model, we used more than 100 neurons in each layer and the activation function is relu.
The result of MLP is much better than decision tree.R = 0.89
Leaner Model
Although the performance of MLP had been good enough, it's difficult to extract konwledge learnt by algorithm, which means the interpretability is weak. Why not try a simple model with high interpretability? First we tried LDA algorithm to compress the 10 dimensions data into 2 dimensions.
We define as within-class scatter matrix
We define as between-class scatter matrix
The result of LDA algorithm is as follows: :
This result proved the data are linear separable, which enabled us to chose logistics regression algorithm.
We difine
Then we can apply maximum likelihood method algorithm to estimate the paramaters.
The result is as follows:
Algorithm optimization
From the result of logistics regression, factor C and I and etc. are with less important weight, these factors may disturb the classifaction. We tried to reduce insigfinicant factors to simplify the model.
Finally, we reserved 4 factors with which we can predict the tobacco's status with 91% confidence and also reduced the VOC device.
Summary
In this model, we tried different algorithm to abttain a robust, interpretable, and accurate solution to predict whether the tobacco is infected only according to 4 features in 91% confidence. Since there are 6 VOC sensors left unused in this model, the device can also be simplified in the future by reducing them. We can also try to add more functions to this device by making use of the left sensors.