Revision as of 09:45, 20 October 2017

Modeling

VOC Classification

Overview

The VOC device is designed to judge whether the tobacco is heathy or gets infected. Since this is an inquiry experiment, algorithms in data analysis are widely use in our modeling. We do data preprocessing, data analysis, and algorithm optimization on the data collected by VOC device. Finally, we use Logistic regression and detect the infected tobacco with 91% confidence.

Data preprocessing

First we defragment the raw input data, and reorganize them into a matrix. 10 VOC factors are served as features, and the status(heathy or infected) is served as tag to be predicted.

Then we analysis the data using box plot and discover that most data are normal, but some records are singular, whose box plot are show as folowing:

We remove those records with singular value, and the data left obey normal distribution:

Data analysis

Our target is to create a model and predict tobacco's status according to 10 input features. This is a classic two classification problem, and there are several algrithm to solve it. The sampling algorithm is cross validation and the scoring policy we apply is ridit test.

Decision Tree

First we use decision tree based on information theory. ID3 decision tree is used to reduce the most information gain, and CART tree is used to reduce the GINI index. The performance of these two algorithm is almost the same. R = 0.83

MLP

The second algorithm we apply is Multi-Layer Perception, also called neutral network. In this model, we use more than 100 neurons in each layer and the activation function is relu.

The result of MLP is much better than decision tree.R = 0.89

Leaner Model

Although the performance of MLP has been good enough, it's difficult to extract konwledge learn by algorithm, the interpretability is weak. Why don't we try a simple model with high interpretability? First we try LDA algorithm to compress the 10dimensions data into 2 dimensions.

$J=\frac{||w^T\mu_0-w^T\mu_1||^2}{w^T\Sigma_0w+w^T\Sigma_1w}=\frac{w^T(\mu_0-\mu_1)(\mu_0-\mu_1)^Tw}{w^T\Sigma_0w+w^T\Sigma_1w}$

We define $S_w$ as within-class scatter matrix

$S_w=\Sigma_0+\Sigma_1=\sum_{x\in X_0}(x-\mu_0)(x-\mu_0)^T+\sum_{x\in X_1}(x-\mu_1)(x-\mu_1)^T$

We define $S_b$ as between-class scatter matrix

$S_b=(\mu_0-\mu_1)(\mu_0-\mu_1)^T$

$J=\frac{w^TS_bw}{w^TS_ww}$ $J$

The result of LDA algorithm is as following and $R=0.89$ :

This result prove the data are linear separable, then we choose logistics regression algorithm.

We difine $logitP=ln\frac{y}{1-y}\in (-\infty,+\infty)$

$p(y=1|x)=\frac{e^{w^Tx+b}}{1+e^{w^Tx+b}}$

$p(y=1|x)=\frac{1}{1+e^{w^Tx+b}}$

$l(w,b)=\sum_{i=1}^{m}lnp(y_i|x_i;w,b)$

Then we can apply maximum likelihood method algorithm to estimate the paramaters.

The result is as following:


                            Weight:
                            [[ 0.1819504 0.38788225 0.01350023 0.39594948 0.17799418
                            0.42087034
                            -0.57733395 -0.23876003 -0.00532918 -0.46174515]]
                            Intercept:
                            [ 0.00937812]
                            Effect:
                            D    35.300735
                            B    22.596339
                            F    18.289277
                            E    10.265025
                            C     0.393225
                            I    -1.575564
                            A   -10.679026
                            H   -14.398440
                            G   -26.211964
                            J   -39.130542
                            dtype: float64
                            Score:
                            0.894333333333

Algorithm optimization

From the result of logistics regression, factor C and I and etc. are with less important weight, these factors maybe disturb the classifaction. We try to reduce unimportant factors and simplify the model.

Finally, we reserve 4 factors with which we can predict the tobacco in 91% confidence and also reduce the VOC device.


                    Weight:
                    [[ 0.53196697  0.3404023  -0.53555988 -0.45588715]]
                    Intercept:
                    [-0.01204088]
                    Effect:
                    D    33.217011
                    F    15.492680
                    G   -17.319760
                    J   -33.967849
                    dtype: float64
                    Score:
                    0.912444444444

Summary

In this model, we try different algorithm to abttain a robust, interpretable, and accurate solution to predict whether the tobacco is infected only according to 4 features in 91% confidence. Since there are 6 VOC sensors are meaningless in this model, we the device can also be simplified by reduce them.

@@ Line 828: / Line 828: @@
                              </script>
                          </p>
-                        <p class="PP">So</p>
                          <p class="PP" style="text-align: center !important;"><span class="MathJax_Preview"></span><span class="MathJax_SVG_Display"
                                                                        style="text-align: center;"><span
@@ Line 866: / Line 865: @@
                              <script type="math/tex; mode=display" id="MathJax-Element-8">J=\frac{w^TS_bw}{w^TS_ww}
                              </script>
-                            , the target of LDA is maxmize the <span class="MathJax_Preview"></span><span
-                                    class="MathJax_SVG" id="MathJax-Element-9-Frame" tabindex="-1"
-                                    style="font-size: 100%; display: inline-block;"><svg
-                                    xmlns:xlink="http://www.w3.org/1999/xlink" width="1.471ex" height="2.009ex"
-                                    viewBox="0 -755.5 633.5 865.1" role="img" focusable="false"
-                                    style="vertical-align: -0.255ex;"><defs><path stroke-width="1" id="E9-MJMATHI-4A"
-                                                                                  d="M447 625Q447 637 354 637H329Q323 642 323 645T325 664Q329 677 335 683H352Q393 681 498 681Q541 681 568 681T605 682T619 682Q633 682 633 672Q633 670 630 658Q626 642 623 640T604 637Q552 637 545 623Q541 610 483 376Q420 128 419 127Q397 64 333 21T195 -22Q137 -22 97 8T57 88Q57 130 80 152T132 174Q177 174 182 130Q182 98 164 80T123 56Q115 54 115 53T122 44Q148 15 197 15Q235 15 271 47T324 130Q328 142 387 380T447 625Z"></path></defs><g
-                                    stroke="currentColor" fill="currentColor" stroke-width="0"
-                                    transform="matrix(1 0 0 -1 0 0)"><use xlink:href="#E9-MJMATHI-4A" x="0" y="0"></use></g></svg></span>
                              <script type="math/tex" id="MathJax-Element-9">J</script>
                          </p>

Difference between revisions of "Team:ZJU-China/Model"