Difference between revisions of "Team:NCTU Formosa/Model"

 
(20 intermediate revisions by 5 users not shown)
Line 5: Line 5:
 
<head>
 
<head>
 
     <meta charset="UTF-8">
 
     <meta charset="UTF-8">
     <title>Untitled Document</title>
+
     <title>NCTU_Formosa: Peptide Prediction Modeling</title>
     <script src="//ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
+
     <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
     <link href="Home.css" rel="stylesheet" type="text/css">
+
    <script src="jQueryAssets/jquery-1.11.1.min.js"></script>
     <script src="Home.js" type="text/javascript"></script>
+
    <script src="jQueryAssets/jquery.ui-1.10.4.dialog.min.js"></script>
 +
     <link href="modeling_peptide_prediction.css" rel="stylesheet" type="text/css">
 +
     <script src="modeling_peptide_prediction.js" type="text/javascript"></script>
 
     <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no, minimum-scale=1.0, maximum-scale=1.0" />
 
     <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no, minimum-scale=1.0, maximum-scale=1.0" />
  
Line 15: Line 17:
 
     margin: 0;
 
     margin: 0;
 
     padding: 0;
 
     padding: 0;
    height: 100vh;
 
 
     overflow-x: hidden;
 
     overflow-x: hidden;
 +
    height: 100vh;
 
}
 
}
  
Line 24: Line 26:
 
}
 
}
  
.subtitle{
+
#fut{
     height: 35px;
+
background: white;
     max-width: 90%;
+
width: 100vw;
 +
height: 50px;
 +
}
 +
 
 +
 
 +
p {
 +
    text-align: justify;
 +
    font-size: 1.3em !important;
 +
    font-family: Arial, sans-serif;
 +
    margin: 10px 0 0;
 +
}
 +
 
 +
h1{
 +
    border: 0;
 +
}
 +
 
 +
h4{
 +
    text-align: center;
 +
    margin-top: 0px;
 +
    margin-bottom: 50px !important;
 +
    font-size: 15px;
 +
}
 +
 
 +
h3{
 +
    text-align: center;
 +
    margin-top: 50px;
 +
    margin-bottom: 0px !important;
 +
    font-size: 15px;
 +
}
 +
 
 +
 
 +
 
 +
 
 +
@font-face {
 +
  font-family: neo_latina;
 +
  src: url(https://static.igem.org/mediawiki/2017/4/46/Neo-latina-demo-FFP.ttf);
 +
}
 +
 
 +
.subtitle>h6{
 +
     font-family: neo_latina;
 +
     font-size: 3.9em;
 
     position: relative;
 
     position: relative;
     left: 10vw;
+
     left: 25px;
 
     margin-top: 50px;
 
     margin-top: 50px;
 
     margin-bottom: 10px;
 
     margin-bottom: 10px;
 +
    line-height: 1em;
 
}
 
}
  
 +
.subtitle>h5{
 +
    font-family: neo_latina;
 +
    font-size: 1.95em;
 +
    position: relative;
 +
    left: 50px;
 +
    margin-top: -10px;
 +
    margin-bottom: 10px;
 +
    line-height: 1em;
 +
}
  
#fut{
+
.latex{
background: white;
+
    font-size: 20px;
width: 100vw;
+
height: 200px;
+
 
}
 
}
  
 +
.sub_text{
 +
    width:65vw;
 +
    position: relative;
 +
    left: 5vw;
 +
}
 +
 +
.sub2_text{
 +
    width:60vw;
 +
    position: relative;
 +
    left: 10vw;
 +
    line-height: 15px;
 +
    margin-bottom: 10px;
 +
}
 
.line{
 
.line{
 
position: relative;
 
position: relative;
Line 46: Line 109:
 
}
 
}
  
p {
+
 
    text-align: justify;
+
    font-size: 1.3em !important;
+
    font-family: Arial, sans-serif;
+
}
+
  
  
Line 148: Line 207:
 
     margin-top: 0 !important;
 
     margin-top: 0 !important;
 
}
 
}
 +
 +
#ptpm_reference{
 +
    margin-bottom: 0 !important;
 +
}
 +
/*----------------------------------------------------------------------------*/
 +
/*----------------------------------------------------------------------------*/
 +
span>a:link {
 +
    color: #85CFEA;
 +
}
 +
 +
span>a:visited {
 +
    color: #85CFEA;
 +
}
 +
 +
span>a:hover {
 +
    color: blue;
 +
}
 +
 +
span>a:active {
 +
    color: blue;
 +
}
 +
 +
span>a{
 +
    text-decoration: none;
 +
}
 +
/*----------------------------------------------------------------------------*/
 +
/*----------------------------------------------------------------------------*/
 +
/*----------------------------------------------------------------------------*/
 +
/*----------------------------------------------------------------------------*/
 +
  
 
</style>
 
</style>
Line 154: Line 243:
  
 
</script>
 
</script>
 +
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
 +
        MathJax.Hub.Config({
 +
            extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
 +
            jax: ["input/TeX", "output/HTML-CSS"],
 +
            tex2jax: {
 +
                inlineMath: [
 +
                    ['$', '$'],
 +
                    ["\\(", "\\)"]
 +
                ],
 +
                displayMath: [
 +
                    ['$$', '$$'],
 +
                    ["\\[", "\\]"]
 +
                ],
 +
            },
 +
            "HTML-CSS": {
 +
                availableFonts: ["TeX"]
 +
            }
 +
        });
 +
    </script>
 
<!----------------------------------------------------------------------------->
 
<!----------------------------------------------------------------------------->
  
Line 202: Line 310:
  
 
             <ul>
 
             <ul>
                 <li>Collection: The collection of positive(antifungal) data was from opening online databases such as CAMP, APD , PhytAMP and new peptides that we recently collect in our database, while negative data from Uniprot which did not have antifungal
+
                 <li>Collection: The collection of positive(antifungal) data was from opening online databases such as CAMP, APD, PhytAMP and new peptides that we recently collect in our database, while negative data from UniProt which did not have antifungal
 
                     or antimicrobial keywords.</li>
 
                     or antimicrobial keywords.</li>
                 <li>Pre-processing: First was to delete some peptides which contained non-standard amino acids. Then we limited the length of peptides for 10 AA’s to 100 AA’s because antifungal peptide were typically between 10-100 amino acids long. Furthermore,
+
                 <li>Pre-processing: First was to delete some peptides which contained non-standard amino acids. Then we limited the length of peptides for 10 AA’s to 100 AA’s because antifungal peptides were typically between 10-100 amino acids long. Furthermore,
 
                     filtered the peptides with identity
 
                     filtered the peptides with identity
                     <=2 5% . And then choose negative data that as much as the positive data.</li>
+
                     <=25% . And then choose negative data that as much as the positive data.</li>
 
             </ul>
 
             </ul>
  
Line 213: Line 321:
 
             </p>
 
             </p>
  
             <p>這裡有一張表</p>
+
             <h3>Table 1: This is the table of datasets with 375 positive and 375 negative data, we randomly selected two-thirds of data to training data and one-thirds of data to testing data.</h3>
 +
            <img src="https://static.igem.org/mediawiki/2017/6/6a/Dataset.PNG" width="40%" style="display: block; margin: auto;">
  
  
Line 221: Line 330:
 
             <h1>Algorithm - SCM</h1>
 
             <h1>Algorithm - SCM</h1>
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We used scoring card method for machine learning. Scoring card method (SCM) is developed from from Shinn-Ying Ho, NCTU. It is a simple, accurate and interpretable machine learning method. It can not only predict the peptide
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We used scoring card method for machine learning. Scoring card method (SCM) is developed from Shinn-Ying Ho, NCTU. It is a simple, accurate and interpretable machine learning method. It can not only predict the peptide function , but also can predict the important domain of the peptide.
                function , but also can predict the important domain of the peptide.
+
 
             </p>
 
             </p>
  
Line 230: Line 338:
  
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For the first part, the dipeptide score is a simple and effective way to predict peptides’ function by scoring peptides. For every peptide, we could calculate its dipeptide frequency. Then, we gave a initial weight for each
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For the first part, the dipeptide score is a simple and effective way to predict peptides’ function by scoring peptides. For every peptide, we could calculate its dipeptide frequency. Then, we gave an initial weight for each specific dipeptide through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility antifungal has to be.
                specific dipeptide through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility antifungal has to be.
+
 
             </p>
 
             </p>
  
Line 237: Line 344:
 
             <ul>
 
             <ul>
 
                 <li>Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.</li>
 
                 <li>Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.</li>
                 <p>一個方程式</p>
+
                 <div class="latex"> $$ 20_{AA} \times 20_{AA}=400_{dipeptide} $$</div>
 
                 <li>Each peptide will then get a score according to the peptide sequences by the multiplication.</li>
 
                 <li>Each peptide will then get a score according to the peptide sequences by the multiplication.</li>
                 <p>一張圖</p>
+
                 <img src="https://static.igem.org/mediawiki/2017/3/39/Ptpm_photo2.png" width="60%" style="display: block; margin: auto;">
 +
                <h4>Figure 1: This picture illustrated how to calculate peptide score by the scorecard. First reshape the 20 x 20 to 400 matrix, then multiplied with the scorecard matrix. Finally we can get the final score.</h4>
 +
                <div class="latex">$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$</div>
 
                 <li>If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.</li>
 
                 <li>If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.</li>
                <p>一個方程式</p>
 
 
             </ul>
 
             </ul>
 +
 +
            <div class="latex">$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$
 +
            </div>
  
 
             <h2>2.The initial weight</h2>
 
             <h2>2.The initial weight</h2>
Line 248: Line 359:
 
             <p>P<small>(ij)</small> is the dipeptide frequency of positive dataset </p>
 
             <p>P<small>(ij)</small> is the dipeptide frequency of positive dataset </p>
 
             <p>N<small>(ij)</small> is the dipeptide frequency of negative dataset</p>
 
             <p>N<small>(ij)</small> is the dipeptide frequency of negative dataset</p>
             <p>一個連立方程式</p>
+
             <div class="latex">$$ P(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=1\right ),1\leq i,j\leq 20 $$</div>
             <p>一個方程式</p>
+
             <div class="latex">$$ N(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=0\right ),1\leq i,j\leq 20 $$</div>
 +
            <div class="latex">$$ S_{ij} =P_{ij}-N_{ij} $$</div>
 +
 
 
             <p>
 
             <p>
 
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.
 
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.
 
             </p>
 
             </p>
  
             <p>一個方程式</p>
+
             <div class="latex">$$ S'_{(ij)}=\left ( \frac{S_{ij}-S_{min}}{S_{max}-S_{min}} \right ) \times 1000 $$</div>
  
 
             <p>
 
             <p>
Line 261: Line 374:
  
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the second part , we used IGA to optimize our initial scoring card. It is based on the nature life evolution.
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the second part, we used IGA to optimize our initial scoring card. It is based on the natural life evolution.
 
             </p>
 
             </p>
  
Line 267: Line 380:
  
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Below is the the flow chart of IGA.
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;This is the the flow chart of IGA. First, we combined an initial scoring card with another randomly initialized scoring card to make the first population. Second, we calculated the fitness of every scoring card. Then, if the scoring cards reach the ending condition, it will switch to the select section to select many pairs of scoring cards into the crossover section to make new offspring(scoring card).The new offspring then would be passed to the mutation section. After the mutation, the new offspring would be added into the population and the population would be ranked by their fitness and the scoring card which ranking out of the max_population would be removed.
 
             </p>
 
             </p>
  
             <p>一張flow chart</p>
+
             <img src="https://static.igem.org/mediawiki/2017/b/bd/Ptpm_photo3.png" width="60%" style="display: block; margin: auto;">
 +
            <h4>Figure 2: This is the the flow chart of IGA</h4>
  
 
             <p>
 
             <p>
Line 277: Line 391:
  
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To optimize the initial card, we would do the advanced crossover to have the variation to reach machine learning. Every round we should select two weights among all by the pick up method. After the advanced crossover optimized
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To optimize the initial card, we would do the advanced crossover to have the variation to reach machine learning. Every round we should select two weights among all by the pickup method. After the advanced crossover optimized from the normal crossover, the mutation had done and new weights would be put into the population.
                from the normal crossover, the mutation had done and new weights would be put into the population.
+
 
             </p>
 
             </p>
  
Line 284: Line 397:
 
                 <li>Confusion Matrix</li>
 
                 <li>Confusion Matrix</li>
 
                 <p>
 
                 <p>
                     To deal with a set of weight, first we calculated the confusion matrix. By the confusion matrix we separated prediction section and label section into four classes, which called TP,FP,FN, and TN.
+
                     To deal with a set of weight, first we calculated the confusion matrix. By the confusion matrix we separated prediction section and label section into four classes, which called TP(True Positive),FP(False Positive),FN(False Negative), and TN(True Negative).
 
                 </p>
 
                 </p>
                 <p>一張表格</p>
+
                 <h3>Table 2: Confusion Matrix</h3>
 +
                <img src="https://static.igem.org/mediawiki/2017/f/ff/Confusion_2.png" width="40%" style="display: block; margin: auto;">
 
                 <p>Then we calculated TPR and FPR.</p>
 
                 <p>Then we calculated TPR and FPR.</p>
                 <p>一個公式</p>
+
                 <div class="latex">$$ TPR=\frac{TP}{\left(TP+FN\right)} $$</div>
                 <p>一個公式</p>
+
                 <div class="latex">$$ FPR=\frac{FP}{\left(FP+TN\right)} $$</div>
 
                 <p>We put TPR as y-axis and FPR as x-axis to draw the ROC curve.</p>
 
                 <p>We put TPR as y-axis and FPR as x-axis to draw the ROC curve.</p>
 
                 <li>AUC of ROC curves</li>
 
                 <li>AUC of ROC curves</li>
                 <p>一個圖</p>
+
                 <img src="https://static.igem.org/mediawiki/2017/2/23/ROC_ex_2.png" width="60%" style="display: block; margin: auto;">
 +
                <h4>Figure 3: AUC of ROC curves</h4>
 
                 <p>
 
                 <p>
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;With different thresholds to distinguish positive ones from negative ones, the TP,FP,FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate
+
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;With different thresholds to distinguish positive ones from negative ones, the TP,FP, FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate the fitness of the weight, we calculated the area under the curve (AUC), which is a common way to evaluate models and predictions. The advantage of AUC of ROC curve is that it possesses much resistance to unbalancing datasets, while in fact non-antifungal peptides are far more than antifungal peptides.
                    the fitness of the weight, we calculated the area under curve (AUC), which is a common way to evaluate models and predictions. The advantage of AUC of ROC curve is that it possesses much resistance to unbalancing datasets, while in
+
                    fact non-antifungal peptides are far more than antifungal peptides.
+
 
                 </p>
 
                 </p>
 +
                <li>Pearson coefficient</li>
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We not only considered the AUC value but also considered the Pearson coefficient of the amino acids between the initial scoring card and the current scoring card. We gave different weights for each value, with 0.9 of the
 +
                        AUC value and 0.1 of the Pearson coefficient for the best training performance. With the addition of the Pearson coefficient, the model can avoid overtraining.
  
                <li>the Pick up method</li>
+
                    </p>
                <p>
+
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We picked two weights among all. One had the most fitness(the highest AUC), which is probably to be the best weight. The other parent was selected using the roulette method.
+
  
                </p>
 
  
                <p>
 
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We separated different areas for each weight of the score card proportionally to their fitness. The higher fitness of the weight would get the larger area.
 
                </p>
 
  
                <p>
+
                    <li>The Select method</li>
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Then we randomly chose a number and took the score card which the random number was in its area.
+
                     <p>
                </p>
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We picked two weights among all. One had the most fitness(the highest AUC), which is probably to be the best weight. The other parent was selected using the roulette method.
  
                <p>一張圖</p>
+
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Using Roulette method is to make sure the randomness of the selection. The one who had higher fitness score card probably would be chosen but not absolutely be chosen.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We separated different areas for each weight of the score card proportionally to their fitness. The higher fitness of the weight would get the larger area.
                </p>
+
                    </p>
  
                <li>Crossover of IGA</li>
+
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Then we randomly chose a number and took the score card which the random number was in its area.
 +
                    </p>
  
                <p>
+
                     <img src="https://static.igem.org/mediawiki/2017/4/42/Ptp_probability.png" width="60%" style="display: block; margin: auto;">
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After choosing the parents, we used IGA to optimize the crossover. IGA is based on the normal GA.
+
                </p>
+
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the Genetic Algorithms (normal GA), the crossover selection is the most important selection. After choosing two parents is to randomly choose a pair of parameters to exchange. And then return the exchanged score card
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Using Roulette method is to make sure the randomness of the selection. The one who had higher fitness score card probably would be chosen but not absolutely be chosen.
                     into the new population.
+
                     </p>
                </p>
+
  
                <p>一張圖</p>
+
                    <li>Crossover of IGA</li>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After that, delete lower fitness score card and keep the population in a range.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After choosing the parents, we used IGA to optimize the crossover. IGA is based on the normal GA.
                </p>
+
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;However, how do we choose the best set of parameters? IGA cross over is the method which is for large parameters developed by SY Ho. (ref : http://ieeexplore.ieee.org/document/1369245/).
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the Genetic Algorithms (normal GA), the crossover selection is the most important selection. After choosing two parents is to randomly choose a pair of parameters to exchange. And then return the exchanged score card
 +
                        into the new population.
 +
                    </p>
  
                </p>
+
                    <img src="https://static.igem.org/mediawiki/2017/b/b7/Ptpm_photo6.png" width="60%" style="display: block; margin: auto;">
 +
                    <h4>Figure 4: Cross over</h4>
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After that, delete lower fitness score card and keep the population in a range.
 +
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If there is a target function,
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;However, how do we choose the best set of parameters? IGA crossover is the method which is for large parameters developed by SY Ho. (ref : http://ieeexplore.ieee.org/document/1369245/).
                </p>
+
  
                <p>一個方程式</p>
+
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and for each x1, x2, x3, we have 2 candidates to choose, just like the two parents in the crossover step.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If there is a target function,
                </p>
+
                    </p>
  
                <p>一綱方程式</p>
+
                    <div class="latex">$$ f(x_1,x_2,x_3)=100x_1-10x_2-x_3 $$</div>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1)To maximize the function of IGA, we will first create an OA-array, just like the array shown above.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and for each x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, we have 2 candidates to choose, just like the two parents in the crossover step.
                </p>
+
                    </p>
  
                <p>
+
                    <div class="latex">$$ maximize\;y(x_{1},x_{2},x_{3})=100x_{1}-10x_{2}-x_{3},\\x_{1}\in\left \{ 1,2 \right \},x_{2}\in\left \{ 3,4 \right \},and\;x_{3}\in\left \{ 5,6 \right \} $$</div>
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2)Take x1 for example.
+
                     <p>
                </p>
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1)To maximize the function of IGA, we will first create an OA-array, just like the array shown above.
 +
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For evaluating x1, the key is to eliminate the effect by x2 and x3.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2)Take x1 for example.
                </p>
+
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;From the table below, we can see the way to obtain the evaluation is to pair column 1 and 2 together, while 3 and 4 together.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For evaluating x1, the key is to eliminate the effect by x2 and x3.
                </p>
+
                    </p>
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So we have:
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;From the table below, we can see the way to obtain the evaluation is to pair column 1 and 2 together, while 3 and 4 together.
                </p>
+
                    </p>
  
                <p>一個連立方程式</p>
+
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So we have:
 +
                    </p>
  
                <p>
+
                    <div class="latex"></div>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Because the value ofSJ2is larger than ofSJ1, the better parameter for x1 will be 2 instead of 1.
+
                </p>
+
  
                <p>
+
                    <p>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(3)Other parameters are chosen by the same method.
+
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Because the value of SJ2 is larger than of SJ1, the better parameter for x1 will be 2 instead of 1.
                </p>
+
                    </p>
  
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(3)Other parameters are chosen by the same method.
 +
                    </p>
  
                <p>
 
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.
 
                </p>
 
  
                <li>Termination</li>
+
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.
 +
                    </p>
  
 +
                    <h3>Table 3: An example of IGA optimization</h3>
 +
                    <img src="https://static.igem.org/mediawiki/2017/b/b6/Ptpm_photo7.png" width="40%" style="display: block; margin: auto">
  
                <p>
+
 
                     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To prevent the model from over-training, the program will be terminated after 30 generations. When it reached its end condition it would return the final score card with the best fitness in training data.
+
                    <li>Mutation</li>
                </p>
+
 
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After the crossover section, a new offspring will start a mutation section. In a mutation section, the program will choose a random number to determine whether to mutate or not. If the result is yes, it will randomly choose a allele of offspring and set a random number. The mutation section is important to increase the randomness of the model.
 +
                     </p>
 +
 
 +
                    <li>Sorting and Filter</li>
 +
 
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After the mutation section, the new offspring would join the population, then the program sorted all scoring cards in the population by their fitness values. After the sorting population, the last process was to filter out the scoring card that ranked outside the max_population number.
 +
                    </p>
 +
 
 +
                    <li>Termination</li>
 +
 
 +
                    <p>
 +
                        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To get the best accuracy and prevent the model from over-training, the program will be terminated after 30 generations. When it reached its end condition it would return the final score card with the best fitness in training
 +
                        data.
 +
                    </p>
  
 
             </ul>
 
             </ul>
Line 402: Line 538:
 
             <h1>Results</h1>
 
             <h1>Results</h1>
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;However, the training data can not always reflect the real situation, so the model will be evaluated by the independent testing data (sequence identity=25%). Here is the ROC graph of the test data for each dataset.
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;However, the training data cannot always reflect the real situation, so the model will be evaluated by the independent testing data (sequence identity=25%). Here is the ROC graph of the test data for each dataset.
 
             </p>
 
             </p>
  
Line 409: Line 545:
 
             </p>
 
             </p>
  
             <p>一張圖</p>
+
             <img src="https://static.igem.org/mediawiki/2017/d/da/Ptp_result_photo1.png" width="60%" style="display: block; margin: auto;">
 
+
            <h4>Figure 5: The AUC curve and our result of our model</h4>
 
             <p>
 
             <p>
 
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System Operation:
 
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System Operation:
Line 416: Line 552:
  
 
             <p>
 
             <p>
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For an unknown peptide inputting in the scoring system,we will first transfer it into dipeptide frequency and then multiply them with the dipeptide score card, and then get the final score.
+
                 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For an unknown peptide inputting in the scoring system,we will first transfer it into dipeptide frequency and then multiply them with the dipeptide score card, and then get the final score. The threshold value with the best accuracy is determined due to the test result.  
 
             </p>
 
             </p>
  
             <p>兩個方程式</p>
+
             <div class="latex">$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$</div>
 +
            <div class="latex">$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$
 +
            </div>
  
            <p>- Moreover , the scoring card predicting system is now available in (http://web.it.nctu.edu.tw/~nctu_formosa/Parabase/tool.html)</p>
 
  
        </div>
+
            <p>- Moreover , the scoring card predicting system is now available in <span><a href="http://web.it.nctu.edu.tw/~nctu_formosa/Parabase/tool.html" target="_blank">Parabase predict tool</a></span></p>
  
 +
        </div>
 +
       
 
         <div id="ptpm_reference">
 
         <div id="ptpm_reference">
             <p><small>[1]Vasylenko T, Liou TF, Chiou PC, Chu HW, Lai YS, Chuo YL,Shinn-Ying Ho, et al. SCMBYK: prediction and characterization of bacterial tyrosine-kinases based on propensity scores of dipeptides. BMC Bioinformatics. 2016. doi:10.1186/s12859-016-1371-4.</small></p>
+
        <div class="subtitle">
             <p><small>[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176</small></p>
+
            <h6>Reference</h6>
 +
        </div>
 +
             <p><small>[1] Huang, H.-L., Charoenkwan, P., Kao, T.-F., Lee, H.-C., Chang, F.-L., Huang, W.-L., … Ho, S.-Y. (2012). Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics, 13(Suppl 17), S3. http://doi.org/10.1186/1471-2105-13-S17-S3</small></p>
 +
             <p><small>[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176 </small></p>
 
         </div>
 
         </div>
  

Latest revision as of 02:56, 2 November 2017

navigation

NCTU_Formosa: Peptide Prediction Modeling

Purpose

     1.To build a model for introducing a complete method of to analyzing peptides to provide iGEM new knowledge and a simple and applicable tool

     2.To use a new way to solve the crisis of fungal diseases achieving the combination of information technology and agriculture

     3.To establish a biggest antifungal database so far, possessing the data of hosts, pathogens, antifungal peptides, and scoring system finding the potential antifungal peptides

Datasets

     The performance of machine learning depends on the quality and quantity of the training datasets.

  • Collection: The collection of positive(antifungal) data was from opening online databases such as CAMP, APD, PhytAMP and new peptides that we recently collect in our database, while negative data from UniProt which did not have antifungal or antimicrobial keywords.
  • Pre-processing: First was to delete some peptides which contained non-standard amino acids. Then we limited the length of peptides for 10 AA’s to 100 AA’s because antifungal peptides were typically between 10-100 amino acids long. Furthermore, filtered the peptides with identity <=25% . And then choose negative data that as much as the positive data.

     Finally, we combined our positive and negative dataset and randomly distributed ⅓ datas for independent testing set.

Table 1: This is the table of datasets with 375 positive and 375 negative data, we randomly selected two-thirds of data to training data and one-thirds of data to testing data.

Algorithm - SCM

     We used scoring card method for machine learning. Scoring card method (SCM) is developed from Shinn-Ying Ho, NCTU. It is a simple, accurate and interpretable machine learning method. It can not only predict the peptide function , but also can predict the important domain of the peptide.

     It consists of two important parts - the dipeptide score and the intelligent genetic algorithms(IGA) ,which is based on genetic algorithms.

      For the first part, the dipeptide score is a simple and effective way to predict peptides’ function by scoring peptides. For every peptide, we could calculate its dipeptide frequency. Then, we gave an initial weight for each specific dipeptide through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility antifungal has to be.

1.Dipeptide

  • Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.
  • $$ 20_{AA} \times 20_{AA}=400_{dipeptide} $$
  • Each peptide will then get a score according to the peptide sequences by the multiplication.
  • Figure 1: This picture illustrated how to calculate peptide score by the scorecard. First reshape the 20 x 20 to 400 matrix, then multiplied with the scorecard matrix. Finally we can get the final score.

    $$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$
  • If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.
$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$

2.The initial weight

P(ij) is the dipeptide frequency of positive dataset

N(ij) is the dipeptide frequency of negative dataset

$$ P(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=1\right ),1\leq i,j\leq 20 $$
$$ N(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=0\right ),1\leq i,j\leq 20 $$
$$ S_{ij} =P_{ij}-N_{ij} $$

     Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.

$$ S'_{(ij)}=\left ( \frac{S_{ij}-S_{min}}{S_{max}-S_{min}} \right ) \times 1000 $$

     After doing so,we have the initail scoring card(a set of dipeptide weight).

     For the second part, we used IGA to optimize our initial scoring card. It is based on the natural life evolution.

3.IGA (intelligent genetic algorithm)

     This is the the flow chart of IGA. First, we combined an initial scoring card with another randomly initialized scoring card to make the first population. Second, we calculated the fitness of every scoring card. Then, if the scoring cards reach the ending condition, it will switch to the select section to select many pairs of scoring cards into the crossover section to make new offspring(scoring card).The new offspring then would be passed to the mutation section. After the mutation, the new offspring would be added into the population and the population would be ranked by their fitness and the scoring card which ranking out of the max_population would be removed.

Figure 2: This is the the flow chart of IGA

     We initialized our score card and added other sets of weights with random numbers.

     To optimize the initial card, we would do the advanced crossover to have the variation to reach machine learning. Every round we should select two weights among all by the pickup method. After the advanced crossover optimized from the normal crossover, the mutation had done and new weights would be put into the population.

  • Confusion Matrix
  • To deal with a set of weight, first we calculated the confusion matrix. By the confusion matrix we separated prediction section and label section into four classes, which called TP(True Positive),FP(False Positive),FN(False Negative), and TN(True Negative).

    Table 2: Confusion Matrix

    Then we calculated TPR and FPR.

    $$ TPR=\frac{TP}{\left(TP+FN\right)} $$
    $$ FPR=\frac{FP}{\left(FP+TN\right)} $$

    We put TPR as y-axis and FPR as x-axis to draw the ROC curve.

  • AUC of ROC curves
  • Figure 3: AUC of ROC curves

         With different thresholds to distinguish positive ones from negative ones, the TP,FP, FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate the fitness of the weight, we calculated the area under the curve (AUC), which is a common way to evaluate models and predictions. The advantage of AUC of ROC curve is that it possesses much resistance to unbalancing datasets, while in fact non-antifungal peptides are far more than antifungal peptides.

  • Pearson coefficient
  •      We not only considered the AUC value but also considered the Pearson coefficient of the amino acids between the initial scoring card and the current scoring card. We gave different weights for each value, with 0.9 of the AUC value and 0.1 of the Pearson coefficient for the best training performance. With the addition of the Pearson coefficient, the model can avoid overtraining.

  • The Select method
  •      We picked two weights among all. One had the most fitness(the highest AUC), which is probably to be the best weight. The other parent was selected using the roulette method.

         We separated different areas for each weight of the score card proportionally to their fitness. The higher fitness of the weight would get the larger area.

         Then we randomly chose a number and took the score card which the random number was in its area.

         Using Roulette method is to make sure the randomness of the selection. The one who had higher fitness score card probably would be chosen but not absolutely be chosen.

  • Crossover of IGA
  •      After choosing the parents, we used IGA to optimize the crossover. IGA is based on the normal GA.

         For the Genetic Algorithms (normal GA), the crossover selection is the most important selection. After choosing two parents is to randomly choose a pair of parameters to exchange. And then return the exchanged score card into the new population.

    Figure 4: Cross over

         After that, delete lower fitness score card and keep the population in a range.

         However, how do we choose the best set of parameters? IGA crossover is the method which is for large parameters developed by SY Ho. (ref : http://ieeexplore.ieee.org/document/1369245/).

         If there is a target function,

    $$ f(x_1,x_2,x_3)=100x_1-10x_2-x_3 $$

         and for each x1, x2, x3, we have 2 candidates to choose, just like the two parents in the crossover step.

    $$ maximize\;y(x_{1},x_{2},x_{3})=100x_{1}-10x_{2}-x_{3},\\x_{1}\in\left \{ 1,2 \right \},x_{2}\in\left \{ 3,4 \right \},and\;x_{3}\in\left \{ 5,6 \right \} $$

         (1)To maximize the function of IGA, we will first create an OA-array, just like the array shown above.

         (2)Take x1 for example.

         For evaluating x1, the key is to eliminate the effect by x2 and x3.

         From the table below, we can see the way to obtain the evaluation is to pair column 1 and 2 together, while 3 and 4 together.

         So we have:

         Because the value of SJ2 is larger than of SJ1, the better parameter for x1 will be 2 instead of 1.

         (3)Other parameters are chosen by the same method.

         The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.

    Table 3: An example of IGA optimization

  • Mutation
  •      After the crossover section, a new offspring will start a mutation section. In a mutation section, the program will choose a random number to determine whether to mutate or not. If the result is yes, it will randomly choose a allele of offspring and set a random number. The mutation section is important to increase the randomness of the model.

  • Sorting and Filter
  •      After the mutation section, the new offspring would join the population, then the program sorted all scoring cards in the population by their fitness values. After the sorting population, the last process was to filter out the scoring card that ranked outside the max_population number.

  • Termination
  •      To get the best accuracy and prevent the model from over-training, the program will be terminated after 30 generations. When it reached its end condition it would return the final score card with the best fitness in training data.

Results

     However, the training data cannot always reflect the real situation, so the model will be evaluated by the independent testing data (sequence identity=25%). Here is the ROC graph of the test data for each dataset.

     (AFP25: Antifungal peptide with sequence identity=25%)

Figure 5: The AUC curve and our result of our model

     System Operation:

     For an unknown peptide inputting in the scoring system,we will first transfer it into dipeptide frequency and then multiply them with the dipeptide score card, and then get the final score. The threshold value with the best accuracy is determined due to the test result.

$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$
$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$

- Moreover , the scoring card predicting system is now available in Parabase predict tool

Reference

[1] Huang, H.-L., Charoenkwan, P., Kao, T.-F., Lee, H.-C., Chang, F.-L., Huang, W.-L., … Ho, S.-Y. (2012). Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics, 13(Suppl 17), S3. http://doi.org/10.1186/1471-2105-13-S17-S3

[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176

Untitled Document