Revision as of 18:51, 29 October 2017

3.To establish a biggest antifungal database so far, possessing the data of hosts, pathogens, antifungal peptides, and scoring system finding the potential antifungal peptides

Datasets

The performance of machine learning depends on the quality and quantity of the training datasets.

Collection: The collection of positive(antifungal) data was from opening online databases such as CAMP, APD , PhytAMP and new peptides that we recently collect in our database, while negative data from Uniprot which did not have antifungal or antimicrobial keywords.
Pre-processing: First was to delete some peptides which contained non-standard amino acids. Then we limited the length of peptides for 10 AA’s to 100 AA’s because antifungal peptide were typically between 10-100 amino acids long. Furthermore, filtered the peptides with identity <=2 5% . And then choose negative data that as much as the positive data.

Finally, we combined our positive and negative dataset and randomly distributed ⅓ datas for independent testing set.

Algorithm - SCM

We used scoring card method for machine learning. Scoring card method (SCM) is developed from from Shinn-Ying Ho, NCTU. It is a simple, accurate and interpretable machine learning method. It can not only predict the peptide function , but also can predict the important domain of the peptide.

It consists of two important parts - the dipeptide score and the intelligent genetic algorithms(IGA) ,which is based on genetic algorithms.

For the first part, the dipeptide score is a simple and effective way to predict peptides’ function by scoring peptides. For every peptide, we could calculate its dipeptide frequency. Then, we gave a initial weight for each specific dipeptide through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility antifungal has to be.

1.Dipeptide

Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.

$$ 20_{AA} \times 20_{AA}=400_{dipeptide} $$

Each peptide will then get a score according to the peptide sequences by the multiplication.

$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$

If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.

$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$

2.The initial weight

P(ij) is the dipeptide frequency of positive dataset

N(ij) is the dipeptide frequency of negative dataset

$$ P(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=1\right ),1\leq i,j\leq 20 $$

$$ N(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=0\right ),1\leq i,j\leq 20 $$

$$ S_{ij} =P_{ij}-N_{ij} $$

Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.

$$ S'_{(ij)}=\left ( \frac{S_{ij}-S_{min}}{S_{max}-S_{min}} \right ) \times 1000 $$

After doing so,we have the initail scoring card(a set of dipeptide weight).

For the second part , we used IGA to optimize our initial scoring card. It is based on the nature life evolution.

3.IGA (intelligent genetic algorithm)

Below is the the flow chart of IGA.

We initialized our score card and added other sets of weights with random numbers.

To optimize the initial card, we would do the advanced crossover to have the variation to reach machine learning. Every round we should select two weights among all by the pick up method. After the advanced crossover optimized from the normal crossover, the mutation had done and new weights would be put into the population.

Confusion Matrix

To deal with a set of weight, first we calculated the confusion matrix. By the confusion matrix we separated prediction section and label section into four classes, which called TP,FP,FN, and TN.

一張表格

Then we calculated TPR and FPR.

$$ TPR=\frac{TP}{\left(TP+FN\right)} $$

$$ FPR=\frac{FP}{\left(FP+TN\right)} $$

We put TPR as y-axis and FPR as x-axis to draw the ROC curve.

AUC of ROC curves

With different thresholds to distinguish positive ones from negative ones, the TP,FP,FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate the fitness of the weight, we calculated the area under curve (AUC), which is a common way to evaluate models and predictions. The advantage of AUC of ROC curve is that it possesses much resistance to unbalancing datasets, while in fact non-antifungal peptides are far more than antifungal peptides.

the Pick up method

We picked two weights among all. One had the most fitness(the highest AUC), which is probably to be the best weight. The other parent was selected using the roulette method.

We separated different areas for each weight of the score card proportionally to their fitness. The higher fitness of the weight would get the larger area.

Then we randomly chose a number and took the score card which the random number was in its area.

Using Roulette method is to make sure the randomness of the selection. The one who had higher fitness score card probably would be chosen but not absolutely be chosen.

Crossover of IGA

After choosing the parents, we used IGA to optimize the crossover. IGA is based on the normal GA.

For the Genetic Algorithms (normal GA), the crossover selection is the most important selection. After choosing two parents is to randomly choose a pair of parameters to exchange. And then return the exchanged score card into the new population.

After that, delete lower fitness score card and keep the population in a range.

However, how do we choose the best set of parameters? IGA cross over is the method which is for large parameters developed by SY Ho. (ref : http://ieeexplore.ieee.org/document/1369245/).

If there is a target function,

$$ f(x_1,x_2,x_3)=100x_1-10x_2-x_3 $$

and for each x1, x2, x3, we have 2 candidates to choose, just like the two parents in the crossover step.

一綱方程式

(1)To maximize the function of IGA, we will first create an OA-array, just like the array shown above.

(2)Take x1 for example.

For evaluating x1, the key is to eliminate the effect by x2 and x3.

From the table below, we can see the way to obtain the evaluation is to pair column 1 and 2 together, while 3 and 4 together.

So we have:

一個連立方程式

Because the value ofSJ2is larger than ofSJ1, the better parameter for x1 will be 2 instead of 1.

(3)Other parameters are chosen by the same method.

The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.

(http://ieeexplore.ieee.org/mediastore/IEEE/content/media/4235/29964/1369245/1369245-table-2-small.gif )[2]

Termination

To prevent the model from over-training, the program will be terminated after 30 generations. When it reached its end condition it would return the final score card with the best fitness in training data.

Results

However, the training data can not always reflect the real situation, so the model will be evaluated by the independent testing data (sequence identity=25%). Here is the ROC graph of the test data for each dataset.

(AFP25: Antifungal peptide with sequence identity=25%)

System Operation:

For an unknown peptide inputting in the scoring system,we will first transfer it into dipeptide frequency and then multiply them with the dipeptide score card, and then get the final score.

$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$

$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$

- Moreover , the scoring card predicting system is now available in

[1] Huang, H.-L., Charoenkwan, P., Kao, T.-F., Lee, H.-C., Chang, F.-L., Huang, W.-L., … Ho, S.-Y. (2012). Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics, 13(Suppl 17), S3. http://doi.org/10.1186/1471-2105-13-S17-S3

[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176

Untitled Document

@@ Line 5: / Line 5: @@
 <head>
      <meta charset="UTF-8">
-     <title>Untitled Document</title>
+     <title>NCTU_Formosa: Peptide Prediction Modeling</title>
-     <script src="//ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
+     <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
-     <link href="Home.css" rel="stylesheet" type="text/css">
+    <script src="jQueryAssets/jquery-1.11.1.min.js"></script>
-     <script src="Home.js" type="text/javascript"></script>
+    <script src="jQueryAssets/jquery.ui-1.10.4.dialog.min.js"></script>
+     <link href="modeling_peptide_prediction.css" rel="stylesheet" type="text/css">
+     <script src="modeling_peptide_prediction.js" type="text/javascript"></script>
      <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no, minimum-scale=1.0, maximum-scale=1.0" />
 	<style>
-  body {
+body {
      margin: 0;
      padding: 0;
@@ Line 152: / Line 154: @@
      margin-bottom: 0 !important;
 }
+/*----------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
+span>a:link {
+    color: #85CFEA;
+}
+span>a:visited {
+    color: #85CFEA;
+}
+span>a:hover {
+    color: blue;
+}
+span>a:active {
+    color: blue;
+}
+span>a{
+    text-decoration: none;
+}
+/*----------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
 	</style>
@@ Line 158: / Line 186: @@
 	</script>
+<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
+        MathJax.Hub.Config({
+            extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+            jax: ["input/TeX", "output/HTML-CSS"],
+            tex2jax: {
+                inlineMath: [
+                    ['$', '$'],
+                    ["\\(", "\\)"]
+                ],
+                displayMath: [
+                    ['$$', '$$'],
+                    ["\\[", "\\]"]
+                ],
+            },
+            "HTML-CSS": {
+                availableFonts: ["TeX"]
+            }
+        });
+    </script>
 <!----------------------------------------------------------------------------->
@@ Line 217: / Line 264: @@
              </p>
-             <p>這裡有一張表</p>
+             <img src="https://static.igem.org/mediawiki/2017/c/cb/Ptpm_photo1.png" width="40%" style="display: block; margin: auto;">
@@ Line 241: / Line 288: @@
              <ul>
                  <li>Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.</li>
-                 <p>一個方程式</p>
+                 <div class="latex"> $$ 20_{AA} \times 20_{AA}=400_{dipeptide} $$</div>
                  <li>Each peptide will then get a score according to the peptide sequences by the multiplication.</li>
-                 <p>一張圖</p>
+                 <img src="https://static.igem.org/mediawiki/2017/3/39/Ptpm_photo2.png" width="60%" style="display: block; margin: auto;">
+                <div class="latex">$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$</div>
                  <li>If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.</li>
-                <p>一個方程式</p>
              </ul>
+            <div class="latex">$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$
+            </div>
              <h2>2.The initial weight</h2>
@@ Line 252: / Line 302: @@
              <p>P<small>(ij)</small> is the dipeptide frequency of positive dataset </p>
              <p>N<small>(ij)</small> is the dipeptide frequency of negative dataset</p>
-             <p>一個連立方程式</p>
+             <div class="latex">$$ P(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=1\right ),1\leq i,j\leq 20 $$</div>
-             <p>一個方程式</p>
+             <div class="latex">$$ N(ij)=\left ( \frac{n_{ij}}{L_{p-1}}\mid C=0\right ),1\leq i,j\leq 20 $$</div>
+            <div class="latex">$$ S_{ij} =P_{ij}-N_{ij} $$</div>
              <p>
                  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.
              </p>
-             <p>一個方程式</p>
+             <div class="latex">$$ S'_{(ij)}=\left ( \frac{S_{ij}-S_{min}}{S_{max}-S_{min}} \right ) \times 1000 $$</div>
              <p>
@@ Line 274: / Line 326: @@
              </p>
-             <p>一張flow chart</p>
+             <img src="https://static.igem.org/mediawiki/2017/b/bd/Ptpm_photo3.png" width="60%" style="display: block; margin: auto;">
              <p>
@@ Line 292: / Line 344: @@
                  <p>一張表格</p>
                  <p>Then we calculated TPR and FPR.</p>
-                 <p>一個公式</p>
+                 <div class="latex">$$ TPR=\frac{TP}{\left(TP+FN\right)} $$</div>
-                 <p>一個公式</p>
+                 <div class="latex">$$ FPR=\frac{FP}{\left(FP+TN\right)} $$</div>
                  <p>We put TPR as y-axis and FPR as x-axis to draw the ROC curve.</p>
                  <li>AUC of ROC curves</li>
-                 <p>一個圖</p>
+                 <img src="https://static.igem.org/mediawiki/2017/7/7f/Ptpm_photo4.png" width="60%" style="display: block; margin: auto;">
                  <p>
                      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;With different thresholds to distinguish positive ones from negative ones, the TP,FP,FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate
@@ Line 317: / Line 369: @@
                  </p>
-                 <p>一張圖</p>
+                 <img src="https://static.igem.org/mediawiki/2017/6/6e/Ptpm_photo5.png" width="30%" style="display: block; margin: auto;">
                  <p>
@@ Line 334: / Line 386: @@
                  </p>
-                 <p>一張圖</p>
+                 <img src="https://static.igem.org/mediawiki/2017/b/b7/Ptpm_photo6.png" width="60%" style="display: block; margin: auto;">
                  <p>
@@ Line 349: / Line 401: @@
                  </p>
-                 <p>一個方程式</p>
+                 <div class="latex">$$ f(x_1,x_2,x_3)=100x_1-10x_2-x_3 $$</div>
                  <p>
@@ Line 391: / Line 443: @@
                      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.
                  </p>
+                <img src="https://static.igem.org/mediawiki/2017/b/b6/Ptpm_photo7.png" width="40%" style="display: block; margin: auto">
+                <p>(http://ieeexplore.ieee.org/mediastore/IEEE/content/media/4235/29964/1369245/1369245-table-2-small.gif )[2]</p>
                  <li>Termination</li>
@@ Line 413: / Line 468: @@
              </p>
-             <p>一張圖</p>
+             <img src="https://static.igem.org/mediawiki/2017/2/26/Ptpm_photo8.png" width="60%" style="display: block; margin: auto;">
              <p>
@@ Line 423: / Line 478: @@
              </p>
-             <p>兩個方程式</p>
+             <div class="latex">$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$</div>
+            <div class="latex">$$ f_{(x)}=\left\{ \begin{array}{l} if\ x>threshold : f_{(x)}=positive\\ if\ x\leq threshold : f_{(x)}=negative \\ \end{array} \right . $$
+            </div>
-             <p>- Moreover , the scoring card predicting system is now available in (http://web.it.nctu.edu.tw/~nctu_formosa/Parabase/tool.html)</p>
+             <p>- Moreover , the scoring card predicting system is now available in <span><a href="http://web.it.nctu.edu.tw/~nctu_formosa/Parabase/tool.html" target="_blank"></a></span></p>
          </div>
          <div id="ptpm_reference">
-             <p><small>[1]Vasylenko T, Liou TF, Chiou PC, Chu HW, Lai YS, Chuo YL,Shinn-Ying Ho, et al. SCMBYK: prediction and characterization of bacterial tyrosine-kinases based on propensity scores of dipeptides. BMC Bioinformatics. 2016. doi:10.1186/s12859-016-1371-4.</small></p>
+             <p><small>[1] Huang, H.-L., Charoenkwan, P., Kao, T.-F., Lee, H.-C., Chang, F.-L., Huang, W.-L., … Ho, S.-Y. (2012). Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics, 13(Suppl 17), S3. http://doi.org/10.1186/1471-2105-13-S17-S3</small></p>
-             <p><small>[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176</small></p>
+             <p><small>[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176 </small></p>
          </div>

Difference between revisions of "Team:NCTU Formosa/Model"