Difference between revisions of "Team:NCTU Formosa/Peptide Prediction"

Revision as of 00:41, 2 November 2017

navigation

☰

MENU

Project

Disease Occurrence Prediction

Demonstration

Contribution

Improvement

Modeling

Peptide Prediction Model

Disease Occurrence Model

Wet Lab

Parts

Human Practice

Education and Public Engagement

Achievement

Team

Notebook

MENU

TEAM

PROJECT

PARTS

HUMAN PRACTICES

AWARDS

NCTU_Formosa: Peptide Prediction

Overview

Our database, a new genesis for Artificial Intelligence, strengthens the power of large datasets by the antifungal peptide prediction system on the basis of SCM with other optimization. The antifungal characteristic can be evaluated and interpreted only by sequence analysis.

Furthermore, we integrated all the relative data to form a complete antifungal database to achieve the query function of hosts, pathogens, and corresponding peptides. Combining two together, a novel Parabase database achieving both new drug discovery and old drug repurposing for antifungal peptides is born.

Antifungal Prediction System

In order to evaluate peptide functions in a quicker and smarter way, we introduced SCM into making our antifungal peptide prediction system. With this applicable and interpretable tool, we are able to find potential target peptides in a large number of unknown peptides, making the best use of vast data.

Content:

Datasets
The concept of the dipeptide and the weight
IGA

For the prediction of our peptides, we integrated Scoring Card Method and modified to our antifungal peptide prediction system. The major advantage of the method is its simplicity, interpretability, and acceptable accuracy.

SCM, based on Support Vector Machine (SVM), is a method originating from our instructor Shinn-Ying Ho. To measure the property of anti-fungus, we introduced SCM into our model to evaluate peptides’ antifungal functions with the perspective of biological information.

- Datasets:

We obtain our positive data from antifungal databases, such as cAMP, PhytAMP and papers we found in PubMed. We collected our negative data from peptides that are not annotated to be antifungal in Uniprot.

We created the train dataset and test dataset by reducing the sequence identity of positive data and negative data and divide them into two portion that each dataset has equal amount of positive and negative data.

- Dipeptide:

The premise of this method is to hypothesize the function of peptides correspond to their sequences. We viewed two amino acids as a group to form the smallest functional unit, defined dipeptides.

A peptide that has more potentially antifungal dipeptides will more likely to be an antifungal peptide, vise versa. The total 400 individual dipeptide propensities are obtained by statistical discrimination between dipeptide composition of the antifungal peptides and non-antifungal peptides.

- Dipeptide Frequency & Score:

Each dipeptide frequency (400 types) of each peptide multiplies the weight to get a score.

The score is obtained by summing each dipeptide frequency (400 types) of each peptide multiplies the weight to get a score.

- Weight:

The value of weight floats every round in the computing loop. The initial weight value for each dipeptide is the ratio of the dipeptide appearing in the positive datasets minus the ratio appearing in the negative datasets. Others to be the candidates in the IGA round are picked randomly.

$$ Dipeptid\quad Propensity\quad Scores: $$

$$ P(ij) - N(ij) $$

- Selection of Weight: Pick up Method:

We picked up two weights among all: the one that had the highest fitness value or the one selected by the Roulette method

$$ Fitness\quad Value = 0.9 AUC + 0.1R $$

R is the value of cor relation coefficient (R-value) between the initial and the optimized propensity scores.

- AUC:

The Area Under ROC curves which is viewed as a way to evaluate the model built. The closer to 1 of the value is, the higher accuracy of the prediction model has.

- Roulette:

A choosing method to ensure the randomness even the higher fitness probably will be selected.

- IGA (intelligent genetic algorithm):

Cross Over Selection: A pair of parameters of the two weights are radomly choosed to exchange.

Optimization (developed by Shinn-Ying Ho): A creative method for large parameters optimization which the selection function has been designed to simplify the numbers of different parameter sets.

(For the algorithm in detail, please check out Peptide Prediction Model.)

Antifungal Database

In order to organize present antifungal data to a level of both high quantity and quality, we aggregated relative databases online and organized them to become a complete, useful and the largest antifungal database online.

Content:

Connection of data: Hosts - Pathogens - Peptides
Cross-match: Drug repurposing by the integration of databases

After we finished our prediction system, the next would be the integration of antifungal databases. There're several databases related to fungal infection in the internet yet lack of arrangement and integrity. The disorder of data would lead to the inconvenience for searching full information and end up to have the narrow- sighted absorbance of knowledge.

As a result, we planned to aggregate and organize all the relative data in different websites or databases to set up a complete antifungal database, reaching drug repurposing by cross-reference.

1. Connection of data

To focus on the problem we were dealing with, the fungal diseases in agriculture, there’re some factors related to the issue: hosts, pathogens, and antifungal peptides. Here's the data quantity we collected:

(1) hosts - pathogens : 514 (Phytopath / PHIbase)

(2) pathogens - peptides : 1334 (cAMP / PhytAMP)

(3) pathogens - peptides : 57 (paper searching)

By our processing, we have updated almost 300 peptides and found almost 70 new antifungal peptides.

2. Cross-match

After the data has been ordered and assembled by us, the quantity of data is even bigger than the original amounts of data before they gathered because of cross-reference. We call it the cross-match of data.

In the end, we set up our Parabase website, presenting the antifungal prediction system and validated antifungal peptides relative data relationships. Please check out the final presentation in Demonstration.

Results

- You can click here to view the demonstration - Parabase Website.

Here show the results of the peptide prediction.

For the antifungal database: the data amount we have collected

For the antifungal scoring system :

The ROC curve and the results of test data
Visualized antifungal scoring card
Discussion of the relationships of dipeptides and active sites

For the achievement: the conclusion of what we’ve dedicated to humans

1.Antifungal Database (relative antifungal data)

(1)514 interactions between hosts and pathogens

(2)1334 experimentally validated antifungal peptides and their introductions

2.Antifungal Peptide Prediction System:

(1)The final ROC curve and the result of test datasets

Figure 1:
The test accuracy, the overall performance of classifying positive data as positive and negative data as negative, is 76%. The sensitivity, the performance of classifying positive data as positive, is 77%. The specitivity, the performance of classifying negative data as negative, is 76%. The suitable threshold value is 354, peptides score higher than this value is considered as antifungal peptide.

(2)The score distribution between positive datasets and negative datasets

Figure 2:

(3)Final antifungal scoring card (dipeptide score)

$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$

Figure 3:

3.Discussion

Figure 4: The bar graph above showed the single amino acid score calculated from each dipeptide score.

- Single Peptide Score Analysis:

By the score results, the top three amino acids are Cysteine(C), Glycine(G), and Lysine(K), and the five amino acids to have lowest scores are Aspartic acid(D), Glutamic acid(E), Serine(S), Threonine(T), Valine(V).

We interpreted the results as the following reasons:

There are many antifungal peptides for plants and mammals that contain lots of Cysteine , such as Thionins, plant defensins, and more. For Glycine, there are also many Glycine-rich peptides from Insect's antifungal peptides.

For the 5 peptides(D, E, S, T, V. ) of the lowest scores, four of them are hydrophilic, while most of the hydrophilic amino acids have a higher score (average score : 362.73 > threshold : 350).

Additionally, for the top 5 highest amino acids,Cysteine contains a sulfide functional group that can form disulfide bond, and Lysine(K) and Arginine(R) are easy to form hydrogen bond.

- 3D structure and active site:

To show the result of the scoring card, we visualized the peptides by drawing the dipeptide score on the peptide 3D structure. The region of a peptide become redder when the dipeptide score there is higher. Otherwise, the region become bluer when the dipeptide score there is lower.

By doing so , we can find the important region of an antifungal peptide.

We took Rs-AFP2 as an example. Rs-AFP2 was an antifungal peptide from the plant defensin family .

Figure 5:

It seemed that the N term of the peptide and the 2sheet were the reddest. To our antifungal peptide prediction system based on the SCM, it indicated that these two regions were important regions that determined the full peptide sequence as an antifungal peptide or not.

Figure 6:

To compare with papers, the paper showed that the active site are the β2−β3 loop, from Ala³¹ to Phe⁴⁹, and some activities were found in the N-terminal part of the protein.

Comparing with the scoring card visualized picture and the real active site, we can find in the picture of score card the 3sheet and the N-termina were also labeled.

In conclusion, we can say that SCM might possess the ability to show antifungal active sites.

Achievement

We created a powerful database that helps iGEMers who aims to solve agricultural problems caused by fungus or even other disease cases by the framework. Our database has a convenient searching tool that can quickly find out effective antifungal peptides by searching host species or fungal pathogens. Our database also enables users to find out potential new antifungal peptides by applying the antifungal prediction system.

Reference

[1]W. M. M. Schaaper,Synthetic peptides derived from the β2−β3 loop of Raphanus sativus antifungal protein 2 that mimic the active site, http://onlinelibrary.wiley.com/doi/10.1034/j.1399-3011.2001.00842.x/full, 2001

Untitled Document

@@ Line 481: / Line 481: @@
                      <p style="margin-top: 50px;">
-                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the prediction of our peptides, we integrated Scoring Card Method and modified to our antifungal peptide prediction system. The major advantage of the method is its simplicity, interpretability, and accuracy.
+                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the prediction of our peptides, we integrated Scoring Card Method and modified to our antifungal peptide prediction system. The major advantage of the method is its simplicity, interpretability, and acceptable accuracy.
                      </p>
                      <p>
-                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;SCM, also called scoring card method, is a method originating from our instructor Shinn-Ying Ho. To measure the property of antifungal, we introduced SCM into our model to evaluate peptides’ antifungal functions with
+                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;SCM, based on Support Vector Machine (SVM), is a method originating from our instructor Shinn-Ying Ho. To measure the property of anti-fungus, we introduced SCM into our model to evaluate peptides’ antifungal functions with
                          the perspective of biological information.
                      </p>
@@ Line 493: / Line 493: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We obtain our positive data from antifungal databases, such as cAMP, PhytAMP, and papers we found in PubMed. We collected our negative data from peptides that are not annotated to be antifungal in Uniprot.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We obtain our positive data from antifungal databases, such as cAMP, PhytAMP and papers we found in PubMed. We collected our negative data from peptides that are not annotated to be antifungal in Uniprot.
                          </p>
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We created the train dataset and test dataset by reducing the sequence identity of positive data and negative data
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We created the train dataset and test dataset by reducing the sequence identity of positive data and negative data and divide them into two portion that each dataset has equal amount of positive and negative data.
- to lower than 25% and divided them into two portions that each data set has an equal amount of positive and negative data.
                          </p>
@@ Line 504: / Line 503: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The premise of this method is to hypothesize the function of peptides corresponding to their sequences. We viewed two amino acids as a group to form the smallest functional unit, defined dipeptides. Each dipeptide might have different propensity to make a peptide to be antifungal.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The premise of this method is to hypothesize the function of peptides correspond to their sequences. We viewed two amino acids as a group to form the smallest functional unit, defined dipeptides.
                          </p>
@@ Line 510: / Line 509: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A peptide that has a higher ratio of potentially antifungal dipeptides will more likely to be an antifungal peptide and vice versa. The propensity of each dipeptide will be converted to a weight score (dipeptide propensity score).
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A peptide that has more potentially antifungal dipeptides will more likely to be an antifungal peptide, vise versa. The total 400 individual dipeptide propensities are obtained by statistical discrimination between dipeptide
+                            composition of the antifungal peptides and non-antifungal peptides.
                          </p>
@@ Line 516: / Line 516: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Each dipeptide frequency (400 types) of a  peptide refers to the ratio of the number of each dipeptide to the total dipeptides.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Each dipeptide frequency (400 types) of each peptide multiplies the weight to get a score.
                          </p>
@@ Line 522: / Line 522: @@
                          <p style="margin-top: 20px;">
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The score is obtained by summing each dipeptide frequency of the peptide multiplies their corresponding weight.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The score is obtained by summing each dipeptide frequency (400 types) of each peptide multiplies the weight to get a score.
                          </p>
@@ Line 528: / Line 528: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The value of weight floats every round in the computing loop. The initial weight value for each dipeptide is the ratio of the dipeptide appearing in the positive data in the train dataset minus the ratio appearing in the negative data in the train dataset.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The value of weight floats every round in the computing loop. The initial weight value for each dipeptide is the ratio of the dipeptide appearing in the positive datasets minus the ratio appearing in the negative datasets.
                              Others to be the candidates in the IGA round are picked randomly.
                          </p>
@@ Line 538: / Line 538: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We picked up two weights among all: the one that had the highest fitness value and the one selected by the Roulette method.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We picked up two weights among all: the one that had the highest fitness value or the one selected by the Roulette method
                          </p>
@@ Line 546: / Line 546: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>R</b> is the value of Pearson's correlation coefficient (R-value) between the initial and the optimized propensity scores.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>R</b> is the value of cor relation coefficient (R-value) between the initial and the optimized propensity scores.
                          </p>
@@ Line 552: / Line 552: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The Area Under ROC curves which are viewed as a way to evaluate the model built. The closer to 1 of the value is, the higher accuracy of the prediction model has.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The Area Under ROC curves which is viewed as a way to evaluate the model built. The closer to 1 of the value is, the higher accuracy of the prediction model has.
                          </p>
@@ Line 607: / Line 607: @@
                      <p style="margin-top: 50px;">
-                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After we finished our prediction system, the next would be the integration of antifungal databases. There're several databases related to fungal infection on the internet yet lack of arrangement and integrity. The disorder of data would lead to the inconvenience for searching full information and end up to have the narrow- sighted absorbance of knowledge.
+                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After we finished our prediction system, the next would be the integration of antifungal databases. There're several databases related to fungal infection in the internet yet lack of arrangement and integrity. The disorder
+                        of data would lead to the inconvenience for searching full information and end up to have the narrow- sighted absorbance of knowledge.
                      </p>
@@ Line 628: / Line 629: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2) pathogens - peptides : 1277 (cAMP / PhytAMP)
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2) pathogens - peptides : 1334 (cAMP / PhytAMP)
                          </p>
@@ Line 636: / Line 637: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;By our processing, we have updated about 200 peptides and found more new antifungal peptides.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;By our processing, we have updated almost 300 peptides and found almost 70 new antifungal peptides.
                          </p>
@@ Line 671: / Line 672: @@
                      <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the antifungal database: the data amount we have collected </p>
                      <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the antifungal scoring system : </p>
-                     <ol style="margin: 0 10vw; font-size: 1.3em;">
+                     <div class="sublist">
-                        <li>The ROC curve and the results of test data</li>
+                        <ol style="font-size: 1.3em">
-                        <li>Visualized antifungal scoring card</li>
+                            <li>The ROC curve and the results of test data</li>
-                        <li>Discussion of the relationships of dipeptides and active sites</li>
+                            <li>Visualized antifungal scoring card</li>
-                    </ol>
+                            <li>Discussion of the relationships of dipeptides and active sites</li>
+                        </ol>
+                    </div>
                      <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the achievement: the conclusion of what we’ve dedicated to humans</p>
@@ Line 699: / Line 702: @@
                      <p>
-                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1)The final ROC curve and the result of test dataset
+                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1)The final ROC curve and the result of test datasets
                      </p>
                      <img src="https://static.igem.org/mediawiki/2017/d/da/Ptp_result_photo1.png" width="60%" style="display: block; margin: auto;">
-                     <h4>Figure 1: the AUC curve and the result of our model<br> The test accuracy, the overall performance of classifying positive data as positive and negative data as negative, is 76%. The sensitivity, the performance of classifying positive data as positive, is 77%. The specitivity,
+                     <h4>Figure 1:<br> The test accuracy, the overall performance of classifying positive data as positive and negative data as negative, is 76%. The sensitivity, the performance of classifying positive data as positive, is 77%. The specitivity,
-                     the performance of classifying negative data as negative, is 76%. The suitable threshold value is about 353, peptides score higher than this value is considered as antifungal peptide.</h4>
+                     the performance of classifying negative data as negative, is 76%. The suitable threshold value is 354, peptides score higher than this value is considered as antifungal peptide.</h4>
@@ Line 713: / Line 716: @@
                      <img src="https://static.igem.org/mediawiki/2017/b/b6/Ptp_result_photo2.png" width="60%" style="display: block; margin: auto;">
-                     <h4 style="margin-top: -20px;">Figure 2: The score distribution</h4>
+                     <h4 style="margin-top: -20px;">Figure 2: </h4>
                      <p>
@@ Line 722: / Line 725: @@
                      <img src="https://static.igem.org/mediawiki/2017/f/fb/Ptp_result_photo3.jpeg" width="60%" style="display: block; margin: auto;">
-                     <h4>Figure 3: The heat map of the scoring card</h4>
+                     <h4>Figure 3: </h4>
                      <h2>
@@ Line 748: / Line 751: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;There are many antifungal peptides for plants and mammals that contain lots of Cysteine, such as Thionins, plant defensins, and more. For Glycine, there are also many Glycine-rich peptides from Insect's antifungal peptides.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;There are many antifungal peptides for plants and mammals that contain lots of Cysteine , such as Thionins, plant defensins, and more. For Glycine, there are also many Glycine-rich peptides from Insect's antifungal peptides.
                          </p>
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the 5 peptides(D, E, S, T, V ) of the lowest scores, four of them are hydrophilic, while most of the hydrophilic amino acids have a higher score (average score : 362.73 > threshold : 353).
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the 5 peptides(D, E, S, T, V. ) of the lowest scores, four of them are hydrophilic, while most of the hydrophilic amino acids have a higher score (average score : 362.73 > threshold : 350).
                          </p>
@@ Line 764: / Line 767: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To show the result of the scoring card, we visualized the peptides by drawing the dipeptide score on the peptide 3D structure. The region of a peptide become redder when the dipeptide score there is higher. Otherwise, the region becomes bluer when the dipeptide score there is lower.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;To show the result of the scoring card, we visualized the peptides by drawing the dipeptide score on the peptide 3D structure. The region of a peptide become redder when the dipeptide score there is higher. Otherwise, the
+                            region become bluer when the dipeptide score there is lower.
                          </p>
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;By doing so, we can find the important region of an antifungal peptide.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;By doing so , we can find the important region of an antifungal peptide.
                          </p>
@@ Line 776: / Line 780: @@
                          <img src="https://static.igem.org/mediawiki/2017/1/1c/Design_photo1.gif" width="60%" style="display: block; margin: auto;">
-                         <h4>Figure 5: The Rs-AFP2 with scoring card visualized the higher score is red</h4>
+                         <h4>Figure 5: </h4>
                          <p style="margin-top: 50px; margin-bottom: 50px">
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;It seemed that the N terminal of the peptide and the β2 sheet were the reddest. To our antifungal peptide prediction system based on the SCM, it indicated that these two regions were important regions that determined the full peptide sequence as an antifungal peptide or not.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;It seemed that the N term of the peptide and the 2sheet were the reddest. To our antifungal peptide prediction system based on the SCM, it indicated that these two regions were important regions that determined the full peptide
+                            sequence as an antifungal peptide or not.
                          </p>
                          <img src="https://static.igem.org/mediawiki/2017/f/f0/Design_photo2.gif" width="60%" style="display: block; margin: auto;">
-                         <h4>Figure 6: The Rs-AFP2 with labeled red color on the active site</h4>
+                         <h4>Figure 6: </h4>
                          <p style="margin-top: 50px;">
@@ Line 791: / Line 796: @@
                          <p>
-                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Comparing with the scoring card visualized picture and the real active site, we can find in the picture of score card the β3 sheet and the N-terminal were also labeled.
+                             &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Comparing with the scoring card visualized picture and the real active site, we can find in the picture of score card the 3sheet and the N-termina were also labeled.
                          </p>
@@ Line 803: / Line 808: @@
                      <p>
-                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We created a powerful database that helps iGEMers who aims to solve agricultural problems caused by fungus or even other disease cases by the framework. Our database has a convenient searching tool that can quickly find out effective antifungal peptides by searching host species or fungal pathogens. Our database also enables users to find out potential new antifungal peptides by applying the antifungal prediction system.
+                         &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We created a powerful database that helps iGEMers who aims to solve agricultural problems caused by fungus or even other disease cases by the framework. Our database has a convenient searching tool that can quickly find out
+                        effective antifungal peptides by searching host species or fungal pathogens. Our database also enables users to find out potential new antifungal peptides by applying the antifungal prediction system.
                      </p>