Team:NCTU Formosa/Peptide Prediction


NCTU_Formosa: Peptide Prediction

     Our database, a new genesis for Artificial Intelligence, strengthens the power of large datasets by the antifungal peptide prediction system on the basis of SCM with other optimization. The antifungal characteristic can be evaluated and interpreted only by sequence analysis.

     Furthermore, we integrated all the relative data to form a complete antifungal database to achieve the query function of hosts, pathogens, and corresponding peptides. Combining two together, a novel Parabase database achieving both new drug discovery and old drug repurposing for antifungal peptides is born.

Antifungal Prediction System

     In order to evaluate peptide functions in a quicker and smarter way, we introduced SCM into making our antifungal peptide prediction system. With this applicable and interpretable tool, we are able to find potential target peptides in a large number of unknown peptides, making the best use of vast data.


  1. Datasets
  2. The concept of the dipeptide and the weight
  3. Intelligent Genetic Algorithm

     For the prediction of our peptides, we integrated Scoring Card Method and modified to our antifungal peptide prediction system. The major advantage of the method is its simplicity, interpretability, and acceptable accuracy.

     SCM, based on Support Vector Machine (SVM), is a method originating from our instructor Shinn-Ying Ho. To measure the property of anti-fungus, we introduced SCM into our model to evaluate peptides’ antifungal functions with the perspective of biological information.

- Datasets:

     We obtain our positive data from antifungal databases, such as cAMP, PhytAMP and papers we found in PubMed. We collected our negative data from peptides that are not annotated to be antifungal in UniProt.

     We created the train dataset and test dataset by reducing the sequence identity of positive data and negative data and divide them into two portions that each dataset has an equal amount of positive and negative data.

- Dipeptide:

     The premise of this method is to hypothesize the function of peptides correspond to their sequences. We viewed two amino acids as a group to form the smallest functional unit, defined dipeptides.

     A peptide that has more potentially antifungal dipeptides will more likely to be an antifungal peptide, vise versa. The total 400 individual dipeptide propensities are obtained by statistical discrimination between dipeptide composition of the antifungal peptides and non-antifungal peptides.

- Dipeptide Frequency & Score:

     Each dipeptide frequency (400 types) of each peptide multiplies the weight to get a score.

     The score is obtained by summing each dipeptide frequency of each peptide multiplies the weight to get a score.

- Weight:

     The initial weight value for each dipeptide is the ratio of the dipeptide appearing in the positive datasets minus the ratio appearing in the negative datasets. The weight value is then further optimized by IGA.

$$ Dipeptid\quad Propensity\quad Scores: $$
$$ P(ij) - N(ij) $$

- Selection of Weight: The Select Method:

     We picked up two weights among all: the one that had the highest fitness value or the one selected by the Roulette method. These two scoring cards were used for crossover selection.

$$ Fitness\quad Value = 0.9 AUC + 0.1R $$

     R is the value of cor relation coefficient (R-value) between the initial and the optimized propensity scores.

- AUC:

     The Area Under ROC curves which are viewed as a way to evaluate the model built. The closer to 1 of the value is, the higher accuracy of the prediction model has.

- Roulette:

     A choosing method to ensure the randomness even the higher fitness probably will be selected.

- IGA (intelligent genetic algorithm):

     Cross Over Selection: A pair of parameters of the two weights are radomly choosed to exchange.

     Optimization (developed by Shinn-Ying Ho): A creative method for large parameters optimization which the selection function has been designed to simplify the numbers of different parameter sets.

(For the algorithm in detail, please check out Peptide Prediction Model.)

Antifungal Database

     In order to organize present antifungal data to a level of both high quantity and quality, we aggregated relative databases online and organized them to become a complete, useful and the largest antifungal database online.


  1. Connection of data: Hosts - Pathogens - Peptides
  2. Cross-match: Drug repurposing by the integration of databases

     After we finished our prediction system, the next would be the integration of antifungal databases. There're several databases related to fungal infection on the internet yet lack of arrangement and integrity. The disorder of data would lead to the inconvenience for searching full information and end up to have the narrow- sighted absorbance of knowledge.

     As a result, we planned to aggregate and organize all the relative data in different websites or databases to set up a complete antifungal database, reaching drug repurposing by cross-reference.

1. Connection of data

     To focus on the problem we were dealing with, the fungal diseases in agriculture, there’re some factors related to the issue: hosts, pathogens, and antifungal peptides. Here's the data quantity we collected:

     (1) hosts - pathogens : 514 (Phytopath / PHIbase)

     (2) pathogens - peptides : 1334 (cAMP / PhytAMP)

     (3) pathogens - peptides : 57 (paper searching)

     By our processing, we have updated almost 300 peptides and found almost 70 new antifungal peptides.

2. Cross-match

     After the data has been ordered and assembled by us, the quantity of data is even bigger than the original amounts of data before they gathered because of cross-reference. We call it the cross-match of data.

     In the end, we set up our Parabase website, presenting the antifungal prediction system and validated antifungal peptides relative data relationships. Please check out the final presentation in Demonstration.

- You can click here to view the demonstration - Parabase Website.

     Here show the results of the peptide prediction.

     For the antifungal database: the data amount we have collected

     For the antifungal scoring system :

  1. The ROC curve and the results of test data
  2. Visualized antifungal scoring card
  3. Discussion of the relationships of dipeptides and active sites

     For the achievement: the conclusion of what we’ve dedicated to humans

1.Antifungal Database (relative antifungal data)

     (1)514 interactions between hosts and pathogens

     (2)1334 experimentally validated antifungal peptides and their introductions

2.Antifungal Peptide Prediction System:

     (1)The final ROC curve and the result of test datasets

Figure 1:
The test accuracy, the overall performance of classifying positive data as positive and negative data as negative, is 76%. The sensitivity, the performance of classifying positive data as positive, is 77%. The specitivity, the performance of classifying negative data as negative, is 76%. The suitable threshold value is 354, peptides score higher than this value is considered as antifungal peptide.

     (2)The score distribution between positive datasets and negative datasets

Figure 2: The score distribution between positive datasets and negative datasets

     (3)Final antifungal scoring card (dipeptide score)

$$ \sum_{i=0}^{400} x_{i}\cdot w_{i}=score $$

Figure 3: Final antifungal scoring card


Figure 4: The bar graph above showed the single amino acid score calculated from each dipeptide score.

- Single Peptide Score Analysis:

     By the score results, the top three amino acids are Cysteine(C), Glycine(G), and Lysine(K), and the five amino acids to have lowest scores are Aspartic acid(D), Glutamic acid(E), Serine(S), Threonine(T), Valine(V).

     We interpreted the results as the following reasons:

     There are many antifungal peptides for plants and mammals that contain lots of Cysteine, such as Thionins, plant defensins, and more. For Glycine, there are also many Glycine-rich peptides from Insect's antifungal peptides.

     For the 5 peptides(D, E, S, T, V) of the lowest scores, four of them are hydrophilic, while most of the hydrophilic amino acids have a higher score (average score : 362.73 > threshold : 350).

     Additionally, for the top 5 highest amino acids,Cysteine contains a sulfide functional group that can form disulfide bond, and Lysine(K) and Arginine(R) are easy to form hydrogen bond.

- 3D structure and active site:

     To show the result of the scoring card, we visualized the peptides by drawing the dipeptide score on the peptide 3D structure. The region of a peptide become redder when the dipeptide score there is higher. Otherwise, the region becomes bluer when the dipeptide score there is lower.

     By doing so , we can find the important region of an antifungal peptide.

     We took Rs-AFP2 as an example. Rs-AFP2 was an antifungal peptide from the plant defensin family .

Figure 5:
This picture is the 3D rotating gif of the Rs-AFP2 with scoring card visualize score on the peptide.As you can see, the N terminal of the peptide(on the top) and the 3sheet are the reddest part of the peptide. To our scoring system based on the SCM, it indicated that these two regions are important regions that determined the whole peptide sequence as an antifungal peptide or not.

     It seemed that the N term of the peptide and the 2sheet were the reddest. To our antifungal peptide prediction system based on the SCM, it indicated that these two regions were important regions that determined the full peptide sequence as an antifungal peptide or not.

Figure 6:
This is a 3D rotating gif picture of Rs-AFP2 peptide with red color labeled on it’s active region which found in the paper[1].According to this paper, showing that the major active site are between the β2 and β3 loop, from Ala32 and Phe49 and some activity was found in the N-terminal part of the protein.

     To compare with papers, the paper showed that the active site are the β2−β3 loop, from Ala31 to Phe49, and some activities were found in the N-terminal part of the protein.

     Comparing with the scoring card visualized picture and the real active site, we can find in the picture of score card the 3sheet and the N-termina were also labeled.

     In conclusion, we can say that SCM might possess the ability to show antifungal active sites.


     We created a powerful database that helps iGEMers who aims to solve agricultural problems caused by fungus or even other disease cases by the framework. Our database has a convenient searching tool that can quickly find out effective antifungal peptides by searching host species or fungal pathogens. Our database also enables users to find out potential new antifungal peptides by applying the antifungal prediction system.


[1]W. M. M. Schaaper,Synthetic peptides derived from the β2−β3 loop of Raphanus sativus antifungal protein 2 that mimic the active site,, 2001

Untitled Document