3.To establish a biggest antifungal database so far, possessing the data of hosts, pathogens, antifungal peptides, and scoring system finding the potential antifungal peptides

Datasets

The performance of machine learning depends on the quality and quantity of the training datasets.

Collection: The collection of positive(antifungal) data was from opening online databases such as CAMP, APD , PhytAMP and new peptides that we recently collect in our database, while negative data from Uniprot which did not have antifungal or antimicrobial keywords.
Pre-processing: First was to delete some peptides which contained non-standard amino acids. Then we limited the length of peptides for 10 AA’s to 100 AA’s because antifungal peptide were typically between 10-100 amino acids long. Furthermore, filtered the peptides with identity <=2 5% . And then choose negative data that as much as the positive data.

Finally, we combined our positive and negative dataset and randomly distributed ⅓ datas for independent testing set.

這裡有一張表

Algorithm - SCM

We used scoring card method for machine learning. Scoring card method (SCM) is developed from from Shinn-Ying Ho, NCTU. It is a simple, accurate and interpretable machine learning method. It can not only predict the peptide function , but also can predict the important domain of the peptide.

It consists of two important parts - the dipeptide score and the intelligent genetic algorithms(IGA) ,which is based on genetic algorithms.

For the first part, the dipeptide score is a simple and effective way to predict peptides’ function by scoring peptides. For every peptide, we could calculate its dipeptide frequency. Then, we gave a initial weight for each specific dipeptide through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility antifungal has to be.

1.Dipeptide

Each peptide will form a 400*1 matrix of every dipeptide frequency because there are 20 types of amino acids so results in 400 types of dipeptide frequency.

一個方程式

Each peptide will then get a score according to the peptide sequences by the multiplication.

一張圖

If the score of the peptide is higher than the threshold tallied out, then it is predicted as an antifungal peptide, otherwise it isn’t. The higher score it is, the higher probability of the antifungal function it possesses.

一個方程式

2.The initial weight

P(ij) is the dipeptide frequency of positive dataset

N(ij) is the dipeptide frequency of negative dataset

一個連立方程式

一個方程式

Then,each weight is the frequency of positive data (P (ij)) minus the frequency of negative data (N (ij)), normalizes them to [0,1] and then times 1000.

一個方程式

After doing so,we have the initail scoring card(a set of dipeptide weight).

For the second part , we used IGA to optimize our initial scoring card. It is based on the nature life evolution.

3.IGA (intelligent genetic algorithm)

Below is the the flow chart of IGA.

一張flow chart

We initialized our score card and added other sets of weights with random numbers.

To optimize the initial card, we would do the advanced crossover to have the variation to reach machine learning. Every round we should select two weights among all by the pick up method. After the advanced crossover optimized from the normal crossover, the mutation had done and new weights would be put into the population.

Confusion Matrix

To deal with a set of weight, first we calculated the confusion matrix. By the confusion matrix we separated prediction section and label section into four classes, which called TP,FP,FN, and TN.

一張表格

Then we calculated TPR and FPR.

一個公式

We put TPR as y-axis and FPR as x-axis to draw the ROC curve.

AUC of ROC curves

一個圖

With different thresholds to distinguish positive ones from negative ones, the TP,FP,FN, and TN would be different. As a result, we will have different TPR and FPR of every threshold and following AUC of ROC curve. To evaluate the fitness of the weight, we calculated the area under curve (AUC), which is a common way to evaluate models and predictions. The advantage of AUC of ROC curve is that it possesses much resistance to unbalancing datasets, while in fact non-antifungal peptides are far more than antifungal peptides.

the Pick up method

We picked two weights among all. One had the most fitness(the highest AUC), which is probably to be the best weight. The other parent was selected using the roulette method.

We separated different areas for each weight of the score card proportionally to their fitness. The higher fitness of the weight would get the larger area.

Then we randomly chose a number and took the score card which the random number was in its area.

一張圖

Using Roulette method is to make sure the randomness of the selection. The one who had higher fitness score card probably would be chosen but not absolutely be chosen.

Crossover of IGA

After choosing the parents, we used IGA to optimize the crossover. IGA is based on the normal GA.

For the Genetic Algorithms (normal GA), the crossover selection is the most important selection. After choosing two parents is to randomly choose a pair of parameters to exchange. And then return the exchanged score card into the new population.

一張圖

After that, delete lower fitness score card and keep the population in a range.

However, how do we choose the best set of parameters? IGA cross over is the method which is for large parameters developed by SY Ho. (ref : http://ieeexplore.ieee.org/document/1369245/).

If there is a target function,

一個方程式

and for each x1, x2, x3, we have 2 candidates to choose, just like the two parents in the crossover step.

一綱方程式

(1)To maximize the function of IGA, we will first create an OA-array, just like the array shown above.

(2)Take x1 for example.

For evaluating x1, the key is to eliminate the effect by x2 and x3.

From the table below, we can see the way to obtain the evaluation is to pair column 1 and 2 together, while 3 and 4 together.

So we have:

一個連立方程式

Because the value ofSJ2is larger than ofSJ1, the better parameter for x1 will be 2 instead of 1.

(3)Other parameters are chosen by the same method.

The main idea of this method is related to statistics. If the number of parameters is big enough, the effect of other parameters will be limited.

Termination

To prevent the model from over-training, the program will be terminated after 30 generations. When it reached its end condition it would return the final score card with the best fitness in training data.

Results

However, the training data can not always reflect the real situation, so the model will be evaluated by the independent testing data (sequence identity=25%). Here is the ROC graph of the test data for each dataset.

(AFP25: Antifungal peptide with sequence identity=25%)

一張圖

System Operation:

For an unknown peptide inputting in the scoring system,we will first transfer it into dipeptide frequency and then multiply them with the dipeptide score card, and then get the final score.

兩個方程式

- Moreover , the scoring card predicting system is now available in (http://web.it.nctu.edu.tw/~nctu_formosa/Parabase/tool.html)

[1]Vasylenko T, Liou TF, Chiou PC, Chu HW, Lai YS, Chuo YL,Shinn-Ying Ho, et al. SCMBYK: prediction and characterization of bacterial tyrosine-kinases based on propensity scores of dipeptides. BMC Bioinformatics. 2016. doi:10.1186/s12859-016-1371-4.

[2] Shinn-Ying Ho, Li-Sun Shu and Jian-Hung Chen, "Intelligent evolutionary algorithms for large parameter optimization problems," in IEEE Transactions on Evolutionary Computation, vol. 8, no. 6, pp. 522-541, Dec. 2004. doi: 10.1109/TEVC.2004.835176

Untitled Document

Team:NCTU Formosa/Model

Purpose

Datasets

Algorithm - SCM

1.Dipeptide

2.The initial weight

3.IGA (intelligent genetic algorithm)

Results