By modifying the Translationally Controlled Tumor Protein (TCTP) homologue in P. falciparum, we were able to create two novel synthetic proteins. The first protein has the areas of interest 1 —including the two binding sites—removed and replaced with partly randomized residues. The second has the areas replaced with a single, self-contained binding site. These will serve as the basis for our negative control and our artemisinin binding device, respectively. In the next iGEM 2018, the Owlgems will test our new synthetic protein in wet-lab that is proposed by the machine learning model and develop the colorimetric assay for counterfeit Arteminisin drug detection.
The most notable results form our machine learning model are shown and briefly described below.
The model was trained to determine if a novel sequence would bind our not bind depending to artemisinin target drug. The model can predict binding with decent accuracy probability (Figure A). We showed the model 3 different novel amino acid sequences it has never seen before, gathered from literature and our proposed protein to be used in the genetic engineering component (Figure B). The results were as predicted (Figure C.)
Sequence length of 150, among others, performed rather well. Validation accuracy of around 90 percent (FigureD).The loss/validation graph (Figure E) shows the value of the error of function within the validation set (lower number = more accurate prediction). The model was also able to determine the composition of sequence that is most probable to be associated with binding (Figure F) with the lowest sequence length parameter being only 40 amino acids long. Since binding to Artemisinin is controlled by multiple unknown mechanisms, we proposed a new dataset to show the model could show a consensus sequence if one was present in a known control experiment (Homeobox Consensus Sequence).
Sequence length of 100 showed best and consistent results on our model. This result was predicted because the homeo-domain protein in around 60 amino acids in length. A sequence length of 100 allows for the machine to cut the sub-sequence (the parts of the sequence it views) more often, allowing for it to get the entire 60 amino acid long sequence in view (instead of the first half or later half if only viewing 60 amino acid long sequence length). The model also showed a predicted a sequence composition very close to the theoretical accepted homeo-domain consensus sequence (Figure H). The accuracy graph (Figure I) showed the highest score of around 80%, which is what you would expect to find for a variably conserved sequence (100% would mean the sequence is exactly the same). The loss/validation graph (Figure G) shows the value of the error of function within the validation set (lower number = more accurate prediction). The accuracy is an average of all the proteins in the data set that the model was tested on in predicting if the sequence it was looking at had the theoretical homeo-domain sequence or not. This model was used as the control to compare to the previous artemisinin binding set. Once the model was trained, we ran the theoretical consensus sequence of homeo-domain through the model, which detected the sequence was present with a probability of 94% (see software page).
Using the LSTM learning protocol, prospective Artemisinin binding proteins can be validated preliminarily without needing to generate them in vitro. This allows us to rapidly determine what sequences are likely to bind, saving time and money. However, there are only a limited number of proteins that have known Artemisinin binding functionality, and testing random mutations generated manually is tedious; the next logical step is to have the LSTM protocol generate likely binding proteins autonomously. This could also be applied to other functions besides binding to artemisinin, allowing researchers to create novel synthetic proteins for many different applications. Using the SnapGene visualization software, we created a device that can constitutively express a protein or set of proteins and then inducibly lyse E. coli cells. This system can be used in other applications to more effectively remove proteins from cells, as larger proteins can be difficult for bacteria to transport. Indeed, it can also be used as a “kill switch” to lyse bacteria that have served their purpose, allowing for rapid disposal of cell cultures.