Team:Lethbridge/Software



One of these sequences encodes a toxin.

Do you know which?

ACAGTTACACGGACAACAAGGTGTTCCAGGCTTCTTTCCTCCCTTCGACGATGTATTTCCAATAGGTGTAATCGTCGGCGAAATCGTGTTCGGTTCTCCCGACGACTGCGAAGGAAACGGATTGTTCGGTGTATTCGGCGTGTACAGCAGAATTTCACCCGACAGCGGTCCTTCTTCACCAAATCCCAGCGGCGGCGG

CTGCGCGATGCTCGACGAGTCAATTCCGCTGTAGTACAGGAAGTCGTACGAGTCATTGCTATTGTATCAGCTAGAGATAATCGCGTACGCGCTCGAGCTCGAGCTATTTCGTCCTGAGCTGATGTCTCCGTCGATAATGAAAAATCCTCCGCTGATGTCCAGGTACAGACCCAGTCCGTCGTATCCTCCAATCGCTGA

GGTGCGAAGATTGACGACCCGCTCTATATTCATCATGTGTGGCCGCATGACCCGACAATTACACATTTCATTTTAAAGCTCGCGCATGCGATTGACATTGACATTACACTATATGAAATTAAGCCGTATCCGAAGCTCTGGCGTTATTATATTAAGCCGGTGCATGTGTCTGTGTATCCGCATTATTATCTCGCGGAA

AGGCACTTCCTACTTCTTAAGAAACGGCTAAGCAGCAGAGTTAAGAGCCTTAAGTCACTATCAAGCCCGCTAGTATTCAAACACAGCCACCTACTTCTACTTCTATCATGGCGGATGCTATTCAAGCGGAAGTTCAAAGTTTGCCGGCGGCTATTCAAGAGAAGCAGACCAAGACGGAAGAGCCGGCGGAAACACATG

The Next vivo Connection

Rapid Cell-Free Systems

In essence, our project is a rapidly purifiable cell-free system to bring the benefits of synthetic biology to as many people as possible. To do so, we provide methods to easily purify all of the necessary transcriptional and translational components. This includes proteins and RNAs- including functional tRNAs. Furthermore, the Next Vivo system lacks genomic DNA and is instead a minimal simple DNA input and protein output system. Because of these characteristics, Next vivo is highly amenable to genetic recoding.

For a more comprehensive look at the system, check out our design page.

Genetic Recoding

Genetic recoding is a process by which the conventional relationships between codon-anticodon and tRNA-amino acid are altered. For instance, the amber stop codon (UAG) can be reassigned to instead incorporate a natural or unnatural amino acid into a growing peptide. [1]

Modifying the relationship between codon and amino acid incorporation is equivalent to the creation of a novel genetic code.

This has numerous benefits including the incorporation of unnatural amino acids, biocontainment, and protein engineering.

Genetic Recoding vs. Codon Reassignment

Though there is some discussion surrounding the use of the term “Genetic Recoding” and “Codon Reassignment.” Becuase our system falls in between two proposed definitions, we have chosen to refer to the practice as “Genetic Recoding” in the context of our project and will refer to it accordingly.

Recoding can be accomplished via:

Introducing orthogonal tRNA-aaRS pairs [2]

Mutating tRNA-aaRS pairs [3]

tRNA misacylation by promiscuous RNA enzymes (Flexizymes) [4]


Other iGEM Teams are also working on codon reassignment for alternative purposes. Check out the awesome project at Bielefeld where they focus on expanding the genetic code!

Encrypted Sequences

Novel Genetic Codes

Though this is a developing field, genetic recoding will only develop as scientific understanding and computational design improve. It is not hard to imagine the construction of a library of tRNAs that can be charged with non-canonical amino acids. Whether this is achieved via flexizymes or mutant pairs, selecting internally consistent sets of tRNAs and charging machinery will make it trivially easy to design a novel genetic code, and the Next vivo system would make it readily obtainable.

The apparent risk of this technology is that genetic recoding may allow harmful sequences to be “encrypted”, thus masking the information contained within while retaining the ability to faithfully produce the encoded protein.

When the available sample space provided by the genetic code is analyzed, recoding allows for a potential to generate numerous genetic codes according to the following formula:


Where n is the number of nucleic acid bases, l is the length of the codon, and a is the number of amino acids that need to be assigned a unique codon.

When all codons are reassigned, a simplistic estimation (64!/44!) suggests that there are 4.77 x 10^34 possible combinations available.

That's 47 decillion, or 47 million billion billion billion genetic codes. This is an extremely large sample space to search combinatorially. Despite the size of the sample space, it remains to be seen whether or not this relationship is cryptographically strong. The program used to perform this calculation can be found on our GitHub page.

Preliminary Testing

The potential for harm as a result of this technology is not to be underestimated. If recoded systems become as prevalent and easy to obtain as we expect them to be, control over where toxin sequences are sent greatly diminishes. Accordingly, we reached out to gene synthesis companies to determine whether or not current bioinformatic technologies can detect radically re-coded toxin sequences.

Emails were sent to all current members of the IGSC asking them to screen twelve sequences for us. Of the five companies that were willing to help us, all of them correctly identified the un-encrypted toxic proteins. However, no organization could correctly identify the encrypted toxins.

Detecting Encrypted Sequences

A total of five companies from the IGSC (n=5) of the 11 possible agreed to test our sequences. Two control sequences were sent along with the encrypted sequences: unencrypted GFP and unencrypted conotoxin. The remaining 10 sequences consisted of equal numbers of encrypted GFP and conotoxin. Because BLAST relies on the Universal Genetic Code, no company was able to detect the encrypted sequences.

Unencrypted Sequences (n=2) Encrypted Sequences (n=10)
Sequence Identity Green Flourescent Protein (n=1) Conotoxin (n=1) Green Flourescent Protein (n=5) Conotoxin (n=5)
Identification Rate 100% (±0%) 100% (±0%) 0% (±0%) 0% (±0%)

While this result is unsettling, it is not unexpected. All of genomic science relies heavily on the assumption that there is a known relationship between DNA and protein. Though the technology for large-scale recoding does not presently exist, it is prudent to be prepared instead of ignoring a potentially dangerous problem.

This experiment was repeated using each variation of the BLAST software hosted on the NCBI website.[5] Again, the software could not identify any of the completely recoded sequences. If you are curious about the results, the sequences that we sent are available for you to try as well.

Following the initial testing, we have maintained correspondence with individuals at these companies are are looking forward to working closely with them to ensure that DNA synthesis remains a safe and secure practice. We would also like to thank them for their tremendous assistance in identifying and dealing with this problem before it becomes a pressing security issue. Synthetic biologists need DNA, and DNA synthesis needs new bioinformatic screening tools.

Beating BLAST

Basic Local Alignment Search Tool

Currently the only tool maintaining the safety and security of DNA synthesis is BLAST. We have shown earlier that recoding completely nullifies the ability of BLAST to detect a sequence, but it remains to be seen how much recording BLAST can tolerate before a sequence becomes totally unmatchable to a reference. BLAST works by breaking a query sequence into small ‘words’ of a specified length. Words that exactly match a sequence within the database are ‘high-scoring pairs’ and contribute to a positive scoring alignment. In essence, the more exact word matches in a query sequence to a database sequence, the better the alignment score will be.

However, it is not intuitively obvious what degree of genetic recoding is required to evade detection via BLAST. To test this, we developed a software tool within the CODONxCHANGE suite, written in Python 2.7 to test the integrity of the BLAST platform against sequences that have been partially encrypted with a set number of recoding events. This tool can also be used prepare genes for implementation in an orthogonal cell-free system for biocontainment purposes.

SeCReT (Sequential Codon Reassignment Tool)

The tool is designed to take a nucleic acid coding sequence or protein sequence as an input, and return an ‘encrypted’ version of the sequence. It achieves this by translating a DNA sequence into a protein sequence, and then sequentially assigning a random codon to each unique amino acid required within the protein. The resultant sequence is returned in a newly encrypted state.

How Robust is BLAST?

The sequence of Cholear Toxin A was randomly encrypted N times for each N possible recoding events. The sequences were then analyzed via BLAST and the percent identity of each sequence was plotted as a function of the number of recoding events.

Name Item Name Item Price
Alvin Eclair $0.87
Alan Jellybean $3.76
Jonathan Lollipop $7.00

The results of this analysis suggests that the most effective number of switches to BEAT Blast is X. More results can be found on the GitHub page. Again, the raw data generated from the experiments are available for public analysis.

Building Solutions

Changes on the Horizon

Though the power of the BLAST program to detect genetically recoded sequences has been shown to be incredibly limited, there are initiatives to develop new biosecurity tools. Intelligence Advanced Research Projects Activity (IARPA), the cousin of DARPA, has a program called Functional Genomic and Computational Assessment of Threats (Fun GCAT) which aims to catalyze the development of tools to improve DNA screening capabilites. Several of the synthesis companies that we spoke to are involved in this program.

DeToxIT (Decryption and Toxin Identification Tool)

Here we throw our own hat into the ring with a simple tool to decrypt DNA sequences that have been radically recoded and compare them to a database of know select agents.

Test it out with a trial data set found on our GitHub page!

Summary

Biosecurity Analysis

We determined that current biosecurity protocols are ineffective at detecting recoded sequences.

We also determined that the minimum degree of recording requried to evade BLAST is X recoding events.

Lastly, we have been in contact with members of the IGSC and are excited to continue working to keep DNA synthesis safe.

CODONxCHANGE Software Suite

GRecoS (Genetic Recoding Space)

SeCReT (Sequential Codon Reassignment Tool)

DeToxIT (Decryption and Toxin Identification Tool)

Looking for the Source Code?

While most of our software can be found freely availble on our GitHub repository, you will notice that some of the tools do not have their source distributed. For the interim, we have been advised not to publish the source code for the fully functional encryption software.

Until we know the ethical and legal standing surrounding the distribution of the software, it will be available via direct contact only.

Instead, a feature-reduced version of the software is available as a pre-compiled binary to get a taste of how it works, and a tangible idea for just how effective recoding is at hiding the identity of a sequence. If you would like to get access to the source code, you can get in touch with us by following the link and verifying that you are affiliated with an academic institution or other trusted party. We love how safe and accessible Synthetic Biology is, and we are excited to continue developing tools to keep it that way. Thank you for your understanding.

References

[1] Young, T. S. and P. G. Schultz, Beyond the Canonical 20 Amino Acids: Expanding the Genetic Lexicon. Journal of Biological Chemistry, 2010. 285: 11039-11044.

[2] Javahishvili, T., A. Manibusan, S. Srinagesh, D. Lee, S. Ensari, M. Shimazu, and P. G. Schultz, Role of tRNA Orthogonality in an Expanded Genetic Code. ACS Chemical Biology, 2014. 9(4): 874-879.

[3]Chatterjee, A., H. Xiao, and P. G. Schultz, Evolution of multiple, mutually orthogonal prolyl-tRNA synthetase/tRNA pairs for unnatural amino acid mutagenesis in Escherichia coli. Proceedings of the National Academy of Sciences of the United States of America, 2012. 109(37): 14841-14846.

[4] Ohuchi, M., H. Murakami, and H. Suga, The flexizyme system: a highly flexible tRNA aminoacylation tool for the translation apparatus. Current Opinion in Chemical Biology, 2007. 11(5): 537-542.