Difference between revisions of "Team:Heidelberg/Software/SafetyNet"

Line 53: Line 53:
 
</style>
 
</style>
 
<script>
 
<script>
{{Heidelberg/title|Safetynet}}
+
{{Heidelberg/title|SafetyNet}}
 
</script>
 
</script>
 
}}
 
}}
Line 65: Line 65:
 
         https://static.igem.org/mediawiki/2017/e/eb/T--Heidelberg--2017_SafetyNet_GA.jpg|
 
         https://static.igem.org/mediawiki/2017/e/eb/T--Heidelberg--2017_SafetyNet_GA.jpg|
 
         When performing large scale, automated directed evolution experiments a manual assertion of every sequence in the library is impossible. However profound background and quality checks on sequences are crucial in the automated context as the experimentator has no direct control of the processes. This especially holds true for <i>in silico</i> evolution, where the immediate effect of a mutation is not assessable.
 
         When performing large scale, automated directed evolution experiments a manual assertion of every sequence in the library is impossible. However profound background and quality checks on sequences are crucial in the automated context as the experimentator has no direct control of the processes. This especially holds true for <i>in silico</i> evolution, where the immediate effect of a mutation is not assessable.
In order to safeguard our <i>in vivo</i> and <i>in silico</i> directed evolution experiments we developed Safetynet.
+
In order to safeguard our <i>in vivo</i> and <i>in silico</i> directed evolution experiments we developed SafetyNet.
Safetynet is a web available, neural network based sequence check. It does not only infer the function and species of origin, but does also assert the safety level assigned to the origin species and the potential harm of an input sequence. We applied SafetyNet throughout our directed evolution experiments to ensure safe and flawless sequence improvement all the while preventing the unintended emergence of harmful traits.
+
SafetyNet is a web available, neural network based sequence check. It does not only infer the function and species of origin, but does also assert the safety level assigned to the origin species and the potential harm of an input sequence. We applied SafetyNet throughout our directed evolution experiments to ensure safe and flawless sequence improvement all the while preventing the unintended emergence of harmful traits.
 
     }}
 
     }}
 
     {{Heidelberg/templateus/Contentsection|
 
     {{Heidelberg/templateus/Contentsection|
 
             {{#tag:html|
 
             {{#tag:html|
 
<h2>Method</h2>
 
<h2>Method</h2>
Safetynet is based on two algorithmic pillars. The first one is a BLAST search of the input sequence against the swissprot database, performed through the NCBI web API. The request is POSTed to the NCBI server and the result is catched with GET request. Subsequently the result is parsed for the protein IDs of all non redundant matches. Next, the retrieved protein IDs are used to send a GET request to the UniProt database, requesting the entry of the protein in question. The entry is again parsed for key information, this time returning the assigned GO-Terms, the species of origin and the gene of origin. Subsequently the collected information on each entry is combined and a lookup on the safetynet internal databases is performed. These comprehensive databases list GO-Terms associated with cytotoxic, viral or pathogenic functions or pathways. Further we included the functional terms for proteases and nucleases, to account for destructive intracellular potential. The biological safety level of the retrieved species of origin is investigated by a database lookup on the biosafety-database of the <a href"https://www.bvl.bund.de/DE/06_Gentechnik/gentechnik_node.html">german ministry of consumer and food safety</a> (the german FDA).<br>
+
SafetyNet is based on two algorithmic pillars. The first one is a BLAST search of the input sequence against the swissprot database, performed through the NCBI web API. The request is POSTed to the NCBI server and the result is catched with GET request. Subsequently the result is parsed for the protein IDs of all non redundant matches. Next, the retrieved protein IDs are used to send a GET request to the UniProt database, requesting the entry of the protein in question. The entry is again parsed for key information, this time returning the assigned GO-Terms, the species of origin and the gene of origin. Subsequently the collected information on each entry is combined and a lookup on the SafetyNet internal databases is performed. These comprehensive databases list GO-Terms associated with cytotoxic, viral or pathogenic functions or pathways. Further we included the functional terms for proteases and nucleases, to account for destructive intracellular potential. The biological safety level of the retrieved species of origin is investigated by a database lookup on the biosafety-database of the <a href"https://www.bvl.bund.de/DE/06_Gentechnik/gentechnik_node.html">german ministry of consumer and food safety</a> (the german FDA).<br>
 
The second algorithmic column applies a DeeProtein implementation in the browser. Upon user request the neural network inference can additionally be enabled to support the BLAST search in function classification. This is especially useful as the neural network is able to detect latent or "hidden" potential as it learned the sequence to function relation accross the whole respective functional domain, whereas the BLAST search is limited to direct sequence identity.<br>
 
The second algorithmic column applies a DeeProtein implementation in the browser. Upon user request the neural network inference can additionally be enabled to support the BLAST search in function classification. This is especially useful as the neural network is able to detect latent or "hidden" potential as it learned the sequence to function relation accross the whole respective functional domain, whereas the BLAST search is limited to direct sequence identity.<br>
 
The browser integrated neural network is implemented in DeeplearnJS and features GPU support. It is a ResNet30, similar to the Architecture of DeeProtein, asserting the class probability for 886 classes. As the size of the ResNet-weigths is ~100MB we offer a selection mode to guarantee the use of the BLAST-part on mobile connections.<br>
 
The browser integrated neural network is implemented in DeeplearnJS and features GPU support. It is a ResNet30, similar to the Architecture of DeeProtein, asserting the class probability for 886 classes. As the size of the ResNet-weigths is ~100MB we offer a selection mode to guarantee the use of the BLAST-part on mobile connections.<br>
 
Finally the collected information is concatenated and presented in a easily understandable color coded scheme.
 
Finally the collected information is concatenated and presented in a easily understandable color coded scheme.
 +
}}
 +
    }}
 +
{{Heidelberg/templateus/Contentsection|
 +
            {{#tag:html|
 +
<h2>SafetyBLAST: a smart inter-database search</h2>
 +
Using the dialogue below, you can test arbitrary protein sequences for potential safety risks, by relying on a homology based database search. A sequence to be tested can be provided either as plain text, by entering it into the text box, or supplied in a single-sequence standard FASTA file, by providing a file handle via the choose file button. You will then be prompted to choose a file on your computer or smart phone, which will then be read by the browser. Sequences you enter will be forwarded to the NCBI's BLAST server via a reverse proxy and resulting BLAST hits will be compared against Uniprot and multiple internal risk databases. No third parties besides NCBI's BLAST and Uniprot will see any of your data, making SafetyBLAST as secure as a series of manual queries.
 +
The results will be displayed in a table under "BLAST results", with both color-coded and numerical indicators of protein safety and hit correctness. If a protein potentially matches counterparts from different bacterial strains with different safety levels, a worst-case prognosis is displayed. For example, if a protein matches both K12 <i>E. coli</i> (S1 organism) and enterohemorrhagic <i>E. coli</i> (S3 organism), the most pessimistic outcome of the S3 safety level is displayed and relevant cells in the table are marked red.
 +
Keep in mind, that using SafetyNet <b>does not</b> replace common sense and careful deliberation of security risks, merely serving as an aid to screening a large amount of sequences for safety in a short amount of time.
 
}}
 
}}
 
     }}
 
     }}

Revision as of 22:17, 1 November 2017


SafetyNet
Evolution does no harm.
When performing large scale, automated directed evolution experiments a manual assertion of every sequence in the library is impossible. However profound background and quality checks on sequences are crucial in the automated context as the experimentator has no direct control of the processes. This especially holds true for in silico evolution, where the immediate effect of a mutation is not assessable. In order to safeguard our in vivo and in silico directed evolution experiments we developed SafetyNet. SafetyNet is a web available, neural network based sequence check. It does not only infer the function and species of origin, but does also assert the safety level assigned to the origin species and the potential harm of an input sequence. We applied SafetyNet throughout our directed evolution experiments to ensure safe and flawless sequence improvement all the while preventing the unintended emergence of harmful traits.

Method

SafetyNet is based on two algorithmic pillars. The first one is a BLAST search of the input sequence against the swissprot database, performed through the NCBI web API. The request is POSTed to the NCBI server and the result is catched with GET request. Subsequently the result is parsed for the protein IDs of all non redundant matches. Next, the retrieved protein IDs are used to send a GET request to the UniProt database, requesting the entry of the protein in question. The entry is again parsed for key information, this time returning the assigned GO-Terms, the species of origin and the gene of origin. Subsequently the collected information on each entry is combined and a lookup on the SafetyNet internal databases is performed. These comprehensive databases list GO-Terms associated with cytotoxic, viral or pathogenic functions or pathways. Further we included the functional terms for proteases and nucleases, to account for destructive intracellular potential. The biological safety level of the retrieved species of origin is investigated by a database lookup on the biosafety-database of the german ministry of consumer and food safety (the german FDA).
The second algorithmic column applies a DeeProtein implementation in the browser. Upon user request the neural network inference can additionally be enabled to support the BLAST search in function classification. This is especially useful as the neural network is able to detect latent or "hidden" potential as it learned the sequence to function relation accross the whole respective functional domain, whereas the BLAST search is limited to direct sequence identity.
The browser integrated neural network is implemented in DeeplearnJS and features GPU support. It is a ResNet30, similar to the Architecture of DeeProtein, asserting the class probability for 886 classes. As the size of the ResNet-weigths is ~100MB we offer a selection mode to guarantee the use of the BLAST-part on mobile connections.
Finally the collected information is concatenated and presented in a easily understandable color coded scheme.

SafetyBLAST: a smart inter-database search

Using the dialogue below, you can test arbitrary protein sequences for potential safety risks, by relying on a homology based database search. A sequence to be tested can be provided either as plain text, by entering it into the text box, or supplied in a single-sequence standard FASTA file, by providing a file handle via the choose file button. You will then be prompted to choose a file on your computer or smart phone, which will then be read by the browser. Sequences you enter will be forwarded to the NCBI's BLAST server via a reverse proxy and resulting BLAST hits will be compared against Uniprot and multiple internal risk databases. No third parties besides NCBI's BLAST and Uniprot will see any of your data, making SafetyBLAST as secure as a series of manual queries. The results will be displayed in a table under "BLAST results", with both color-coded and numerical indicators of protein safety and hit correctness. If a protein potentially matches counterparts from different bacterial strains with different safety levels, a worst-case prognosis is displayed. For example, if a protein matches both K12 E. coli (S1 organism) and enterohemorrhagic E. coli (S3 organism), the most pessimistic outcome of the S3 safety level is displayed and relevant cells in the table are marked red. Keep in mind, that using SafetyNet does not replace common sense and careful deliberation of security risks, merely serving as an aid to screening a large amount of sequences for safety in a short amount of time.
Check protein safety using BLAST

Results of safetyBLAST search.

IMPORTANT

In order for you to be able use our deep neural network classifier - DeeProtein - the neural network's parameters have to be loaded into memory on your computer. As our classifier is very deep, that is many-layered, the total size of the weights surpasses 100 MiB. To allow you to use DeeProtein without incurring too much of a burden on your internet connection and data plan, we provide the weights as a one-time-only downloadable file, which you can then load into memory on your machine, whenever you plan to use DeeProtein.

If this is your first time using DeeProtein, download the weights using the button on the left. If you have already downloaded the weights, click the button on the right and point DeeProtein to where they are on your computer. Happy testing!

Check safety using DeeProtein

Results of DeeProtein inference.

References