Key Achievements
- Developed a model to inform CRISPR/Cas9 guide design for any genome input.
- Developed a script to parse an input gene for PAM sites and generate a list of all protospacers.
- Ranked sgRNA for all NGG sites in a gene based on the relative strength of their target site compared to all off-target hits on the genome.
- Made our script publically available for use by other teams.
- Collaborated with the University of Toronto iGEM team to design the sgRNA for their project.
Background
Breakthroughs with CRISPR gene editing have been flooding the scientific scene in recent years. This system works by repurposing a bacterial nuclease that precisely targets a position of the host genome with the help of a small stretch of RNA called a single-guide RNA, or sgRNA. Properly designing sgRNAs is crucial to the success of any CRISPR experiment yet most online sgRNA design tools are limited to a selection of curated eukaryotic genomes such as human, mouse or fly genomes. While these tools are helpful to researchers working on these model organisms, the lack of flexibility in genome choice creates a significant barrier to projects involving non-model organisms.
UBC iGEM has decided to tackle this issue by creating a flexible sgRNA design program that allows the user full reign over the choice of their target genome. This model builds off of a biophysical model of CRISPR activity published by the Salis Lab at Pennsylvania State University (Farasat & Salis 2016). Its usage is simple – provide the program with the appropriate genome and target gene sequence, and it will return a list of potential guide sequences ranked from best to worst.
Model Theory
The 20 nucleotide sequence that the CRISPR/Cas9 protein targets with the sgRNA construct on a DNA strand is called a protospacer. CRISPR-Cas9 uses a specific nucleotide combination adjacent to the protospacer, the PAM, to bind to the DNA and facilitate cutting. In Streptococcus pyogenes the nucleotide combination of the PAM site is 5’ - NGG - 3’. In any DNA sequence, S. pyogenes CRISPR/Cas9 can only bind and cut locations with a PAM site. For example, CRISPR/Cas9 with a guide sequence of ATGCTTACGTAGCTCCATGT could could bind to a DNA fragment with the sequence ATGCTTACGTAGCTCCATGTNGG (where N is any nucleotide), but could not bind to a sequence ATGCTTACGTAGCTCCATGTNCG due to the absence of a PAM site; the PAM is essential to DNA cleavage with CRISPR/Cas9.
But what about a CRISPR/Cas9 complex with a guide that didn’t match the 20 nucleotide sequence, but has a PAM site? According to Farasat and Salis (2016), the binding efficiency for CRISPR/Cas9 is determined by the Gibbs free energy released by complementary nucleotide binding and the energy spent uncoiling the DNA complex. The likelihood that the CRISPR/Cas9 complex binds to a DNA fragment is reduced due to each mismatched sequence between the sgRNA and the guide strand. Mismatches closer to the PAM site reduce the likelihood of binding more than matches further away from the PAM site. This effect is likely caused by the mechanism of sgRNA binding, theorized to bind to the strand like a zipper starting from the PAM site.
To calculate the total Gibbs free energy release (∆G) for a given protospacer, the following equation is used:
∆GPAM
This is the free energy released from binding of Cas9 to the PAM site. It is calculated by using preset experimental values for the free energy release when Cas9 binds to various iterations of NGG, such as AGG or CGG.
∆Gprotospacer
This is the free energy released from binding of the protospacer to the complementary DNA sequence. In principle, mismatched bases between the sgRNA and the DNA strand will reduce the free energy release, making binding between the guide and the target less favorable. In addition, mismatches impact adjacent nucleotide pairings, so energy loss data for mismatched nucleotides must be considered in pairs (Figure 1). The energy lost due to a mismatch must be multiplied by a distance factor; mismatches closer to the PAM cause relatively greater changes in free energy, while mismatches further from the PAM site had less effect overall on the energy released (Figure 2b). The values for nucleotide matching and distance factor were obtained experimentally via deep sequencing (Farasat & Salis, 2016) and allow biologically relevant ∆G values to be assigned to each nucleotide pair. A simplified version of the Farasat & Salis equation is:
Where k is the number of bases away from the PAM site, dk is the distance factor from Figure 2 and ∆GRNA:DNA is the value obtained from Figure 1.
Figure 1. Experimentally derived ∆G values (kcal/mol) for sgRNA:DNA binding from Farasat & Salis 2016. For example, the ∆G value for AC:AA binding is 0.75 kcal/mol, shown in green and blue.
Figure 2. Relative importance of each nucleotide in the sgRNA based on distance from PAM site. This information is experimentally determined and shows a general trend of decreasing pairing importance moving further away from the PAM site.
∆Gsupercoil
This is the energy required to ‘unzip’ the supercoiled DNA to allow for binding to the exposed nucleotides. ∆Gsupercoil is a constant in this model, based on the average superhelical density of the E. coli genome and the amount of energy it takes to unwind it to 1/10 of its original density.
The British_Columbia team consulted the research group responsible for calculating the Gibbs free energies for guide binding and obtained preliminary code, provided by Dr. Howard Salis. The initial model would compare a short input nucleotide sequence (or sgRNA) against an input DNA sequence and calculate the Gibbs energy for each protospacer match. The protospacers would then be ranked based on their energy released, and a partition function of each protospacer would be calculated. The partition function is the relative likelihood of binding to a protospacer, when all available protospacers are considered.
For our model, we wanted to functionalize the code provided by Dr. Howard Salis to work with genomic DNA (the initial system could not allow inputs larger than a few hundred thousand nucleotides), allow the ability to input multiple DNA inputs (so plasmids can be included in the search), and we wanted the ability to search multiple guides at once and compare them directly. Another design requirement was a matching nucleotide location of the guide to the genome, so the site could be found easily after modelling. Finally, we wanted it to be user friendly, working directly file inputs instead of re-writing code every time a guide needed to be run.
Model Development
Rather than having model users choose their guide sequences based on their nucleotide sequence, we wanted our model to provide a list of all potential protospacer sites in an input gene, ranked by their partition functions (how well the sgRNA binds to the target protospacer compared to the binding efficiency of the guide to all other protospacers). To accomplish this, we first developed a code to search a gene for all potential protospacers.
Initially the code had been developed to search a single sgRNA sequence against a short DNA input (that did not necessarily contain a protospacer matching the sgRNA provided) and rank the PAM sites with a partition function used as a measure of the amount of off target cuts that occur outside the target gene. While this is useful, it lacked the functionality we wanted.
An initial adjustment to the model allowed us to search larger DNA inputs (to include genomes). This was done by expanding the file types to include fasta and allowing the files to be appended to the processing queue. Next, we wanted the ability to input a number of DNA sequences to search against. This function would be useful when plasmids have been added to the system that we want the program to consider - either to avoid cutting a useful plasmid, or to target a plasmid region of interest. Our own project is proof of the utility of this feature, as our CRISPR/Cas9 design is targeted to a plasmid in Agrobacterium. Next we added increased functionality by automating the sgRNA candidate selection process. Our script will now generate a list of all potential protospacer cut sites based on a gene or sequence supplied to the system. This list also attaches a location tag of the top protospacer match for each sgRNA generated. Finally, we wanted to combine all this information in a simple and easy to interpret table, so that guides can be easily selected based on their partition function, target location, and specific sequence.
The first step was to make a user input command so the gene to generate guides from, the DNA sequences to search, and model processing conditions could be set. Code was written with this in mind and values are taking as command line parameters and passed to the corresponding functions.
The model starts by searching the gene input for all PAM sites, creating a list of all protospacer sequences (and therefore sgRNA sequences) available in the gene. The model then runs each sgRNA sequence through the biophysical model, searching for sites on the rest of the genome where the sgRNA could also bind. For each protospacer, including the the target protospacer and all off-target sites, the model calculates the binding affinity of the sgRNA sequence. The model does this by checking each nucleotide pair in the sgRNA against the protospacer. Each sgRNA:DNA nucleotide pair is used to reference a matrix of energy data and weighted based on distance from the PAM site (Fig. 2, 3), calculating the free energy for each pair. This procedure is performed on all protospacers for each guide sequence obtained from the gene, generating the Gibbs free energy and the partition function for each sgRNA.
The information for each guide is compiled and exported into an excel file (Figure 4). With this file, the guides can be ranked on their partition function, allowing the sgRNA with the highest affinity to the target site to be chosen. The automation we built into our software allows the process of choosing guides for any genome and targeting any gene to be fast and without human error.
There is much optimization could be done to reduce the runtime of the code that currently takes an inconvenient length of time. Many of the function calls repeat and perform work that could be done once at the start of the thread. This kind of data processing is very open to Multi-threading which could reduce the run time by an order of magnitude for most consumer laptops and significantly more on a server.
Model Usage
A terminal command can be used with our interface to run the model with a set of genomic data. The code can be found on our github page. Genbank files are inputted as a standardized source file format containing annotations for genomic sequence information.
Example:
python Cas9.py -q genome.gbk [genome_n.gbk] -m MODEL.mat -t target_seq.fasta -p 0
This will start a full run of the script examining all candidate sgRNAs that could be used to cut the target gene, along with their respective ∆G, and partition functions. The output is written to a CSV file with the fields: guideRNA sequence, target sequence, position, dG_Target, % partition function and information about the source files used.
Parameters:
-m : The model file with the matrix of experimental data for ∆G lookup table.
-q : Sequences that need to be analyzed for off-target guideRNA binding (ie. Host genome and extraneous plasmids).
-t : The target sequence (FASTA-formatted) for Cas9-mediated cleavage.
-p : The minimum partition value required to report a potential guide RNA.
Figure 3. An example of terminal window showing the input command and the summary of progress as the program runs.
Figure 4. The head of our output file, showing summary information about the run as well as a table with information of all potential guide sequences.
Experimental Validation
Using this model, we have designed sgRNAs to target essential virulence regions in Agrobacterium. The guides obtained from running the model were shown to be effective in vitro, as demonstrated on the CRISPR/Cas9 page!
Collaboration with University of Toronto
The University of Toronto desired to choose guides to target the AraC gene with S. pyogenes CRISPR/dCas9 (catalytically inactive Cas9) in E.coli K12 MG1655. To regulate gene expression with CRISPR/dCas9, it is desirable to have guides near the beginning of a gene to block transcription activity. Determining guides for this problem was the perfect problem for our model to solve. A complete list of potential guide RNA in the AraC gene was produced, with their respect partition functions and location on the genome.
Model execution determined their original guides sequences, AAACGAGTATCCCGGCAGCA and AACGAGTATCCCGGCAGCAG had a relatively low partition functions of 47.276 and 35.326 respectively, indicating that there would be many off target CRISPR/dCas9 matches on the genome. We provided their team two guides from the complete list that would be appropriate choices for their project. We ranked the guides by genome location, searching for guides with a partition function above seventy within the first 100 base pairs of the gene. The guides our model suggested were: TGGGCGTTAAACGAGTATCC, appearing on the bottom strand (with a PAM site at the 33rd nucleotide from the start), which has a partition function of 75.957; and TTAACGCCGATTGAGGCCAA appearing on the top strand (with a PAM site at the 86th nucleotide from the start), and has a partition function of 72.671. These guides should provide a more precise target for CRISPR/dCas9 while retaining the desired function of dCas9. The University of Toronto expressed their gratitude for the guide sequences and will provide us any data produced with them during the competition.
Concluding Remarks
We believe that our sgRNA design program will prove to be incredibly useful for any team looking to use CRISPR in their projects. This model brings unmatched specificity while still remaining easy to use. Given that many iGEM teams work with non-model organisms or modified model organisms (think E. coli with a plasmid), we predict that our model will be a fantastic asset to future iGEM teams working with CRISPR.
References
Farasat I, Salis HM (2016) A Biophysical Model of CRISPR/Cas9 Activity for Rational Design of Genome Editing and Gene Regulation. PLOS Computational Biology 12(1): e1004724. https://doi.org/10.1371/journal.pcbi.1004724