Team:TUDelft/Off targeting

Off-targeting Full Model

Introduction to the off-target model

Our project involves employing Cas13a for detecting antibiotic resistance genes. Cas13a acts like a detective that can signal the presence of specific RNA sequences. Its target description is provided by a specific single guide RNA sequence, denoted as the crRNA, that is positioned (loaded) within the protein. Cas13a is activated when the loaded crRNA pairs with a complementary (target) sequence. The crRNA is therefore essential in directing Cas13a towards a predefined target. Besides specific target recognition, proteins from the CRISPR-Cas family may also exhibit off-target activity. While offering possibilities to target related variants of antibiotic resistance genes, this off-target activity could also compromise the reliability of a Cas13a-based detection system. We address this problem by employing a recently proposed kinetic model (Klein et al. 2017) to describe the off-target effects of our detection system for a given crRNA. As such, we are able to rank potential crRNAs - derived from our motif finding algorithm - by their reliability. In addition, these modelling efforts revealed a design flaw in one of the crRNAs, guiding the labteam towards the right crRNA sequences to utilize in our experiments.

The kinetic model we used was designed for proteins similar to Cas9, and required adjustment, as Cas13a has interesting properties that the model was not developed for. Cas13a has a seed region central to the spacer (Abudayyeh et al. 2016), while other Cas protein variants, including Cas9, have a 3’-end seed Semenova et al. 2011). In addition, for Cas13a, a protospacer flanking site (PFS) – a nucleotide 3’ to the target that should not be guanine - is important, rather than a protospacer adjacent motif (PAM) - a nucleotide sequence next to the non-target strand of f.e. Cas9 (Shmakov et al. 2015). This can attributed to the fact that Cas13a detects single strand RNA sequences, rather than double stranded DNA sequences. As a result, we set the model’s energy associated to PAM-binding to 0. The process of target binding with a central seed region is illustrated in Figure 1 below.

Figure 1: Schematic of Cas13a target binding.

In the kinetic model, loaded Cas13a can be in a number of states. It can be unbound, have formed n (correct or incorrect) basepairs with a sequence that is being probed, or have cleaved its target substrate, at which point it is also activated. The ratio between the forward and backward rates determines the probability of moving between states and can be determined by the energy difference in the free-energy barrier between going forward and backwards between the state (for each position n):

$$\frac{k_{forward}(n)}{k_{backward}(n)} = e^{\Delta n} \tag{1}$$

As mentioned above, functional and structural studies of Cas13a indicate a central seed region (Abudayyeh et al. 2016; Gootenberg et al. 2017; Lui et al. 2017). Therefore, we can schematically represent binding process as depicted in Figure 2). We take the first bound nucleotide to be at position 12 (as seen from the 5’ to the 3’ end of the spacer), in the middle of the central seed region that ranges from nucleotide 9 to 15 (Abudayyeh et al. 2016). We impose that subsequent binding events occur sequentially in both directions.

Figure 2: Model representation of Cas13a target binding. A) adapted Figure 1 from Klein et al. 2017 birth-death process

With these adjustments, we can follow the recipe given in (Klein et al. 2017), we can now calculate the probability to cleave ($P_{clv}$) a target ($s$) with a given guide ($g$) once it is bound:

$$ P_{cls}(s|g) = \frac{1}{1+\Sigma ^n_{n=1} e^{-\Delta T(n)}} \tag{2}$$

With $\Delta T(n)$ the parametrization of the energy landscape: the transition state energy between the first bound state and the n-nucleotides bound state. Lastly, we make the simplifying assumption that the transition to the next state is favourable for correct Watson-Crick base pairing (with an energy gain of $\delta C$) and unfavourable for mismatches (with an energy penalty of $\Delta I$). The energy landscape is then only dependent on the number of correct ($n_c$) and number of incorrect ($n_i=n-n_c$) base pairs.

$$ \Delta T(n) = -(n-n_c) \Delta I + n_c \Delta C, \qquad n_c\leq \textrm{ and } n \in [1,N] \tag{3}$$

For convenience, we normalize the $P_{clv}$ through division by the cleavage probability for a matching (on-target) sequence and find R, the cleavage probability relative to the on-target.

$$ \mathcal{R} = \frac{P_{clv} (s|g)}{P_{clv}(g|g)} \tag{4}$$

From a numerical implementation of the model (in Python), we can now determine Cas13a’s tolerance for mismatches between the target and guiding sequence for several model parameters (the energy gain $\Delta C$ and energy penalty $\Delta I$).

Off-target simulations

Although scarce, there are some experimental data illustrating the mismatch tolerance of Cas13a (Abudayyeh et al. 2017, Figure S19 and S21). We observe there is a central seed region between nucleotide 9 and 15. Two consecutive mismatches are only tolerated at the two ends of the spacer, while three consecutive mismatches are only allowed to occur between the nucleotides at the 3’ end of the crRNA spacer and thus the 5’ end of the target. We consider these data, adjust the model parameters and compare the data to the computed tolerance for mismatches using our model. We find a decent parameter fit by eye to be:

$$ \Delta C = 0.25 \tag{5}$$ $$ \Delta I = 2.5 \tag{6}$$

Figure 3 displays the computed mismatch tolerance for single mismatches, two consecutive mismatches and three consecutive mismatches and shows that single mismatches are not tolerated in the central seed region, two consecutive mismatches are tolerated at the ends of the spacer and three consecutive mismatches are only tolerated at the 3’ end of the spacer.

barplot

Figure 3: Cas13a tolerance for mismatches. Cas13a tolerance for mismatches ($\Delta C=0.25$,$\Delta I=2.5$). The cleavage probability relative to the on-target is plotted for different types of mismatches. Single mismatches (blue line) are not tolerated well in the central seed region. Two consecutive mismatches (orange line) can be tolerated in the beginning and end of the target. Three consecutive mismatches (green line) are only allowed at the target’s 5’ end.

As can be seen from Figure 3, a maximum of approximately 3 mismatches is tolerated by Cas13a. We can use a binomial distribution to compute the probability that a random target and spacer have at most 3 mismatches (thus having 25 or more matches) is low:

$$ P_{\textrm{at least 25 matches}} = \sum ^{28}_{m=25} \frac{28!}{(28-m)!m!} \left(\frac{1}{4}\right)^m \left(\frac{3}{2}\right)^{28-m} = 1.276 \cdot 10^{-12} \tag{7}$$

Having found suitable model parameters for $\Delta C$ and $\Delta I$, we can now employ the model to compute the cleavage probability for different guides and plasmids (or even entire genomes). We started by investigating the relative cleavage probability of our crRNA 1-5 on the tetracycline-resistance conferring (TcR) plasmid that the crRNAs were designed target (Figure 4). We were able to confirm that crRNAs 1-4 target the plasmid at the desired position. However, we also found that crRNA 5 did not meet the design requirements of the PFS: the position 3’ to the target sequence was a G. This finding prevented us from wasting time on testing crRNA 5 in the lab. This is also the reason why we will not consider crRNA 5 in the remainder of the simulations.

Figure 4: Cas13a cleavage probability (relative to on-target) along the sequence of a tetracycline resistance conferring (TcR) plasmid. Cas13a activation on TcR plasmid for crRNA 1 to 5 ($\Delta C=0.25$,$\Delta I=2.5$). Figures a-e display the predicted relative probability of cleavage of Cas13a on every position along the TcR plasmid. crRNAs 1-4 clearly showed the designed on-target hit approximately at position 1750. crRNA 5 only shows off-target activity, indicating a design flaw that prevented us from wasting time on testing crRNA 5 in the lab. A) crRNA 1. B) crRNA 2. C) crRNA 3. D) crRNA 4. E) crRNA 5.

By summing the $\mathcal{R}$, the cleavage probability relative to the on-target, over all off-target positions on the plasmid, we compute the $P_{FP}$, the probability of a false positive output, provided that there is a positive output:

$$ P_{FP} = \frac{ \Sigma _{\textrm{off target}}\mathcal{R} }{\Sigma _{\textrm{off target}}\mathcal{R} + \Sigma _{\textrm{on target}}\mathcal{R}} = \frac{1}{1+ \frac{\Sigma_{on}\mathcal{R}}{\Sigma_{off}\mathcal{R}}} \tag{8}$$

Computing $P_{FP}$ for crRNA 1-4 (Figure 5) shows that crRNA 3 has the lowest probability of yielding a false-positive result on the TcR plasmid, i.e. being activated by a non-target site on the plasmid: 1 in 100 billion. Interestingly, experiments in the lab show that Cas13a also performs highest collateral cleavage activity when loaded when crRNA 3.

Figure 5: Off-target cleavage probability (relative to on-target) for crRNA 1 to 4 on TcR plasmid. Off target cleavage probability (relative to on-target) for crRNA 1-5 on the TcR plasmid ($\Delta C=0.25$,$\Delta I=2.5$). crRNA 3 (orange) is the crRNA with one on-target site that further displays the lowest probability of false-positive activation on the TcR plasmid itself (i.e. on the non-target sites). Interestingly, crRNA 3 also results in best Cas13a performance in the lab.

Rewriting the implementation of the model into a command line utility allowed us to use the TU Delft computer clusters for simulating the off-target effects of our crRNAs on the genomes of different relevant organisms: E. coli BL21 (4.5 mln basepairs (bp)), Staphylococcus aureus (2.8 mln bp), Staphylococcus epidermis (2.5 mln bp) and chromosome 27 of the domestic cow (Bos taurus, 45 mln bp). E. coli BL21 is a commonly used lab strain, which was the strain containing the tetracyclin resistance-conferring (TcR) plasmid that we designed the crRNA for. S. aureus and S. epidermis are common pathogens in mastitis, while genetic material from the cow itself may also be present when detecting sequences in cow samples, for example milk. Scanning the entire genome allows to investigate if tiny local off-target effects together influence the reliability of the output signal.

In Figure 6, we present $P_{FP}$ for our crRNA sequences for the different genomes/chromosome. In general, we find off-target cleavage probabilities below 1 in 100.000 for the bacterial genomes. As expected, the $P_{FP}$ is generally larger for the Bos taurus chromosome 27, which is approximately tenfold larger than the bacterial genomes, but reaches a maximum of 1 in 1000. As such, we conclude that the probability of a false output is low and likely does not compromise the usability of the Cas13a-mediated detection system.

In Figure 6: Off-target cleavage probability (relative to on-target) for crRNA 1 to 4 on the genomes of four organisms. $P_{FP}$ for different crRNAs and genomes of four organisms ($\Delta C=0.25$,$\Delta I=2.5$). There seems to be no relation between genome size and per base false positive output probability, as the Bos taurus chr. 27 scores for the different crRNAs are not systematically highest. Interestingly, the off-target activity per base pair tends to be highest for E. coli, the organism containing the plasmid for which the crRNAs were designed. Despite the size of the genomes, ranging from 2.5 to 45 million basepairs, even the cumulative off-target effects do not result in high false positives output.

The simulations so far indicate the detection system to be reliable, i.e. have a low probability of yielding a false-positive output. However, insufficient data are available to accurately determine the model fit parameters. In addition, experimental validation is lacking. As such, it is possible that experimental work proves Cas13a to have increased off-target activity. We investigated what the effect of this would be on the reliability of the system, and subsequently propose methods to experimentally reduce the off-target probability.

We repeated the simulation for a higher energy gain for matches ($\Delta C=0.30$) and lower energy penalty for mismatches ($\Delta I=0.80$). In this situation, we find the off-target probability to be much higher than in our original simulations: $P_{FP}$ is 10 to 800 (Figure 7). It is now more intuitive to interpret $P_{FP}$ as the total number of false positives per true positive. Given a positive output, the probability of it being false is higher than 90% in all simulations. This situation would greatly compromise the reliability of the detection system, so we set out to investigate possible solutions.

Figure 7: Number of false positive outputs (per true positive) for crRNA 1 to 4 on the genomes of four organisms. $P_{FP}$, now more intuitively interpreted as the total number of false positives per true positive, or relative signal strength, for crRNA 1-4 for the E. coli, S. aureus and S. epidermis genome and the Bos Taurus chromosome 27 ($\Delta C=0.30$,$\Delta I=0.85$). The large values indicate a false positive probability higher than 90% in all simulations (given a positive output).

In our simulations, we assume equal representation of all genomic sequences. In reality, only part of the sequences are transcribed and available in RNA, and of course in different quantities. We can correct for these effects by applying a mask over the genome which takes into account the amount of RNA transcripts for each position. Moreover, our simple sample preparation method for gene amplification (RPA) and in vitro transcription are able to specifically amplify a region of interest at least 452 times (notebook of sample prep 15-08-2017). This amplification changes the ratio between true and false positive output probabilities (by the same factor of 452) in favour of on-targeting (Figure 8). We can incorporate this amplification factor in a generalized version of $P_{FP}$: $P_{FP}'$ and we call the amplification factor α:

$$ P_{FP}\textrm{'} = \frac{1}{1+\frac{\alpha \Sigma_{on}\mathcal{R}}{\Sigma_{off}\mathcal{R}}} \tag{9}$$

From our experiments, we estimate α$=452$. In the case of no amplification, we can set α$=1$. The amplification step improves the reliability of the system from 10% up to 40-95%, even in the case of higher mismatch tolerance ($\Delta C=0.30$,$\Delta I=0.85$)

Figure 8: Reliability of a positive output for crRNA 1 to 4 on the genomes of four organisms. Reliability of positive output as represented by the ratio between the probability of a true positive output (blue) as compared to a false positive output (orange). Simulation with $\Delta C=0.30$,$\Delta I=0.85$. We take into account a 452-fold amplification of the on-target signal as a result of the gene amplification and in vitro transcription. For each crRNA, the different bars represent the E. coli genome, S. aureus genome, S. epidermis genome and Bos taurus chr. 27 from left to right respectively. The reliability of the system is improved from less than 10% to 40 to 95%.

As such, gene amplification and sensitivity not only increase the sensitivity of the detection method, these two steps in sample preparation also reduce the probability of false positive results and thereby increase the reliability of the detection system. We can now simulate $P_{FP}^\textrm{'}$ for different ratios between on- and off-target cleavage probabilities, and determine the amplification factor required to achieve reliable detection above a desired threshold $\beta$ (Figure 9). Analytically, this amplification factor can be calculated as follows by setting $P_{FP threshold}'=\frac{1}{\beta}$:

$$ \alpha_{required} = \frac{\beta}{1-\beta} \frac{\Sigma_{off}\mathcal{R}}{\Sigma_{on}\mathcal{R}} \tag{10}$$

We can see the reliability of the system increase as a function of α for different rations ($ \frac{\Sigma_{on}\mathcal{R}}{\Sigma_{off}\mathcal{R}} $) and can read-off the amplification factor required to achieve an input that is 95% reliable (i.e. the probability of a true positive given a positive output).

Figure 9: Reliability of the detection system for different on-target amplification factors. Reliability of the output as a function of α for different ratios ($ \frac{\Sigma_{on}\mathcal{R}}{\Sigma_{off}\mathcal{R}} $). The intersection between the different graphs and $\beta$, which we set to 95% as per example, allows reading off the required amplification factor required to achieve the desired reliability.

Collaboration iGEM LMU-TUM Munich

Both the iGEM TU Delft and iGEM LMU-TUM Munich teams realized that optimal crRNA is a prerequisite in utilizing Cas13a for their respective detection purposes. In addition to sensitively detecting a sequence of interest, modeling the off-target effects of a chosen crRNA is important in assessing the reliability of the system, as it provides information about the probability of a false positive output. Both teams took a different approach: we employed a kinetic model (Klein et al. 2017) while iGEM LMU-TUM Munich applied an alignment algorithm.

Biophysical modeling has the advantage of providing a unified targeting framework that is able to take into account tiny (local) off-target effects that together may add to affect the reliability of the system. However, it is computationally intensive as compared to homology search and requires the development of the software, which is already available from NCBI for homology search.

We ran 4 off-target simulations for 2 different crRNAs spacer sequences (crRNA1-3 and crRNA4) designed by iGEM LMU-TUM Munich for targeting the 16s ribosomal RNA of two different bacterial species (DH5α and B. subtilis). The on and off-target activities are modeled for sequences on all possible frames on the genome (Figure 10). From the simulation results, we deduce that the crRNAs specifically target the different bacterial species.

Figure 10: On and off-target effects of Munich crRNAs on different bacterial targets. On and off-target effects of designed crRNAs on different bacterial targets. A) crRNA 1-3 yield 7 on-target hits (blue) on the DH5α genome, while off-target probabilities (orange) remain below 1 in 10 million per site. B) crRNA 1-3 yield off-target probabilities below 1 in 10 million for all possible sequences on the B. subtilis genome. C) crRNA 2 yields off-target probabilities below 1 in a million for all possible sequences on the DH5α genome. D) crRNA 4 yields 10 on-target hits (blue) on the B. subtilis genome, while off-target probabilities (orange) remain below 1 in a million for all possible sequences on the B. subtilis genome.

We establish that the two designed crRNA sequences target their designated bacterial species specifically. crRNA 1-3 contains 7 on-target hits on the DH5a genome, while crRNA 4 contains 10 on-target hits on the B. subtilis genome. From this it can be concluded conclude that the designed crRNA sequences allow differentiation between the (Gram-negative) E. Coli and (Gram-positive) B. subtilis species. Arguably, designing one crRNA that targets both strains could be beneficial in distinguishing between for example a viral and bacterial infection by requiring less experiments, however a Gram-positive or negative distinction could is already very helpful.

We also compute $P_{FP}$ for the four simulations (Figure 11). For all simulations, the probability of acquiring a false-positive result was found to be low: in the order of one in a million times. Moreover, having multiple on-target sites on the genome of the bacterial species to be detected decreases $P_{FP}$ on the targeted genome. The $P_{FP}$ on the non-targeted genome, however is unaffected. This implies that having more on-target sites intuitively also increases the sensitivity of the assay. However, the off-target probability on non-target species remains unaffected by this.

Figure 11: False-positive probability of crRNAs on different bacterial targets.E. Coli DH5α and B. Subtilis. The probability of false-positive output for crRNA1-3 on E. Coli DH5α and crRNA 4 on B. Subtilis. (off-target genomes) is significantly lower than for the targeted genome, because there are multiple on-target sites.

We conclude that the design of the crRNAs is reliable as only minute off-target effects in the order of one in a million are observed. In addition, we gain a deeper insight into the effect of having multiple on-targets in the genome: this increases the sensitivity of the detection method, but does not affect the reliability. As such, our tool assesses the design of the crRNAs by iGEM LMU-TUM Munich to be clever and to allow reliable detection. We thank iGEM LMU-TUM Munich for the fun collaboration and look forward to meeting them again at the Giant Jamboree.

Conclusions

We employed a recently proposed kinetic model to assess the reliability of our crRNAs in the Cas13a-mediated detection system. In general, we find the predicted false positive probability (given a positive output) to be low (as low as 1 in billion). Moreover, the simulation unveiled a design flaw in crRNA 5 that prevented us from wasting time on testing it in the lab. Subsequently, we simulated the off-target activity of all our crRNAs on the genomes of relevant organisms: E. coli BL21, Staphylococcus aureus, Staphylococcus epidermis and chromosome 27 of the domestic cow (Bos taurus). As expected, the false positive probability increases for the presence of larger genome sequences being present. We repeated the simulations for parameter that correspond to a higher mismatch tolerance and observed that this resulted in false positive probabilities exceeding 90%. This situation would greatly compromise the reliability of the detection system, so we set out to investigate possible solutions. In order to do so, we re-evaluate the model’s assumptions and find that if we incorporate our sample preparation protocol, amplifying and in vitro transcribing the gene, the probability of false positives is reduced accordingly. In addition, we propose a method that allows calculation of the required amplification factor α for achieving a desired reliability in the output.