MODEL
iGEM Modeling - Protospacer
iGEM Groningen 2017
Bacteria with a functioning CRISPR system defend against infection by targeting specific DNA sequences from the foreign bacteriophages. To be able to do this, the bacteria must have previously added such sequences to its library of targetable sequences. All targetable sequences are located in the genome of the bacteria in a region called the spacer array. Each targetable sequence, commonly known as spacer, is encoded in this region and is separated from other sequences by the same repeated sequence. The process of acquiring a new spacer into the array is called CRISPR adaptation and is known to be mediated by the Cas 1 and Cas 2 proteins (1). Importantly, there are spacers that are more likely to be “adapted” into the spacer array. The causes of this bias in adaptation is thought to reside in the sequence the spacer has on the original invader bacteriophage and the type of CRISPR system. In our project, we want to detect the presence of a specific bacteriophage by using bacteriophage specific spacer that guides the activity of the CRISPR system. It would be useful to predict which sequences in the bacteriophage are most likely to be effective guides and are also most specific to the bacteriophage we wish to detect. This would allow us to pre-encode these spacers into our construct and detect the presence of the bacteriophage more effectively. We wanted to code a pipeline that takes the genome of a bacteriophage of interest and outputs which sequences are most likely to work for our specific CRISPR system. After reviewing current literature we found that the effectiveness of a spacer depends on:
- Adaptability: The likelihood of that spacer to be adapted into the array:
- Activity: The effectiveness of the guide encoded by that spacer in cleaving the target bacteriophage sequence.
- Off-target potential: The ability of that spacer in detrimentally cleaving a similar sequence from the genome of the bacteria instead of the bacteriophage.
Adaptability
Adaptability is one of the most influential aspects that define a spacer sequence. Each CRISPR system adapts spacers with a different “motif”. This proto-spacer adjacent motif (PAM) is usually located at the boundaries of the spacer and for our S. pyogenes based system corresponds to 5’-spacer-NGG-3’ (2). This means spacer size sequences in the bacteriophage that end in NGG are much more likely to be taken up as spacers by the Cas1-Cas2 complex and then integrated into the bacterial CRISPR array. One report in E. coli also showed that the main source of spacer material for adaptation comes from excision products from a nuclease (RecBCD) that participates in DNA repair (3). When a region of the DNA is being transcribed or replicated it is more likely to be damaged because these processes require the temporal unwinding of the double helix into two single strands. Therefore sequences that are more likely to be transcribed or replicated are therefore more exposed to DNA damage and are more likely to be taken up as spacers. In the same report, it was also demonstrated that CHI sequences stop the activity of the nuclease. This means that sequences upstream of a CHI sequence are slightly less likely to be taken up as spacers because they are protected from the activity of the nuclease. Because the CRISPR system of E. coli and S. pyogenes are different we are not sure if these findings is relevant for our predictions. Intriguingly, there is an analog nuclease and an analog CHI sequence that inhibits it in S. pyogenes which suggests the analog CHI sequence may also influence the spacer acquisition in S. pyogenes (4). If this is the case, then encoding the corresponding sequence at different locations in our constructs would protect them from incorrectly being used as spacer material.
Activity
By design, the sequence of the guide is critical in guiding the activity of the CRISPR system, but factors that influence this activity are still not fully characterized. In theory, spacers that attack more critical parts of the bacteriophage genome would be even more effective (and therefore more likely to make the bacteria survive). Nevertheless, large studies on adaptation do not support this assumption (5). Nevertheless, their findings support that the spacer sequence influences the activity of the guide. For example, some spacers with correct PAM show almost no activity. This shows that there is much to learn about the mechanisms influence activity. We decided to account for this bias by collecting information on the activity levels of different spacers sequences from the literature. Below we explain how we integrated this information into our pipeline.
Off-target potential
In theory, spacers that erroneously cleave the bacterial genome are less likely to be effective spacers because they reduce the fitness of the bacteria. We want to prevent these off-target effects as much as possible. From the literature, we learned that in our CRISPR system, the 13 last nucleotides of the spacer are the most important for off-target activity with the exception of the second to last, which corresponds to the N position of the PAM. This means for example, that a spacer that ends in AGG will have the same activity on GGG, CGG, and TGG ending spacers if all else remains equal.
Modeling
With this information, we coded a pipeline that, taking a bacteriophage genome as input, selects the spacers that are more likely to be effective for that bacteriophage. The pipeline follows these steps:- Extract all possible NGG ending spacers from the target bacteriophage genome.
- Penalize spacers for which the last 13 nucleotides align to a region in the bacterial genome with less than 5 mis-matches.
- Using the collected information of spacer activity in our CRISPR system prioritize spacers that share the similar* 13 nucleotide ending with a database spacer if the database spacer is on the top 25 percentile of a measure of activity.
- Using the collected information of spacer activity in our CRISPR system penalize spacers that share the similar* 13 nucleotide ending with a database spacer if the database spacer is on the bottom 25 percentile of a measure of activity.
Similar*: Less than 3 mismatches The only large-scale study of spacer adaptation done on our CRISPR system was used to improve the prediction of spacers that may be more active (6). It is important to mention that more data is needed to fully understand the relationship of a spacer sequence and its activity.
Model results
For our project, we used the protospacer predictor algorithm to select which spacers to encode into our array. We used as input the genome of L. lactis (AM406671) and the genomic sequence of virus SK1 (NC_001835.1). We obtained a list of spacers that we ranked based on a metric that unifies the activity score and the off-target effect s of such spacer using Z-scores (see code). Despite this, the metric was not useful to rank our spacers because some spacers had unexpectedly high activity scores. This made some spacers have a high unification score despite a high probability of off-target effects. Instead, we selected spacers:AGTAGACAACGCAGGANGG
GGCGGAAGCAATACTCNGG
These spacers are 21 st and 22 st on the ranked list and are located in SK1 at positions 15067 and 27694 respectively. They have moderate activity scores and low chance of off-target effects. We expect that with more activity measurements similar to the one kindly provided by Heler et al (PMC4385744) the activity score will become more informative and make the behavior of the unified score less erratic. Another solution is to preprocess the activity score and off-target effect data to make them more comparable. Nevertheless, the current output table informs the final user about the activity evidence of each spacer and the possibility of off-target effects, which can assist a decision.
References
- Amitai, G. & Sorek, R. CRISPR-Cas adaptation: insights into the mechanism of action. Nat. Rev. Microbiol. 14, 67–76 (2016).
- Leenay, R. T. et al. Identifying and Visualizing Functional PAM Diversity across CRISPR-Cas Systems. Mol. Cell 62, 137–147 (2016).
- Levy, A. et al. CRISPR adaptation biases explain preference for acquisition of foreign DNA. Nature 520, 505–510 (2015).
- El Karoui, M. et al. Orientation specificity of the Lactococcus lactis Chi site. Genes Cells Devoted Mol. Cell. Mech. 5, 453–461 (2000).
- Savitskaya, E., Semenova, E., Dedkov, V., Metlitskaya, A. & Severinov, K. High-throughput analysis of type I-E CRISPR/Cas spacer acquisition in E. coli. RNA Biol. 10, 716–725 (2013).
- Heler, R. et al. Cas9 specifies functional viral targets during CRISPR-Cas adaptation. Nature 519, 199–202 (2015).
- Wegmann, et al. Complete Genome Sequence of the Prototype Lactic Acid Bacterium Lactococcus lactis subsp. cremoris MG1363. Journal of Bacteriology, 189(8), pp.3256-3270. (2007).
- Chandry, P. S. et al. Analysis of the DNA sequence, gene expression, origin of replication and modular structure of the Lactococcus lactis lytic bacteriophage sk1. Molecular Microbiology, 26(01), 49-64. (1997).
Genetic Algorithm
How do genetic algorithms work
In principle genetic algorithms are used to imitate the process of evolution. These algorithms are most commonly used to solve search and optimization problems. This is done by invoking operations that also take place during the process of evolution in nature. Such processes include mutation, crossover and selection.- Selection The selection process is tightly coupled with the fitness calculation. During the selection step we "discard" the agents with the worst fitness. This is called elitism and it only allows the strongest of each generation to pass the genes to the next generations.
- Crossover Crossover is analogous to reproduction. This means that genes from two parents, who passed the selection step, will be passed on to a new agent. This process continues randomly till the population has reached its original size.
- Mutation The introduction of mutation is necessary to avoid local optima. In other words, mutation's purpose is preserving and introducing diversity in the population.
Diagram
How did we use it
- Step 1 - Initialization In our case we used the genetic algorithm to find the best combinations of spacers that would allow our agents to detect viruses. Once we created the population of agents we assigned 2 pairs of spacers from 2 real viruses and one pair from a randomly generated (and thus fake) virus adding up to a total of 6 spacers. The fake virus is introduced for validation purposes.
- Step 2 - Fitness For every spacer of the agent, we calculate the similarity between that spacer and all spacers of all virus taken into account. The smallest similarity is the one that is taken into account. That similarity is calculated using the hamming distance metric and the total fitness is calculated by adding up the similarity value of all spacers currently being held by the agent.
- Step 3 - Selection Half of the agents with the lowest fitness are being discarded.
- Step 4 - Crossover The agents that are left proceed to the crossover step. The agents reproduce and the new agents have a different combinations of the spacers from the previous generation.
- Step 5 - Repeat Once the new agents have been created they are added to the population and the procedure starts all over again from step 2 for the number of generations defined in the initialization step.