# Team:TUDelft/Modeling

Case13a
##### Modeling:

The aim of our project was to develop a safe and reliable detection method for antibiotic resistance (ABR) genes. We modeled the various components to predict and complement the work in the lab.

To achieve high specificity in our detection, we use a protein (Cas13a) loaded with a guide RNA (crRNA) for target finding. To find the most suitable guides, we developed a motif finder to identify conserved regions in a set of antibiotic resistant genes. We used sequences we found, to design the crRNAs for the lab. Furthermore, we checked the accuracy of Cas13a of our crRNAs, using an off-target simulation. We found a high accuracy with respect to off-targets.

In our on-site ABR gene detection tool we use coacervation, a phase transition which is visible as a transition from clear to opaque, as a readout method. We optimized the technique and explained the biophysical principles behind the phase transition with analytical modelling.

For our tool to be storable for longer periods of time, we used Tardigrade intrinsically Disordered Proteins (TDPs) to protect the Cas13a proteins. For this part of our project we employed a lattice model to predict the behavior during drying of TDPs together with Cas13a. We found that the two TDPs best suited to preserve Cas13a activity were CAHS94205 and SAHS33020, at concentrations of 1 g/L or higher.

In our envisioned final product, bacteria are used as cellular-factories by letting them produce and excrete all necessary proteins in small vesicles. To understand the dynamics of the transport of proteins (TorA-GFP) from the cytoplasm into vesicles, we used kinetic modelling. We found that the optimal time to harvest vesicles is 25 minutes after induction in order to obtain a maximum concentration of GFP in the vesicles.

## Cas13a - Accurate detection of specific genes

We used a combined bioinformatics and biophysics approach to determine the optimal target sequence of our antibiotic resistance genes. To this end, we built our own antibiotic resistance genes database, employed a hidden Markov model to find conserved regions and used a kinetic model to describe off-target effects. Based on our modeling work, we detected an antibiotic resistance gene in the lab without knowing which variant of the gene was present in our sample. In addition, we ranked the most advantageous target sequences by the reliability of the output signal they would generate. In order to understand how we can use Cas13a to detect antibiotic resistance genes, we need to get some background on Cas13a and antibiotic resistance genes first. We will give a short introduction below of antibiotic resistance genes and Cas13a.

Antibiotic resistance occurs when bacteria develop mechanisms to reduce the effectiveness of antibiotics. Certain genes can be related to specific types of antibiotic resistance, while others can also cause resistance against multiple drugs. At the time of writing, the curated antibiotic resistance database we used contains 6472 gene entries divided among 28 categories (Jia et al. 2017). Simultaneous identification of the presence of different families of antibiotic resistance genes in a sample requires the system to be specific for each gene family, while also being versatile, i.e. easily adaptable to different targets.

The recently characterized protein Cas13a, an "RNA-guided, RNA-targeting CRISPR effector" (Gootenberg et al. 2017), meets both these requirements. Cas13a acts like a detective that can signal the presence of specific RNA sequences. Its target description is provided by a specific single guide RNA sequence, called crRNA, that is positioned (loaded) within the protein. The ease of loading a desired crRNA into the Cas13a allows versatile detection between different experiments by incorporating different crRNA. Cas13a is activated when the loaded crRNA pairs with a complementary (target) sequence. The crRNA is therefore essential in directing Cas13a towards a predefined target. Upon target binding, Cas13a undergoes a conformational change and engages in 'collateral cleavage': it starts to aspecifically cleave any RNA it encounters, as represented schematically in Figure 1. We exploit this property to obtain an amplified signal from a small target recognition event. In their original biological role in the bacterial adaptive immune system, proteins from the CRISPR-Cas family are under selective pressure to also attack slightly mutated target sequences from evolved invading viruses (Datsenko et al. 2012; Fineran et al. 2014; Kuenne et al. 2016). While offering possibilities to target related variants of antibiotic resistance genes, this off-target activity could also compromise the reliability of a Cas13a-based detection system.

Figure 1: Schematical representation of collateral cleavage by Cas13a.

Below you can find several models we made to determine which crRNA sequences are optimal to guide Cas13a towards the desired antibiotic resistance gene target. The first aim of this modeling work was to get a better idea of the structural features of Cas13a and its domains (Q1) and its structural difference with and without crRNA bound (Q2). Subsequently, we focused to find which DNA sequences are conserved between variants of the same antibiotic resistance gene (Q3). From these conserved regions, we determined interesting targets for Cas13a and designed crRNA sequences. Following this, we computed the tolerance of Cas13a for mismatches (Q4). Based on this information, we extended the model to score potential crRNA sequences based upon their likelihood to yield false-positive results (Q5). Finally, we propose how to experimentally reduce the off-target probability (Q6).

• Q1 What does the structure of Cas13a and its domains look like?

There are several variants of Cas13a, derived from different organisms. In our project, we utilize Leptotrichia wadei Cas13a (LwCas13a). The structure of this protein has been determined with and without crRNA bound (Liu et al. 2017a) and with crRNA and target bound (Liu et al. 2017b). The Cas13a structure contains two main lobes: a crRNA recognition (REC) lobe and a nuclease (NUC) lobe. The two lobes are positively charged towards the protein core to accommodate the negatively charged RNA. Each of these two lobes contains an RNase domain. At the REC lobe this domain is responsible for the processing of pre-crRNA, while at the NUC lobe the domain is responsible for target cleavage. In contrast to other Cas proteins, two RNase domains of the NUC lobe are located at the outside of the protein, which is likely the reason why Cas13a can engage in (aspecific) collateral cleavage upon activation by binding to a matching target. These two domains have been labeled as red spots in Figure 2 and can be found at the interface between the green and pink domain.

Figure 2: Rotating surface image of Cas13a with crRNA bound. We annotated the structure by colour according to the proposed domain organization (Liu et al. 2017a). We observe the N-terminal domain (NTD) in cyan, making up the REC lobe together with the light-pink Helical-1 domain. The remaining domains together make up the NUC lobe. In green, we see the HEPN domain. In yellow, we see the Helical-2 domain. In orange we see the linker domain and in purple we observe the HEPN2 domain. For a more thorough explanation of the function of the domains, please refer to (Liu et al. 2017a; Liu et al. 2017b).

Because getting a good understanding of a 3D protein structure from 2D images is difficult, we provide an annotated structure file to visualize the 3D structure using the PyMOL software (free for educational purposes) .

• Q2 What is Cas13a's structural difference with and without crRNA bound?

The binding of the crRNA induces significant conformational changes in Cas13a. Figure 3 shows a superposition of both Cas13a without the crRNA (apo Cas13a) in green and crRNA-bound Cas13a (Liu et al. 2017a). From the alignment between the two conformations, we see that binding crRNA results in folding of the domains in the NUC lobe, which likely plays an important role in the stabilization of the crRNA-Cas13a complex.

Figure 3: Superposition of Cas13a with and without crRNA bound. Superposition of apo Cas13a in green and crRNA-bound Cas13a in blue (with the crRNA represented in red). We observe that the binding of the crRNA influences the folding of the domains in the NUC lobe, likely playing an important role in the stabilization of the crRNA-Cas13a complex.

Because getting a good understanding of a 3D protein structure from 2D images is difficult, we provide an annotated structure file to visualize the 3D structure using the PyMOL software (free for educational purposes) .

• Q3 What are the conserved regions in antibiotic resistance genes?

The Cas13a in our detection tool needs a guide crRNA to identify its target. To design this crRNA we searched for conserved regions in antibiotic resistant genes (ABR genes). We started by building our own database of ABR genes, so that we could directly work with it without the interface of the internet. In Figure 4, we show some statistics of the gene data. The largest group of antibiotics in the database are the beta-lactam antibiotics, which for example includes penicillin.

Figure 4: Stastitics of our database. The number of genes with known or unknown sequence or subgroup present in our database per antibiotic.

The next step was to find small (24-30 basepairs) conserved regions (motifs) in a given set of genes. To do so we used the MEME (Multiple EM for Motif Elicitation) program (Bailey et al. 2009), which we iterated multiple times to improve the finds. In addition to being part of conserved regions, we have three requirements for our motifs to be used as a crRNA as found by the work we did on CRISPR Cas13a. We automated the search for conserved motifs and the checks on whether they met all three requirements. Details on the scripts we wrote for this search process can be found on the software page.

There are two different ways in which we can search for conserved regions. We could consider a group of genes that confer resistance to a certain antibiotic, but we could also search between variants of a single gene. The conserved regions you find in a group are usually also the conserved regions of the variants of a single gene. However, when you focus on the variants of a single gene you will often find much larger conserved regions. We started with comparing groups of resistant genes of certain antibiotics. We found that multiple motifs were required. Even with multiple motifs, we still were unable to cover all the genes. Coverage of half of the genes was the maximum we were able to reach. In Figure 5, an example of fluoroquinolone antibiotic is shown.

Figure 5: Fluoroquinolone. This graph shows the coveraged of motifs that confer antibiotic resistance against Fluoroquinolone. (Percentage 100x)

Furthermore, we looked at variances between certain resistance genes. For the case of Mastitis, it is known that mecA and blaZ. are important genes to detect. Therefore, we searched for motifs in these genes. For mecA, we found that we need more than one motif to cover all the variances in the gene. However, for blaZ we identified one motif which is sufficient on its own. We tested this motif in the lab on an unknown variant of the blaZ gene and proved that it indeed works.

• Q4 What is the tolerance of Cas13a for mismatches?

We employed a kinetic model (Klein et al. 2017) to determine the cleavage probability of Cas13a for a target sequence that is not a full match to the guide. The kinetic model was designed for proteins similar to Cas9, and required adjustment as Cas13a has interesting properties that the model was not developed for. Cas13a has a seed region central to the spacer and a protospacer flanking site (PFS) - a nucleotide 3' to the target that should not be a guanine (Abudayyeh et al. 2016). The model assumes binding of the target in the centre of the seed region that ranges from nucleotide 9 to 15 (Abudayyeh et al. 2016). We impose that subsequent binding events occur sequentially in both directions. With these adjustments, we can follow the recipe given in (Klein et al. 2017) and calculate the probability to cleave ($P_{clv}$) a target with a given guide once it is bound. In order to compute this probability, we assume that the transition from one state to the next (where another nucleotide of the guide pairs with the target) is favourable for correct Watson-Crick base pairing (with an energy gain of $\Delta C$) and unfavourable for mismatches (with an energy penalty of $\Delta I$).

Although data to fit the model are scarce, we utilized available mismatch tolerance data (Abudayyeh et al. 2016, Figures S19-S21) to qualitatively find suitable model parameters $\Delta C$ and $\Delta I$. In general agreement with the available data, our simulated Cas13a has a seed region from nucleotides 9 to 12, where a single mismatch is sufficient to prevent cleavage (and thus activation) (Figure 6, red graph). Furthermore, two consecutive mismatches are only tolerated at the beginning and end of the target sequences (Figure 6, blue graph) while three consecutive mismatches are only tolerated at the target's 5' end (Figure 6, orange graph).

Figure 6: Cas13a tolerance for mismatches. Cas13a tolerance for mismatches ($\Delta C=0.25$, $\Delta I=2.5$). The cleavage probability relative to the on-target is plotted for different types of mismatches. Single mismatches (blue line) are not tolerated well in the central seed region. Two consecutive mismatches (orange line) can be tolerated in the beginning and end of the target. Three consecutive mismatches (green line) are only allowed at the target's 5' end.

• Q5 What is the off-target activity of our crRNAs?

Rewriting the implementation of the model into a command line utility allowed us to use the TU Delft computer clusters for simulating the off-target effects of our crRNAs on the genomes of different relevant organisms: E. coli BL21 (4.5 mln basepairs (bp)), Staphylococcus aureus (2.8 mln bp), Staphylococcus epidermis (2.5 mln bp) and chromosome 27 of the domestic cow (Bos taurus, 45 mln bp). E. coli BL21 is a commonly used lab strain, which was the strain containing the tetracyclin resistance-conferring (TcR) plasmid that we designed our crRNAs for. S. aureus and S. epidermis are common pathogens in mastitis, while genetic material from the cow itself may also be present when detecting sequences in cow samples, for example milk. Scanning the entire genome allows investigating the question if tiny local off-target effects together influence the reliability of the output signal.

In Figure 7, we present the off-target cleavage probability (relative to the on-target) for our crRNA sequences for the different genomes/chromosome. In general, we find off-target cleavage probabilities below 1 in 100.000 for the bacterial genomes. As expected, the $P_{FP}$ is generally larger for the Bos taurus chromosome 27, which is approximately tenfold larger than the bacterial genomes, but reaches a maximum of 1 in 1000. As such, we conclude that the probability of a false output is low and likely does not compromise the usability of the Cas13a-mediated detection system.

Figure 7: Off-target cleavage probability (relative to on-target) for crRNA 1 to 4 on the genomes of four organisms. $P_{FP}$ for different crRNAs and genomes of four organisms ($\Delta C=0.25$, $\Delta I=2.5$). There seems to be no relation between genome size and per base false positive output probability, as the Bos taurus chr. 27 scores for the different crRNAs are not systematically highest. Interestingly, the off-target activity per base pair tends to be highest for E. coli, the organism containing the plasmid for which the crRNAs were designed. Despite the size of the genomes, ranging from 2.5 to 45 million basepairs, even the cumulative off-target effects do not result in significant false positives output.

• Q6 What can we do to reduce off-target effects?

The simulations indicated the detection system to be reliable, i.e. have a low probability of yielding a false-positive output. However, insufficient data are available to accurately determine the model fit parameters. In addition, experimental validation is lacking. As such, it is possible that experimental work proves Cas13a to have increased off-target activity. Therefore, we propose methods to experimentally reduce the off-target probability if it appears to compromise the reliability of the detection system.

In our simulations, we assume equal representation of all genomic sequences. However, our simple sample preparation method for gene amplification (RPA) and in vitro transcription are able to specifically amplify a region of interest at least 452 times (notebook of sample prep 15-08-2017). This amplification changes the ratio between true and false positive output probabilities (by the same factor of 452) in favour of on-targeting. We can incorporate this amplification factor (we will call it $\alpha$) in a generalized quantity $P_{FP}'$: the probability of a false positive output, given that there is a positive output.

$$P_{FP}' = \frac{1}{1+\frac{\alpha \sum_{on}\mathcal{R}}{\sum_{off}\mathcal{R}}} \tag{1}$$

$\sum_{on}\mathcal{R}$ represents the amount of on-target sites, while $\sum_{off}\mathcal{R}$ is the sum of the off-target cleavage probability (relative the on-target) over all non-target sites. As can be seen from the formula, an increased amplification results in a lower false-positive probability

Using $P_{FP}'$, we can compute the increase of the reliability of the system as a function $\alpha$ for different rations of $\frac{\sum_{on}\mathcal{R}}{\sum_{off}\mathcal{R}}$. Subsequently we can read from the figure which amplification factor is required to achieve an input that is for example 95% reliable (i.e. the probability of a true positive given a positive output), by looking at the intersection between the graph and the $\beta =95%$ line in Figure 8.

Figure 8: Reliability of the detection system for different on-target amplification factors. Reliability of the output of the device as a function of $\alpha$ for different ratios $\left(\frac{\sum_{on}\mathcal{R}}{\sum_{off}\mathcal{R}} \right)$. The intersection between the different graphs and $\beta$, which we set to 95% as per example, allows reading off the required amplification factor required to achieve the desired reliability.

## CINDY Seq - Generating a visible readout

We modeled the process of coacervation to investigate whether it can be used to achieve a readout visible to the naked eye in combination with Cas13a. When two or more attracting polymer species are mixed in solution and the right conditions are provided, phase separation into a polymer-rich and polymer-poor region occurs, see schematic representation in Figure 9. The polymer-rich region in such a solution is coined the 'coacervate' phase. This phase-separation transforms the solution from transparent to turbid: a change visible to the naked eye. We applied the theory of polymer solutions using lattice modeling and the Voorn-Overbeek model to describe the interactions between polymers. We will first give an introduction to the coacervation process and how to employ it to achieve a visible readout following Cas13a detecting an antibiotic resistance gene. Afterwards, we dive into the modeling of coacervation whereby we guided our lab team towards successful application of coacervation and to maximize the change of turbidity in solution.

Figure 9: Schematical representation of coacervation. Coacervation is the phase separation of mutually attracting polymers. In this scheme and throughout our project, the attraction between the polymers is due to electrostatic interactions.

Although polymers can attract each other by various physical forces, we will focus on combinations of positively and negatively charged polymers (polycations and polyanions respectively), which attract each other by electrostatic interactions. It has been observed that only long polymers phase separate to form coacervates (Qin et al. 2014; Qin & De Pablo 2016). RNA is a negatively charged polymer, and multiple efforts have shown that longer RNA polymers can be combined with polycations to form coacervates (Aumiller et al. 2016). These two facts taken together form the basis of the coacervate based detection: if long strands of RNA are present and the right conditions are provided, the RNA forms a coacervate with polycations, whereas if the RNA is chopped up into smaller fragments, coacervation no longer occurs. As is described in more detail on the Cas13a page, Cas13a is a CRISPR-Cas protein that carries an RNA guide and undergoes a conformational change when it encounters its target (an RNA sequence that is complementary to this guide). After such a conformational change, the protein engages in collateral cleavage, meaning that it starts acting like an unspecific RNase by cleaving all RNA it encounters from then on. Target recognition of Cas13a and subsequent collateral cleavage of RNA polymers into smaller fragments can thus in principle be used to turn a solution containing coacervates into a solution that no longer contains coacervates.

In order to know if Cas13a collateral cleavage of RNA polymers can affect the formation of coacervates, we first wanted to understand why only polymers form coacervate (Q7). Furthermore, we investigated if using a higher concentration of coacervate forming polymers would increase the turbidity of the solution (Q8).

• Q7 Why do only long polymers coacervate?

The fact that only long polymers form coacervates can be intuitively explained by the fact that longer polyelectrolytes can be attached to parts of multiple other oppositely charged polyelectrolytes, and therefore aggregate more readily. We used a polymer model from (Qin et al. 2014), to model polymer solutions from a thermodynamic perspective. The model is an expression for the free energy of the solution, and thermodynamic analysis is used to determine the conditions under which it forms coacervates. This analysis yields a phase diagram as shown in Figure 10, which divides parameter space into distinct regions where coacervation does and does not occur. The line between the two regions in parameter space is called the phase boundary. Finding this curve is not trivial, as the defining equation cannot be solved analytically. However, close to the critical point (at the minimum of the curve) this curve is approximately equal to the spinodal, which can be found easily, and is the one we plot here Figure 10.

Figure 10: Degradation of polymers shifts the phase boundary. Plots of the phase diagram of polymers with effective chain lengths of (A) 400 units and (B) 100 units. The parameters $\phi$ and $\sigma$ denote the volume fraction of the solution occupied by polymers and their charge density, respectively. In both plots, the phase space is split into a region where coacervation does (upper region) and one in which it does not (lower region) happen. By comparing the two graphs at point P in parameter space, we find that we get coacervates if the polymers are of length 400, but not anymore if they are degraded to a length of only 100. In our complete model, we show that irrespective of the actual chain length, the spinodal curve always moves into the direction of positive $\sigma$ when the polymers decrease in length.

From analyzing these phase boundaries, we find that the coacervate region is larger when the polymers are longer. As polymers are degraded into smaller pieces, the coacervate region shifts and a solution initially within the coacervate region shifts to the mixed region.

• Q8 Can the turbidity of coacervate solutions be increased by having higher concentration of polymers?

The surprising and counterintuitive answer to this question is no. The thermodynamic model for coacervation in polymer solutions discussed in Q11 results in spinodal curves that divide parameter space into two regions. Depending on the parameters of the system, either coacervation or homogenous mixing will occur. However, when we again consider a coacervate solution at point P and start increasing the volume fraction of polymers (to point Q in Figure 11), the system moves out of the coacervate region and becomes homogeneously mixed. From this we can conclude that increasing the concentration (or more precisely the volume fraction) of the coacervating polymers does not by definition increase the turbidity caused by the coacervates, because after a certain concentration, addition of polymers moves the system out of the coacervate region. Besides the fact that the model predicts this phenomenon, we have also shown this experimentally in our project.

Figure 11: Increasing polymer concentration does not increase the turbidity caused by coacervates. In the above graph the spinodal curve is shown for a symmetric solution of positively and negatively charged polymers. To answer the question posed above, imagine the system to be at point P in parameter space, and now imagine that the polymer volume fraction is increased by adding polymers of the same charge density $\sigma$. This brings the system to point Q, which lies out of the coacervate region, meaning that the system will no longer form coacervates

## TDPs - Designing for on-site usage

To make our tool as easy to handle as possible, we wanted to be able to guarantee its function even when stored at room temperature for a longer period of time. To this end, we are investigating the use of tardigrade-specific intrinsically disordered proteins (TDPs) to preserve Cas13a activity under these storage conditions, a schematic representation of the drying of Cas13a with TDPs if shown in Figure 12. TDPs originate from tardigrades, micro-animals that are able to survive a number of extreme conditions, including desiccation. TDPs have been shown to improve desiccation tolerance of the enzyme lactate dehydrogenase (LDH, 140 kDa) (Boothby et al. 2017), when isolated from the organism. Although Cas13a is functionally and structurally different from LDH, their molecular weight is similar. To get an idea of the concentration and type of TDP Cas13a has to be dried with to retain its functionality and activity, we modeled interactions between Cas13a and the four different TDPs we produced during our project. Upon dehydration, TDPs are thought to form a glass-like matrix which acts as a scaffold to cellular structures and proteins (Boothby et al. 2017). During the vitrification, the hydrophilic TDPs form a matrix around the protein, effectively replacing water molecules, before the protein is damaged through water loss. A common way to model such phase separation like processes is by using a lattice model.

Figure 12: Schematical representation of a drying rehydration cycle. Cas13a drying and rehydration is shown without (top) and with (bottom) TDPs. When TDPs are present Cas13a will still be active after dehydration.

Since we wanted to observe the phase separation behaviour of TDPs with Cas13a, we were mainly interested in protein-protein interactions rather than interactions between individual amino acid residues. After assessing if our lattice model could simulate such interactions (Q9), we transformed the structure of both Cas13a and the different TDPs to fit into the lattice, by averaging over a number of residues (Q10). By observing how quickly phase separation occurs, and therefore how quickly the protective scaffold forms around Cas13a, we determined which TDP is best suited to preserve Cas13a (Q11). Furthermore, we determined the concentration necessary to confer desiccation tolerance to Cas13a, by again modeling the phase separation behavior over a range of different concentrations (Q12).

• Q9 Can the lattice model be used to model protein interactions?

In our lattice model, we use the hydrophobe and charge compatibility indices, HCI and CCI respectively (Biro 2006) and convert them into energies by scaling them to the usual energy contributions of electrostatic and hydrophobic interactions. By running the model without any favourable interactions and with exclusively favourable interactions between the chains, we assessed if the model could provide information on the behaviour of polymers in solution. While no significant changes of the polymer arrangement can be seen in Figure 13, Figure 14 clearly shows a clustering of chains, which would be expected if all chain interactions are favourable. This demonstrates that the lattice model can simulate the aggregation of chains in solution and that it can be used to model the interactions between Cas13a and TDPs.

Figure 13: Simulation with nointeractions. Two states of a simulation without favourable protein interactions. To the left, the initial state of the simulation is shown immediately after it is generated, to the right, the environment is shown after 50 000 iterations.

Figure 14: Simulation with interactions. Two states of a simulation with exclusively favourable protein interactions. To the left, the initial state of the simulation is shown immediately after it is generated, to the right, the environment is shown after 50 000 iterations.

• Q10 What is the structure of Cas13a to input into the lattice model?

In order to simulate the preservation of Cas13a upon drying with protective proteins (TDPs), we require the lattice model to use the structure of Cas13a as an input. This requires a simplified, LEGO-like structure of our Cas13a protein, as well as the distribution of charged and hydrophobic residues on that structure.

The distribution of hydrophobic and charged residues over the surface of Cas13a is heterogenous. We annotated the Cas13a structural files with the Eisenberg normalized consensus hydrophobicity scale using existing algorithms (SIB, 2017) (Figure 15a) and adjusted these algorithms to also annotate charge (Figure 15b), from which we conclude that both these properties are distributed relatively heterogeneously.

Figure 15: Distribution of hydrophobic and charged residues over the surface of Cas13a A) Annotation of Cas13a using the Eisenberg normalize consensus hydrophobicity scale using existing algorithms (SIB, 2017). Hydrophobicity is high for red patches and low for white patches. B) Annotation of Cas13a by adjusting the script used in A, using amino acid charge information (IMGT, 2017). Red patches are negatively charged, blue patches are positively charged and white patches are charge neutral. The PyMOL structure files and annotation scripts can be downloaded here.

As it was not straightforward to manually derive a protein model using these data, we wrote a program that converts the commonly used "protein database" file extension (.pdb) into a 'LEGO-block' model (in the form of a matrix ) to use as input for the lattice drying simulation (Figure 16). The hydrophobicity and charge (isoelectric point) scales applied (Biro, 2006) can be easily adapted.

Figure 16: LEGO-block model of Cas13a to input into the lattice model. 2) Model of Cas13a hydrophobicity distribution. For visualization purposes, a colour threshold was applied: red areas are hydrophobic while grey areas are not. b) Model for Cas13a charge (isoelectric point) distribution. For visualization purposes, a colour threshold was applied: blue patches are positively charged while gray patches are not charged or negatively charged.

We provide the Matlab script, enabling anyone with basic Matlab knowledge to import a structure, which can be converted to a coarse-grained 'LEGO-block' model with a resolution that can be set by the user. In addition, the code intuitively includes the desired hydrophobicity and charge scales to consider, which can be adapted as desired.

• Q11 Which TDP best preserves Cas13a activity upon drying?

To model interactions between TDPs and Cas13a during the drying process, we observed the interaction of one Cas13a protein with the number of TDPs present at a certain concentration. Since the volume is reduced during desiccation, we set the simulation volume to be a small fraction of the original volume of the Cas13a solution. This fraction was chosen to be 0.1, which is small enough to display phase separation behaviour on a reasonable timescale, but large enough to assume that no protein denaturation or degradation has occurred yet. Data by Boothby et al. 2017 showed that LDH activity was best preserved at a TDP concentration of ~1 g/L. We used this concentration to run our lattice model for all our TDPs. One of these simulations (SAHS33020, 1 g/L, volume fraction 0.1) is shown in Figure 17. Both the migration of proteins towards Cas13a and formation of protein clusters can be seen.

Figure 17: Direct Cas13a neighbours. Simulation of Cas13a with SAHS33020 at a concentration of 1 g/L and in a volume fraction 0.1

To determine which TDP would preserve Cas13a activity best, we analyzed the coverage of Cas13a during the lattice model simulation. If Cas13a is covered with TDPs more quickly, they can form the glass-like matrix supporting its structure earlier, improving its desiccation tolerance. In Figure 18, the number of immediate neighbours of Cas13a is plotted against the number of iterations per chain. While all TDPs display similar development of coverage, SAHS33020 and CAHS94205 show a slightly higher coverage at 800 iterations per chain and a generally steeper slope, indicating that they are better suited for drying Cas13a.

Figure 18: Direct Cas13a neighbours. Number of immediate neighbours to Cas13a against the number of iterations per chain for TDPs CAHS94205, CAHS106094, SAHS33020 and SAHS68234 at a concentration of 1 g/L and in a volume fraction of 0.1.

To determine which TDP would preserve Cas13a activity best, we analyzed the coverage of Cas13a during the lattice model simulation. If Cas13a is covered with TDPs more quickly, they can form the glass-like matrix supporting its structure earlier, improving its desiccation tolerance. In Figure 18, the number of immediate neighbours of Cas13a is plotted against the number of iterations per chain. While all TDPs display similar development of coverage, SAHS33020 and CAHS94205 show a slightly higher coverage at 800 iterations per chain and a generally steeper slope, indicating that they are better suited for drying Cas13a.

• Q12 What concentrations of TDPs are needed for sufficient preservation of activity after drying Cas13a?

After CAHS 94205 and SAHS 33020 were chosen to dry Cas13a, we simulated the behaviour of the two proteins at concentrations ranging from 0.1 to 1 g/L. As in the previous simulations, we chose a volume fraction of 0.1 and analyzed the coverage of Cas13a over time (see Figure 19).

Figure 19: Initial state. Initial state of lattice model simulation for SAHS33020 concentrations ranging from 0.1 to 1 g/L at a volume fraction of 0.1.

In Figure 20 the number of direct neighbours to Cas13a is plotted against the TDP concentration at the same time in each simulation. Since Boothby et al. 2017 had previously observed a logarithmic behaviour of a different enzyme's activity towards higher TDP concentrations, we employed a logarithmic fit over all points with function (log).

$$y=m \cdot \ln(x)+b \tag{1}$$

The resulting curve fits the data reasonably well and while the system has not reached its saturation point, the curve clearly evens out towards higher TDP concentrations. This implies that, up to a concentration of 1 g/L at least, a higher Cas13a coverage can be achieved by increasing the TDP concentration. While better preservation of Cas13a activity may be possible, our results indicate that the difference in the resulting desiccation tolerance will become less pronounced at higher concentrations. While we cannot give recommendations for higher concentrations, it appears that a concentration of at least 1 g/L for either TDP is advisable for drying Cas13a.

Figure 20: Direct neighbours to Cas13a. Number of direct neighbours to Cas13a over a) SAHS33020 and b) CAHS94205 concentrations ranging from 0.1 to 1 g/L at 800 moves per chain. ($y = 54.696\cdot\ln(x) + 127.03$, $R^2 = 0.7361$)

## Vesicles - Exploring the cell's potential as a (future) factory

In the spirit of synthetic biology, we aimed to let our bacteria produce all our necessary proteins at once and package them in outer membrane vesicles (OMVs), which presents one way of achieving a responsible, cell-free system. In order to transport proteins into the vesicles, two things need to happen, as explained in Design page. Firstly, the cell should be hypervesiculating, i.e., producing a lot of vesicles. We achieved this by using E. coli BW25133 strain from the Keio collection (Baba et al. 2006) (KEIO), see Results. Using this strain, the 2016 iGEM team of the University of New South Wales (UNSW) in Australia achieved production of vesicles 80 - 100 nm in diameter. They also found that overexpression of the protein TolR shifts the size to 115 - 300 nm. Secondly, the vesicles are produced in the periplasm, so the protein needs to be translocated there. There are two pathways in E. coli that are most commonly used for this purpose. 1) In the Tat pathway an export tag is fused to the protein, after which the protein gets folded in the cytoplasm and is then transported through a pore to the periplasm (Oates et al. 2004). Having reached the periplasm, the tag gets cleaved off. Another pathway we explored, was 2) the Sec pathway. In this pathway the protein is translocated to the periplasm unfolded, which means it needs to be folded when it has arrived in the periplasm (Mori et al. 2001). As a proof of concept, we fused an export tag to GFP to transport it to the periplasm. If vesicles are formed in this case, a fraction of the content of the periplasm is taken into the vesicles. This means that the concentration of GFP in the periplasm and vesicles will be the same. In Figure 20, a schematic representation of the transport of TDPs and Cas13a (left) and GFP (right) into vesicles is shown.

Figure 21: Schematical representation transport to vesicles. A schematical reprisentation of the transport of TDPs and Cas13a (left) and GFP (right) into vesicles.

However, we had to assess first which transport pathway to use. Therefore, we wanted to predict whether Cas13a would be able to go through the pore of the Tat pathway (Q13). Furthermore, the question arose how many Cas13a would fit into a vesicle (Q14). Lastly, we model the dynamics of the system to understand how the concentration of GFP changes over time. This tells us what the best time of harvesting the vesicles is (Q15).

• Q13 How many Cas13a proteins can fit inside the vesicles?

Cas13a size information is important for estimating the amount of protein that can be fit into the vesicles our engineered bacteria produce, provide insight into the compatibility of the protein with drying by the TDPs and indicate if the methods we developed to detect Cas13a collateral cleavage are feasible. To estimate relevant protein dimensions, we employ a PyMOL script (Calvo 2015) that first aligns the protein to the axis and then calculates a bounding box around it. From this bounding box, we find that Cas13a is approximately 104x99x75 Angstrom in size, or 10x10x8 nanometer (Figure 21).

Figure 22: Bounding box around Cas13a. Visualization of the determined bounding box around the structure of Cas13a in PyMOL.

Based on the size of a Cas13a protein we estimate the number of proteins that fit in a vesicle. The iGEM team of the University of New South Wales (UNSW) Australia of 2016, determined that the diameter of the vesicles for the KEIO strain would range from 80 to 115 nm. If you add the TolR overexpression to this strain, they saw a shift of the size to 130 to 300 nm. With this information we calculated the maximum number of Cas13a proteins per vesicle by dividing the vesicle volume by the volume of Cas13a. The number of proteins we found range from 18 to 49 in the KEIO strain and from 80 to 994 in the KEIO strain in combination with TolR. This result made us optimistic about our idea and therefore proceeded with the module. However, our experiments showed that the vesicles we produced with the KEIO strain with and without TolR are around 18 nm in diameter. This would mean that only one Cas13a protein would fit in the vesicle. If our observations are correct, it would be challenging to use this KEIO strain as a cell factory that produces and secretes our proteins in vesicles, especially if more contents need to be packed inside.

• Q14 Is the Tat pathway a plausible pathway for the transport of Cas13a?

Producing Cas13a-containing vesicles requires translocation of Cas13a to the periplasm. The labteam considered using the Tat pathway for this. In this pathway, Cas13a would be fused to an export tag, after which it would get transported through a pore that can accommodate proteins up to 500 kDa (Oates et al. 2004). Since it was not easy to retrieve the molecular weight of Cas13a from literature, we employed a built-in PyMOL function to compute it from the protein structure file (Lui et al. 2017a). We compute the molecular weight of Cas13a to be 139kDa, while the molecular weight of loaded Cas13a (Cas13a with crRNA bound) is 152 kDa. As this is within the range of protein weight that the Tat pathway can transport, we conclude that the Tat pathway is a plausible pathway for the translocation of Cas13a from the cytoplasm to the periplasm. This is one of the reasons why we choose to pursue using the Tat pathway for protein translocation in the lab

• Q15 At what time does the concentration of TorA-GFP in vesicles reach steady state?

As we want to know the best moment to harvest the vesicles, we developed a kinetic model describing how the concentrations of TorA-GFP in the periplasm and cytoplasm of the cell evolve over time (Figure 23). We used this model to determine the concentrations in both a hypervesiculating and a non-hypervesiculating strain. In the model we need the growth rate of our strains. Therefore, we fitted a growth equation to measured ODs. Furthermore, we calculated the promoter strength ($P$) and the transport rate into the vesicles ($k_v$) by looking at the ordinary differential equation in the stationary state. All other parameters of the model were available in the literature.

Figure 23: Flow diagram. Flow diagram of kinetic model of TorA-GFP.

We showed that in both WT and KEIO the concentration TorA-GFP in the cytoplasm stays at a low level. The difference between the strains is the concentration of GFP in the periplasm. In the WT strain the concentration of GFP keeps growing, because we assumed that GFP can only leave of the periplasm by going into vesicles. In the KEIO strain the concentration of GFP in the periplasm stays at a low level, because GFP goes into the vesicles which in turn are secreted by the cell. After 25 minutes the concentration of GFP does not grow anymore and is on its maximum. We find that after 25 minutes the system of the hypervesiculating strain reaches a steady state. To have the highest concentration of GFP in the vesicles, they need to be harvest after 25 minutes. Therefore, in experiments measuring the GFP fluorescence in vesicles, the vesicles were harvested after 25 minutes.