Team:Bielefeld-CeBiTec/Results/unnatural base pair/development of new methods

Development of New Methods

Background: Detection of Unnatural Bases in DNA

When working with unnatural bases, one of the major challenges is the detection of unnatural base pairs (UBPs) in DNA. For the analysis of UBP retention in vivo, in vitro replication and PCR experiments, it is mandatory to have a reliable method for UBP detection. In most cases, scientists working with unnatural bases have to develop their own methods specifically suitable for the detection of the very unnatural bases they are working with. Unfortunately, these methods often come along with unfavorable circumstances.

One method commonly applied is the usage of molecular beacons, as described by Johnson et al., 2004 for the detection if isoG and isoC^m in PCR experiments. This is a quite circumstantial and expensive method, as for every ssDNA sample, an individual fluorescence labeled, specific oligonucleotide containing the unnatural nucleotide complementary to the one inverstigated is needed. Additionally, the influence of the analyzed unnatural bases on the annealing temperature has to be investigated previously, to prevent unspecific hybridization from influencing the analysis results. A different method used primarily for UBP retention analysis is the biotin shift assay (Zhang et al., 2017). A DNA sample analyzed by thid approach is used as a template in a PCR reaction with biotin-labeled nucleotide triphosphates of one of the unnatural bases. Through specific interaction with Spreptavidin after amplification, molecules containing these labeled nucleotides are shifted upwards in comparison to DNA molecules of the same length in subsequent PAGE analysis. The use of specially labeled variants of the analyzed unnatural bases is generally not recommended, as even small structural changes might have a great impact on the interaction with other biomolecules and could lead to falsified experiment results. In addition, the necessity of a PCR amplification for UBP detection is disadvantageous too, as the proportion of DNA molecules with sustained and lost UBPs might be greatly impacted, particularly - but not exclusively - due to influences of the unnatural nucleoside triphosphates on the PCR reaction.

For our experiments, we developed two new methods which are potentially applicable for the analysis of experiments with most unnatural bases: The Mutational Analysis Xpolorer (M.A.X) and Oxford Nanopore sequencing of unnatural bases with iCG. Both are comparably cost efficient methods that have the major advantage of enabling the direct analysis of mutational events in addition to the mere detection of unnatural bases. In our experiments, we showed that both methods are applicable for the analysis of experiments with the unnatural bases isoG and isoC^m.

Mutation Analysis Xplorer – Results

Primer annealing

To test the best annealing efficiency, we applied three different annealing methods.

We designed five ssDNA pairs with different bases at position 40. Each of the natural bases A, T, G, and C will lead to a recognition sequence of one of the restriction enzymes EciI, BsaI, SapI and MnlI. We postulate that non of these restriction enzymes will recognize their respective recognition sequences if the basepair between isoG and isoC^m is present at this position. For good annealing efficiency, it is necessary to add the two oligo strands together in equal molar amounts. The concentration can be calculated by the OD₂₆₀ value, while an OD₂₆₀ of 1 equals 33 µg ml^-1 (NEB calculator, September 2017) and the molecular mass of each oligo.

(1)

Figure 1:Sequenzes of M.A.X targets mutA, mutT, mutG ad mutC with relevant restriction sites as well as the sequence of UBP_target.

All reactions showed a nearly complete alignment. We prosecuted further experiments with the aqua annealing in order to avoid affecting subsequent digestion reactions by influencing the buffer conditions.

Subsequently, we tested different amounts of annealed DNA, varying from 50 µmol L^-1 to 0.25 µmol L^-1, which still shows visible bands on the gel.

After first results, a final annealing concentration of 0.5 µmol L^-1 seems to be a good choice in terms of visibility and low DNA quantity for complete digestion. For the following annealing reactions, 1 µmol L^-1 of ssDNA was used to get 0.5 µL L^-1 of annealed dsDNA.

Restriction digest

The DNA strands were designed such that the partial restriction sites of four different restriction enzymes are located at the same position. In case of a mutation, we can validate to which base the unnatural base mutated without sequencing it. To test the practicability and quality of the restriction system, we performed several test restriction digests.

To ensure the digestion is complete, we calculated the amount of DNA which is digested per unit of enzyme in 1 hour at 37 °C. 1 unit is defined as the amount of restriction enzyme needed to digest 1 µg of lambda DNA. The lambda DNA consists of 48,502 bp (NEB) , which equals 1.99 ∙ 10¹⁰ molecules per µL. Depending on the sum of recognition sites of each enzyme, we calculated the cuts per hour of each enzyme.

Table 1: Calculation of restrictions per hour (1 unit) of the M.A.X enzymes.

	Enzyme	unit per µL	restriction sites in lambda DNA	restriction per hour (1 unit)
mutG	SapI	10	10	1.99∙10¹²
mutC	MnlI	5	262	2.35∙10¹³
mutT	BsaI	10	2	3.6∙10¹¹
mutA	EciI	2	29	1.44∙10¹²

In an annealing reaction with 0.5 µmol L^-1 DNA in a total reaction volume of 50 µL, we have 3.011 ∙ 10⁸ molecules per µL, each contain one or two restriction site. Theoretically, more than 1 mL of the annealing DNA should be digested by 1 unit of restriction enzyme per hour. After 1 hour of incubation and following heat inactivation, all samples were resolves by electrophoresis using a 12 % bisacrlyamid Native DNA PAGE.

Figure 2: Native DNA PAGE of annealed mutA oligos. Samples: UBP_target ssDNA, UBP_target annealing, restricted UBP_target (EciI), restricted mutA (EciI), ssDNA oligo, annealed mutA.

Figure 3: Native DNA PAGE of annealed oligos. Samples: annealed mutT, ssDNA oligo, restricted mutT (BsaI), restricted UBP_target (SapI), UBP_target annealing, UBP_target ssDNA.

Figure 4: Native DNA PAGE of annealed mutG oligos. Samples: UBP_target ssDNA, UBP_target annealing, restricted UBP_target (SapI) restricted mutG (SapI), ssDNA oligo, annealed mutG.

Figure 5: Native DNA PAGE of annealed mutC oligos. Samples: UBP_target ssDNA, UBP_target annealing, restricted UBP_target (MnlI) restricted mutC (MnlI), ssDNA oligo, annealed mutC.

Figures 2 - 5 show the expected band pattern, although the digest of the M.A.X targets is not complete. Later experiments revealed that a longer incubation time is necessary. The M.A.X targets are digested after 40 bp, while the UBP_target which contains the unnatuarl bases is still of full length because the recognition site of the restriction enzyme is interrupted by the UBP. In Figure 5, the UBP_target shows a restricted band pattern because of the second MnlI restriction site just 4 bases next to the first one. It is hard to see, but the UBP_target fragment is a little bit longer than the restriction fragment of mutC.The UBP_target annealings are not unexpected digested, indicating that the UBP prevents sequence recognition of the tested restriction enzymes. This proofs that the M.A.X restriction system is a good detection system for UBP retention and mutation event analysis in selected DNA sequences.

PCR with UBPs

Based on Sismour et al. (2005) and Johnson et al. (2004), we designed a novel protocol for PCR with the unnatural base pair isoG and isoC^m. We first started to reproduce positive results with Titanium Taq (TiTaq) polymerase. While Johnson et al. presented an efficiency of 96 % ± 3 %, Sismour et al. showed a reduced fidelity using the Klenow fragment of TiTaq polymerase. Without thymidine analogues, the fidelity per round PCR decreases rapidly to less than 60 % after 20 rounds of PCR.

For endpoint determination, we performed PCR reactions with 30 rounds to find out if there is any polymerase activity with template DNA containing the unnatural bases isoG and isoC^m.

The PCR templates were prepared by ligating each of the annealed 80 bp M.A.X targets mutA, mutT, mutG and mutC into pSB1C3_RuBisCo (BBa_K1465202) . For this purpose, the plasmid backbone was linearized by digestion with BmtI and XbaI. For complementary sticky ends, the annealed oligos were digested with BmtI and SpeI. After ligation, subsequent digestion with XbaI, lambda exonuclease and exonuclease I was performed to reduce the amount of unintended DNA template.

Figure 6: Plasmidcard of pSB1C3_RuBisCo which is used as backbone (BBa_K1465202)for M.A.X targets and UBP_target during PCR. It has a chloramphenicol resistance and a length of 2517 bp. Each construct contains a different M.A.X target insert (mutA, muT, mutG, mutC) or UBP_target and is 351 bp long.

To increase the possibility of the insertion of the unnatural bases, we used 100 µM dNTPs and 200 µM isoG and 200 µM isoC^m for each reaction. After variations of template concentrations from 1 ng µL^-1 to 50 ng µL-1, the best concentrations to acquire high-quality bands were 1 ng µL^-1 for the M.A.X targets and 25 ng µL^-1 for the UBP_target template.

Figure 7: PCR with Titanium Taq polymerase of pSB1C3_RuBisCo with the inserts mutA, mutT, mutG, mutC (5 ng µL^-1) and UBP_target (25 ng µL^-1). The expected fragment is 351 bp long.

To quantify the efficiency of the incorporation of isoG and isoC^m, all PCR products were tested/restricted with the M.A.X system. In order to achieve complete digestion, different incubation times from 1 h to 15 h were tested. The best results with BsaI and MnlI were achieved with an incubation of 15 h overnight. For the less stable enzymes EciI and SapI, a 2 h digestion with an addition of further enzyme after 1 h turned out to be optimal. Nevertheless, EciI and SapI could not digest the complete sample even if the concentrations are lowered. Therefore, we expected undigested bands in the M.A.X targets mutA and mutG for the whole experiment.

After the first successful PCR, we tested if the presence of isoG and isoC^m has any influence on the efficiency of the polymerase. So, we added both unnatural bases to every PCR reaction with the M.A.X targets as template to see if the intensity of the bands decreases.

Figure 9: PCRs
with Titanium Taq (A), Go Taq G2 (B), Allin HiFi DNA Polymerase (C), innuDRY polymerase (D), BioMaster-HS Taq PCR polymerase (E), FirePol DNA polymerase (F), Phusion DNA polymerase (G) and Q5 DNA polymerase (H). The template is pSB1C3_RuBisCo with the inserts mutA, mutT, mutG, mutC (5 ng µL^-1) and UBP_target (25 ng µL^-1) after the restriction digest with EciI (mutA) and SapI (mutG) for 2 h and BsaI (mutT) and MnlI (mutC) for 15 h.

The native fragments of the PCRs are as expected 351 bp long as can be seen in Figure 8. We always used the same template concentration and the same primers to ensure the comparability between the DNA polymerases. As can be seen in Figure 8, all Taq based polymerases are able to incorporate the unnatural base pair in the DNA.

The best results were apparently achieved with the Go Taq G2 DNA polymerase (Figure 8 B). All lanes with the UBPs show clear bands and no mutations to T or A. The M.A.X restriction digest showed there are some mutations to C or the antisense strand mutated to G and was paired with a C, which caused the completed recognition site of MnlI.

Every PCR product of the UBP_target fragment was digested because of the second MnlI restriction site, but differences can be shown between the M.A.X target mutC and the UBP_target PCR product because of the distance between both restriction sites.

The digested PCR products of the BioMaster-HS Taq PCR polymerase and the Allin HiFi DNA Polymerase show more mutations to A than the other polymerases. But also, the TiTaq polymerase seems to miss incorporate the A instead of isoG. Moreover, the Phusion DNA polymerase proceeds to miss the incorporated G while the Q5 DNA polymerase does not show any bands containing the UBPs. Both polymerases have a proofreading function in contrast to the other polymerases.

Table 2: Used polymerases with specific modifications during the PCRs with isoG and isoc^m.

Position in Figure 9	DNA polymerase	Distributor	Modification	Incorporation of UBP?
A	Titanium Taq	Clontech	Lacks 5'-exonuclease activity	yes
B	GoTaq G2	Promega	Has 5’-3’ exonuclease activity	yes
C	Allin HiFi	highQu	Derived from Pfu polymerase with several mutations and proof reading function.	yes
D	innuDRY	Analytik Jena	Specific hot-start Taq DNA polymerase.	yes
E	BioMaster-HS Taq	Biolabmix	Hot-Start Taq DNA polymerase	yes
F	FirePol	Solis Biodyne	Has a 5’-3’ polymerization-dependent exonuclease replacement, but lacks 3’-5’ exonuclease activity	yes
G	Phusion	NEB	Derived Pyrococcus enzyme fused with a processivity-enhancing domain. It possesses 5’-3’ polymerase activity and 3’-5’ exonuclease.	yes
H	Q5	NEB	Fused to the processivity-enhancing Sso7d DNA binding domain with an error rate ~280-fold lower than of Taq DNA polymerase	No

We always used the same template concentration and the same primers to ensure the comparability between the DNA polymerases. As can be seen in the figures above, all Taq based polymerases are able to incorporate the unnatural base pair in the DNA. The best results were apparently achieved with the Go Taq G2 DNA polymerase. All lanes with the UBPs show clear bands and no mutations to T or A. The M.A.X restriction digest showed there are some mutations to C. The PCR product of the UBP_target fragment was digested because of the second MnlI restriction site, but one can see a difference between the M.A.X target mutC and the UBP_target PCR product. The digested PCR products of the BioMaster-HS Taq PCR polymerase and the Allin HiFi DNA Polymerase show more mutations to A than the other polymerases. But also the TiTaq polymerase seems to miss incorporate the A instead of isoG. Moreover, the Phusion DNA polymerase proceeds to miss the incorporated G while the Q5 DNA polymerase does not show any bands containing the UBPs. Both polymerases have a proofreading function in contrast to the other polymerases.

In the final analysis, the faster and the stronger the poof reading function of a polymerase is, the worse is the incorporation of the UBPs.

The M.A.X system seems to be a good method for the first review of the efficiency of the polymerases. One can see if there is any incorporation of UBPs, so that sequencing is worthwhile. Minor deviations are not detectable by gel electrophoresis. It is also difficult to make a clear statement about the proportion of correctly or incorrectly incorporated unnatural bases, because the digested fragments seem to be less intensive than intact sequences.

References

Johnson, S.C., Sherrill, C.B., Marshall, D.J., Moser, M.J., and Prudent, J.R. (2004). A third base pair for the polymerase chain reaction: inserting isoC and isoG. ´Nucleic Acids Res. 32: 1937–41.

Sismour, A.M. and Benner, S.A. (2005). The use of thymidine analogs to improve the replication of an extra DNA base pair: a synthetic biological system. Nucleic Acids Res. 33: 5640–6.

Isoguanine & 5-methyl isocytosine in Nanopore Sequencing

Nanopore Sequencing

Oxford Nanopore Technology's (ONT) sequencing technology offers a great potential as a tool for the detection of unnatural bases in DNA. In ONT sequencing, protein nanopores are distributed inside a synthetic membrane of high electrical resistance. When applying an electrical field across this membrane, an ionic current passes through each nanopore which is being measured and recorded. If a biomolecule, such as proteins, RNA or DNA are located inside the Nanopore, the ionic current is influenced. These characteristic changes can be used to identify which molecule is passing through the nanopore. (Feng et al., 2015) This way, an algorithm called the "basecaller" is able to predict the nucleotide sequence of a single stranded DNA or RNA molecule based on the raw data that is recorded when it is pulled through a nanopore. Since the commercial availability of the portable sequencer MinION in 2015, strong improvements have been made in terms of increasing the bascalling accuracy. Even though the error rate is still high compared to other sequencing techniques, the advantage of having long reads of several kilobases is often preferential regarding sequencing of DNA containing repetitive sequences or mobile genetic elements like transposable elements (Debladis et al., 2017) More recently, efforts have been made towards the analysis of epigenetic information based on the identification of modified bases in nucleic acids with nanopore sequencing. For example, methylated cytosine was shown to be distinguishable from unmodified cytosine by training a hidden Markov model (Simpson et al., 2017).

Compared to other sequencing technologies, nanopore sequencing offers several advantages regarding the detection of unnatural bases. Most importantly, no PCR amplification of the DNA sample is needed in the process of library preparation. This way, no information gets lost prior to sequencing as a result of a potentially lower PCR amplification fidelity of the unnatural base pair. Another big advantage is that no additional chemistry is needed in the process of sequencing. Other sequencing technologies such as 454, Sanger, Illumina and PacBio are based on polymerases that synthesize a DNA strand complement to a template being sequenced. When a specially labeled nucleotide is incorporated, a detectable signal is emitted. This is problematic regarding sequencing of DNA containing unnatural bases, as additional labeled nucleotides would be needed for a continuous strand synthesis and to produce a unique signal for the unnatural bases. Considering the development costs for this new chemistry, the necessary process adaptations and increased complexity of data analysis, the sequencing of orthogonal unnatural bases is unlikely to be feasible with these technologies. In contrast, nanopore sequencing omits the necessity for additional chemistry and it is unlikely that sequencing will be interrupted by unnatural bases passing through the nanopore. On top of that, Nanopore sequencing was shown to be applicable for direct sequencing of RNA, without prior transcription into cDNA (Garalde et al., 2016). Therefore, it promises to be suitable for transcription studies involving unnatural bases too.

We aim to examine if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we sequenced different DNA samples containing either the unnatural nucleotides isoguanosine and 5‑methyl isocytidine or any natural bases in the same sequence context to see if the output signal differs significantly between these groups. The data processing and evaluation was performed with the help of our own software iCG, that we developed specifically for analyzing Nanopore sequencing data of DNA containing unnatural bases. Our aim is to create a linear discriminant analysis model that is able to discriminate between isoG/isoC^m and natural bases in the given neighboring sequence context of two bases upstream and two bases downstream of the position of interest. For a detailed description of how the software works, please refer to our Software page.

Reference Sample Preparation & Sequencing

In order to examine if the unnatural bases isoG and isoC^m are differentiable from the natural bases through nanopore sequencing, five different DNA samples were prepared that differed only at a single sequence position, containing either an unnatural base or one of the four natural bases at this position of interest. For our experiments, we started by sequencing isoC^m in the sequence context 5'‑AG\iC^m\CC‑3' and, on the reverse strand, isoG in the sequence context 5'‑GG\iG^\CT‑3'. For this purpose, we constructed five reference DNA samples with the following sequences:

Fig. 1: Annealed oligonucleotides used for reference sample preparation. Sequences at the position of interest of DNA samples used as references for nanopore sequencing.

Each reference DNA sample was prepared starting from a pair of complementary synthetic oligonucleotides. The oligos containing isoguanine or 5‑methyl isocytosine were synthesized by Biolegio, including subsequent purification through polyacrylamid gel electrophoresis (PAGE). Mass spectrometry and ultra performance liquid chromatography data (Figure 1) provided by Biolegio indicate that the concentrations of unmodified side products are below detection limits. In general, the manufacturer specifies the purity of PAGE purified oligonuleotides containing modified bases to be greater than 95 %. The oligonucleotides containing exclusively natural bases were ordered from metabion and were purified by desalting.

The existence of the unnatural bases in the oligos from Biolegio and the correct sequence of the oligos containing natural bases exclusively was further confirmed by analysis with our Mutational Analysis Xplorer (M.A.X.). This orthogonal analysis approach reveals that the purity and sequence identity of the oligos was very high.

Fig. 2: UPLC and MS data from oligos containing isoguanine or 5‑methyl isocytosine. Oligonucleotides containing isoguanine or 5‑methyl isocytosine were synthesized by Biolegio and analysed by ultra performance liquid chromatography (UPLC) and mass spectrometry (MS). Shown above are the results from the UPLC (above) and MS (below) analysis for each of the complementary oligos containing either isoguanine (A) or 5‑methyl isocytosine (B).

For each DNA sample, a complementary pair of oligonucleotides was annealed and ligated into a plasmid backbone (BBa_K1465202) previously linearized by XbaI and BmtI. For this purpose, a leading BmtI and a tailing SpeI recognition site were included into the oligonucleotide sequences. After ligation, re-ligated backbone was linearized by digestion with XbaI. After consecutive digestion of double and single stranded linear DNA fragments with lambda exonuclease and E. coli exonuclease I, the DNA samples were linearized through EcoRV digestion and purified for sequencing library preparation. An individual library was prepared for each DNA sample, according to the 1D Library Protocol for SQK-LSK108, starting from the end repair step.

Fig. 3: Loading a library onto the flow cell for sequencing.

Fig. 4: Sequencing a DNA sample containing unnatural base pairs.

Sequencing was performed with a new FLO-MIN106 R9.4 flowcell on a MinION sequencer with the MinKNOW software version 1.7.14.1. Each DNA sample was sequenced individually, with intermediate washing steps with the Flow Cell Wash Kit EXP-WSH002. For every sample, a minimum of 260000 reads was generated. Basecalling was performed locally with ONT albacore version 1.2.4. .

Data Analysis with our iCG Software Package

After basecalling wih albacore, the Nanopore data was analyzed with our own software iCG. The scripts that were created in this process were combined to a powerful software tool named iCG that can potentially be used for the analysis of DNA containing arbitrary unnatural bases with Nanopore Sequencing.

In the first step, the reads were filtered by iCG filter in order to identify reads that contain the region of interest and have a high basecalling quality. Regarding the parameters minimum length, maximum length and minimum mean quality qscore, the the default argument settings of iCG filter were used for filtering. Of the remaining reads, only those containing the neighboring sequence context of 15 bases upstream and downstream of the POI were selected, without considering the close sequence context (blur region) of 3 ±1 bases around the POI, where influences of the unnatural bases may lead to unpredictable behavior of the basecaller. The matching reads were allowed to contain a maximum of 2 mismatches, including indels. The maximum deviation in length was set to 1 base and reads containing the region of interest multiple times were rejected. Additionally, the selected reads were further filtered for a minimum mean quality score of 14 in this restricted sequence context and sorted by their stand orientation. For further information about, please read more about iCG on our Software page.

Fig. 5: Normalized signal traces of analyzed DNA samples. Overlayed, normalized signal traces of DNA samples containing either isoG/isoC^m or any natural base at the position of interest in the analyzed sequence context. The reads displayed in these plots were selected from their respective sequencing runs by using iCG filter, using the same filter settings for all DNA samples. To remove contaminating reads from previous sequencing runs, a quantile of 0.7 of the most deviating reads was removed previous to plotting with the help of iCG model.

Afterwards, iCG model was used to create linear discriminant models based on the filtered groups of template reads gathered by iCG filter. Different setting for the amount of removed, deviating reads were tested. Figure 5 shows plots of the Region of interest for both the forward and the reverse strand and all five template groups, with a quantile of 0.7 removed reads. For both strand orientations, there is a distinct difference in the mean, normalized signal trace detectable comparing the sequences containing an unnatural base with those containing a natural base at the position of interest.

Looking at the plot of the sample containing isoC^m, the signal seems to be much noisier than any other sample analyzed. Based on the quality control after synthesis by UPLC and MS and our orthogonal analysis with the M.A.X. system, it is highly unlikely that this result is based on impurities of the oligos prior to sample preparation. Keeping in mind that the displayed data is the product of several steps of data processing, including event generation during sequencing, basecalling, alignment to the reference sequence and normalization, one possible explanation for this result are errors in this data processing pipeline as a consequence of a strong signal deviations to the expected signals of natural bases. For example, if the basecaller misinterprets the signal in the sequence context of isoC^m, the error is likely to propagate and cause problems during sequence alignment and time-dependent normalization. Alternatively but less likely, the signal noise could be caused by influences of the sample preparation on isoC^m. It has been shown that deoxy-isocytidine and deoxy-isocytidine triphosphate show a tendency for deamination, resulting in their respective uridine analogues (Switzer et al., 1993). Considering the alkaline reaction conditions of the exonuclease reactions performed during sample preparation, the isoC^m bases in the analyzed sample could be partly deaminated to methylated uracil, potentially influencing the signal in Nanopore sequencing.

Linear Discriminant Analysis

Based on the data presented in Figure 5, a cluster analysis based on linear discriminant analysis was conducted using iCG model. Figure 6 shows dot-plots for the forward and reverse models, presenting the linear discriminants of the reads each respective model was created with. The direct comparison of both models reveals that the model created upon the data of the reverse strand seems to perform better in terms of classification of the sequencing reads. Except for the groups containing A and G at the position of interest, which slightly overlap with each other, all other groups are well seperated from each other. On the other hand, the linear discriminant analysis of the data of the forward strand was unable to properly separate the reads containing A, G and iC^m from each other, mainly due to widely scattered reads of the iC^m group. Both results coincide with the visual assessment of the signal traces in Figure 5.

Fig. 6: Dot-plots of the linear discriminant models of the forward and reverse strand. Dot-plots of the linear discriminants of the reads used for the creation of the statistical models for base prediction at the position of interest in the forward and reverse strand. The data used for the linear discriminant analysis was previously filtered by removing 70 % of reads from each group, based on their deviation from the groups median signal in the neighboring sequence context of the position of interest. Lorem Ipsum.

Since a statistical model should not be tested with the very data it was created with, we prepared a new set of DNA samples to properly evaluate the performance of both models concerning the prediction of bases at the position of interest in their respective sequence context. For this purpose, we modified the RuBisCo plasmid that was used for the first sample preparation by cloning five different sequences downstream RuBisCo with standard BioBrick assembly (, , , , ). Each of these plasmids contains a 25 nt sequence that is unique, while the remaining plasmid sequence is the same. These unique sequences can be used for identification assignment of sequencing reads comparable to the Nanopore barcoding approach. Starting with these five plasmids, we prepared new DNA samples according to the same procedure explained above. After ligation, all five samples were pooled in approximately equimolar proportion and further prepared for sequencing. After sequencing and basecalling of this pooled sample, the reads were assigned to their respective group by using iCG filter with the "--barcode" argument and each plasmid's unique sequence. After filtering, 50 reads of every group were randomly selected in order to be used for evaluating the performance of the linear discriminant models with iCG predict. The results of this evaluation are summarized in Figure 7.

Fig. 7: Evaluation of the linear discriminant analysis models. Evaluation results of the linear discriminant analysis models for the forward and reverse strand. (A) Linear discriminants of the test data colored in accordance with their respective base prediction. (B) Distribution of predicted bases. Based on the assumption that every read in the test data set was correctly assigned with the barcoding approach, equal portions of 20 % for each base would be ideal, corresponding to 50 reads per test data group. (C) Fidelity of base prediction, revealing which base predictions were made for the reads of each group individually.

The results in Figure 7 indicate that the linear discriminant model for the reverse strand orientation is performing better than the model for the sense strand. The base prediction fidelity is especially high for reads containing an adenine, a cytosine or an isoguanine at the position of interest. Due to the hydrolysis of isoC^m to T and the tautomerisation of isoG, leading to mispairing with T, the most common mutation that leading to a loss of the unnatural base pair between isoG and isoC^m is the mutation from isoG to A (Bande et al., 2015). Considering the fidelity of base prediction for both A and isoG with the reverse strand model, we conclude that this linear discriminant analysis model is well suited for the discrimination between isoG and all natural bases in the sequence context 5'-ggNct-3'. Therefore, we could show that the software package iCG is applicable for the analysis of experiments

Refereces

Bande, O., Abu El Asrar, R., Braddick, D., Dumbre, S., Pezo, V., Schepers, G., Pinheiro, V.B., Lescrinier, E., Holliger, P., Marlière, P., and Herdewijn, P. (2015). Isoguanine and 5-Methyl-Isocytosine Bases, In Vitro and In Vivo. Chem. - A Eur. J. 21: 5009–5022.

Debladis, E., Llauro, C., Carpentier, M.-C., Mirouze, M., and Panaud, O. (2017). Detection of active transposable elements in Arabidopsis thaliana using Oxford Nanopore Sequencing technology. BMC Genomics 18: 537.

Feng, Y., Zhang, Y., Ying, C., Wang, D., and Du, C. (2015). Nanopore-based Fourth-generation DNA Sequencing Technology. Genomics. Proteomics Bioinformatics 13: 4–16.

Garalde, D.R. et al. (2016). Highly parallel direct RNA sequencing on an array of nanopores. bioRxiv: 68809.

Johnson, S.C., Sherrill, C.B., Marshall, D.J., Moser, M.J., and Prudent, J.R. (2004). A third base pair for the polymerase chain reaction: inserting isoC and isoG. Nucleic Acids Res. 32: 1937–41.

Simpson, J.T., Workman, R.E., Zuzarte, P.C., David, M., Dursi, L.J., and Timp, W. (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nat Meth 14: 407–410.

Sismour, A.M. and Benner, S.A. (2005). The use of thymidine analogs to improve the replication of an extra DNA base pair: a synthetic biological system. Nucleic Acids Res. 33: 5640–6.

Switzer, C.Y., Moroney, S.E., and Benner, S.A. (1993). Enzymic recognition of the base pair between isocytidine and isoguanosine. Biochemistry 32: 10489–10496.

Zhang, Y., Lamb, B.M., Feldman, A.W., Zhou, A.X., Lavergne, T., Li, L., and Romesberg, F.E. (2017). A semisynthetic organism engineered for the stable expansion of the genetic alphabet. Proc. Natl. Acad. Sci. 114: 1317–1322.