Isoguanine & 5-methyl isocytosine in Nanopore Sequencing
Oxford Nanopore Technology's (ONT)
sequencing technology offers a great potential as a tool for the detection of unnatural bases in DNA. In ONT sequencing, protein nanopores are distributed inside a synthetic membrane of high electrical resistance. When applying an electrical field across this membrane, an ionic current passes through each nanopore which is being measured and recorded. If a biomolecule, such as proteins, RNA or DNA are located inside the Nanopore, the ionic current is influenced. These characteristic changes can be measured and used to identify which molecule is passing through the nanopore. (Feng et al.
, 2015) This way, an algorithm called the "basecaller" is able to predict the nucleotide sequence of a single stranded DNA or RNA molecule based on the raw data that is recorded when it is pulled through the nanopore. Since the commercial availability of the portable sequencer MinION in 2015 (Check Hayden, 2015), strong improvements have been made in terms of increasing the bascalling accuracy. Even though the error rate is still high compared to other sequencing techniques, the advantage of having long reads of several kilobases is often preferential regarding sequencing of DNA containing repetitive sequences or mobile genetic elements like transposable elements (Debladis et al.
, 2017) More recently, efforts have been made towards the analysis of epigenetic information based on the identification of modified bases in nucleic acids with nanopore sequencing. For example, methylated cytosine was shown to be distinguishable from unmodified cytosine by training a hidden Markov model (Simpson et al.
Compared to other sequencing technologies, nanopore sequencing offers several advantages regarding the detection of unnatural bases. Most importantly, no PCR amplification of the DNA sample is needed in the process of library preparation. This way, no information gets lost prior to sequencing due to potentially lower PCR amplification fidelity of the unnatural base pair. Another big advantage is that no additional chemistry is needed in the process of sequencing. Other sequencing technologies such as 454, Sanger, Illumina and PacBio are based on polymerases that synthesize a DNA strand complement to a template being sequenced. When a specially labeled nucleotide is incorporated, a detectable signal is emitted. This is problematic regarding sequencing of DNA containing unnatural bases, as additional labeled nucleotides would be needed for a continuous strand synthesis and to produce a unique signal for the unnatural bases. Considering the development costs for this new chemistry, the necessary process adaptations and increased complexity of data analysis, the sequencing of orthogonal unnatural bases is unlikely to be feasible with these technologies. In contrast, nanopore sequencing omits the necessity for additional chemistry and it is unlikely that sequencing will be interrupted by unnatural bases passing through the nanopore. On top of that, Nanopore sequencing was shown to be applicable for direct sequencing of RNA, without prior transcription into cDNA (Garalde et al., 2016). Therefore, it promises to be suitable for transcription studies involving unnatural bases too.
We aimed to examine if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we sequenced different DNA samples containing either the unnatural bases isoG and isoCm
or any natural bases in the same sequence context to see if the output signal differs significantly between these groups. The data processing and evaluation was performed with the help of our own software iCG
, that we developed specifically for analyzing Nanopore sequencing data of DNA containing unnatural bases. Our aim is to create a linear discriminant analysis model that is able to discriminate between isoG/isoCm
and natural bases in the given neighboring sequence context of two bases upstream and two bases downstream of the position of interest. For a detailed description of how the software works, please refer to our Software
Reference Sample Preparation & Sequencing
In order to examine if the unnatural bases isoG and isoCm are differentiable from the natural bases through nanopore sequencing, five different DNA samples were prepared that differed only at a single sequence position, containing either an unnatural base or one of the four natural bases at this position of interest. For our experiments, we started by sequencing isoCm in the sequence context 5'‑AG\iCm\CC‑3' and, on the reverse strand, isoG in the sequence context 5'‑GG\iG\CT‑3'. For this purpose, we constructed five reference DNA samples with the following sequences:
Figure 9: Annealed oligonucleotides used for reference sample preparation.
Sequences at the position of interest of DNA samples used as references for nanopore sequencing.
Each reference DNA sample was prepared starting from a pair of complementary synthetic oligonucleotides. The oligos containing isoguanine or 5‑methyl isocytosine were synthesized by Biolegio
, including subsequent purification through polyacrylamid gel electrophoresis (PAGE). Mass spectrometry and ultra performance liquid chromatography data (Figure 1) provided by Biolegio indicate that the concentrations of unmodified side products are below detection limits. In general, the manufacturer specifies the purity of PAGE purified oligonuleotides containing modified bases to be greater than 95 %. The oligonucleotides containing exclusively natural bases were ordered from metabion
and were purified by desalting.
The existence of the unnatural bases in the oligos from Biolegio and the correct sequence of the oligos containing natural bases exclusively was further confirmed by analysis with our Mutational Analysis Xplorer (M.A.X.)
. This orthogonal analysis approach reveals that the purity and sequence identity of the oligos was very high.
Figure 10: UPLC and MS data from oligos containing isoguanine or 5‑methyl isocytosine.
Oligonucleotides containing isoguanine or 5‑methyl isocytosine were synthesized and analysed by Biolegio by ultra performance liquid chromatography (UPLC) and mass spectrometry (MS). Shown above are the results from the UPLC (above) and MS (below) analysis for each of the complementary oligos containing either isoguanine (A) or 5‑methyl isocytosine (B).
For each DNA sample, a complementary pair of oligonucleotides was annealed
and ligated into a plasmid backbone (BBa_K1465202
) previously linearized by Xba
I and Bmt
I. For this purpose, a leading Bmt
I and a tailing Spe
I recognition site were included into the oligonucleotide sequences. After ligation, re-ligated backbone was linearized by digestion with Xba
I. After consecutive digestion of double and single stranded linear DNA fragments with lambda exonuclease and E. coli
exonuclease I, the DNA samples were linearized through Eco
RV digestion and purified for sequencing library preparation. An individual library was prepared for each DNA sample, according to the 1D Library Protocol for SQK-LSK108, starting from the end repair step.
Figure 11: Library preparation for Oxford Nanopore sequencing. Purification of DNA containing the unnatural base pair, after the adapter ligation step of library preparation for Oxford Nanopore sequencing.
Figure 12: MinIon sequencer with R9.4 flowcell. The MinIon sequencer that we used in our experiments, together with a R9.4 flowcell.
Figure 13: Status of the pore grid during sequencing. While sequencing, the software MinKnow gives online feedback about the pores in the flowcell.
Sequencing was performed with a new FLO-MIN106 R9.4 flowcell on a MinION sequencer with the MinKNOW software version 22.214.171.124. Each DNA sample was sequenced individually, with intermediate washing steps with the Flow Cell Wash Kit EXP-WSH002. For every sample, a minimum of 260000 reads was generated. Basecalling was performed locally with ONT albacore version 1.2.4. .
Data Analysis with our Software Package iCG
After basecalling wih albacore, the Nanopore data was analyzed with our own software iCG. The scripts that were created in this process were combined to a powerful software tool named iCG
that can potentially be used for the analysis of DNA containing arbitrary unnatural bases with Nanopore Sequencing.
In the first step, the reads were filtered by iCG filter in order to identify reads that contain the region of interest and have a high basecalling quality. Regarding the parameters minimum length, maximum length and minimum mean Phred qscore, the default argument settings of iCG filter were used for filtering. Of the remaining reads, only those containing the neighboring sequence context of 15 bases upstream and downstream of the POI were selected, without considering the close sequence context (blur region) of 3 ±1 bases around the POI, where influences of the unnatural bases may lead to unpredictable behavior of the basecaller. The matching reads were tolerated to contain a maximum of 2 mismatches, including indels. The maximum deviation in length was set to 1 base and reads containing the region of interest multiple times were rejected. Additionally, the selected reads were further filtered for a minimum mean quality score of 14 in this restricted sequence context and sorted by their stand orientation. For further information about, please read more about iCG on our Software
Figure 14: Normalized signal traces of analyzed DNA samples.
Overlayed, normalized signal traces of DNA samples containing either isoG/isoCm or any natural base at the position of interest in the analyzed sequence context. The reads displayed in these plots were selected from their respective sequencing runs by using iCG filter, using the same filter settings for all DNA samples. To remove contaminating reads from previous sequencing runs, a quantile of 0.7 of the most deviating reads was removed previous to plotting with the help of iCG model.
Afterwards, iCG model was used to create linear discriminant models based on the filtered groups of template reads gathered by iCG filter. Different setting for the amount of removed, deviating reads were tested. Figure 5 shows plots of the Region of interest for both the forward and the reverse strand and all five template groups, with a quantile of 0.7 removed reads. For both strand orientations, there is a distinct difference in the mean, normalized signal trace detectable comparing the sequences containing an unnatural base with those containing a natural base at the position of interest.
Looking at the plot of the sample containing isoCm, the signal seems to be much noisier than any other sample analyzed. Based on the quality control after synthesis by UPLC and MS and our orthogonal analysis with the M.A.X. system, it is highly unlikely that this result is based on impurities of the oligos prior to sample preparation. Keeping in mind that the displayed data is the product of several steps of data processing, including event generation during sequencing, basecalling, alignment to the reference sequence and normalization, one possible explanation for this result are errors in this data processing pipeline as a consequence of a strong signal deviations to the expected signals of natural bases. For example, if the basecaller misinterprets the signal in the sequence context of isoCm, the error is likely to propagate and cause problems during sequence alignment and time-dependent normalization. Alternatively but less likely, the signal noise could be caused by influences of the sample preparation on isoCm. It has been shown that deoxy-isocytidine and deoxy-isocytidine triphosphate show a tendency for deamination, resulting in their respective uridine analogues (Switzer et al., 1993). Considering the alkaline reaction conditions of the exonuclease reactions performed during sample preparation, the isoCm bases in the analyzed sample could be partly deaminated to methylated uracil, potentially influencing the signal in Nanopore sequencing.
Linear Discriminant Analysis
Based on the data presented in Figure 14, a cluster analysis based on linear discriminant analysis was conducted using iCG model. Figure 15 shows dot-plots for the forward and reverse models, presenting the linear discriminants of the reads each respective model was created with. The direct comparison of both models reveals that the model created upon the data of the reverse strand seems to perform better in terms of classification of the sequencing reads. Except for the groups containing A and G at the position of interest, which slightly overlap with each other, all other groups are well separated from each other. On the other hand, the linear discriminant analysis of the data of the forward strand was unable to properly separate the reads containing A, G and iCm from each other, mainly due to widely scattered reads of the iCm group. Both results coincide with the visual assessment of the signal traces in Figure 14.
Figure 15: Dot-plots of the linear discriminant models of the forward and reverse strand. Dot-plots of the linear discriminants of the reads used for the creation of the statistical models for base prediction at the position of interest in the forward and reverse strand. The data used for the linear discriminant analysis was previously filtered by removing 70 % of reads from each group, based on their deviation from the groups median signal in the neighboring sequence context of the position of interest.
Since a statistical model should not be tested with the very data it was created with, we prepared a new set of DNA samples to properly evaluate the performance of both models concerning the prediction of bases at the position of interest in their respective sequence context. For this purpose, we modified the RuBisCO plasmid that was used for the first sample preparation by cloning five different sequences downstream RuBisCO with Standard BioBrick assembly
). Each of these plasmids contains a 25 nt sequence that is unique, while the remaining plasmid sequence is the same. These unique sequences can be used for identification assignment of sequencing reads comparable to the Nanopore barcoding approach. Starting with these five plasmids, we prepared new DNA samples according to the same procedure explained above. After ligation, all five samples were pooled in approximately equimolar proportion and further prepared for sequencing. After sequencing and basecalling of this pooled sample, the reads were assigned to their respective group by using iCG filter with the "--barcode" argument and each plasmid's unique sequence. After filtering, 50 reads of every group were randomly selected in order to be used for evaluating the performance of the linear discriminant models with iCG predict. The results of this evaluation are summarized in Figure 16.
Figure 16: Evaluation of the linear discriminant analysis models.
Evaluation results of the linear discriminant analysis models for the forward and reverse strand. (A) Linear discriminants of the test data colored in accordance with their respective base prediction. (B) Distribution of predicted bases. Based on the assumption that every read in the test data set was correctly assigned with the barcoding approach, equal portions of 20 % for each base would be ideal, corresponding to 50 reads per test data group. (C) Fidelity of base prediction, revealing which base predictions were made for the reads of each group individually.
The results in Figure 16 indicate that the linear discriminant model for the reverse strand orientation is performing better than the model for the sense strand. The base prediction fidelity is especially high for reads containing an adenine, a cytosine or an isoguanine at the position of interest. Due to the hydrolysis of isoCm
to T and the tautomerisation of isoG, leading to mispairing with T, the most common mutation that leading to a loss of the unnatural base pair between isoG and isoCm
is the mutation from isoG to A (Bande et al.
, 2015). Considering the fidelity of base prediction for both A and isoG with the reverse strand model, we conclude that this linear discriminant analysis model is well suited for the discrimination between isoG and all natural bases in the sequence context 5'-ggNct-3'
. Therefore, we could show that the software package iCG
is applicable for the analysis of experiments with unnatural bases.