Isoguanine & 5-methyl isocytosine in Nanopore Sequencing
Nanopore Sequencing
Oxford Nanopore Technology's (ONT) sequencing technology offers a great potential as a tool for the detection of unnatural bases in DNA. In ONT sequencing, protein nanopores are distributed inside a synthetic membrane of high electrical resistance. When applying an electrical field across this membrane, an ionic current passes through each nanopore which is being measured and recorded. If a biomolecule, such as proteins, RNA or DNA are located inside the Nanopore, the ionic current is influenced. These characteristic changes can be used to identify which molecule is passing through the nanopore. (Feng et al., 2015) This way, an algorithm called the "basecaller" is able to predict the nucleotide sequence of a single stranded DNA or RNA molecule based on the raw data that is recorded when it is pulled through a nanopore. Since the commercial availability of the portable sequencer MinION in 2015, strong improvements have been made in terms of increasing the bascalling accuracy. Even though the error rate is still high compared to other sequencing techniques, the advantage of having long reads of several kilobases is often preferential regarding sequencing of DNA containing repetitive sequences or mobile genetic elements like transposable elements (Debladis et al., 2017) More recently, efforts have been made towards the analysis of epigenetic information based on the identification of modified bases in nucleic acids with nanopore sequencing. For example, methylated cytosine was shown to be distinguishable from unmodified cytosine by training a hidden Markov model (Simpson et al., 2017).
Compared to other sequencing technologies, nanopore sequencing offers several advantages regarding the detection of unnatural bases. Most importantly, no PCR amplification of the DNA sample is needed in the process of library preparation. This way, no information gets lost prior to sequencing as a result of a potentially lower PCR amplification fidelity of the unnatural base pair. Another big advantage is that no additional chemistry is needed in the process of sequencing. Other sequencing technologies such as 454, Sanger, Illumina and PacBio are based on polymerases that synthesize a DNA strand complement to a template being sequenced. When a specially labeled nucleotide is incorporated, a detectable signal is emitted. This is problematic regarding sequencing of DNA containing unnatural bases, as additional labeled nucleotides would be needed for a continuous strand synthesis and to produce a unique signal for the unnatural bases. Considering the development costs for this new chemistry, the necessary process adaptations and increased complexity of data analysis, the sequencing of orthogonal unnatural bases is unlikely to be feasible with these technologies. In contrast, nanopore sequencing omits the necessity for additional chemistry and it is unlikely that sequencing will be interrupted by unnatural bases passing through the nanopore. On top of that, Nanopore sequencing was shown to be applicable for direct sequencing of RNA, without prior transcription into cDNA (Garalde et al., 2016). Therefore, it promises to be suitable for transcription studies involving unnatural bases too.
We aim to examine if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we sequenced different DNA samples containing either the unnatural nucleotides isoguanosine and 5‑methyl isocytidine or any natural bases in the same sequence context to see if the output signal differs significantly between these groups. The data processing and evaluation was performed with the help of our own
software iCG, that we developed specifically for analyzing Nanopore sequencing data of DNA containing unnatural bases. Our aim is to create a linear discriminant analysis model that is able to discriminate between isoG/isoC
m and natural bases in the given neighboring sequence context of two bases upstream and two bases downstream of the position of interest. For a detailed description of how the software works, please refer to our
software page.
Reference Sample Preparation & Sequencing
In order to examine if the unnatural bases isoG and isoCm are differentiable from the natural bases through nanopore sequencing, five different DNA samples were prepared that differed only at a single sequence position, containing either an unnatural base or one of the four natural bases at this position of interest. For our experiments, we started by sequencing isoCm in the sequence context 5'‑AG\iCm\CC‑3' and, on the reverse strand, isoG in the sequence context 5'‑GG\iG\CT‑3'. For this purpose, we constructed five reference DNA samples with the following sequences:
Each reference DNA sample was prepared starting from a pair of complementary synthetic oligonucleotides. The oligos containing isoguanine or 5‑methyl isocytosine were synthesized by
Biolegio, including subsequent purification through polyacrylamid gel electrophoresis (PAGE). Mass spectrometry and ultra performance liquid chromatography data (Figure 1) provided by Biolegio indicate that the concentrations of unmodified side products are below detection limits. In general, the manufacturer specifies the purity of PAGE purified oligonuleotides containing modified bases to be greater than 95 %. The oligonucleotides containing exclusively natural bases were ordered from metabion and were purified by desalting.
The existence of the unnatural bases in the oligos from Biolegio and the correct sequence of the oligos containing natural bases exclusively was further confirmed by analysis with our
Mutational Analysis Xplorer (M.A.X.). This orthogonal analysis approach reveals that the purity and sequence identity of the oligos was very high.
For each DNA sample, a complementary pair of oligonucleotides was
annealed and ligated into a plasmid backbone (
BBa_K1465202) previously linearized by
XbaI and
BmtI. For this purpose, a leading
BmtI and a tailing
SpeI recognition site were included into the oligonucleotide sequences. After ligation, re-ligated backbone was linearized by digestion with
XbaI. After consecutive digestion of double and single stranded linear DNA fragments with lambda exonuclease and E. coli exonuclease I, the DNA samples were linearized through
EcoRV digestion and purified for sequencing library preparation. An individual library was prepared for each DNA sample, according to the 1D Library Protocol for SQK-LSK108, starting from the end repair step.
Sequencing was performed with a new FLO-MIN106 R9.4 flowcell on a MinION sequencer with the MinKNOW software version 1.7.14.1. Each DNA sample was sequenced individually, with intermediate washing steps with the Flow Cell Wash Kit EXP-WSH002. For every sample, a minimum of 260000 reads was generated. Basecalling was performed locally with ONT albacore version 1.2.4. .
Data Analysis with our iCG Software Package
After basecalling wih albacore, the Nanopore data was analyzed with our own software iCG. The scripts that were created in this process were combined to a powerful software tool named
iCG that can potentially be used for the analysis of DNA containing arbitrary unnatural bases with Nanopore Sequencing.
In the first step, the reads were filtered by iCG filter in order to identify reads that contain the region of interest and have a high basecalling quality. Regarding the parameters minimum length, maximum length and minimum mean quality qscore, the the default argument settings of iCG filter were used for filtering. Of the remaining reads, only those containing the neighboring sequence context of 15 bases upstream and downstream of the POI were selected, without considering the close sequence context (blur region) of 3 ±1 bases around the POI, where influences of the unnatural bases may lead to unpredictable behavior of the basecaller. The matching reads were allowed to contain a maximum of 2 mismatches, including indels. The maximum deviation in length was set to 1 base and reads containing the region of interest multiple times were rejected. Additionally, the selected reads were further filtered for a minimum mean quality score of 14 in this restricted sequence context and sorted by their stand orientation. For further information about, please read more about iCG on our
Software page.
Afterwards, iCG model was used to create linear discriminant models based on the filtered groups of template reads gathered by iCG filter. Different setting for the amount of removed, deviating reads were tested. Figure 5 shows plots of the Region of interest for both the forward and the reverse strand and all five template groups, with a quantile of 0.7 removed reads. For both strand orientations, there is a distinct difference in the mean, normalized signal trace detectable comparing the sequences containing an unnatural base with those containing a natural base at the position of interest.
Looking at the plot of the sample containing isoCm, the signal seems to be much noisier than any other sample analyzed. Based on the quality control after synthesis by UPLC and MS and our orthogonal analysis with the M.A.X. system, it is highly unlikely that this result is based on impurities of the oligos prior to sample preparation. Keeping in mind that the displayed data is the product of several steps of data processing, including event generation during sequencing, basecalling, alignment to the reference sequence and normalization, one possible explanation for this result are errors in this data processing pipeline as a consequence of a strong signal deviations to the expected signals of natural bases. For example, if the basecaller misinterprets the signal in the sequence context of isoCm, the error is likely to propagate and cause problems during sequence alignment and time-dependent normalization. Alternatively but less likely, the signal noise could be caused by influences of the sample preparation on isoCm. It has been shown that deoxy-isocytidine and deoxy-isocytidine triphosphate show a tendency for deamination, resulting in their respective uridine analogues (Switzer et al., 1993). Considering the alkaline reaction conditions of the exonuclease reactions performed during sample preparation, the isoCm bases in the analyzed sample could be partly deaminated to methylated uracil, potentially influencing the signal in Nanopore sequencing.
Linear Discriminant Analysis
Based on the data presented in Figure 5, a cluster analysis based on linear discriminant analysis was conducted using iCG model. Figure 6 shows dot-plots for the forward and reverse models, presenting the linear discriminants of the reads each respective model was created with. The direct comparison of both models reveals that the model created upon the data of the reverse strand seems to perform better in terms of classification of the sequencing reads. Except for the groups containing A and G at the position of interest, which slightly overlap with each other, all other groups are well seperated from each other. On the other hand, the linear discriminant analysis of the data of the forward strand was unable to properly separate the reads containing A, G and iCm from each other, mainly due to widely scattered reads of the iCm group. Both results coincide with the visual assessment of the signal traces in Figure 5.
Since a statistical model should not be tested with the very data it was created with, we prepared a new set of DNA samples to properly evaluate the performance of both models concerning the prediction of bases at the position of interest in their respective sequence context. For this purpose, we modified the RuBisCo plasmid that was used for the first sample preparation by cloning five different sequences downstream RuBisCo with standard BioBrick assembly (
,
,
,
,
). Each of these plasmids contains a 25 nt sequence that is unique, while the remaining plasmid sequence is the same. These unique sequences can be used for identification assignment of sequencing reads comparable to the Nanopore barcoding approach. Starting with these five plasmids, we prepared new DNA samples according to the same procedure explained above. After ligation, all five samples were pooled in approximately equimolar proportion and further prepared for sequencing. After sequencing and basecalling of this pooled sample, the reads were assigned to their respective group by using iCG filter with the "--barcode" argument and each plasmid's unique sequence. After filtering, 50 reads of every group were randomly selected in order to be used for evaluating the performance of the linear discriminant models with iCG predict. The results of this evaluation are summarized in Figure 7.
The results in Figure 7 indicate that the linear discriminant model for the reverse strand orientation is performing better than the model for the sense strand. The base prediction fidelity is especially high for reads containing an adenine, a cytosine or an isoguanine at the position of interest. Due to the hydrolysis of isoCm to T and the tautomerisation of isoG, leading to mispairing with T, the most common mutation that leading to a loss of the unnatural base pair between isoG and isoCm is the mutation from isoG to A (Bande et al., 2015). Considering the fidelity of base prediction for both A and isoG with the reverse strand model, we conclude that this linear discriminant analysis model is well suited for the discrimination between isoG and all natural bases in the sequence context 5'-ggNct-3'. Therefore, we could show that the software package iCG is applicable for the analysis of experiments