iCG
Background
In ONT's sequencing technology, the ionic current that passes through a nanopore is not measured for every nucleotide individually. To cope with the influence of random fluctuations of the ionic current, the background noise, the signal is evaluated for a sequence of approximately 5 nucleotides. Every such k‑mer (in this case every possible combination of 5 successive nucleotides) produces a characteristic signal that is evaluated by the basecaller to predict the sequence of the analyzed DNA strand. Therefore, the signal trace of a certain nucleotide is not the same at every position where it occurs in the DNA strand, but is dependent on its sequence context. This has two interesting effects that have to be considered regarding ONT sequencing of unnatural bases: Firstly, the signal of the close sequence context has to be considered too when analyzing if an unnatural base has a significant effect on the ionic current flow. Secondly, if an unnatural base indeed significantly influences the signal, it will have an impact on the signal traces surrounding the position where it is occurring in the DNA. Therefore, the basecalling may fail for natural bases surrounding this positon.
As a consequence of these considerations, a basecaller that shall be able to differentiate between natural bases and unnatural bases at any sequence position would need to be trained with signals of every possible k‑mer that is a combination of natural and unnatural bases. As the number of possible k‑mers given n possibilities is nk, 6752 new k-mers would have to be analyzed, which is beyond our means. Nevertheless, since unnatural base pairs are mostly used at well-known locations, e.g. to expand the genetic code in in vitro or in vivo experiments, it is sufficient to analyze only certain k‑mers that are of interest. The comparison to k‑mers containing all natural bases provides a foundation for a software that differentiates between unnatural and natural base pairs at a given position of interest (POI). This is the approach for our software tool.
Overview
The software package iCG is a collection of three python scripts for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural base pairs. The first script, iCG filter, is used to filter basecalled Nanopore data by certain criteria like read length and quality. Its fundamental function is to select sequencing reads that contain the position of interest, where the unnatural base pair is subjected to be, irrespective of its presence. The second scipt, iCG model, creates a statistical model for the detection of unnatural base pairs in sequencing reads based on linear discriminant analysis of ONT data previously filtered by iCG filter. The third script, iCG predict, uses a model created by iCG model to predict which base pair is present at the position of interest for a set of filtered sequencing reads.
Dependencies & Installation
Our software works with basecalled Nanopore data. For Nanopore sequencing, we used the ONT software MinKNOW version 1.7.14.1 in combination with FLO-MIN106 R9.4 flowcells on a MinION sequencer. Other combinations of hardware and consumables, such as 1D2 flowcells, were not tested in combination with this software package and will probably require for small adjustments. Basecalling has to be performed with the ONT basecaller Albacore.
The iCG software package has some dependencies that need to be installed before it can be used. First of all,
HDF5 needs to be installed. HFD, standing for Hierarchical Data Format, is a format for storing and retrieving large data sets from/to files and is the format in which Nanopore data is stored. For building and installing from sources, please refer to the distributors website. For installation from depository:
> sudo apt-get install libhdf5-serial-dev
Additionally, the alignment and mapping software package BWA‑MEM (Li, 2013) needs to be installed. For additional help regarding the installation process, please refer to their
sourceforge or
github website.
> git clone https://github.com/lh3/bwa.git
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
Furthermore, the software environment R for statistical computing and graphics needs to be installed. For a system specific installation, please refer to the
r-project website. For statistical computations and plotting, some additional R packages need to be installed. After the R environment is installed, please print the following commands in an R console:
> install.packages("ggplot2")
> install.packages("cowplot")
> install.packages("MASS")
iCG itself is written in python2.7 code. Therefore, a python2.7 environment needs to be installed. Please follow the instructions on
python.org for a system specific installation. Afterwards, use the python package managment tool
pip for the installation of required python packages. For that, open a terminal and execute the following commands:
> pip2.7 install regex
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]
With all dependencies installed, download the latest version of iCG together with example data sets in the
download section of this page. To learn about how to use iCG, please continue reading or start right ahead by executing the following command in a terminal, having navigated to the folder where the downloaded code is located:
python2 iCG.py -h
In order to test if the installation was successful, please consider executing the following commands one after the other:
> python2 iCG.py filter iCG_test/UBP iCG_test/A iCG_test/G iCG_test/T iCG_test/C -r iCG_test/reference.fasta --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE
> python2 iCG.py model iCG_test/results -UBP iCG_test/UBP/filtered -AT iCG_test/A/filtered -GC iCG_test/G/filtered -TA iCG_test/T/filtered -CG iCG_test/C/filtered --UBPsense iCme --UBPantisense iG
> python2 iCG.py predict iCG_test/results/antisense_0.3.model iCG_test/results/antisense_test_reads --dst_dir results/prediction
iCG filter
Description
The command iCG filter is executed on basecalled Nanopore data before any further analysis can be performed with iCG. It uses the information generated by the ONT basecaller albacore to filter a set of sequencing reads for sufficient quality, length and the existence of the position of interest. After the reads have been filtered, they are divided into sense and antisense reads regarding their orientation relative to a given reference sequence. Afterwards, the reads are "re-squiggled" using nanoraw (Stoiber, 2016), meaning that every read is aligned to the given reference sequence in order to correct bascalling errors and to normalize the raw read signals in terms of duration and amplitude. This filtering procedure is mandatory for the subsequent analysis with both iCG model and iCG predict.
For subsequent analysis, it is particularly important that the filtering process is performed irrespective of the position of interest. The potential influence of the unnatural bases on the sequencing signal may lead to bascalling misbehavior which in turn could lead to falsified analysis results if the position of interest and the neighboring sequence context is considered during filtering. Therefore, the presence of the region of interest in a read sequence is checked by regular expression pattern matching, omitting the immediate sequence context of the position of interest, called "blur region". Figure 2 illustrates how the position of interest, the region of interest, the blur region and the regular expression are connected with each other.
Every aspect of this pattern matching approach is adoptable by optional arguments passed to iCG on execution, including the size of the region of interest, the size of the blur region and the amount of mismatches allowed during pattern matching. Additional parameters define whether insertions and deletions (indels) are allowed and how much the length of a match is allowed to deviate from the length of the region of interest as a consequence of indels.
In addition to pattern matching, the defined region of interest is also used for evaluating the read quality in the neighboring sequence context of the position of interest. Assuming that the quality in the blur region is high if the quality of its surrounding sequence context is high too, this criteria can be especially assessed by setting the --min_mean_context_qscore parameter accordingly.
Usage & Output
As a positional argument, the script needs to be handed a non-empty list of albacore base directories containing basecalled reads to be filtered. Additional required arguments that need to be specified are the path to the bwa mem executable and the fasta file containing the reference sequence the reads shall be mapped to. A basic, abstract program execution looks as follows:
python2 iCG.py filter PATH/TO/ALBACORE/DIR_1 PATH/TO/ALBACORE/DIR_2 --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE -r PATH/TO/FASTA
All specified directories are subsequently filtered using the same filter settings. These settings can be modified by specifying optional arguments. The results of the program execution are saved in a new directory inside the albacore base directory named "filtered" by default. In addition to the filtered reads, sorted into subdirectories named "sense" and "antisense", the destination folder will contain a file called settings.txt holding all specified filtering settings. Additionally, a directory plots is created on default containing plots of the normalized signal traces at the region of interest for both the sense and the antisense strand orientation.
During program execution, iCG filter prints useful information regarding current steps of the filtering process, revealing how the result maybe optimized. The following is an example for the console output created by iCG filter:
filtering for minimal mean quality of 10.0
67402 removed (40.81 % of remaining)
97739 remaining (59.19 % of total)
filtering for minimal length of 1887.75
removed 91117 (93.22 % of remaining)
6622 remaining (4.01 % of total)
filtering for maximal length of 2517
removed 1330 (20.08 % of remaining)
5292 remaining (3.20 % of total)
adding fastq information
filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}
deviations in sequence length of remaining sequences (* selected):
sense antisense
-5: 6 3
-4: 9 9
-3: 34 10
*-2: 71 32
*-1: 135 94
* 0: 875 705
* 1: 84 47
* 2: 44 24
3: 14 12
4: 1 1
5: 4 0
------ ------
* 1209 902
removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
2111 remaining (1.28 % of total)
filtering for context quality higher than 14.0 (* selected)
sense antisense
7-8: 0 1
8-9: 3 1
9-10: 12 5
10-11: 18 6
11-12: 38 24
12-13: 70 41
13-14: 108 82
*14-15: 142 101
*15-16: 193 128
*16-17: 155 109
*17-18: 167 116
*18-19: 127 95
*19-20: 84 78
*20-21: 56 54
*21-22: 25 29
*22-23: 7 19
*23-24: 3 10
*24-25: 1 1
*25-26: 0 2
------ ------
* 960 742
removed 409 (19.37 % of remaining)
1702 remaining (1.03 % of total)
Further help regarding the usage of iCG filter is given in the
help pages that are accessible by passing the argument
-h or
--help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE
[--dst_dir_name DST_DIR_NAME]
[--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
[--blur_deviation BLUR_DEVIATION]
[--context_deviation CONTEXT_DEVIATION] [--no_indels]
[--max_length_deviation MAX_LENGTH_DEVIATION]
[--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
[--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
[--normalization_type {median,pA,pA_raw,none}]
[--processes PROCESSES] [--plot_poi]
[--disp_bases DISP_BASES] [--barcode BARCODE]
[--barcode_deviation BARCODE_DEVIATION]
[basecalled_dirs [basecalled_dirs ...]]
This sub-script takes a set of directories containing basecalled Nanopore reads as
input and filters the contained reads by several criteria given as arguments
to this script. Afterwards, the remaining reads are resquigled by nanoraw and
copied to a new folder inside the basecalling directory.
positional arguments:
basecalled_dirs
Paths to directories containing basecalled reads that shall be filtered.
optional arguments:
-h, --help
show this help message and exit
--bwa_mem_exe BWA_MEM_EXE
Required argument. Path to the bwa_mem executable.
-r REFERENCE, --reference REFERENCE
Required argument. Path to a fasta file holding a single sequence as a reference for mapping reads for resquiggling.
--dst_dir_name DST_DIR_NAME
Name of the destination folder. Default: filtered
--min_mean_qscore MIN_MEAN_QSCORE
The minimal mean phred qscore of all basecalles of a
single read, representing the overall quality of the
read in assessment. Default:10.0
--min_length MIN_LENGTH
The minimal length of reads. The default value is 0.75
the length of the reference sequence.
--max_length MAX_LENGTH
The maximal length of reads. The default value is the
length of the reference sequence.
--radius RADIUS
The number of bases upstream and downstream of the
position of interest which form the region of interest
(ROI). In combination with the blur radius, the ROI
defines the sequence for the regular expression search
used for deciding weather a read is covering the
position of interest. Default:15
--blur BLUR
The number of bases upstream and downstream of the
position of interest which are excluded from the
regular expression search used for deciding weather a
read is covering the position of interest. This is
necessary to to the possibile influence that an
unnatural base might have on the signal trace of its
surrounding bases. Default: 3
--blur_deviation BLUR_DEVIATION
The amount of bases the blur region is allowed to vary
in length. Default: 1
--context_deviation CONTEXT_DEVIATION
The amount of deviations, both indels and mismatches,
the upstream and downstream sequence contexts of the
region of interest is allowed to have in the regular
expression search. Default:2
--no_indels
If specified, indels are not accepted in the regular
expression search.
--max_length_deviation MAX_LENGTH_DEVIATION
The maximum deviation in length between the region of
interest in the reference sequence and the read
sequence, taking both the blur deviation and the
context deviation into consideration.
--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE
The minimal mean phred qscore of basecalls in the
region of interest, excluding the blurred radius
around the position of interest. Default: 2
--greedy_regex_search
If specified, the regular expression search is greedy
meaning that overlapping hits are impossible. Faster,
but generally not recommended.
--cpts_limit CPTS_LIMIT
cpts parameter for nanoraw genome_resquiggle. Default:
100
--normalization_type {median,pA,pA_raw,none}
Type of normalization used in the nanoraw
genome_resquiggle script for calculating statistics
based on new segmentation of raw signal.
--processes PROCESSES
Number of processes for nanoraw genome_resquiggle.
Default: 4
--plot_poi
If specified, a plot of the resquiggled signal traces
of the filtered reads is created and saved as svg.
--disp_bases DISP_BASES
Size of displayed region in number of bases when
plotting the signal traces of the input data at the
region of interest. Default: 21
--barcode BARCODE
If specified, only reads containing the given barcode
string pass the filtering process.
--barcode_deviation BARCODE_DEVIATION
The amount of deviations, both indels and mismatches,
a potential regular expression match against the
defined barcode is allowed to have. Default: 3
iCG model
Description
The script iCG model creates a linear discriminant analysis model to discriminate between natural bases and unnatural bases in the sequence context that is present in the DNA samples used to compute the model. As an input, the script needs paths to directories containing sequencing data filtered by iCG filter: One directory containing data of a DNA sample with the UBP being analyzed, and at least one containing natural bases at the position of interest. For these datasets, the program identifies the exact location of the position of interest in the raw data of every read, taking into consideration the alignment data previously produced by nanoraw. Then, the mean raw signal in the normalized intervals from two bases upstream to two bases downstream of the position of interest is calculated for every read. Figure 4 gives an example of how this processing is performed.
With this step, a template dataset consisting of entries for every read in every directory passed to iCG model is created. Each entry holds the mean signal for the five base positions at the position of interest and a group identifier indicating to which input directory the read belongs. The script then applies linear discriminant function analysis to this template dataset in order to create an equation system that classifies the different groups of data based on the mean signals at the five positions. The linear discriminant function is provided by the MASS package (Venables and Ripley, 2002) for the software environment R (R Core Team, 2017). This analysis method provides the linear combinations of the five mean values which best discriminates between reads of different groups, mainly by maximizing variance between reads of different groups and minimizing variance of reads of the same group (see Fisher, 1936, for this approach). By applying the resulting equation system to the very dataset the model was created with, a graphical representation of the classification is attainable. The number of dimensions is depending on the number of linear discriminants needed to discriminate all groups from each other. In most cases, two linear discriminants are sufficient, resulting in a two-dimensional dot-plot with one axis for each linear discriminant.
This approach is best described by taking a closer look at an example. Let us take aside the unnatural bases for a moment and consider that we wanted to create a model that is able to discriminate between the natural bases A, G, T and C at the position of interest in the sequence context presented in
figure 2. First of all, we take sequencing data of four different DNA samples, each one having a different natural base at the position of interest. With the help of iCG filter, the intividual sets of sequencing data are filtered for reads containing the region of interest. Now, the paths of each folder containing one of the four groups are passed to iCG model.
Figure 4 shows the group means of the five positions (mean_0 to mean_5), the coefficients of three linear discriminants resulting from the linear discriminant analysis, and a two dimensional dot-plot showing the linear discriminants of the template dataset.
One great challenge regarding the creation of a good statistical model is the quality of the template data used for the statistical analysis. Ideally, the template dataset should only contain reads that can be assigned to their respective group with absolute certainty. Every read that is assigned wrong will prevent the algorithm performing the linear discriminant analysis from finding liner coefficients that effectively diverge the groups from each other. Unfortunately, Nanopore sequencing has the disadvantage that samples are sequenced one after the other. Even though a washing step is performed between consecutive sequencing runs, parts of the previous libraries remain inside the flow cell and will procude reads. According to Oxford Nanopore technologies, the portion of contaminating reads from previous runs is up to 10 % when washing with the Flow Cell Wash kit EXP-WSH002, depending on the number of previously loaded libraries (see
EXP-WSH002). In our experiments, we found the degree of contamination to be even higher for some samples.
To face this challenge and to improve the overall quality of the statistical models created with iCG, we implemented an additional filtering algorithm into iCG model that recursively removes reads that are deviating the most from the median signal in the neighboring sequence context of the position of interest. This filtering step is performed on every set of reference reads individually. For deciding which reads are removed, the positions from -2 to +2 bases in relation to the position of interest are evaluated. By setting the parameter --quantiles_to_remove accordingly, the influence of this filtering process can be observed by taking a look at the plots of the signal traces, which are automatically produced and saved by iCG. Due to the recursive nature of the function, which calculates new median values after every single read that has been removed, the algorithm is hardly influencing the reads that truly belong to the reference sample being sequenced, as long as the contamination is lower than 50 %. Figure 5 illustrates the effect of this filtering step on the signal traces of a sequencing run with relatively high contamination. Figure 6 gives an example of the effect on the linear discriminant analysis.
Usage & Output
As a positional argument, the script needs a path to a folder where the results shall be saved. Additionally, the paths to directories containing reads filtered by iCG filter need to be passed to iCG model. At least the directory containing filtered reads with the unnatural base pair and one arbitrary natural base pair is required for program execution. A basic, abstract program execution looks as follows:
python2 iCG.py model PATH/TO/RESULTS -UBP PATH/TO/FILTERED/UBP-READS -NN PATH/TO/OTHER/FILTERED/READS
In the destination folder, ".model" files containing the statistical model of the linear discriminant analysis are saved. These can be passed to iCG predict for base prediction of new Nanopore sequencing data. Dot-plots and statistics for the linear discriminant models are saved in subdirectories named plots and stats accordingly. Furthermore, plots of the signal traces at the position of interest are created for each group of reads used for model creation.
Further help regarding the usage of iCG model is given in the
help pages that are accessible by passing the argument
-h or
--help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]
[-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
[-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
[--print_group_characteristics] [--no_data_plots]
[--disp_bases DISP_BASES]
[-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
[--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
[--UBantisense UBANTISENSE]
dst_dir
This sub-script creates a linear discriminant analysis model to
discriminate between unnatural bases and natural bases at a given position of
interest in a reference sequence. Therefore, the script is provided with paths
to directories containing reads that were previously filtered and resquiggled
by the script "iCG filter".
positional arguments:
dst_dir
Path to destination folder where all created results shall be saved.
optional arguments:
-h, --help
show this help message and exit
-UBP UBP_FLTRD_FAST5_DIR, --UBP_fltrd_fast5_dir UBP_FLTRD_FAST5_DIR
Required argument. Relative or absolute path to the
directory containing filtered .fast5 files of reads
with the unnatural base pair at the position of
interest being analyzed.
-AT AT_FLTRD_FAST5_DIR, --AT_fltrd_fast5_dir AT_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing
filtered .fast5 files of reads with an A in the sense
strand and a T in the antisense strand at the position
of interest.
-GC GC_FLTRD_FAST5_DIR, --GC_fltrd_fast5_dir GC_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing
filtered .fast5 files of reads with an G in the sense
strand and a C in the antisense strand at the position
of interest.
-TA TA_FLTRD_FAST5_DIR, --TA_fltrd_fast5_dir TA_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing
filtered .fast5 files of reads with an T in the sense
strand and a A in the antisense strand at the position
of interest.
-CG CG_FLTRD_FAST5_DIR, --CG_fltrd_fast5_dir CG_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing
filtered .fast5 files of reads with an C in the sense
strand and a G in the antisense strand at the position
of interest.
-NN NN_FLTRD_FAST5_DIR, --NN_fltrd_fast5_dir NN_FLTRD_FAST5_DIR
Relative or absolute path to a directory containing
filtered .fast5 files of reads with any natural bases
in the sense strand and antisense strand at the
position of interest.
--print_group_characteristics
If specified, detailed statistical information about the position of interest of each given group of
reads is stored in their respective folders. This information includes called bases as well as mean and median normalized signals with their respective deviations for the position of interest as well as for the two positions upstream and downstream.
--no_data_plots
If specified, no plots are created showing the signal
traces of the input data at the region of interest.
--disp_bases DISP_BASES
Size of displayed region in number of bases when
plotting the signal traces of the input data at the
region of interest. Default: 21
-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]], --quantiles_to_remove [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]
Specifies the quantile of reads being filtered out for
deviating the most from the median signal in the
neighboring sequence context of the position of
interest. Helpful for removing reads that originate
from previously sequenced libraries. Default: 0.3
--no_model_plots
If specified, no plots are created showing the
distribution and classification of reads by the linear
discriminant analysis model.
--no_model_stats
If specified, no stats are saved giving information
about the quality of classification of the linear
discriminant model with respect to the input data.
--UBsense UBSENSE
short identifier for the unnatural base in the sense
strand. Default: UB(+)
--UBantisense UBANTISENSE
short identifier for the unnatural base in the
antisense strand. Default: UB(-)
iCG predict
Description
Based on a linear discriminant analysis model previously created by iCG model, iCG predict gives base predictions for a position of interest of reads filtered by iCG filter. In addition to a text output on the console, the result is saved as a .csv file and as a plot showing the linear discriminants of the predicted reads. An example prediction result is shown in figure 7.
Usage & Output
As a positional arguments, the script needs to be given the paths to the model file that shall be used as well as to the directory containing reads for base prediction, previously filtered with iCG filter. Optionally, a directory for the results can be specified by setting the argument --dst_dir.
python2 iCG.py predict PATH/TO/MODEL PATH/TO/FILTERED/READS
Further help regarding the usage of iCG predict is given in the
help pages that are accessible by passing the argument
-h or
--help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir
Based on a linear discriminant analysis model previously created by "iCG
model", this sub-script classifies the base at a given position of interest for a
set of filtered reads as input.
positional arguments:
model
Absolute or relative path to the .model file holding the
linear discriminant analysis model desired for base
prediction.
in_dir
Absolute or relative path to the directory containing
filtered and resquiggled reads of particular orientation
for base prediction.
optional arguments:
-h, --help
show this help message and exit
--dst_dir DST_DIR
Folder where the results will be saved. Default: ./