Difference between revisions of "Team:Bielefeld-CeBiTec/Software"

Line 239: Line 239:
 
                             <font face="courier new">
 
                             <font face="courier new">
 
                             <div class="article">
 
                             <div class="article">
                                <br>
 
 
                                 usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE<br>
 
                                 usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE<br>
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--dst_dir_name DST_DIR_NAME]<br>
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--dst_dir_name DST_DIR_NAME]<br>
Line 253: Line 252:
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--disp_bases DISP_BASES] [--barcode BARCODE]<br>
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--disp_bases DISP_BASES] [--barcode BARCODE]<br>
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--barcode_deviation BARCODE_DEVIATION]<br>
 
                                 &nbsp;&nbsp;&nbsp;&nbsp;[--barcode_deviation BARCODE_DEVIATION]<br>
                                 &nbsp;&nbsp;&nbsp;&nbsp;[basecalled_dirs [basecalled_dirs ...]]
+
                                 &nbsp;&nbsp;&nbsp;&nbsp;[basecalled_dirs [basecalled_dirs ...]]<br>
 
                                 <br>
 
                                 <br>
 
                                 This sub-script takes a set of directories containing basecalled Nanopore reads as
 
                                 This sub-script takes a set of directories containing basecalled Nanopore reads as
 
                                 input and filters the contained reads by several criteria given as arguments
 
                                 input and filters the contained reads by several criteria given as arguments
 
                                 to this script. Afterwards, the remaining reads are resquigled by nanoraw and
 
                                 to this script. Afterwards, the remaining reads are resquigled by nanoraw and
                                 copied to a new folder inside the basecalling directory.
+
                                 copied to a new folder inside the basecalling directory.<br>
 
+
 
+
                                <br>
+
 
                                 <br>
 
                                 <br>
 
                                 positional arguments:
 
                                 positional arguments:

Revision as of 18:55, 1 November 2017

Software

Nanopore Sequencing of DNA Containing Unnatural Bases

As described on the New Methods page of this wiki, Nanopore sequencing promises to be especially beneficial regarding the analysis of experiments involving unnatural bases in DNA and RNA. As a single-molecule sequencing technology without the need for special, expensive chemistry or PCR amplification of DNA samples prior to analysis, it could serve as an innovative, novel approach for the analysis of experiments involving unnatural bases.

For the analysis of our experiments with isoG and isoCm, examined if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we developed a software called "iCG" for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural bases, including data processing and evaluation. Based on differences in the raw output signal of DNA reads containing an unnatural base compared to those containing only natural bases, the software uses linear discriminant analysis to create a statistical model that gives base predictions for a position of interest in a known sequence context, where either a natural or an unnatural base is present. This software is not limited to a certain unnatural base pair or to a certain sequence context, but rather serves as a pipeline tool for the creation of statistical models potentially allowing for the detection of any unnatural base pairs in an arbitrary sequence context based on reference sequencing data.

iCG

Background

In ONT's sequencing technology, the ionic current that passes through a nanopore is not measured for every nucleotide individually. To cope with the influence of random fluctuations of the ionic current, the background noise, the signal is evaluated for a sequence of approximately 5 nucleotides. Every such k‑mer (in this case every possible combination of 5 successive nucleotides) produces a characteristic signal that is evaluated by the basecaller to predict the sequence of the analyzed DNA strand. Therefore, the signal trace of a certain nucleotide is not the same at every position where it occurs in the DNA strand, but is dependent on its sequence context. This has two interesting effects that have to be considered regarding ONT sequencing of unnatural bases: Firstly, the signal of the close sequence context has to be considered too when analyzing if an unnatural base has a significant effect on the ionic current flow. Secondly, if an unnatural base indeed significantly influences the signal, it will have an impact on the signal traces surrounding the position where it is occurring in the DNA. Therefore, the basecalling may fail for natural bases surrounding this positon.

As a consequence of these considerations, a basecaller that shall be able to differentiate between natural bases and unnatural bases at any sequence position would need to be trained with signals of every possible k‑mer that is a combination of natural and unnatural bases. As the number of possible k‑mers given n possibilities is nk, 6752 new k-mers would have to be analyzed, which is beyond our means. Nevertheless, since unnatural base pairs are mostly used at well-known locations, e.g. to expand the genetic code in in vitro or in vivo experiments, it is sufficient to analyze only certain k‑mers that are of interest. The comparison to k‑mers containing all natural bases provides a foundation for a software that differentiates between unnatural and natural base pairs at a given position of interest (POI). This is the approach for our software tool.

Overview

The software package iCG is a collection of three python scripts for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural base pairs. The first script, iCG filter, is used to filter basecalled Nanopore data by certain criteria like read length and quality. Its fundamental function is to select sequencing reads that contain the position of interest, where the unnatural base pair is subjected to be, irrespective of its presence. The second scipt, iCG model, creates a statistical model for the detection of unnatural base pairs in sequencing reads based on linear discriminant analysis of ONT data previously filtered by iCG filter. The third script, iCG predict, uses a model created by iCG model to predict which base pair is present at the position of interest for a set of filtered sequencing reads.

Fig. 1: Overview of the iCG functionality.

Dependencies & Installation

Our software works with basecalled Nanopore data. For Nanopore sequencing, we used the ONT software MinKNOW version 1.7.14.1 in combination with FLO-MIN106 R9.4 flowcells on a MinION sequencer. Other combinations of hardware and consumables, such as 1D2 flowcells, were not tested in combination with this software package and will probably require for small adjustments. Basecalling has to be performed with the ONT basecaller Albacore.

The iCG software package has some dependencies that need to be installed before it can be used. First of all, HDF5 needs to be installed. HFD, standing for Hierarchical Data Format, is a format for storing and retrieving large data sets from/to files and is the format in which Nanopore data is stored. For building and installing from sources, please refer to the distributors website. For installation from depository:
> sudo apt-get install libhdf5-serial-dev
Additionally, the alignment and mapping software package BWA‑MEM (Li, 2013) needs to be installed. For additional help regarding the installation process, please refer to their sourceforge or github website.
> git clone https://github.com/lh3/bwa.git
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
Furthermore, the software environment R for statistical computing and graphics needs to be installed. For a system specific installation, please refer to the r-project website. For statistical computations and plotting, some additional R packages need to be installed. After the R environment is installed, please print the following commands in an R console:
> install.packages("ggplot2")
> install.packages("cowplot")
> install.packages("MASS")
iCG itself is written in python2.7 code. Therefore, a python2.7 environment needs to be installed. Please follow the instructions on python.org for a system specific installation. Afterwards, use the python package managment tool pip for the installation of required python packages. For that, open a terminal and execute the following commands:
> pip2.7 install regex
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]

With all dependencies installed, download the latest version of iCG together with example data sets in the Downloads section of this page. To learn about how to use iCG, please continue reading or start right ahead by executing the following command in a terminal, having navigated to the folder where the downloaded code is located:
python2 iCG.py -h
Your installation of iCG is now complete. Due to the size of sequencing data, basecalled Oxford Nanopore sequencing reads for testing iCG are available on request.

iCG filter

Description
The command iCG filter is executed on basecalled Nanopore data before any further analysis can be performed with iCG. It uses the information generated by the ONT basecaller albacore to filter a set of sequencing reads for sufficient quality, length and the existence of the position of interest. After the reads have been filtered, they are divided into sense and antisense reads regarding their orientation relative to a given reference sequence. Afterwards, the reads are "re-squiggled" using nanoraw (Stoiber, 2016), meaning that every read is aligned to the given reference sequence in order to correct bascalling errors and to normalize the raw read signals in terms of duration and amplitude. This filtering procedure is mandatory for the subsequent analysis with both iCG model and iCG predict.

For subsequent analysis, it is particularly important that the filtering process is performed irrespective of the position of interest. The potential influence of the unnatural bases on the sequencing signal may lead to bascalling misbehavior which in turn could lead to falsified analysis results if the position of interest and the neighboring sequence context is considered during filtering. Therefore, the presence of the region of interest in a read sequence is checked by regular expression pattern matching, omitting the immediate sequence context of the position of interest, called "blur region". Figure 2 illustrates how the position of interest, the region of interest, the blur region and the regular expression are connected with each other.

Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.

Every aspect of this pattern matching approach is adoptable by optional arguments passed to iCG on execution, including the size of the region of interest, the size of the blur region and the amount of mismatches allowed during pattern matching. Additional parameters define whether insertions and deletions (indels) are allowed and how much the length of a match is allowed to deviate from the length of the region of interest as a consequence of indels.
In addition to pattern matching, the defined region of interest is also used for evaluating the read quality in the neighboring sequence context of the position of interest. Assuming that the quality in the blur region is high if the quality of its surrounding sequence context is high too, this criteria can be especially assessed by setting the --min_mean_context_qscore parameter accordingly.
Usage & Output
As a positional argument, the script needs to be handed a non-empty list of albacore base directories containing basecalled reads to be filtered. Additional required arguments that need to be specified are the path to the bwa mem executable and the fasta file containing the reference sequence the reads shall be mapped to. A basic, abstract program execution looks as follows:
python2 iCG.py filter PATH/TO/ALBACORE/DIR_1 PATH/TO/ALBACORE/DIR_2 --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE -r PATH/TO/FASTA
All specified directories are subsequently filtered using the same filter settings. These settings can be modified by specifying optional arguments. The results of the program execution are saved in a new directory inside the albacore base directory named "filtered" by default. In addition to the filtered reads, sorted into subdirectories named "sense" and "antisense", the destination folder will contain a file called settings.txt holding all specified filtering settings. Additionally, a directory plots is created on default containing plots of the normalized signal traces at the region of interest for both the sense and the antisense strand orientation.

During program execution, iCG filter prints useful information regarding current steps of the filtering process, revealing how the result maybe optimized. The following is an example for the console output created by iCG filter:
filtering for minimal mean quality of 10.0
    67402 removed (40.81 % of remaining)
    97739 remaining (59.19 % of total)

filtering for minimal length of 1887.75
    removed 91117 (93.22 % of remaining)
    6622 remaining (4.01 % of total)

filtering for maximal length of 2517
    removed 1330 (20.08 % of remaining)
    5292 remaining (3.20 % of total)

adding fastq information

filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
    sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
    antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}

    deviations in sequence length of remaining sequences (* selected):
            sense  antisense
     -5:    6      3
     -4:    9      9
     -3:    34     10
    *-2:    71     32
    *-1:    135    94
    * 0:    875    705
    * 1:    84     47
    * 2:    44     24
      3:    14     12
      4:    1      1
      5:    4      0
            ------ ------
    *       1209   902
    removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
    2111 remaining (1.28 % of total)

filtering for context quality higher than 14.0 (* selected)
            sense  antisense
     7-8:   0      1
     8-9:   3      1
     9-10:  12     5
     10-11: 18     6
     11-12: 38     24
     12-13: 70     41
     13-14: 108    82
    *14-15: 142    101
    *15-16: 193    128
    *16-17: 155    109
    *17-18: 167    116
    *18-19: 127    95
    *19-20: 84     78
    *20-21: 56     54
    *21-22: 25     29
    *22-23: 7      19
    *23-24: 3      10
    *24-25: 1      1
    *25-26: 0      2
            ------ ------
    *       960    742
    removed 409 (19.37 % of remaining)
    1702 remaining  (1.03 % of total)
Further help regarding the usage of iCG filter is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE
    [--dst_dir_name DST_DIR_NAME]
    [--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
    [--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
    [--blur_deviation BLUR_DEVIATION]
    [--context_deviation CONTEXT_DEVIATION] [--no_indels]
    [--max_length_deviation MAX_LENGTH_DEVIATION]
    [--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
    [--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
    [--normalization_type {median,pA,pA_raw,none}]
    [--processes PROCESSES] [--plot_poi]
    [--disp_bases DISP_BASES] [--barcode BARCODE]
    [--barcode_deviation BARCODE_DEVIATION]
    [basecalled_dirs [basecalled_dirs ...]]

This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.

positional arguments:
basecalled_dirs
Paths to directories containing basecalled reads that shall be filtered.

optional arguments:
-h, --help
show this help message and exit
--bwa_mem_exe BWA_MEM_EXE
Required argument. Path to the bwa_mem executable.
-r REFERENCE, --reference REFERENCE
Required argument. Path to a fasta file holding a single sequence as a reference for mapping reads for resquiggling.
--dst_dir_name DST_DIR_NAME
Name of the destination folder. Default: filtered
--min_mean_qscore MIN_MEAN_QSCORE
The minimal mean phred qscore of all basecalles of a single read, representing the overall quality of the read in assessment. Default:10.0
--min_length MIN_LENGTH
The minimal length of reads. The default value is 0.75 the length of the reference sequence.
--max_length MAX_LENGTH
The maximal length of reads. The default value is the length of the reference sequence.
--radius RADIUS
The number of bases upstream and downstream of the position of interest which form the region of interest (ROI). In combination with the blur radius, the ROI defines the sequence for the regular expression search used for deciding weather a read is covering the position of interest. Default:15
--blur BLUR
The number of bases upstream and downstream of the position of interest which are excluded from the regular expression search used for deciding weather a read is covering the position of interest. This is necessary to to the possibile influence that an unnatural base might have on the signal trace of its surrounding bases. Default: 3
--blur_deviation BLUR_DEVIATION
The amount of bases the blur region is allowed to vary in length. Default: 1
--context_deviation CONTEXT_DEVIATION
The amount of deviations, both indels and mismatches, the upstream and downstream sequence contexts of the region of interest is allowed to have in the regular expression search. Default:2
--no_indels
If specified, indels are not accepted in the regular expression search.
--max_length_deviation MAX_LENGTH_DEVIATION
The maximum deviation in length between the region of interest in the reference sequence and the read sequence, taking both the blur deviation and the context deviation into consideration.
--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE
The minimal mean phred qscore of basecalls in the region of interest, excluding the blurred radius around the position of interest. Default: 2
--greedy_regex_search
If specified, the regular expression search is greedy meaning that overlapping hits are impossible. Faster, but generally not recommended.
--cpts_limit CPTS_LIMIT
cpts parameter for nanoraw genome_resquiggle. Default: 100
--normalization_type {median,pA,pA_raw,none}
Type of normalization used in the nanoraw genome_resquiggle script for calculating statistics based on new segmentation of raw signal.
--processes PROCESSES
Number of processes for nanoraw genome_resquiggle. Default: 4
--plot_poi
If specified, a plot of the resquiggled signal traces of the filtered reads is created and saved as svg.
--disp_bases DISP_BASES
Size of displayed region in number of bases when plotting the signal traces of the input data at the region of interest. Default: 21
--barcode BARCODE
If specified, only reads containing the given barcode string pass the filtering process.
--barcode_deviation BARCODE_DEVIATION
The amount of deviations, both indels and mismatches, a potential regular expression match against the defined barcode is allowed to have. Default: 3

iCG model

Description
The script iCG model creates a linear discriminant analysis model to discriminate between natural bases and unnatural bases in the sequence context that is present in the DNA samples used to compute the model. As an input, the script needs paths to directories containing sequencing data filtered by iCG filter: One directory containing data of a DNA sample with the UBP being analyzed, and at least one containing natural bases at the position of interest. For these datasets, the program identifies the exact location of the position of interest in the raw data of every read, taking into consideration the alignment data previously produced by nanoraw. Then, the mean raw signal in the normalized intervals from two bases upstream to two bases downstream of the position of interest is calculated for every read. Figure 4 gives an example of how this processing is performed.

Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.

With this step, a template dataset consisting of entries for every read in every directory passed to iCG model is created. Each entry holds the mean signal for the five base positions at the position of interest and a group identifier indicating to which input directory the read belongs. The script then applies linear discriminant function analysis to this template dataset in order to create an equation system that classifies the different groups of data based on the mean signals at the five positions. The linear discriminant function is provided by the MASS package (Venables and Ripley, 2002) for the software environment R (R Core Team, 2017). This analysis method provides the linear combinations of the five mean values which best discriminates between reads of different groups, mainly by maximizing variance between reads of different groups and minimizing variance of reads of the same group (see Fisher, 1936, for this approach). By applying the resulting equation system to the very dataset the model was created with, a graphical representation of the classification is attainable. The number of dimensions is depending on the number of linear discriminants needed to discriminate all groups from each other. In most cases, two linear discriminants are sufficient, resulting in a two-dimensional dot-plot with one axis for each linear discriminant.

This approach is best described by taking a closer look at an example. Let us take aside the unnatural bases for a moment and consider that we wanted to create a model that is able to discriminate between the natural bases A, G, T and C at the position of interest in the sequence context presented in Figure 2. First of all, we take sequencing data of four different DNA samples, each one having a different natural base at the position of interest. With the help of iCG filter, the intividual sets of sequencing data are filtered for reads containing the region of interest. Now, the paths of each folder containing one of the four groups are passed to iCG model. Figure 4 shows the group means of the five positions (mean_0 to mean_5), the coefficients of three linear discriminants resulting from the linear discriminant analysis, and a two dimensional dot-plot showing the linear discriminants of the template dataset.

Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.


One great challenge regarding the creation of a good statistical model is the quality of the template data used for the statistical analysis. Ideally, the template dataset should only contain reads that can be assigned to their respective group with absolute certainty. Every read that is assigned wrong will prevent the algorithm performing the linear discriminant analysis from finding liner coefficients that effectively diverge the groups from each other. Unfortunately, Nanopore sequencing has the disadvantage that samples are sequenced one after the other. Even though a washing step is performed between consecutive sequencing runs, parts of the previous libraries remain inside the flow cell and will procude reads. According to Oxford Nanopore technologies, the portion of contaminating reads from previous runs is up to 10 % when washing with the Flow Cell Wash kit EXP-WSH002, depending on the number of previously loaded libraries (see EXP-WSH002). In our experiments, we found the degree of contamination to be even higher for some samples.

Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.

To face this challenge and to improve the overall quality of the statistical models created with iCG, we implemented an additional filtering algorithm into iCG model that recursively removes reads that are deviating the most from the median signal in the neighboring sequence context of the position of interest. This filtering step is performed on every set of reference reads individually. For deciding which reads are removed, the positions from -2 to +2 bases in relation to the position of interest are evaluated. By setting the parameter --quantiles_to_remove accordingly, the influence of this filtering process can be observed by taking a look at the plots of the signal traces, which are automatically produced and saved by iCG. Due to the recursive nature of the function, which calculates new median values after every single read that has been removed, the algorithm is hardly influencing the reads that truly belong to the reference sample being sequenced, as long as the contamination is lower than 50 %. Figure 5 illustrates the effect of this filtering step on the signal traces of a sequencing run with relatively high contamination. Figure 6 gives an example of the effect on the linear discriminant analysis.

Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly

Usage & Output
As a positional argument, the script needs a path to a folder where the results shall be saved. Additionally, the paths to directories containing reads filtered by iCG filter need to be passed to iCG model. At least the directory containing filtered reads with the unnatural base pair and one arbitrary natural base pair is required for program execution. A basic, abstract program execution looks as follows:
python2 iCG.py model PATH/TO/RESULTS -UBP PATH/TO/FILTERED/UBP-READS -NN PATH/TO/OTHER/FILTERED/READS
In the destination folder, ".model" files containing the statistical model of the linear discriminant analysis are saved. These can be passed to iCG predict for base prediction of new Nanopore sequencing data. Dot-plots and statistics for the linear discriminant models are saved in subdirectories named plots and stats accordingly. Furthermore, plots of the signal traces at the position of interest are created for each group of reads used for model creation.
Further help regarding the usage of iCG model is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]
    [-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
    [-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
    [--print_group_characteristics] [--no_data_plots]
    [--disp_bases DISP_BASES]
    [-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
    [--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
    [--UBantisense UBANTISENSE]
    dst_dir

This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".

positional arguments:
dst_dir
Path to destination folder where all created results shall be saved.

optional arguments:
-h, --help
show this help message and exit
-UBP UBP_FLTRD_FAST5_DIR, --UBP_fltrd_fast5_dir UBP_FLTRD_FAST5_DIR
Required argument. Relative or absolute path to the directory containing filtered .fast5 files of reads with the unnatural base pair at the position of interest being analyzed.
-AT AT_FLTRD_FAST5_DIR, --AT_fltrd_fast5_dir AT_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an A in the sense strand and a T in the antisense strand at the position of interest.
-GC GC_FLTRD_FAST5_DIR, --GC_fltrd_fast5_dir GC_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an G in the sense strand and a C in the antisense strand at the position of interest.
-TA TA_FLTRD_FAST5_DIR, --TA_fltrd_fast5_dir TA_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an T in the sense strand and a A in the antisense strand at the position of interest.
-CG CG_FLTRD_FAST5_DIR, --CG_fltrd_fast5_dir CG_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an C in the sense strand and a G in the antisense strand at the position of interest.
-NN NN_FLTRD_FAST5_DIR, --NN_fltrd_fast5_dir NN_FLTRD_FAST5_DIR
Relative or absolute path to a directory containing filtered .fast5 files of reads with any natural bases in the sense strand and antisense strand at the position of interest.
--print_group_characteristics
If specified, detailed statistical information about the position of interest of each given group of reads is stored in their respective folders. This information includes called bases as well as mean and median normalized signals with their respective deviations for the position of interest as well as for the two positions upstream and downstream.
--no_data_plots
If specified, no plots are created showing the signal traces of the input data at the region of interest.
--disp_bases DISP_BASES
Size of displayed region in number of bases when plotting the signal traces of the input data at the region of interest. Default: 21
-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]], --quantiles_to_remove [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]
Specifies the quantile of reads being filtered out for deviating the most from the median signal in the neighboring sequence context of the position of interest. Helpful for removing reads that originate from previously sequenced libraries. Default: 0.3
--no_model_plots
If specified, no plots are created showing the distribution and classification of reads by the linear discriminant analysis model.
--no_model_stats
If specified, no stats are saved giving information about the quality of classification of the linear discriminant model with respect to the input data.
--UBsense UBSENSE
short identifier for the unnatural base in the sense strand. Default: UB(+)
--UBantisense UBANTISENSE
short identifier for the unnatural base in the antisense strand. Default: UB(-)

iCG predict

Description
Based on a linear discriminant analysis model previously created by iCG model, iCG predict gives base predictions for a position of interest of reads filtered by iCG filter. In addition to a text output on the console, the result is saved as a .csv file and as a plot showing the linear discriminants of the predicted reads. An example prediction result is shown in figure 7.

Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.

Usage & Output
As a positional arguments, the script needs to be given the paths to the model file that shall be used as well as to the directory containing reads for base prediction, previously filtered with iCG filter. Optionally, a directory for the results can be specified by setting the argument --dst_dir.
python2 iCG.py predict PATH/TO/MODEL PATH/TO/FILTERED/READS
Further help regarding the usage of iCG predict is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir

Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.

positional arguments:
model
Absolute or relative path to the .model file holding the linear discriminant analysis model desired for base prediction.
in_dir
Absolute or relative path to the directory containing filtered and resquiggled reads of particular orientation for base prediction.

optional arguments:
-h, --help
show this help message and exit
--dst_dir DST_DIR
Folder where the results will be saved. Default: ./

Downloads

File MD5 Sum
T--Bielefeld-CeBiTec--software_suite.zip b7ea52444a6f27a4cf8354613668516a

Refereces

Debladis, E., Llauro, C., Carpentier, M.-C., Mirouze, M., and Panaud, O. (2017). Detection of active transposable elements in Arabidopsis thaliana using Oxford Nanopore Sequencing technology. BMC Genomics 18: 537.
Feng, Y., Zhang, Y., Ying, C., Wang, D., and Du, C. (2015). Nanopore-based Fourth-generation DNA Sequencing Technology. Genomics. Proteomics Bioinformatics 13: 4–16.
Fisher, R.A. et al. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7: 179-188.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 0: 1–3.
Simpson, J.T., Workman, R.E., Zuzarte, P.C., David, M., Dursi, L.J., and Timp, W. (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nat Meth 14: 407–410.
Venables, W.N. and Ripley, B.D. (2002). Modern Applied Statistics with S Fourth. (Springer: New York).
R Core Team (2017). R: A Language and Environment for Statistical Computing.