Difference between revisions of "Team:Bielefeld-CeBiTec/Software"

Line 809: Line 809:
 
                 <div class="article">
 
                 <div class="article">
 
                     <a href="https://github.com/MarkusHaak/iCG" target="_blank">https://github.com/MarkusHaak/iCG</a>
 
                     <a href="https://github.com/MarkusHaak/iCG" target="_blank">https://github.com/MarkusHaak/iCG</a>
 +
<a href="https://github.com/MaximilianEdich/M.A.X.">https://github.com/MaximilianEdich/M.A.X.</a>
 
                 </div>
 
                 </div>
  

Revision as of 22:36, 15 December 2017

Software

Short Summary

Our sophisticated software suite is composed of two connected modules for the analysis of unnatural base pairs in a specified target sequence: M.A.X and iCG. Oxford Nanopore sequencing data is processed by iCG to identify unnatural base pairs in a given target sequence. As an orthogonal method, our Mutational Analysis Xplorer (M.A.X) utilizes a customized database to find a suitable set of restriction enzymes for our enzyme based detection system. In addition, M.A.X represents a low-cost alternative for the analysis of mutations at a specific position, allowing all iGEM teams to conduct research on unnatural base pairs. Both modules form a powerful software suite, which is extremely helpful for research on unnatural base pairs. Examples are the analysis of mutation frequencies and fidelity of semi-synthetic DNA replication. We postulate that our suite is also applicable for the study of DNA modifications and epigenetics.

Nanopore Sequencing of DNA Containing Unnatural Bases

As described on the New Methods page of this wiki, Nanopore sequencing promises to be especially beneficial regarding the analysis of experiments involving unnatural bases in DNA and RNA. As a single-molecule sequencing technology without the need for special, expensive chemistry or PCR amplification of DNA samples prior to analysis, it could serve as an innovative, novel approach for the analysis of experiments involving unnatural bases.

For the analysis of our experiments with isoG and isoCm, examined if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we developed a software called "iCG" for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural bases, including data processing and evaluation. Based on differences in the raw output signal of DNA reads containing an unnatural base compared to those containing only natural bases, the software uses linear discriminant analysis to create a statistical model that gives base predictions for a position of interest in a known sequence context, where either a natural or an unnatural base is present. This software is not limited to a certain unnatural base pair or to a certain sequence context, but rather serves as a pipeline tool for the creation of statistical models potentially allowing for the detection of any unnatural base pairs in an arbitrary sequence context based on reference sequencing data.

iCG

Background

In ONT's sequencing technology, the ionic current that passes through a nanopore is not measured for every nucleotide individually. To cope with the influence of random fluctuations of the ionic current, the background noise, the signal is evaluated for a sequence of approximately 5 nucleotides. Every such k‑mer (in this case every possible combination of 5 successive nucleotides) produces a characteristic signal that is evaluated by the basecaller to predict the sequence of the analyzed DNA strand. Therefore, the signal trace of a certain nucleotide is not the same at every position where it occurs in the DNA strand, but is dependent on its sequence context. This has two interesting effects that have to be considered regarding ONT sequencing of unnatural bases: Firstly, the signal of the close sequence context has to be considered too when analyzing if an unnatural base has a significant effect on the ionic current flow. Secondly, if an unnatural base indeed significantly influences the signal, it will have an impact on the signal traces surrounding the position where it is occurring in the DNA. Therefore, the basecalling may fail for natural bases surrounding this positon.

As a consequence of these considerations, a basecaller that shall be able to differentiate between natural bases and unnatural bases at any sequence position would need to be trained with signals of every possible k‑mer that is a combination of natural and unnatural bases. As the number of possible k‑mers given n possibilities is nk, 6752 new k-mers would have to be analyzed, which is beyond our means. Nevertheless, since unnatural base pairs are mostly used at well-known locations, e.g. to expand the genetic code in in vitro or in vivo experiments, it is sufficient to analyze only certain k‑mers that are of interest. The comparison to k‑mers containing all natural bases provides a foundation for a software that differentiates between unnatural and natural base pairs at a given position of interest (POI). This is the approach for our software tool.

Overview

The software package iCG is a collection of three python scripts for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural base pairs. The first script, iCG filter, is used to filter basecalled Nanopore data by certain criteria like read length and quality. Its fundamental function is to select sequencing reads that contain the position of interest, where the unnatural base pair is subjected to be, irrespective of its presence. The second scipt, iCG model, creates a statistical model for the detection of unnatural base pairs in sequencing reads based on linear discriminant analysis of ONT data previously filtered by iCG filter. The third script, iCG predict, uses a model created by iCG model to predict which base pair is present at the position of interest for a set of filtered sequencing reads.

Fig. 1: Overview of the iCG functionality.

Dependencies & Installation

Our software works with basecalled Nanopore data. For Nanopore sequencing, we used the ONT software MinKNOW version 1.7.14.1 in combination with FLO-MIN106 R9.4 flowcells on a MinION sequencer. Other combinations of hardware and consumables, such as 1D2 flowcells, were not tested in combination with this software package and will probably require for small adjustments. Basecalling has to be performed with the ONT basecaller Albacore.

The iCG software package has some dependencies that need to be installed before it can be used. First of all, HDF5 needs to be installed. HFD, standing for Hierarchical Data Format, is a format for storing and retrieving large data sets from/to files and is the format in which Nanopore data is stored. For building and installing from sources, please refer to the distributors website. For installation from depository:
> sudo apt-get install libhdf5-serial-dev
Additionally, the alignment and mapping software package BWA‑MEM (Li, 2013) needs to be installed. For additional help regarding the installation process, please refer to their sourceforge or github website.
> git clone https://github.com/lh3/bwa.git
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
Furthermore, the software environment R for statistical computing and graphics needs to be installed. For a system specific installation, please refer to the r-project website. For statistical computations and plotting, some additional R packages need to be installed. After the R environment is installed, please print the following commands in an R console:
> install.packages("ggplot2")
> install.packages("cowplot")
> install.packages("MASS")
iCG itself is written in python2.7 code. Therefore, a python2.7 environment needs to be installed. Please follow the instructions on python.org for a system specific installation. Afterwards, use the python package managment tool pip for the installation of required python packages. For that, open a terminal and execute the following commands:
> pip2.7 install regex
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]

With all dependencies installed, download the latest version of iCG together with example data sets in the Downloads section of this page. To learn about how to use iCG, please continue reading or start right ahead by executing the following command in a terminal, having navigated to the folder where the downloaded code is located:
python2 iCG.py -h
Your installation of iCG is now complete. Due to the size of sequencing data, basecalled Oxford Nanopore sequencing reads for testing iCG are available on request.

iCG filter

Description
The command iCG filter is executed on basecalled Nanopore data before any further analysis can be performed with iCG. It uses the information generated by the ONT basecaller albacore to filter a set of sequencing reads for sufficient quality, length and the existence of the position of interest. After the reads have been filtered, they are divided into sense and antisense reads regarding their orientation relative to a given reference sequence. Afterwards, the reads are "re-squiggled" using nanoraw (Stoiber, 2016), meaning that every read is aligned to the given reference sequence in order to correct bascalling errors and to normalize the raw read signals in terms of duration and amplitude. This filtering procedure is mandatory for the subsequent analysis with both iCG model and iCG predict.

For subsequent analysis, it is particularly important that the filtering process is performed irrespective of the position of interest. The potential influence of the unnatural bases on the sequencing signal may lead to bascalling misbehavior which in turn could lead to falsified analysis results if the position of interest and the neighboring sequence context is considered during filtering. Therefore, the presence of the region of interest in a read sequence is checked by regular expression pattern matching, omitting the immediate sequence context of the position of interest, called "blur region". Figure 2 illustrates how the position of interest, the region of interest, the blur region and the regular expression are connected with each other.

Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.

Every aspect of this pattern matching approach is adoptable by optional arguments passed to iCG on execution, including the size of the region of interest, the size of the blur region and the amount of mismatches allowed during pattern matching. Additional parameters define whether insertions and deletions (indels) are allowed and how much the length of a match is allowed to deviate from the length of the region of interest as a consequence of indels.
In addition to pattern matching, the defined region of interest is also used for evaluating the read quality in the neighboring sequence context of the position of interest. Assuming that the quality in the blur region is high if the quality of its surrounding sequence context is high too, this criteria can be especially assessed by setting the --min_mean_context_qscore parameter accordingly.
Usage & Output
As a positional argument, the script needs to be handed a non-empty list of albacore base directories containing basecalled reads to be filtered. Additional required arguments that need to be specified are the path to the bwa mem executable and the fasta file containing the reference sequence the reads shall be mapped to. A basic, abstract program execution looks as follows:
python2 iCG.py filter PATH/TO/ALBACORE/DIR_1 PATH/TO/ALBACORE/DIR_2 --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE -r PATH/TO/FASTA
All specified directories are subsequently filtered using the same filter settings. These settings can be modified by specifying optional arguments. The results of the program execution are saved in a new directory inside the albacore base directory named "filtered" by default. In addition to the filtered reads, sorted into subdirectories named "sense" and "antisense", the destination folder will contain a file called settings.txt holding all specified filtering settings. Additionally, a directory plots is created on default containing plots of the normalized signal traces at the region of interest for both the sense and the antisense strand orientation.

During program execution, iCG filter prints useful information regarding current steps of the filtering process, revealing how the result maybe optimized. The following is an example for the console output created by iCG filter:
filtering for minimal mean quality of 10.0
    67402 removed (40.81 % of remaining)
    97739 remaining (59.19 % of total)

filtering for minimal length of 1887.75
    removed 91117 (93.22 % of remaining)
    6622 remaining (4.01 % of total)

filtering for maximal length of 2517
    removed 1330 (20.08 % of remaining)
    5292 remaining (3.20 % of total)

adding fastq information

filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
    sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
    antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}

    deviations in sequence length of remaining sequences (* selected):
            sense  antisense
     -5:    6      3
     -4:    9      9
     -3:    34     10
    *-2:    71     32
    *-1:    135    94
    * 0:    875    705
    * 1:    84     47
    * 2:    44     24
      3:    14     12
      4:    1      1
      5:    4      0
            ------ ------
    *       1209   902
    removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
    2111 remaining (1.28 % of total)

filtering for context quality higher than 14.0 (* selected)
            sense  antisense
     7-8:   0      1
     8-9:   3      1
     9-10:  12     5
     10-11: 18     6
     11-12: 38     24
     12-13: 70     41
     13-14: 108    82
    *14-15: 142    101
    *15-16: 193    128
    *16-17: 155    109
    *17-18: 167    116
    *18-19: 127    95
    *19-20: 84     78
    *20-21: 56     54
    *21-22: 25     29
    *22-23: 7      19
    *23-24: 3      10
    *24-25: 1      1
    *25-26: 0      2
            ------ ------
    *       960    742
    removed 409 (19.37 % of remaining)
    1702 remaining  (1.03 % of total)
Further help regarding the usage of iCG filter is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE
    [--dst_dir_name DST_DIR_NAME]
    [--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
    [--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
    [--blur_deviation BLUR_DEVIATION]
    [--context_deviation CONTEXT_DEVIATION] [--no_indels]
    [--max_length_deviation MAX_LENGTH_DEVIATION]
    [--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
    [--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
    [--normalization_type {median,pA,pA_raw,none}]
    [--processes PROCESSES] [--plot_poi]
    [--disp_bases DISP_BASES] [--barcode BARCODE]
    [--barcode_deviation BARCODE_DEVIATION]
    [basecalled_dirs [basecalled_dirs ...]]

This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.

positional arguments:
basecalled_dirs
Paths to directories containing basecalled reads that shall be filtered.

optional arguments:
-h, --help
show this help message and exit
--bwa_mem_exe BWA_MEM_EXE
Required argument. Path to the bwa_mem executable.
-r REFERENCE, --reference REFERENCE
Required argument. Path to a fasta file holding a single sequence as a reference for mapping reads for resquiggling.
--dst_dir_name DST_DIR_NAME
Name of the destination folder. Default: filtered
--min_mean_qscore MIN_MEAN_QSCORE
The minimal mean phred qscore of all basecalles of a single read, representing the overall quality of the read in assessment. Default:10.0
--min_length MIN_LENGTH
The minimal length of reads. The default value is 0.75 the length of the reference sequence.
--max_length MAX_LENGTH
The maximal length of reads. The default value is the length of the reference sequence.
--radius RADIUS
The number of bases upstream and downstream of the position of interest which form the region of interest (ROI). In combination with the blur radius, the ROI defines the sequence for the regular expression search used for deciding whether a read is covering the position of interest. Default:15
--blur BLUR
The number of bases upstream and downstream of the position of interest which are excluded from the regular expression search used for deciding whether a read is covering the position of interest. This is necessary to to the possibile influence that an unnatural base might have on the signal trace of its surrounding bases. Default: 3
--blur_deviation BLUR_DEVIATION
The amount of bases the blur region is allowed to vary in length. Default: 1
--context_deviation CONTEXT_DEVIATION
The amount of deviations, both indels and mismatches, the upstream and downstream sequence contexts of the region of interest is allowed to have in the regular expression search. Default:2
--no_indels
If specified, indels are not accepted in the regular expression search.
--max_length_deviation MAX_LENGTH_DEVIATION
The maximum deviation in length between the region of interest in the reference sequence and the read sequence, taking both the blur deviation and the context deviation into consideration.
--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE
The minimal mean phred qscore of basecalls in the region of interest, excluding the blurred radius around the position of interest. Default: 2
--greedy_regex_search
If specified, the regular expression search is greedy meaning that overlapping hits are impossible. Faster, but generally not recommended.
--cpts_limit CPTS_LIMIT
cpts parameter for nanoraw genome_resquiggle. Default: 100
--normalization_type {median,pA,pA_raw,none}
Type of normalization used in the nanoraw genome_resquiggle script for calculating statistics based on new segmentation of raw signal.
--processes PROCESSES
Number of processes for nanoraw genome_resquiggle. Default: 4
--plot_poi
If specified, a plot of the resquiggled signal traces of the filtered reads is created and saved as svg.
--disp_bases DISP_BASES
Size of displayed region in number of bases when plotting the signal traces of the input data at the region of interest. Default: 21
--barcode BARCODE
If specified, only reads containing the given barcode string pass the filtering process.
--barcode_deviation BARCODE_DEVIATION
The amount of deviations, both indels and mismatches, a potential regular expression match against the defined barcode is allowed to have. Default: 3

iCG model

Description
The script iCG model creates a linear discriminant analysis model to discriminate between natural bases and unnatural bases in the sequence context that is present in the DNA samples used to compute the model. As an input, the script needs paths to directories containing sequencing data filtered by iCG filter: One directory containing data of a DNA sample with the UBP being analyzed, and at least one containing natural bases at the position of interest. For these datasets, the program identifies the exact location of the position of interest in the raw data of every read, taking into consideration the alignment data previously produced by nanoraw. Then, the mean raw signal in the normalized intervals from two bases upstream to two bases downstream of the position of interest is calculated for every read. Figure 4 gives an example of how this processing is performed.

Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.

With this step, a template dataset consisting of entries for every read in every directory passed to iCG model is created. Each entry holds the mean signal for the five base positions at the position of interest and a group identifier indicating to which input directory the read belongs. The script then applies linear discriminant function analysis to this template dataset in order to create an equation system that classifies the different groups of data based on the mean signals at the five positions. The linear discriminant function is provided by the MASS package (Venables and Ripley, 2002) for the software environment R (R Core Team, 2017). This analysis method provides the linear combinations of the five mean values which best discriminates between reads of different groups, mainly by maximizing variance between reads of different groups and minimizing variance of reads of the same group (see Fisher, 1936, for this approach). By applying the resulting equation system to the very dataset the model was created with, a graphical representation of the classification is attainable. The number of dimensions is depending on the number of linear discriminants needed to discriminate all groups from each other. In most cases, two linear discriminants are sufficient, resulting in a two-dimensional dot-plot with one axis for each linear discriminant.

This approach is best described by taking a closer look at an example. Let us take aside the unnatural bases for a moment and consider that we wanted to create a model that is able to discriminate between the natural bases A, G, T and C at the position of interest in the sequence context presented in Figure 2. First of all, we take sequencing data of four different DNA samples, each one having a different natural base at the position of interest. With the help of iCG filter, the intividual sets of sequencing data are filtered for reads containing the region of interest. Now, the paths of each folder containing one of the four groups are passed to iCG model. Figure 4 shows the group means of the five positions (mean_0 to mean_5), the coefficients of three linear discriminants resulting from the linear discriminant analysis, and a two dimensional dot-plot showing the linear discriminants of the template dataset.

Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.


One great challenge regarding the creation of a good statistical model is the quality of the template data used for the statistical analysis. Ideally, the template dataset should only contain reads that can be assigned to their respective group with absolute certainty. Every read that is assigned wrong will prevent the algorithm performing the linear discriminant analysis from finding liner coefficients that effectively diverge the groups from each other. Unfortunately, Nanopore sequencing has the disadvantage that samples are sequenced one after the other. Even though a washing step is performed between consecutive sequencing runs, parts of the previous libraries remain inside the flow cell and will procude reads. According to Oxford Nanopore technologies, the portion of contaminating reads from previous runs is up to 10 % when washing with the Flow Cell Wash kit EXP-WSH002, depending on the number of previously loaded libraries (see EXP-WSH002). In our experiments, we found the degree of contamination to be even higher for some samples.

Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.

To face this challenge and to improve the overall quality of the statistical models created with iCG, we implemented an additional filtering algorithm into iCG model that recursively removes reads that are deviating the most from the median signal in the neighboring sequence context of the position of interest. This filtering step is performed on every set of reference reads individually. For deciding which reads are removed, the positions from -2 to +2 bases in relation to the position of interest are evaluated. By setting the parameter --quantiles_to_remove accordingly, the influence of this filtering process can be observed by taking a look at the plots of the signal traces, which are automatically produced and saved by iCG. Due to the recursive nature of the function, which calculates new median values after every single read that has been removed, the algorithm is hardly influencing the reads that truly belong to the reference sample being sequenced, as long as the contamination is lower than 50 %. Figure 5 illustrates the effect of this filtering step on the signal traces of a sequencing run with relatively high contamination. Figure 6 gives an example of the effect on the linear discriminant analysis.

Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly

Usage & Output
As a positional argument, the script needs a path to a folder where the results shall be saved. Additionally, the paths to directories containing reads filtered by iCG filter need to be passed to iCG model. At least the directory containing filtered reads with the unnatural base pair and one arbitrary natural base pair is required for program execution. A basic, abstract program execution looks as follows:
python2 iCG.py model PATH/TO/RESULTS -UBP PATH/TO/FILTERED/UBP-READS -NN PATH/TO/OTHER/FILTERED/READS
In the destination folder, ".model" files containing the statistical model of the linear discriminant analysis are saved. These can be passed to iCG predict for base prediction of new Nanopore sequencing data. Dot-plots and statistics for the linear discriminant models are saved in subdirectories named plots and stats accordingly. Furthermore, plots of the signal traces at the position of interest are created for each group of reads used for model creation.
Further help regarding the usage of iCG model is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]
    [-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
    [-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
    [--print_group_characteristics] [--no_data_plots]
    [--disp_bases DISP_BASES]
    [-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
    [--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
    [--UBantisense UBANTISENSE]
    dst_dir

This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".

positional arguments:
dst_dir
Path to destination folder where all created results shall be saved.

optional arguments:
-h, --help
show this help message and exit
-UBP UBP_FLTRD_FAST5_DIR, --UBP_fltrd_fast5_dir UBP_FLTRD_FAST5_DIR
Required argument. Relative or absolute path to the directory containing filtered .fast5 files of reads with the unnatural base pair at the position of interest being analyzed.
-AT AT_FLTRD_FAST5_DIR, --AT_fltrd_fast5_dir AT_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an A in the sense strand and a T in the antisense strand at the position of interest.
-GC GC_FLTRD_FAST5_DIR, --GC_fltrd_fast5_dir GC_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an G in the sense strand and a C in the antisense strand at the position of interest.
-TA TA_FLTRD_FAST5_DIR, --TA_fltrd_fast5_dir TA_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an T in the sense strand and a A in the antisense strand at the position of interest.
-CG CG_FLTRD_FAST5_DIR, --CG_fltrd_fast5_dir CG_FLTRD_FAST5_DIR
Relative or absolute path to the directory containing filtered .fast5 files of reads with an C in the sense strand and a G in the antisense strand at the position of interest.
-NN NN_FLTRD_FAST5_DIR, --NN_fltrd_fast5_dir NN_FLTRD_FAST5_DIR
Relative or absolute path to a directory containing filtered .fast5 files of reads with any natural bases in the sense strand and antisense strand at the position of interest.
--print_group_characteristics
If specified, detailed statistical information about the position of interest of each given group of reads is stored in their respective folders. This information includes called bases as well as mean and median normalized signals with their respective deviations for the position of interest as well as for the two positions upstream and downstream.
--no_data_plots
If specified, no plots are created showing the signal traces of the input data at the region of interest.
--disp_bases DISP_BASES
Size of displayed region in number of bases when plotting the signal traces of the input data at the region of interest. Default: 21
-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]], --quantiles_to_remove [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]
Specifies the quantile of reads being filtered out for deviating the most from the median signal in the neighboring sequence context of the position of interest. Helpful for removing reads that originate from previously sequenced libraries. Default: 0.3
--no_model_plots
If specified, no plots are created showing the distribution and classification of reads by the linear discriminant analysis model.
--no_model_stats
If specified, no stats are saved giving information about the quality of classification of the linear discriminant model with respect to the input data.
--UBsense UBSENSE
short identifier for the unnatural base in the sense strand. Default: UB(+)
--UBantisense UBANTISENSE
short identifier for the unnatural base in the antisense strand. Default: UB(-)

iCG predict

Description
Based on a linear discriminant analysis model previously created by iCG model, iCG predict gives base predictions for a position of interest of reads filtered by iCG filter. In addition to a text output on the console, the result is saved as a .csv file and as a plot showing the linear discriminants of the predicted reads. An example prediction result is shown in figure 7.

Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.

Usage & Output
As a positional arguments, the script needs to be given the paths to the model file that shall be used as well as to the directory containing reads for base prediction, previously filtered with iCG filter. Optionally, a directory for the results can be specified by setting the argument --dst_dir.
python2 iCG.py predict PATH/TO/MODEL PATH/TO/FILTERED/READS
Further help regarding the usage of iCG predict is given in the help pages that are accessible by passing the argument -h or --help to the program on execution. They explain every positional and optional argument in detail, including their default values.
Help pages
usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir

Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.

positional arguments:
model
Absolute or relative path to the .model file holding the linear discriminant analysis model desired for base prediction.
in_dir
Absolute or relative path to the directory containing filtered and resquiggled reads of particular orientation for base prediction.

optional arguments:
-h, --help
show this help message and exit
--dst_dir DST_DIR
Folder where the results will be saved. Default: ./

Mutation Analysis Xplorer

The Mutation Analasis Xplorer, in short M.A.X., is a python tool and an easy accessable alternative to iCG. This tool helps us to find out whether the unnatural base pair is still existent in the plasmid or not. Due to several mechanisms, the unnatural base pair can be replaced with natural bases. However, the results of this possible point mutations are predictable. This leads to our software tool M.A.X. and a simple PCR experiment to prove the existence of our unnatural base pair. The tool finds similar recognition sites of restriction enzymes and uses these to create an extended sequence with one mutated base. For any base A, C, G, or T at this position another restriction enzyme can cut. But if a new base like isoG or isoCm no restriction enzyme can recognize the sequence and the plasmid stays uncut. Since this system does not require special hardware it is a cheap alternative to iCG.

Mutations

The unnatural base pairs are not stable within the DNA. Our conservation system handles a lot of mechanisms which remove or replace our new bases. The process of hydrolysis changes an isoCm to a thymidine, which will lead to a replacement of isoG with adenine. Another reaction with the cause of a mutation is the tautomerisation. This leads to a change from the isoG keto tautomer to an isoG phenolic tautomer. This form of isoG can no longer pair with isoCm, but with thymidine. This will lead, like the hydrolysis, in the next step to a replacement of the isoG-isoCm base pair to a thymidine-adenine base pair.

M.A.X. and RestrictionDB

The Mutation Analysis Xplorer is basically a search on a for this case create database of restriction enzymes. To improve the runtime, reduce the necessary computing capacity, and to simplify the software itself, we wrote the python script “RestrictionDB”. Actually, RestrictionDB takes a text file from REBASE containing information about restriction enzymes and their recognition sequence. This parsing system, which knows how to read the REBASE data and translate them into object it can work with, is written in an abstract way. So, adapting the parser to files from other providers is relatively effortless, which allows the script to become with a few updates a universal parser. After parsing the data, restriction enzymes with very similar recognition sequences are grouped together. From the barely more than 400 used restriction enzymes are created more than 230.000 enzyme groups. Each group has an extended sequence. These sequences are determined by looking at the recognition sequences of the restriction enzymes and combining these, so all recognition sequences in a group are subsequences of the extended sequence in this group. Further, all recognition sequences and the extended sequence itself have on the same position a ‘#’ instead of the original base. The replacement of the ‘#’ by an A, C, G, or T could lead to one of the recognition sequences of the enzymes in this group and would lead to a cuttable DNA double strand in the lab. An example of an enzyme group is seen in Figure 6.
RestrictionDB creates just the database and does not look on preferences of the user, since its aim is to create a universal database of restriction enzymes grouped by similarity of their recognition sequence. The next step is to load this database into M.A.X. Obviously the huge output of more than 230.000 groups of enzymes takes some long calculation times. However, this job must be done only once everytime you want to use new sources for your database. To skip this step, we already provide a database ready to use. When starting M.A.X. you must give the path of the desired database. Then you see the graphical user interface of M.A.X. (Figure 7). By entering text into the text field “Search on database” and pressing enter you can start the search on the preferred data of the database. The output is a list of enzyme groups, which can be used to plan a wet lab prove of unnatural bases within a plasmid.

Usage

RestrictionDB takes one file as source, which has to be at the moment an enzyme information text file from REBASE. However, the code is optimized, to adapt easy to accept multiple files at once, which can be from additional providers. The default call of restrictionDB is
python restrictionDB.py <input file>
M.A.X. takes only one source of data as an argument, where the data must be the output from restrictionDB. The default call of M.A.X. is
python MAX.py <restrionDB output>
The graphical user interface opens after loading and parsing the given database (figure 7). The top text field can take inputs of multiple types depending on which of the radio buttons on the right side of this text field are chosen. For a search in enzyme names, the input consists of short strings. The database will be checked for enzyme groups, where at least one enzyme name contains the given string as a substring. The input of multiple strings separated by spaces is possible, so only enzyme groups, where all given strings appear in some enzyme name as substring are returned in the output. The Output is then listed in the table on the bottom half of the graphical user interface (Figure 6). It consists of the enzyme names, recognitions sequences containing the substitution char “#” and aligned to the extended sequence, the original base on the position of the substitution char, and the original recognition sites with displayed cleavage sites. Individual enzyme groups are separated by lines of minuses. To start the search, just press the return key on your keyboard. If you just want to display single results, which trigger the hit and not the whole enzyme group, uncheck the checkbox “show whole group”. The checkbox “Search exact sequence” suppresses hits, which have no miss match score just because a high amount of Ns in the sequence, so only hits where exact bases match are returned.

Figure 6: M.A.X. Output
An example of an output from a search on the database. The output consists of enzymegroups.

Next to enzyme names you can search for sequences, where enzyme groups are only returned, if the given sequence or its reverse complement is a substring of at least one enzymes recognition sequence or a substring of the extended sequence. For a search on other data connected to enzymes check the radio button “Other data”. On all searches you can limit the maximum number of results to reduce the runtime and cancel further search at a given point. To filter for special original bases at the position of the substitution char, enter the desired letters into the text field “Preferred Substitutions”. Multiple letters are possible, what leads to the usual value of “ACGT”, if this software is used to analyze the existence of unnatural bases in a DNA molecule. The search with this setting returns only enzyme groups, where at least one enzyme for any of the natural base at the substitution base position is able to cut the double strand of DNA. One example is shown in Figure 8, which shows also the four enzymes used in our M.A.X. experiment . The last text field is made of bigger sequences. The whole database is checked for having recognition sequences as substrings in this sequence. This is useful, if you already have a sequence of interest and want to find enzyme groups, which are able to cut within or next to this sequence.

Figure 7: Graphical User Interface of M.A.X.

Figure 8: Enzyme group cutting for any base
Shows an entry of the extended sequence and the four restriction enzymes used in our wet lab prove using the M.A.X. system. If the unnatural base stays at the position of the char "#", then none of the four enzymes can cut. Otherwise, for any mutation which lead to a substituiton to A, C, G, or T, one of the four displayed enzymes cuts the DNA strand.

Downloads

The work on this part of the project is being continued. Please visit the following GitHub repositories for up-to-date source code and documentation for both software tools, iCG and M.A.X.:

https://github.com/MarkusHaak/iCG https://github.com/MaximilianEdich/M.A.X.

Refereces

Debladis, E., Llauro, C., Carpentier, M.-C., Mirouze, M., and Panaud, O. (2017). Detection of active transposable elements in Arabidopsis thaliana using Oxford Nanopore Sequencing technology. BMC Genomics 18: 537.
Feng, Y., Zhang, Y., Ying, C., Wang, D., and Du, C. (2015). Nanopore-based Fourth-generation DNA Sequencing Technology. Genomics. Proteomics Bioinformatics 13: 4–16.
Fisher, R.A. et al. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7: 179-188.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 0: 1–3.
Simpson, J.T., Workman, R.E., Zuzarte, P.C., David, M., Dursi, L.J., and Timp, W. (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nat Meth 14: 407–410.
Venables, W.N. and Ripley, B.D. (2002). Modern Applied Statistics with S Fourth. (Springer: New York).
R Core Team (2017). R: A Language and Environment for Statistical Computing.