MarkusHaak (Talk | contribs) |
MarkusHaak (Talk | contribs) |
||
Line 12: | Line 12: | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="bevel tr"></div> | ||
+ | <div class="content"> | ||
+ | <h1>Software</h1> | ||
+ | </div> | ||
+ | <div class="bevel bl"></div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="bevel tr"></div> | ||
+ | <div class="content"> | ||
+ | |||
+ | <h3>Nanopore Sequencing of DNA Containing Unnatural Bases</h3> | ||
+ | <div class="article"> | ||
+ | As described on the <a target="_blank" href="">New Methods</a> page of this wiki, Nanopore sequencing promises to be especially beneficial regarding the analysis of experiments involving unnatural bases in DNA and RNA. As a single-molecule sequencing technology without the need for special, expensive chemistry or PCR amplification of DNA samples prior to analysis, it could serve as an innovative, novel approach for the analysis of experiments involving unnatural bases. | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | For the analysis of our experiments with isoG and isoC<sup>m</sup>, examined if Oxford Nanopore sequencing is suitable for sequencing DNA containing unnatural bases. Therefore, we developed a software called "iCG" for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural bases, including data processing and evaluation. Based on differences in the raw output signal of DNA reads containing an unnatural base compared to those containing only natural bases, the software uses linear discriminant analysis to create a statistical model that gives base predictions for a position of interest in a known sequence context, where either a natural or an unnatural base is present. This software is not limited to a certain unnatural base pair or to a certain sequence context, but rather serves as a pipeline tool for the creation of statistical models potentially allowing for the detection of any unnatural base pairs in an arbitrary sequence context based on reference sequencing data. | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | </div> | ||
+ | <div class="bevel bl"></div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="bevel tr"></div> | ||
+ | <div class="content"> | ||
+ | |||
+ | |||
+ | <h3>iCG</h3> | ||
+ | |||
+ | <h4>Background</h4> | ||
+ | |||
+ | <div class="article"> | ||
+ | In ONT's sequencing technology, the ionic current that passes through a nanopore is not measured for every nucleotide individually. To cope with the influence of random fluctuations of the ionic current, the background noise, the signal is evaluated for a sequence of approximately 5 nucleotides. Every such k‑mer (in this case every possible combination of 5 successive nucleotides) produces a characteristic signal that is evaluated by the basecaller to predict the sequence of the analyzed DNA strand. Therefore, the signal trace of a certain nucleotide is not the same at every position where it occurs in the DNA strand, but is dependent on its sequence context. This has two interesting effects that have to be considered regarding ONT sequencing of unnatural bases: Firstly, the signal of the close sequence context has to be considered too when analyzing if an unnatural base has a significant effect on the ionic current flow. Secondly, if an unnatural base indeed significantly influences the signal, it will have an impact on the signal traces surrounding the position where it is occurring in the DNA. Therefore, the basecalling may fail for natural bases surrounding this positon. | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | As a consequence of these considerations, a basecaller that shall be able to differentiate between natural bases and unnatural bases at any sequence position would need to be trained with signals of every possible k‑mer that is a combination of natural and unnatural bases. As the number of possible k‑mers given n possibilities is n<sup>k</sup>, 6752 new k-mers would have to be analyzed, which is beyond our means. Nevertheless, since unnatural base pairs are mostly used at well-known locations, e.g. to expand the genetic code in <i>in vitro</i> or <i>in vivo</i> experiments, it is sufficient to analyze only certain k‑mers that are of interest. The comparison to k‑mers containing all natural bases provides a foundation for a software that differentiates between unnatural and natural base pairs at a given position of interest (POI). This is the approach for our software tool. | ||
+ | </div> | ||
+ | |||
+ | <h4>Overview</h4> | ||
+ | |||
+ | <div class="article"> | ||
+ | The software package iCG is a collection of three python scripts for the analysis of Oxford Nanopore sequencing data of DNA containing unnatural base pairs. The first script, <b>iCG filter</b>, is used to filter basecalled Nanopore data by certain criteria like read length and quality. Its fundamental function is to select sequencing reads that contain the position of interest, where the unnatural base pair is subjected to be, irrespective of its presence. The second scipt, <b>iCG model</b>, creates a statistical model for the detection of unnatural base pairs in sequencing reads based on linear discriminant analysis of ONT data previously filtered by iCG filter. The third script, <b>iCG predict</b>, uses a model created by iCG model to predict which base pair is present at the position of interest for a set of filtered sequencing reads. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure seventy"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/7/7a/ICG_overview.png"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 1: Overview of the iCG functionality.</b> | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <h4>Dependencies & Installation</h4> | ||
+ | <div class="article"> | ||
+ | Our software works with basecalled Nanopore data. For Nanopore sequencing, we used the ONT software MinKNOW version 1.7.14.1 in combination with FLO-MIN106 R9.4 flowcells on a MinION sequencer. Other combinations of hardware and consumables, such as 1D<sup>2</sup> flowcells, were not tested in combination with this software package and will probably require for small adjustments. Basecalling has to be performed with the ONT basecaller Albacore. | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | The iCG software package has some dependencies that need to be installed before it can be used. First of all, <a target="_blank" href="https://support.hdfgroup.org/HDF5/">HDF5</a> needs to be installed. HFD, standing for Hierarchical Data Format, is a format for storing and retrieving large data sets from/to files and is the format in which Nanopore data is stored. For building and installing from sources, please refer to the distributors website. For installation from depository: | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | > sudo apt-get install libhdf5-serial-dev | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | Additionally, the alignment and mapping software package BWA‑MEM (Li, 2013) needs to be installed. For additional help regarding the installation process, please refer to their <a target="_blank" href="http://bio-bwa.sourceforge.net/">sourceforge</a> or <a target="_blank" href="https://github.com/lh3/bwa">github</a> website. | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | > git clone https://github.com/lh3/bwa.git<br> | ||
+ | > cd bwa; make<br> | ||
+ | > ./bwa index ref.fa<br> | ||
+ | > ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz<br> | ||
+ | > ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | Furthermore, the software environment R for statistical computing and graphics needs to be installed. For a system specific installation, please refer to the <a target="_blank" href="https://www.r-project.org/">r-project</a> website. For statistical computations and plotting, some additional R packages need to be installed. After the R environment is installed, please print the following commands in an R console: | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | > install.packages("ggplot2")<br> | ||
+ | > install.packages("cowplot")<br> | ||
+ | > install.packages("MASS") | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | iCG itself is written in python2.7 code. Therefore, a python2.7 environment needs to be installed. Please follow the instructions on <a target="_blank" href="https://www.python.org/">python.org</a> for a system specific installation. Afterwards, use the python package managment tool <a target="_blank" href="https://pypi.python.org/pypi/pip">pip</a> for the installation of required python packages. For that, open a terminal and execute the following commands: | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | > pip2.7 install regex<br> | ||
+ | > pip2.7 install biopython<br> | ||
+ | > pip2.7 install -Iv rpy2==2.8.0 <font color="grey"># on 10-26-2017, rpy2 v2.9 was incompatible with python2.7 </font><br> | ||
+ | > pip2.7 install nanoraw[plot] | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | With all dependencies installed, download the latest version of iCG together with example data sets in the <a target="_blank" href="">download section</a> of this page. To learn about how to use iCG, please continue reading or start right ahead by executing the following command in a terminal, having navigated to the folder where the downloaded code is located: | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | python2 iCG.py -h | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | In order to test if the installation was successful, please consider executing the following commands one after the other: | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | > python2 iCG.py filter iCG_test/UBP iCG_test/A iCG_test/G iCG_test/T iCG_test/C -r iCG_test/reference.fasta --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE<br> | ||
+ | > python2 iCG.py model iCG_test/results -UBP iCG_test/UBP/filtered -AT iCG_test/A/filtered -GC iCG_test/G/filtered -TA iCG_test/T/filtered -CG iCG_test/C/filtered --UBPsense iCme --UBPantisense iG<br> | ||
+ | > python2 iCG.py predict iCG_test/results/antisense_0.3.model iCG_test/results/antisense_test_reads --dst_dir results/prediction<br> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <h4>iCG filter</h4> | ||
+ | |||
+ | <h5>Description</h5> | ||
+ | |||
+ | <div class="article"> | ||
+ | The command iCG filter is executed on basecalled Nanopore data before any further analysis can be performed with iCG. It uses the information generated by the ONT basecaller albacore to filter a set of sequencing reads for sufficient quality, length and the existence of the position of interest. After the reads have been filtered, they are divided into sense and antisense reads regarding their orientation relative to a given reference sequence. Afterwards, the reads are "re-squiggled" using nanoraw (Stoiber, 2016), meaning that every read is aligned to the given reference sequence in order to correct bascalling errors and to normalize the raw read signals in terms of duration and amplitude. This filtering procedure is mandatory for the subsequent analysis with both iCG model and iCG predict. | ||
+ | </div> | ||
+ | <p></p> | ||
+ | <div class="article"> | ||
+ | For subsequent analysis, it is particularly important that the filtering process is performed irrespective of the position of interest. The potential influence of the unnatural bases on the sequencing signal may lead to bascalling misbehavior which in turn could lead to falsified analysis results if the position of interest and the neighboring sequence context is considered during filtering. Therefore, the presence of the region of interest in a read sequence is checked by regular expression pattern matching, omitting the immediate sequence context of the position of interest, called "blur region". Figure 2 illustrates how the position of interest, the region of interest, the blur region and the regular expression are connected with each other. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure seventy"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/5/5d/ICG_terminology.png"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 2: Terminology and regular expression used for pattern matching.</b>In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence. | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | Every aspect of this pattern matching approach is adoptable by optional arguments passed to iCG on execution, including the size of the region of interest, the size of the blur region and the amount of mismatches allowed during pattern matching. Additional parameters define whether insertions and deletions (indels) are allowed and how much the length of a match is allowed to deviate from the length of the region of interest as a consequence of indels. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | In addition to pattern matching, the defined region of interest is also used for evaluating the read quality in the neighboring sequence context of the position of interest. Assuming that the quality in the blur region is high if the quality of its surrounding sequence context is high too, this criteria can be especially assessed by setting the <font face="courier new">--min_mean_context_qscore</font> parameter accordingly. | ||
+ | </div> | ||
+ | |||
+ | <h5>Usage & Output</h5> | ||
+ | |||
+ | <div class="article"> | ||
+ | As a positional argument, the script needs to be handed a non-empty list of albacore base directories containing basecalled reads to be filtered. Additional required arguments that need to be specified are the path to the bwa mem executable and the fasta file containing the reference sequence the reads shall be mapped to. A basic, abstract program execution looks as follows: | ||
+ | </div> | ||
+ | |||
+ | <div class="codeblock"> | ||
+ | python2 iCG.py filter PATH/TO/ALBACORE/DIR_1 PATH/TO/ALBACORE/DIR_2 --bwa_mem_exe PATH/TO/BWA-MEM/EXECUTABLE -r PATH/TO/FASTA | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | All specified directories are subsequently filtered using the same filter settings. These settings can be modified by specifying optional arguments. The results of the program execution are saved in a new directory inside the albacore base directory named "filtered" by default. In addition to the filtered reads, sorted into subdirectories named "sense" and "antisense", the destination folder will contain a file called settings.txt holding all specified filtering settings. Additionally, a directory plots is created on default containing plots of the normalized signal traces at the region of interest for both the sense and the antisense strand orientation. | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | During program execution, iCG filter prints useful information regarding current steps of the filtering process, revealing how the result maybe optimized. The following is an example for the console output created by iCG filter: | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | <p style='padding-left: 3em; font-family:"courier new"; text-align: left'> | ||
+ | filtering for minimal mean quality of 10.0<br> | ||
+ | 67402 removed (40.81 % of remaining)<br> | ||
+ | 97739 remaining (59.19 % of total)<br> | ||
+ | <br> | ||
+ | filtering for minimal length of 1887.75<br> | ||
+ | removed 91117 (93.22 % of remaining)<br> | ||
+ | 6622 remaining (4.01 % of total)<br> | ||
+ | <br> | ||
+ | filtering for maximal length of 2517<br> | ||
+ | removed 1330 (20.08 % of remaining)<br> | ||
+ | 5292 remaining (3.20 % of total)<br> | ||
+ | <br> | ||
+ | adding fastq information<br> | ||
+ | <br> | ||
+ | filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation<br> | ||
+ | sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}<br> | ||
+ | antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}<br> | ||
+ | <br> | ||
+ | deviations in sequence length of remaining sequences (* selected):<br> | ||
+ | sense antisense<br> | ||
+ | -5: 6 3<br> | ||
+ | -4: 9 9<br> | ||
+ | -3: 34 10<br> | ||
+ | *-2: 71 32<br> | ||
+ | *-1: 135 94<br> | ||
+ | * 0: 875 705<br> | ||
+ | * 1: 84 47<br> | ||
+ | * 2: 44 24<br> | ||
+ | 3: 14 12<br> | ||
+ | 4: 1 1<br> | ||
+ | 5: 4 0<br> | ||
+ | ------ ------<br> | ||
+ | * 1209 902 <br> | ||
+ | removed 3181 (60.11 % of remaining), 19 due to multiple RE matches<br> | ||
+ | 2111 remaining (1.28 % of total)<br> | ||
+ | <br> | ||
+ | filtering for context quality higher than 14.0 (* selected)<br> | ||
+ | sense antisense<br> | ||
+ | 7-8: 0 1<br> | ||
+ | 8-9: 3 1<br> | ||
+ | 9-10: 12 5<br> | ||
+ | 10-11: 18 6<br> | ||
+ | 11-12: 38 24<br> | ||
+ | 12-13: 70 41<br> | ||
+ | 13-14: 108 82<br> | ||
+ | *14-15: 142 101<br> | ||
+ | *15-16: 193 128<br> | ||
+ | *16-17: 155 109<br> | ||
+ | *17-18: 167 116<br> | ||
+ | *18-19: 127 95<br> | ||
+ | *19-20: 84 78<br> | ||
+ | *20-21: 56 54<br> | ||
+ | *21-22: 25 29<br> | ||
+ | *22-23: 7 19<br> | ||
+ | *23-24: 3 10<br> | ||
+ | *24-25: 1 1<br> | ||
+ | *25-26: 0 2<br> | ||
+ | ------ ------<br> | ||
+ | * 960 742 <br> | ||
+ | removed 409 (19.37 % of remaining)<br> | ||
+ | 1702 remaining (1.03 % of total)<br> | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | Further help regarding the usage of iCG filter is given in the <a target="_blank" href="">help pages</a> that are accessible by passing the argument <font face="courier new">-h</font> or <font face="courier new">--help</font> to the program on execution. They explain every positional and optional argument in detail, including their default values. | ||
+ | </div> | ||
+ | |||
+ | <h5>Help pages</h5> | ||
+ | |||
+ | <font face="courier new"> | ||
+ | <div class="article"> | ||
+ | <p> | ||
+ | usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE<br> | ||
+ | [--dst_dir_name DST_DIR_NAME]<br> | ||
+ | [--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]<br> | ||
+ | [--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]<br> | ||
+ | [--blur_deviation BLUR_DEVIATION]<br> | ||
+ | [--context_deviation CONTEXT_DEVIATION] [--no_indels]<br> | ||
+ | [--max_length_deviation MAX_LENGTH_DEVIATION]<br> | ||
+ | [--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]<br> | ||
+ | [--greedy_regex_search] [--cpts_limit CPTS_LIMIT]<br> | ||
+ | [--normalization_type {median,pA,pA_raw,none}]<br> | ||
+ | [--processes PROCESSES] [--plot_poi]<br> | ||
+ | [--disp_bases DISP_BASES] [--barcode BARCODE]<br> | ||
+ | [--barcode_deviation BARCODE_DEVIATION]<br> | ||
+ | [basecalled_dirs [basecalled_dirs ...]] | ||
+ | </p> | ||
+ | <p> | ||
+ | This sub-script takes a set of directories containing basecalled Nanopore reads as | ||
+ | input and filters the contained reads by several criteria given as arguments | ||
+ | to this script. Afterwards, the remaining reads are resquigled by nanoraw and | ||
+ | copied to a new folder inside the basecalling directory. | ||
+ | </p> | ||
+ | |||
+ | |||
+ | <br> | ||
+ | <br> | ||
+ | positional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third">basecalled_dirs</div> | ||
+ | <div class="third double">Paths to directories containing basecalled reads that shall be filtered.</div> | ||
+ | </div> | ||
+ | |||
+ | <br> | ||
+ | optional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-h, --help</div> | ||
+ | <div class="third double">show this help message and exit</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--bwa_mem_exe BWA_MEM_EXE</div> | ||
+ | <div class="third double">Required argument. Path to the bwa_mem executable.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-r REFERENCE, --reference REFERENCE</div> | ||
+ | <div class="third double">Required argument. Path to a fasta file holding a single sequence as a reference for mapping reads for resquiggling.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--dst_dir_name DST_DIR_NAME</div> | ||
+ | <div class="third double">Name of the destination folder. Default: filtered</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--min_mean_qscore MIN_MEAN_QSCORE</div> | ||
+ | <div class="third double">The minimal mean phred qscore of all basecalles of a | ||
+ | single read, representing the overall quality of the | ||
+ | read in assessment. Default:10.0</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--min_length MIN_LENGTH</div> | ||
+ | <div class="third double">The minimal length of reads. The default value is 0.75 | ||
+ | the length of the reference sequence.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--max_length MAX_LENGTH</div> | ||
+ | <div class="third double">The maximal length of reads. The default value is the | ||
+ | length of the reference sequence.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--radius RADIUS</div> | ||
+ | <div class="third double">The number of bases upstream and downstream of the | ||
+ | position of interest which form the region of interest | ||
+ | (ROI). In combination with the blur radius, the ROI | ||
+ | defines the sequence for the regular expression search | ||
+ | used for deciding weather a read is covering the | ||
+ | position of interest. Default:15</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--blur BLUR</div> | ||
+ | <div class="third double">The number of bases upstream and downstream of the | ||
+ | position of interest which are excluded from the | ||
+ | regular expression search used for deciding weather a | ||
+ | read is covering the position of interest. This is | ||
+ | necessary to to the possibile influence that an | ||
+ | unnatural base might have on the signal trace of its | ||
+ | surrounding bases. Default: 3</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--blur_deviation BLUR_DEVIATION</div> | ||
+ | <div class="third double">The amount of bases the blur region is allowed to vary | ||
+ | in length. Default: 1</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--context_deviation CONTEXT_DEVIATION</div> | ||
+ | <div class="third double">The amount of deviations, both indels and mismatches, | ||
+ | the upstream and downstream sequence contexts of the | ||
+ | region of interest is allowed to have in the regular | ||
+ | expression search. Default:2</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--no_indels</div> | ||
+ | <div class="third double">If specified, indels are not accepted in the regular | ||
+ | expression search.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--max_length_deviation MAX_LENGTH_DEVIATION</div> | ||
+ | <div class="third double">The maximum deviation in length between the region of | ||
+ | interest in the reference sequence and the read | ||
+ | sequence, taking both the blur deviation and the | ||
+ | context deviation into consideration.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE</div> | ||
+ | <div class="third double">The minimal mean phred qscore of basecalls in the | ||
+ | region of interest, excluding the blurred radius | ||
+ | around the position of interest. Default: 2</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--greedy_regex_search</div> | ||
+ | <div class="third double">If specified, the regular expression search is greedy | ||
+ | meaning that overlapping hits are impossible. Faster, | ||
+ | but generally not recommended.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--cpts_limit CPTS_LIMIT</div> | ||
+ | <div class="third double">cpts parameter for nanoraw genome_resquiggle. Default: | ||
+ | 100</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--normalization_type {median,pA,pA_raw,none}</div> | ||
+ | <div class="third double">Type of normalization used in the nanoraw | ||
+ | genome_resquiggle script for calculating statistics | ||
+ | based on new segmentation of raw signal.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--processes PROCESSES</div> | ||
+ | <div class="third double">Number of processes for nanoraw genome_resquiggle. | ||
+ | Default: 4</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--plot_poi</div> | ||
+ | <div class="third double">If specified, a plot of the resquiggled signal traces | ||
+ | of the filtered reads is created and saved as svg.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--disp_bases DISP_BASES</div> | ||
+ | <div class="third double">Size of displayed region in number of bases when | ||
+ | plotting the signal traces of the input data at the | ||
+ | region of interest. Default: 21</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--barcode BARCODE</div> | ||
+ | <div class="third double">If specified, only reads containing the given barcode | ||
+ | string pass the filtering process.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--barcode_deviation BARCODE_DEVIATION</div> | ||
+ | <div class="third double">The amount of deviations, both indels and mismatches, | ||
+ | a potential regular expression match against the | ||
+ | defined barcode is allowed to have. Default: 3</div> | ||
+ | </div> | ||
+ | |||
+ | </div> | ||
+ | </font> | ||
+ | |||
+ | |||
+ | <h4>iCG model</h4> | ||
+ | |||
+ | <h5>Description</h5> | ||
+ | <div class="article"> | ||
+ | The script iCG model creates a linear discriminant analysis model to discriminate between natural bases and unnatural bases in the sequence context that is present in the DNA samples used to compute the model. As an input, the script needs paths to directories containing sequencing data filtered by iCG filter: One directory containing data of a DNA sample with the UBP being analyzed, and at least one containing natural bases at the position of interest. For these datasets, the program identifies the exact location of the position of interest in the raw data of every read, taking into consideration the alignment data previously produced by nanoraw. Then, the mean raw signal in the normalized intervals from two bases upstream to two bases downstream of the position of interest is calculated for every read. Figure 4 gives an example of how this processing is performed. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/f/f0/ICG_model_raw_data.png"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.</b>The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions. | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | With this step, a template dataset consisting of entries for every read in every directory passed to iCG model is created. Each entry holds the mean signal for the five base positions at the position of interest and a group identifier indicating to which input directory the read belongs. The script then applies linear discriminant function analysis to this template dataset in order to create an equation system that classifies the different groups of data based on the mean signals at the five positions. The linear discriminant function is provided by the MASS package (Venables and Ripley, 2002) for the software environment R (R Core Team, 2017). This analysis method provides the linear combinations of the five mean values which best discriminates between reads of different groups, mainly by maximizing variance between reads of different groups and minimizing variance of reads of the same group (see Fisher, 1936, for this approach). By applying the resulting equation system to the very dataset the model was created with, a graphical representation of the classification is attainable. The number of dimensions is depending on the number of linear discriminants needed to discriminate all groups from each other. In most cases, two linear discriminants are sufficient, resulting in a two-dimensional dot-plot with one axis for each linear discriminant. | ||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | This approach is best described by taking a closer look at an example. Let us take aside the unnatural bases for a moment and consider that we wanted to create a model that is able to discriminate between the natural bases A, G, T and C at the position of interest in the sequence context presented in <a target="_blank" href="">figure 2</a>. First of all, we take sequencing data of four different DNA samples, each one having a different natural base at the position of interest. With the help of iCG filter, the intividual sets of sequencing data are filtered for reads containing the region of interest. Now, the paths of each folder containing one of the four groups are passed to iCG model. <a target="_blank" href="">Figure 4</a> shows the group means of the five positions (mean_0 to mean_5), the coefficients of three linear discriminants resulting from the linear discriminant analysis, and a two dimensional dot-plot showing the linear discriminants of the template dataset. | ||
+ | </div> | ||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/6/63/ICG_model_example.png"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 4: Example output of iCG model.</b>Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible. | ||
+ | </p> | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | |||
+ | </div> | ||
+ | <br> | ||
+ | <div class="article"> | ||
+ | One great challenge regarding the creation of a good statistical model is the quality of the template data used for the statistical analysis. Ideally, the template dataset should only contain reads that can be assigned to their respective group with absolute certainty. Every read that is assigned wrong will prevent the algorithm performing the linear discriminant analysis from finding liner coefficients that effectively diverge the groups from each other. Unfortunately, Nanopore sequencing has the disadvantage that samples are sequenced one after the other. Even though a washing step is performed between consecutive sequencing runs, parts of the previous libraries remain inside the flow cell and will procude reads. According to Oxford Nanopore technologies, the portion of contaminating reads from previous runs is up to 10 % when washing with the Flow Cell Wash kit EXP-WSH002, depending on the number of previously loaded libraries (see <a target="_blank" href="https://store.nanoporetech.com/flow-cell-wash-kit-r9.html">EXP-WSH002</a>). In our experiments, we found the degree of contamination to be even higher for some samples. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/9/9d/ICG_model_quantil.gif"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 5: Effect of removing most deviating reads on the normalized signal traces.</b>With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed. | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | To face this challenge and to improve the overall quality of the statistical models created with iCG, we implemented an additional filtering algorithm into iCG model that recursively removes reads that are deviating the most from the median signal in the neighboring sequence context of the position of interest. This filtering step is performed on every set of reference reads individually. For deciding which reads are removed, the positions from -2 to +2 bases in relation to the position of interest are evaluated. By setting the parameter <font face="courier new">--quantiles_to_remove</font> accordingly, the influence of this filtering process can be observed by taking a look at the plots of the signal traces, which are automatically produced and saved by iCG. Due to the recursive nature of the function, which calculates new median values after every single read that has been removed, the algorithm is hardly influencing the reads that truly belong to the reference sample being sequenced, as long as the contamination is lower than 50 %. Figure 5 illustrates the effect of this filtering step on the signal traces of a sequencing run with relatively high contamination. Figure 6 gives an example of the effect on the linear discriminant analysis. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure seventy"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/4/4b/ICG_model_quantil_2.gif"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.</b>With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <h5>Usage & Output</h5> | ||
+ | |||
+ | <div class="article"> | ||
+ | As a positional argument, the script needs a path to a folder where the results shall be saved. Additionally, the paths to directories containing reads filtered by iCG filter need to be passed to iCG model. At least the directory containing filtered reads with the unnatural base pair and one arbitrary natural base pair is required for program execution. A basic, abstract program execution looks as follows: | ||
+ | </div> | ||
+ | |||
+ | <div class="codeblock"> | ||
+ | python2 iCG.py model PATH/TO/RESULTS -UBP PATH/TO/FILTERED/UBP-READS -NN PATH/TO/OTHER/FILTERED/READS | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | In the destination folder, ".model" files containing the statistical model of the linear discriminant analysis are saved. These can be passed to iCG predict for base prediction of new Nanopore sequencing data. Dot-plots and statistics for the linear discriminant models are saved in subdirectories named plots and stats accordingly. Furthermore, plots of the signal traces at the position of interest are created for each group of reads used for model creation. | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | Further help regarding the usage of iCG model is given in the <a target="_blank" href="">help pages</a> that are accessible by passing the argument <font face="courier new">-h</font> or <font face="courier new">--help</font> to the program on execution. They explain every positional and optional argument in detail, including their default values. | ||
+ | </div> | ||
+ | |||
+ | <h5>Help pages</h5> | ||
+ | |||
+ | <font face="courier new"> | ||
+ | <div class="article"> | ||
+ | <p> | ||
+ | usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]<br> | ||
+ | [-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]<br> | ||
+ | [-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]<br> | ||
+ | [--print_group_characteristics] [--no_data_plots]<br> | ||
+ | [--disp_bases DISP_BASES]<br> | ||
+ | [-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]<br> | ||
+ | [--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]<br> | ||
+ | [--UBantisense UBANTISENSE]<br> | ||
+ | dst_dir | ||
+ | </p> | ||
+ | <p> | ||
+ | This sub-script creates a linear discriminant analysis model to | ||
+ | discriminate between unnatural bases and natural bases at a given position of | ||
+ | interest in a reference sequence. Therefore, the script is provided with paths | ||
+ | to directories containing reads that were previously filtered and resquiggled | ||
+ | by the script "iCG filter". | ||
+ | </p> | ||
+ | |||
+ | |||
+ | <br> | ||
+ | <br> | ||
+ | positional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third">dst_dir</div> | ||
+ | <div class="third double">Path to destination folder where all created results shall be saved.</div> | ||
+ | </div> | ||
+ | |||
+ | <br> | ||
+ | optional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-h, --help</div> | ||
+ | <div class="third double">show this help message and exit</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-UBP UBP_FLTRD_FAST5_DIR, --UBP_fltrd_fast5_dir UBP_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Required argument. Relative or absolute path to the | ||
+ | directory containing filtered .fast5 files of reads | ||
+ | with the unnatural base pair at the position of | ||
+ | interest being analyzed.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-AT AT_FLTRD_FAST5_DIR, --AT_fltrd_fast5_dir AT_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Relative or absolute path to the directory containing | ||
+ | filtered .fast5 files of reads with an A in the sense | ||
+ | strand and a T in the antisense strand at the position | ||
+ | of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-GC GC_FLTRD_FAST5_DIR, --GC_fltrd_fast5_dir GC_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Relative or absolute path to the directory containing | ||
+ | filtered .fast5 files of reads with an G in the sense | ||
+ | strand and a C in the antisense strand at the position | ||
+ | of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-TA TA_FLTRD_FAST5_DIR, --TA_fltrd_fast5_dir TA_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Relative or absolute path to the directory containing | ||
+ | filtered .fast5 files of reads with an T in the sense | ||
+ | strand and a A in the antisense strand at the position | ||
+ | of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-CG CG_FLTRD_FAST5_DIR, --CG_fltrd_fast5_dir CG_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Relative or absolute path to the directory containing | ||
+ | filtered .fast5 files of reads with an C in the sense | ||
+ | strand and a G in the antisense strand at the position | ||
+ | of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-NN NN_FLTRD_FAST5_DIR, --NN_fltrd_fast5_dir NN_FLTRD_FAST5_DIR</div> | ||
+ | <div class="third double">Relative or absolute path to a directory containing | ||
+ | filtered .fast5 files of reads with any natural bases | ||
+ | in the sense strand and antisense strand at the | ||
+ | position of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--print_group_characteristics</div> | ||
+ | <div class="third double">If specified, detailed statistical information about the position of interest of each given group of | ||
+ | reads is stored in their respective folders. This information includes called bases as well as mean and median normalized signals with their respective deviations for the position of interest as well as for the two positions upstream and downstream.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--no_data_plots</div> | ||
+ | <div class="third double">If specified, no plots are created showing the signal | ||
+ | traces of the input data at the region of interest.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--disp_bases DISP_BASES</div> | ||
+ | <div class="third double">Size of displayed region in number of bases when | ||
+ | plotting the signal traces of the input data at the | ||
+ | region of interest. Default: 21</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]], --quantiles_to_remove [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]</div> | ||
+ | <div class="third double">Specifies the quantile of reads being filtered out for | ||
+ | deviating the most from the median signal in the | ||
+ | neighboring sequence context of the position of | ||
+ | interest. Helpful for removing reads that originate | ||
+ | from previously sequenced libraries. Default: 0.3</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--no_model_plots</div> | ||
+ | <div class="third double">If specified, no plots are created showing the | ||
+ | distribution and classification of reads by the linear | ||
+ | discriminant analysis model.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--no_model_stats</div> | ||
+ | <div class="third double">If specified, no stats are saved giving information | ||
+ | about the quality of classification of the linear | ||
+ | discriminant model with respect to the input data.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--UBsense UBSENSE</div> | ||
+ | <div class="third double">short identifier for the unnatural base in the sense | ||
+ | strand. Default: UB(+)</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--UBantisense UBANTISENSE</div> | ||
+ | <div class="third double">short identifier for the unnatural base in the | ||
+ | antisense strand. Default: UB(-)</div> | ||
+ | </div> | ||
+ | |||
+ | </div> | ||
+ | </font> | ||
+ | |||
+ | <h4>iCG predict</h4> | ||
+ | |||
+ | <h5>Description</h5> | ||
+ | |||
+ | <div class="article"> | ||
+ | Based on a linear discriminant analysis model previously created by iCG model, iCG predict gives base predictions for a position of interest of reads filtered by iCG filter. In addition to a text output on the console, the result is saved as a .csv file and as a plot showing the linear discriminants of the predicted reads. An example prediction result is shown in figure 7. | ||
+ | </div> | ||
+ | |||
+ | <div class="figure seventy"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/9/9d/ICG_predict_example.png"> | ||
+ | <p class="figure subtitle"> | ||
+ | <b>Fig. 5: Example of a base prediction result created by iCG predict.</b> Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction. | ||
+ | </p> | ||
+ | </div> | ||
+ | |||
+ | <h5>Usage & Output</h5> | ||
+ | |||
+ | <div class="article"> | ||
+ | As a positional arguments, the script needs to be given the paths to the model file that shall be used as well as to the directory containing reads for base prediction, previously filtered with iCG filter. Optionally, a directory for the results can be specified by setting the argument <font face="courier new">--dst_dir</font>. | ||
+ | </div> | ||
+ | |||
+ | <div class="codeblock"> | ||
+ | python2 iCG.py predict PATH/TO/MODEL PATH/TO/FILTERED/READS | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | Further help regarding the usage of iCG predict is given in the <a target="_blank" href="">help pages</a> that are accessible by passing the argument <font face="courier new">-h</font> or <font face="courier new">--help</font> to the program on execution. They explain every positional and optional argument in detail, including their default values. | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <h5>Help pages</h5> | ||
+ | |||
+ | <font face="courier new"> | ||
+ | <div class="article"> | ||
+ | <p> | ||
+ | usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir | ||
+ | </p> | ||
+ | <p> | ||
+ | Based on a linear discriminant analysis model previously created by "iCG | ||
+ | model", this sub-script classifies the base at a given position of interest for a | ||
+ | set of filtered reads as input. | ||
+ | </p> | ||
+ | |||
+ | |||
+ | <br> | ||
+ | <br> | ||
+ | positional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">model</div> | ||
+ | <div class="third double">Absolute or relative path to the .model file holding the | ||
+ | linear discriminant analysis model desired for base | ||
+ | prediction.</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">in_dir</div> | ||
+ | <div class="third double">Absolute or relative path to the directory containing | ||
+ | filtered and resquiggled reads of particular orientation | ||
+ | for base prediction.</div> | ||
+ | </div> | ||
+ | |||
+ | <br> | ||
+ | optional arguments: | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">-h, --help</div> | ||
+ | <div class="third double">show this help message and exit</div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentline"> | ||
+ | <div class="third" style="text-align:left">--dst_dir DST_DIR</div> | ||
+ | <div class="third double">Folder where the results will be saved. Default: ./</div> | ||
+ | </div> | ||
+ | |||
+ | </div> | ||
+ | </font> | ||
+ | |||
+ | |||
+ | </div> | ||
+ | <div class="bevel bl"></div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="bevel tr"></div> | ||
+ | <div class="content"> | ||
+ | |||
+ | |||
+ | <h3>Refereces</h3> | ||
+ | <div class="article"> | ||
+ | <b>Debladis, E., Llauro, C., Carpentier, M.-C., Mirouze, M., and Panaud, O.</b> (2017). Detection of active transposable elements in Arabidopsis thaliana using Oxford Nanopore Sequencing technology. BMC Genomics <b>18</b>: 537. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>Feng, Y., Zhang, Y., Ying, C., Wang, D., and Du, C.</b> (2015). Nanopore-based Fourth-generation DNA Sequencing Technology. Genomics. Proteomics Bioinformatics <b>13</b>: 4–16. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>Fisher, R.A. et al.</b> (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics <b>7</b>: 179-188. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>Li, H.</b> (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. <b>0</b>: 1–3. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>Simpson, J.T., Workman, R.E., Zuzarte, P.C., David, M., Dursi, L.J., and Timp, W.</b> (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nat Meth <b>14</b>: 407–410. | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>Venables, W.N. and Ripley, B.D.</b> (2002). Modern Applied Statistics with S Fourth. (Springer: New York). | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <b>R Core Team</b> (2017). R: A Language and Environment for Statistical Computing. | ||
+ | </div> | ||
+ | |||
+ | |||
+ | </div> | ||
+ | <div class="bevel bl"></div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
</div> | </div> |
Revision as of 00:26, 1 November 2017
Software
Nanopore Sequencing of DNA Containing Unnatural Bases
iCG
Background
Overview
Fig. 1: Overview of the iCG functionality.
Dependencies & Installation
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
> install.packages("cowplot")
> install.packages("MASS")
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]
> python2 iCG.py model iCG_test/results -UBP iCG_test/UBP/filtered -AT iCG_test/A/filtered -GC iCG_test/G/filtered -TA iCG_test/T/filtered -CG iCG_test/C/filtered --UBPsense iCme --UBPantisense iG
> python2 iCG.py predict iCG_test/results/antisense_0.3.model iCG_test/results/antisense_test_reads --dst_dir results/prediction
iCG filter
Description
Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.
Usage & Output
filtering for minimal mean quality of 10.0
67402 removed (40.81 % of remaining)
97739 remaining (59.19 % of total)
filtering for minimal length of 1887.75
removed 91117 (93.22 % of remaining)
6622 remaining (4.01 % of total)
filtering for maximal length of 2517
removed 1330 (20.08 % of remaining)
5292 remaining (3.20 % of total)
adding fastq information
filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}
deviations in sequence length of remaining sequences (* selected):
sense antisense
-5: 6 3
-4: 9 9
-3: 34 10
*-2: 71 32
*-1: 135 94
* 0: 875 705
* 1: 84 47
* 2: 44 24
3: 14 12
4: 1 1
5: 4 0
------ ------
* 1209 902
removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
2111 remaining (1.28 % of total)
filtering for context quality higher than 14.0 (* selected)
sense antisense
7-8: 0 1
8-9: 3 1
9-10: 12 5
10-11: 18 6
11-12: 38 24
12-13: 70 41
13-14: 108 82
*14-15: 142 101
*15-16: 193 128
*16-17: 155 109
*17-18: 167 116
*18-19: 127 95
*19-20: 84 78
*20-21: 56 54
*21-22: 25 29
*22-23: 7 19
*23-24: 3 10
*24-25: 1 1
*25-26: 0 2
------ ------
* 960 742
removed 409 (19.37 % of remaining)
1702 remaining (1.03 % of total)
Help pages
usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE
[--dst_dir_name DST_DIR_NAME]
[--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
[--blur_deviation BLUR_DEVIATION]
[--context_deviation CONTEXT_DEVIATION] [--no_indels]
[--max_length_deviation MAX_LENGTH_DEVIATION]
[--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
[--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
[--normalization_type {median,pA,pA_raw,none}]
[--processes PROCESSES] [--plot_poi]
[--disp_bases DISP_BASES] [--barcode BARCODE]
[--barcode_deviation BARCODE_DEVIATION]
[basecalled_dirs [basecalled_dirs ...]]
This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.
positional arguments:
optional arguments:
iCG model
Description
Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.
Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.
Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.
Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly
Usage & Output
Help pages
usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]
[-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
[-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
[--print_group_characteristics] [--no_data_plots]
[--disp_bases DISP_BASES]
[-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
[--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
[--UBantisense UBANTISENSE]
dst_dir
This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".
positional arguments:
optional arguments:
iCG predict
Description
Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.
Usage & Output
Help pages
usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir
Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.
positional arguments:
optional arguments: