MarkusHaak (Talk | contribs) |
MarkusHaak (Talk | contribs) |
||
Line 12: | Line 12: | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
<div class="contentbox"> | <div class="contentbox"> | ||
Line 176: | Line 177: | ||
</div> | </div> | ||
+ | <font face="courier new"> | ||
<div class="article"> | <div class="article"> | ||
− | + | filtering for minimal mean quality of 10.0<br> | |
− | + | 67402 removed (40.81 % of remaining)<br> | |
− | + | 97739 remaining (59.19 % of total)<br> | |
− | + | <br> | |
− | + | filtering for minimal length of 1887.75<br> | |
− | + | removed 91117 (93.22 % of remaining)<br> | |
− | + | 6622 remaining (4.01 % of total)<br> | |
− | + | <br> | |
− | + | filtering for maximal length of 2517<br> | |
− | + | removed 1330 (20.08 % of remaining)<br> | |
− | + | 5292 remaining (3.20 % of total)<br> | |
− | + | <br> | |
− | + | adding fastq information<br> | |
− | + | <br> | |
− | + | filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation<br> | |
− | + | sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}<br> | |
− | + | antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}<br> | |
− | + | <br> | |
− | + | deviations in sequence length of remaining sequences (* selected):<br> | |
− | + | sense antisense<br> | |
− | + | -5: 6 3<br> | |
− | + | -4: 9 9<br> | |
− | + | -3: 34 10<br> | |
− | + | *-2: 71 32<br> | |
− | + | *-1: 135 94<br> | |
− | + | * 0: 875 705<br> | |
− | + | * 1: 84 47<br> | |
− | + | * 2: 44 24<br> | |
− | + | 3: 14 12<br> | |
− | + | 4: 1 1<br> | |
− | + | 5: 4 0<br> | |
− | + | ------ ------<br> | |
− | + | * 1209 902 <br> | |
− | + | removed 3181 (60.11 % of remaining), 19 due to multiple RE matches<br> | |
− | + | 2111 remaining (1.28 % of total)<br> | |
− | + | <br> | |
− | + | filtering for context quality higher than 14.0 (* selected)<br> | |
− | + | sense antisense<br> | |
− | + | 7-8: 0 1<br> | |
− | + | 8-9: 3 1<br> | |
− | + | 9-10: 12 5<br> | |
− | + | 10-11: 18 6<br> | |
− | + | 11-12: 38 24<br> | |
− | + | 12-13: 70 41<br> | |
− | + | 13-14: 108 82<br> | |
− | + | *14-15: 142 101<br> | |
− | + | *15-16: 193 128<br> | |
− | + | *16-17: 155 109<br> | |
− | + | *17-18: 167 116<br> | |
− | + | *18-19: 127 95<br> | |
− | + | *19-20: 84 78<br> | |
− | + | *20-21: 56 54<br> | |
− | + | *21-22: 25 29<br> | |
− | + | *22-23: 7 19<br> | |
− | + | *23-24: 3 10<br> | |
− | + | *24-25: 1 1<br> | |
− | + | *25-26: 0 2<br> | |
− | + | ------ ------<br> | |
− | + | * 960 742 <br> | |
− | + | removed 409 (19.37 % of remaining)<br> | |
− | + | 1702 remaining (1.03 % of total)<br> | |
− | + | ||
− | + | ||
</div> | </div> | ||
+ | </font> | ||
<div class="article"> | <div class="article"> | ||
Line 775: | Line 776: | ||
<div class="bevel bl"></div> | <div class="bevel bl"></div> | ||
</div> | </div> | ||
+ | |||
Revision as of 00:31, 1 November 2017
Software
Nanopore Sequencing of DNA Containing Unnatural Bases
iCG
Background
Overview
Fig. 1: Overview of the iCG functionality.
Dependencies & Installation
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
> install.packages("cowplot")
> install.packages("MASS")
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]
> python2 iCG.py model iCG_test/results -UBP iCG_test/UBP/filtered -AT iCG_test/A/filtered -GC iCG_test/G/filtered -TA iCG_test/T/filtered -CG iCG_test/C/filtered --UBPsense iCme --UBPantisense iG
> python2 iCG.py predict iCG_test/results/antisense_0.3.model iCG_test/results/antisense_test_reads --dst_dir results/prediction
iCG filter
Description
Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.
Usage & Output
67402 removed (40.81 % of remaining)
97739 remaining (59.19 % of total)
filtering for minimal length of 1887.75
removed 91117 (93.22 % of remaining)
6622 remaining (4.01 % of total)
filtering for maximal length of 2517
removed 1330 (20.08 % of remaining)
5292 remaining (3.20 % of total)
adding fastq information
filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}
deviations in sequence length of remaining sequences (* selected):
sense antisense
-5: 6 3
-4: 9 9
-3: 34 10
*-2: 71 32
*-1: 135 94
* 0: 875 705
* 1: 84 47
* 2: 44 24
3: 14 12
4: 1 1
5: 4 0
------ ------
* 1209 902
removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
2111 remaining (1.28 % of total)
filtering for context quality higher than 14.0 (* selected)
sense antisense
7-8: 0 1
8-9: 3 1
9-10: 12 5
10-11: 18 6
11-12: 38 24
12-13: 70 41
13-14: 108 82
*14-15: 142 101
*15-16: 193 128
*16-17: 155 109
*17-18: 167 116
*18-19: 127 95
*19-20: 84 78
*20-21: 56 54
*21-22: 25 29
*22-23: 7 19
*23-24: 3 10
*24-25: 1 1
*25-26: 0 2
------ ------
* 960 742
removed 409 (19.37 % of remaining)
1702 remaining (1.03 % of total)
Help pages
usage: filter.py [-h] --bwa_mem_exe BWA_MEM_EXE -r REFERENCE
[--dst_dir_name DST_DIR_NAME]
[--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
[--blur_deviation BLUR_DEVIATION]
[--context_deviation CONTEXT_DEVIATION] [--no_indels]
[--max_length_deviation MAX_LENGTH_DEVIATION]
[--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
[--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
[--normalization_type {median,pA,pA_raw,none}]
[--processes PROCESSES] [--plot_poi]
[--disp_bases DISP_BASES] [--barcode BARCODE]
[--barcode_deviation BARCODE_DEVIATION]
[basecalled_dirs [basecalled_dirs ...]]
This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.
positional arguments:
optional arguments:
iCG model
Description
Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.
Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.
Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.
Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly
Usage & Output
Help pages
usage: model.py [-h] -UBP UBP_FLTRD_FAST5_DIR [-AT AT_FLTRD_FAST5_DIR]
[-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
[-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
[--print_group_characteristics] [--no_data_plots]
[--disp_bases DISP_BASES]
[-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
[--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
[--UBantisense UBANTISENSE]
dst_dir
This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".
positional arguments:
optional arguments:
iCG predict
Description
Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.
Usage & Output
Help pages
usage: predict.py [-h] [--dst_dir DST_DIR] model in_dir
Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.
positional arguments:
optional arguments: