(added max) |
|||
(7 intermediate revisions by 2 users not shown) | |||
Line 14: | Line 14: | ||
− | + | <div class="contentbox"> | |
+ | <div class="content"> | ||
+ | <h2>Short Summary</h2> | ||
+ | <div class="article"> | ||
+ | Our sophisticated software suite is composed of two connected modules for the analysis of unnatural base pairs in a specified target sequence: M.A.X and iCG. Oxford Nanopore sequencing data is processed by iCG to identify unnatural base pairs in a given target sequence. As an orthogonal method, our Mutational Analysis Xplorer (M.A.X) utilizes a customized database to find a suitable set of restriction enzymes for our enzyme based detection system. In addition, M.A.X represents a low-cost alternative for the analysis of mutations at a specific position, allowing all iGEM teams to conduct research on unnatural base pairs. Both modules form a powerful software suite, which is extremely helpful for research on unnatural base pairs. Examples are the analysis of mutation frequencies and fidelity of semi-synthetic DNA replication. We postulate that our suite is also applicable for the study of DNA modifications and epigenetics. | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
Line 314: | Line 321: | ||
(ROI). In combination with the blur radius, the ROI | (ROI). In combination with the blur radius, the ROI | ||
defines the sequence for the regular expression search | defines the sequence for the regular expression search | ||
− | used for deciding | + | used for deciding whether a read is covering the |
position of interest. Default:15</div> | position of interest. Default:15</div> | ||
</div> | </div> | ||
Line 322: | Line 329: | ||
<div class="third double">The number of bases upstream and downstream of the | <div class="third double">The number of bases upstream and downstream of the | ||
position of interest which are excluded from the | position of interest which are excluded from the | ||
− | regular expression search used for deciding | + | regular expression search used for deciding whether a |
read is covering the position of interest. This is | read is covering the position of interest. This is | ||
necessary to to the possibile influence that an | necessary to to the possibile influence that an | ||
Line 712: | Line 719: | ||
<h3>Mutation Analysis Xplorer</h3> | <h3>Mutation Analysis Xplorer</h3> | ||
<div class="article"> | <div class="article"> | ||
− | The Mutation Analasis Xplorer, in short M.A.X., is a python tool | + | The Mutation Analasis Xplorer, in short M.A.X., is a python tool and an easy accessable alternative to iCG. This tool helps us to find out whether the unnatural base pair is still existent in the plasmid or not. Due to several mechanisms, the unnatural base pair can be replaced with natural bases. However, the results of this possible point mutations are predictable. This leads to our software tool M.A.X. and a simple PCR experiment to prove the existence of our unnatural base pair. The tool finds similar recognition sites of restriction enzymes and uses these to create an extended sequence with one mutated base. For any base A, C, G, or T at this position another restriction enzyme can cut. But if a new base like isoG or isoCm no restriction enzyme can recognize the sequence and the plasmid stays uncut. Since this system does not require special hardware it is a cheap alternative to iCG. |
</div> | </div> | ||
</div> | </div> | ||
Line 757: | Line 764: | ||
</div> | </div> | ||
<div class="article"> | <div class="article"> | ||
− | + | The graphical user interface opens after loading and parsing the given database (figure 7). The top text field can take inputs of multiple types depending on which of the radio buttons on the right side of this text field are chosen. For a search in enzyme names, the input consists of short strings. The database will be checked for enzyme groups, where at least one enzyme name contains the given string as a substring. The input of multiple strings separated by spaces is possible, so only enzyme groups, where all given strings appear in some enzyme name as substring are returned in the output. The Output is then listed in the table on the bottom half of the graphical user interface (Figure 6). It consists of the enzyme names, recognitions sequences containing the substitution char “#” and aligned to the extended sequence, the original base on the position of the substitution char, and the original recognition sites with displayed cleavage sites. Individual enzyme groups are separated by lines of minuses. To start the search, just press the return key on your keyboard. If you just want to display single results, which trigger the hit and not the whole enzyme group, uncheck the checkbox “show whole group”. The checkbox “Search exact sequence” suppresses hits, which have no miss match score just because a high amount of Ns in the sequence, so only hits where exact bases match are returned. | |
− | + | </div> | |
− | + | <div class="figure large"> | |
− | + | <img class="figure image" src="https://static.igem.org/mediawiki/2017/0/0d/T--Bielefeld-CeBiTec--software_enzymegroup.png"> | |
− | + | <p class="figure subtitle"><b>Figure 6: M.A.X. Output</b><br>An example of an output from a search on the database. The output consists of enzymegroups.</p> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
</div> | </div> | ||
<div class="article"> | <div class="article"> | ||
Line 774: | Line 775: | ||
The last text field is made of bigger sequences. The whole database is checked for having recognition sequences as substrings in this sequence. This is useful, if you already have a sequence of interest and want to find enzyme groups, which are able to cut within or next to this sequence. | The last text field is made of bigger sequences. The whole database is checked for having recognition sequences as substrings in this sequence. This is useful, if you already have a sequence of interest and want to find enzyme groups, which are able to cut within or next to this sequence. | ||
</div> | </div> | ||
− | + | <div class="figure large"> | |
− | + | <img class="figure image" src="https://static.igem.org/mediawiki/2017/7/7b/T--Bielefeld-CeBiTec--software_gui.png"> | |
− | + | <p class="figure subtitle"><b>Figure 7: Graphical User Interface of M.A.X.</b></p> | |
− | + | </div> | |
− | + | <div class="figure large"> | |
− | + | <img class="figure image" src="https://static.igem.org/mediawiki/2017/2/29/T--Bielefeld-CeBiTec--software_searchACGT.png"> | |
− | + | <p class="figure subtitle"><b>Figure 8: Enzyme group cutting for any base</b><br> Shows an entry of the extended sequence and the | |
− | + | four restriction enzymes used in our wet lab prove using the M.A.X. system. If the unnatural base stays at the position of the char | |
− | + | "#", then none of the four enzymes can cut. Otherwise, for any mutation which lead to a substituiton to A, C, G, or T, one of the | |
− | + | four displayed enzymes cuts the DNA strand.</p> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
</div> | </div> | ||
</div> | </div> | ||
Line 808: | Line 803: | ||
<h3 id="Downloads">Downloads</h3> | <h3 id="Downloads">Downloads</h3> | ||
− | + | <div class="article"> | |
− | < | + | The work on this part of the project is being continued. Please visit the following GitHub repositories for up-to-date source code and documentation for both software tools, iCG and M.A.X.: |
− | + | </div> | |
− | + | <br> | |
− | + | <div class="article"> | |
− | + | <a href="https://github.com/MarkusHaak/iCG" target="_blank">https://github.com/MarkusHaak/iCG</a><br> | |
− | + | <a href="https://github.com/MaximilianEdich/M.A.X.">https://github.com/MaximilianEdich/M.A.X.</a> | |
− | + | </div> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
</div> | </div> |
Latest revision as of 22:36, 15 December 2017
Short Summary
Nanopore Sequencing of DNA Containing Unnatural Bases
iCG
Background
Overview
Fig. 1: Overview of the iCG functionality.
Dependencies & Installation
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
> install.packages("cowplot")
> install.packages("MASS")
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]
iCG filter
Description
Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.
Usage & Output
67402 removed (40.81 % of remaining)
97739 remaining (59.19 % of total)
filtering for minimal length of 1887.75
removed 91117 (93.22 % of remaining)
6622 remaining (4.01 % of total)
filtering for maximal length of 2517
removed 1330 (20.08 % of remaining)
5292 remaining (3.20 % of total)
adding fastq information
filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}
deviations in sequence length of remaining sequences (* selected):
sense antisense
-5: 6 3
-4: 9 9
-3: 34 10
*-2: 71 32
*-1: 135 94
* 0: 875 705
* 1: 84 47
* 2: 44 24
3: 14 12
4: 1 1
5: 4 0
------ ------
* 1209 902
removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
2111 remaining (1.28 % of total)
filtering for context quality higher than 14.0 (* selected)
sense antisense
7-8: 0 1
8-9: 3 1
9-10: 12 5
10-11: 18 6
11-12: 38 24
12-13: 70 41
13-14: 108 82
*14-15: 142 101
*15-16: 193 128
*16-17: 155 109
*17-18: 167 116
*18-19: 127 95
*19-20: 84 78
*20-21: 56 54
*21-22: 25 29
*22-23: 7 19
*23-24: 3 10
*24-25: 1 1
*25-26: 0 2
------ ------
* 960 742
removed 409 (19.37 % of remaining)
1702 remaining (1.03 % of total)
Help pages
[--dst_dir_name DST_DIR_NAME]
[--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
[--blur_deviation BLUR_DEVIATION]
[--context_deviation CONTEXT_DEVIATION] [--no_indels]
[--max_length_deviation MAX_LENGTH_DEVIATION]
[--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
[--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
[--normalization_type {median,pA,pA_raw,none}]
[--processes PROCESSES] [--plot_poi]
[--disp_bases DISP_BASES] [--barcode BARCODE]
[--barcode_deviation BARCODE_DEVIATION]
[basecalled_dirs [basecalled_dirs ...]]
This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.
positional arguments:
optional arguments:
iCG model
Description
Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.
Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.
Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.
Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly
Usage & Output
Help pages
[-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
[-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
[--print_group_characteristics] [--no_data_plots]
[--disp_bases DISP_BASES]
[-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
[--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
[--UBantisense UBANTISENSE]
dst_dir
This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".
positional arguments:
optional arguments:
iCG predict
Description
Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.
Usage & Output
Help pages
Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.
positional arguments:
optional arguments:
Mutation Analysis Xplorer
Mutations
M.A.X. and RestrictionDB
Usage
Figure 6: M.A.X. Output
An example of an output from a search on the database. The output consists of enzymegroups.
Figure 7: Graphical User Interface of M.A.X.
Figure 8: Enzyme group cutting for any base
Shows an entry of the extended sequence and the
four restriction enzymes used in our wet lab prove using the M.A.X. system. If the unnatural base stays at the position of the char
"#", then none of the four enzymes can cut. Otherwise, for any mutation which lead to a substituiton to A, C, G, or T, one of the
four displayed enzymes cuts the DNA strand.