MarkusHaak (Talk | contribs) |
(added max) |
||
Line 703: | Line 703: | ||
</div> | </div> | ||
</font> | </font> | ||
− | |||
− | |||
</div> | </div> | ||
<div class="bevel bl"></div> | <div class="bevel bl"></div> | ||
Line 710: | Line 708: | ||
+ | <div class="contentbox"> | ||
+ | <div class="content"> | ||
+ | <h3>Mutation Analysis Xplorer</h3> | ||
+ | <div class="article"> | ||
+ | The Mutation Analasis Xplorer, in short M.A.X., is a python tool written by our team member Max. This tool helps us to find out whether the unnatural base pair is still existent in the plasmid or not. Due to several mechanisms, the unnatural base pair can be replaced with natural bases. However, the results of this possible point mutations are predictable. This leads to our software tool M.A.X. and a simple PCR experiment to prove the existence of our unnatural base pair. The tool finds similar recognition sites of restriction enzymes and uses these to create an extended sequence with one mutated base. For any base A, C, G, or T at this position another restriction enzyme can cut. But if a new base like isoG or isoCm no restriction enzyme can recognize the sequence and the plasmid stays uncut. | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="content"> | ||
+ | <h4>Mutations</h4> | ||
+ | <div class="article"> | ||
+ | The unnatural base pairs are not stable within the DNA. Our conservation system handles a lot of mechanisms which remove or replace our new bases. The process of hydrolysis changes an isoCm to a thymidine, which will lead to a replacement of isoG with adenine. Another reaction with the cause of a mutation is the tautomerisation. This leads to a change from the isoG keto tautomer to an isoG phenolic tautomer. This form of isoG can no longer pair with isoCm, but with thymidine. This will lead, like the hydrolysis, in the next step to a replacement of the isoG-isoCm base pair to a thymidine-adenine base pair. | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="content"> | ||
+ | <span class="anchor-jump" id="MAX"></span> | ||
+ | <div class="section"></div> | ||
+ | <h4>M.A.X. and RestrictionDB</h4> | ||
+ | <div class="article"> | ||
+ | The Mutation Analysis Xplorer is basically a search on a for this case create database of restriction enzymes. To improve the runtime, reduce the necessary computing capacity, and to simplify the software itself, we wrote the python script “RestrictionDB”. Actually, RestrictionDB takes a text file from <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383893/">REBASE</a> containing information about restriction enzymes and their recognition sequence. This parsing system, which knows how to read the REBASE data and translate them into object it can work with, is written in an abstract way. So, adapting the parser to files from other providers is relatively effortless, which allows the script to become with a few updates a universal parser. After parsing the data, restriction enzymes with very similar recognition sequences are grouped together. From the barely more than 400 used restriction enzymes are created more than 230.000 enzyme groups. Each group has an extended sequence. These sequences are determined by looking at the recognition sequences of the restriction enzymes and combining these, so all recognition sequences in a group are subsequences of the extended sequence in this group. Further, all recognition sequences and the extended sequence itself have on the same position a ‘#’ instead of the original base. The replacement of the ‘#’ by an A, C, G, or T could lead to one of the recognition sequences of the enzymes in this group and would lead to a cuttable DNA double strand in the lab. An example of an enzyme group is seen in Figure 6. | ||
+ | </div> | ||
+ | |||
+ | <div class="article"> | ||
+ | RestrictionDB creates just the database and does not look on preferences of the user, since its aim is to create a universal database of restriction enzymes grouped by similarity of their recognition sequence. The next step is to load this database into M.A.X. Obviously the huge output of more than 230.000 groups of enzymes takes some long calculation times. However, this job must be done only once everytime you want to use new sources for your database. To skip this step, we already provide a database ready to use. When starting M.A.X. you must give the path of the desired database. Then you see the graphical user interface of M.A.X. (Figure 7). By entering text into the text field “Search on database” and pressing enter you can start the search on the preferred data of the database. The output is a list of enzyme groups, which can be used to plan a wet lab prove of unnatural bases within a plasmid. | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class="contentbox"> | ||
+ | <div class="content"> | ||
+ | <h4>Usage</h4> | ||
+ | <div class="article"> | ||
+ | RestrictionDB takes one file as source, which has to be at the moment an enzyme information text file from <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383893/">REBASE</a>. However, the code is optimized, to adapt easy to accept multiple files at once, which can be from additional providers. The default call of restrictionDB is | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | python restrictionDB.py <input file> | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | M.A.X. takes only one source of data as an argument, where the data must be the output from restrictionDB. The default call of M.A.X. is | ||
+ | </div> | ||
+ | <div class="codeblock"> | ||
+ | python MAX.py <restrionDB output> | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | <div class="contentline"> | ||
+ | <div class="half left"> | ||
+ | The graphical user interface opens after loading and parsing the given database (figure 7). The top text field can take inputs of multiple types depending on which of the radio buttons on the right side of this text field are chosen. For a search in enzyme names, the input consists of short strings. The database will be checked for enzyme groups, where at least one enzyme name contains the given string as a substring. The input of multiple strings separated by spaces is possible, so only enzyme groups, where all given strings appear in some enzyme name as substring are returned in the output. The Output is then listed in the table on the bottom half of the graphical user interface (Figure 6). It consists of the enzyme names, recognitions sequences containing the substitution char “#” and aligned to the extended sequence, the original base on the position of the substitution char, and the original recognition sites with displayed cleavage sites. Individual enzyme groups are separated by lines of minuses. To start the search, just press the return key on your keyboard. If you just want to display single results, which trigger the hit and not the whole enzyme group, uncheck the checkbox “show whole group”. The checkbox “Search exact sequence” suppresses hits, which have no miss match score just because a high amount of Ns in the sequence, so only hits where exact bases match are returned. | ||
+ | </div> | ||
+ | <div class="half right"> | ||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/0/0d/T--Bielefeld-CeBiTec--software_enzymegroup.png"> | ||
+ | <p class="figure subtitle"><b>Figure 6: M.A.X. Output</b><br>An example of an output from a search on the database. The output consists of enzymegroups.</p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | <div class="article"> | ||
+ | Next to enzyme names you can search for sequences, where enzyme groups are only returned, if the given sequence or its reverse complement is a substring of at least one enzymes recognition sequence or a substring of the extended sequence. For a search on other data connected to enzymes check the radio button “Other data”. | ||
+ | On all searches you can limit the maximum number of results to reduce the runtime and cancel further search at a given point. To filter for special original bases at the position of the substitution char, enter the desired letters into the text field “Preferred Substitutions”. Multiple letters are possible, what leads to the usual value of “ACGT”, if this software is used to analyze the existence of unnatural bases in a DNA molecule. The search with this setting returns only enzyme groups, where at least one enzyme for any of the natural base at the substitution base position is able to cut the double strand of DNA. One example is shown in Figure 8, which shows also the four enzymes used in <a href="https://2017.igem.org/Team:Bielefeld-CeBiTec/Results/unnatural_base_pair/development_of_new_methods#MAX">our M.A.X. experiment </a>. | ||
+ | The last text field is made of bigger sequences. The whole database is checked for having recognition sequences as substrings in this sequence. This is useful, if you already have a sequence of interest and want to find enzyme groups, which are able to cut within or next to this sequence. | ||
+ | </div> | ||
+ | <div class="contentline"> | ||
+ | <div class="half left"> | ||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/7/7b/T--Bielefeld-CeBiTec--software_gui.png"> | ||
+ | <p class="figure subtitle"><b>Figure 7: Graphical User Interface of M.A.X.</b></p> | ||
+ | </div> | ||
+ | </div> | ||
+ | <div class="half right"> | ||
+ | <div class="figure large"> | ||
+ | <img class="figure image" src="https://static.igem.org/mediawiki/2017/2/29/T--Bielefeld-CeBiTec--software_searchACGT.png"> | ||
+ | <p class="figure subtitle"><b>Figure 8: Enzyme group cutting for any base</b><br> Shows an entry of the extended sequence and the | ||
+ | four restriction enzymes used in our wet lab prove using the M.A.X. system. If the unnatural base stays at the position of the char | ||
+ | "#", then none of the four enzymes can cut. Otherwise, for any mutation which lead to a substituiton to A, C, G, or T, one of the | ||
+ | four displayed enzymes cuts the DNA strand.</p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
<div class="contentbox"> | <div class="contentbox"> |
Revision as of 03:30, 2 November 2017
Nanopore Sequencing of DNA Containing Unnatural Bases
iCG
Background
Overview
Fig. 1: Overview of the iCG functionality.
Dependencies & Installation
> cd bwa; make
> ./bwa index ref.fa
> ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
> ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
> install.packages("cowplot")
> install.packages("MASS")
> pip2.7 install biopython
> pip2.7 install -Iv rpy2==2.8.0 # on 10-26-2017, rpy2 v2.9 was incompatible with python2.7
> pip2.7 install nanoraw[plot]
iCG filter
Description
Fig. 2: Terminology and regular expression used for pattern matching.In order to identify reads containing the region of interest, a regular expression is used for pattern matching that is omitting the immediate sequence context of the position of interest, called "blur" region. The displayed regular expression is resulting from the default parameter settings for the exemplary reference sequence.
Usage & Output
67402 removed (40.81 % of remaining)
97739 remaining (59.19 % of total)
filtering for minimal length of 1887.75
removed 91117 (93.22 % of remaining)
6622 remaining (4.01 % of total)
filtering for maximal length of 2517
removed 1330 (20.08 % of remaining)
5292 remaining (3.20 % of total)
adding fastq information
filtering for sequence context: 15 bases upstream and downstream, with 3 bases of blur and 1 bases blur_deviation
sense pattern: (TCTAGTACCGAA){e<=2}.{6,8}?(CCTATCATCGCT){e<=2}
antisense pattern: (AGCGATGATAGG){e<=2}.{6,8}?(TTCGGTACTAGA){e<=2}
deviations in sequence length of remaining sequences (* selected):
sense antisense
-5: 6 3
-4: 9 9
-3: 34 10
*-2: 71 32
*-1: 135 94
* 0: 875 705
* 1: 84 47
* 2: 44 24
3: 14 12
4: 1 1
5: 4 0
------ ------
* 1209 902
removed 3181 (60.11 % of remaining), 19 due to multiple RE matches
2111 remaining (1.28 % of total)
filtering for context quality higher than 14.0 (* selected)
sense antisense
7-8: 0 1
8-9: 3 1
9-10: 12 5
10-11: 18 6
11-12: 38 24
12-13: 70 41
13-14: 108 82
*14-15: 142 101
*15-16: 193 128
*16-17: 155 109
*17-18: 167 116
*18-19: 127 95
*19-20: 84 78
*20-21: 56 54
*21-22: 25 29
*22-23: 7 19
*23-24: 3 10
*24-25: 1 1
*25-26: 0 2
------ ------
* 960 742
removed 409 (19.37 % of remaining)
1702 remaining (1.03 % of total)
Help pages
[--dst_dir_name DST_DIR_NAME]
[--min_mean_qscore MIN_MEAN_QSCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--radius RADIUS] [--blur BLUR]
[--blur_deviation BLUR_DEVIATION]
[--context_deviation CONTEXT_DEVIATION] [--no_indels]
[--max_length_deviation MAX_LENGTH_DEVIATION]
[--min_mean_context_qscore MIN_MEAN_CONTEXT_QSCORE]
[--greedy_regex_search] [--cpts_limit CPTS_LIMIT]
[--normalization_type {median,pA,pA_raw,none}]
[--processes PROCESSES] [--plot_poi]
[--disp_bases DISP_BASES] [--barcode BARCODE]
[--barcode_deviation BARCODE_DEVIATION]
[basecalled_dirs [basecalled_dirs ...]]
This sub-script takes a set of directories containing basecalled Nanopore reads as input and filters the contained reads by several criteria given as arguments to this script. Afterwards, the remaining reads are resquigled by nanoraw and copied to a new folder inside the basecalling directory.
positional arguments:
optional arguments:
iCG model
Description
Fig. 3: Determination of Mean Normalized Signals at positions -2 to +2.The plot on the left hand side shows the normalized signal traces of reads originating from a DNA sample containing C at the position of interest, plotted against the normalized and corrected sequence position using nanoraw. For positions -2 to +2 relative to the position of interest, the mean normalized signal is calculated for every read individually. The table on the right hand side lists the average characteristics of this group regarding these sequence positions.
Fig. 4: Example output of iCG model.Group means, coefficients of linear discriminants, proportion of trace and dot-plot of linear discriminants as an exemplary result of a linear discriminant analysis of a template dataset containing sequencing reads with natural bases at the position of interest. The portion of traces indicates the effect of each linear discriminant regarding the discrimination of the different groups. In this particular example, first two linear discriminants are sufficient for classification, as the effect of the third linear discriminant is negligible.
Fig. 5: Effect of removing most deviating reads on the normalized signal traces.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the normalized signals traces become more and more focused to the mean signal of the group being analyzed. Contaminations of previous sequencing libraries are effectively removed.
Fig. 6: Changes in the dot-plot representation of the linear discriminant analysis model with increasing portion of removed, deviating reads.With an increasing portion of reads that are removed due to their deviation from the median signal at the position of interest, the linear discriminant analysis generates linear discriminant coefficients that are increasingly
Usage & Output
Help pages
[-GC GC_FLTRD_FAST5_DIR] [-TA TA_FLTRD_FAST5_DIR]
[-CG CG_FLTRD_FAST5_DIR] [-NN NN_FLTRD_FAST5_DIR]
[--print_group_characteristics] [--no_data_plots]
[--disp_bases DISP_BASES]
[-q [QUANTILS_TO_REMOVE [QUANTILS_TO_REMOVE ...]]]
[--no_model_plots] [--no_model_stats] [--UBsense UBSENSE]
[--UBantisense UBANTISENSE]
dst_dir
This sub-script creates a linear discriminant analysis model to discriminate between unnatural bases and natural bases at a given position of interest in a reference sequence. Therefore, the script is provided with paths to directories containing reads that were previously filtered and resquiggled by the script "iCG filter".
positional arguments:
optional arguments:
iCG predict
Description
Fig. 5: Example of a base prediction result created by iCG predict. Example for the text based output (left) and dot-plot (right) of a prediction result based on a linear discriminant model which needs only one linear discriminant for base prediction.
Usage & Output
Help pages
Based on a linear discriminant analysis model previously created by "iCG model", this sub-script classifies the base at a given position of interest for a set of filtered reads as input.
positional arguments:
optional arguments:
Mutation Analysis Xplorer
Mutations
M.A.X. and RestrictionDB
Usage
Figure 6: M.A.X. Output
An example of an output from a search on the database. The output consists of enzymegroups.
Figure 7: Graphical User Interface of M.A.X.
Figure 8: Enzyme group cutting for any base
Shows an entry of the extended sequence and the
four restriction enzymes used in our wet lab prove using the M.A.X. system. If the unnatural base stays at the position of the char
"#", then none of the four enzymes can cut. Otherwise, for any mutation which lead to a substituiton to A, C, G, or T, one of the
four displayed enzymes cuts the DNA strand.
Downloads
File | MD5 Sum |
---|---|
T--Bielefeld-CeBiTec--software_suite.zip | b7ea52444a6f27a4cf8354613668516a |