Difference between revisions of "Team:TUDelft/Main-Software"

Line 1: Line 1:
 
<html>   
 
<html>   
  <main class="row">
+
  <main class="row contributes">
 
         <div class="topper">
 
         <div class="topper">
 
             <img src="https://static.igem.org/mediawiki/2017/d/d5/T--TUDelft--2017_Software_tool.png">
 
             <img src="https://static.igem.org/mediawiki/2017/d/d5/T--TUDelft--2017_Software_tool.png">

Revision as of 14:30, 1 November 2017

Software tool - Motif Finder

Documentation

These scripts are intended for retrieving antibiotics sequences from online databases, finding conserved regions in said sequences, postprocessing them to comply to the Cas13a crRNA requirements, and lastly showing this and the relevant statistics visually.

This is the online documentation for our scripts we used during the iGEM project. Note that most of the documentation is found in the code as well and that a lot of the documentation you will find here is actually auto generated from the code. If you are an advanced programmer intending to build upon this software, you can therefore skip this and go straight into the code itself (Completely safe).

All scripts are written in Python, but from Python there are sometimes calls to other programs. If and how to install those are mentioned here.

Installation of Python can best be done by downloading the Anaconda distribution. You need to install Python 3.6. It can work with 3.x, but this is untested. It will not work with Python 2.x or lower. Anaconda will also include the conda package manager. With conda you can install python packages much easier, but you are free to stick with regular python with pip.

You can call our scripts from the command line. If you are on Windows you can better use the Anaconda Prompt which will automatically be installed if you install the Anaconda distribution. If you are on a *NIX system (Mac OS for example), you can make use of the build-in terminal. In the documentation we give command line examples by prefixing it with a dollar sign $.

Calling the Python script from the command line can be done in different ways. The easiest way is to prefix it with python or python3 if you have python 2 also installed. The scripts do include a shebang, so if you are on a *NIX system you can make it an executable and run it without.

The command line utility is set up with the build-in argparse module. This will work like many other command line utilities. So to give variables you set a flag and after that give the variable, like this: (-b antibiotic_molecule). We also make use of positional arguments. Those do not have these flags and must be included all the way at the end. This is important to remember, and can be confusing since the help includes the positional arguments first and then the positional. When calling the script you have to turn it around.

All scripts have build-in help. You can call it by passing '-h'. For example: $ python3 meme_process.py -h.

Setting up the database

This is about setting up the database from the CARD database. The database will just be implemented in your filesystem because that is the most easy way to do it.

Script: card_reader.py
Internal depencies: card_functions.py, custom_functions.py
Requirements: Python3, BeautifulSoup, Internet
Estimated time: a few hours
(Remark: There is no parallel execution, if you know about parallel programming you can probably change this. 'processGenes(gene, BASE)' should be moved to a new thread every time it is called.)

Set-Up

First install BeautifulSoup, this will be very easy to install with the use of conda.

You now can use the script. Take a look at the help to see how it works. For you convenience the build-in help is also shown here:

$ python3 card_reader.py -h

usage: card_reader.py [-h] [-b BASE] [-l LINK] [-n NAME] ...

Run the webcrawler on the meme database

positional arguments:
relations The relations which to follow for the recursive search.
DEFAULT=['confers_resistance_to',
'confers_resistance_to_drug']

Optional arguments:
-h, --help: show this help message and exit
-b BASE, --base BASE: The location of the CARD database.
DEFAULT='https://card.mcmaster.ca'
-l LINK, --link LINK: The link were to start searching.
DEFAULT='/ontology/36008'
-n NAME, --name NAME: The name of the folder were everything will be saved.
DEFAULT='antibiotic molecule'

Additional information about the variables

BASE: The current domain name of the CARD database. You should check if it is still online and not moved, otherwise you don't have to touch this.

LINK: The link where you want to start. 36008 are the antibiotic molecules, but there are also other options to start, like to look at the mechanisms. Take a look at 36006 for the options on this.

NAME: The name of the folder where everything be saved. For 36008 this is off course antibiotic molecule.

RELATIONS: The relations the recursive search follows. This is a list of positional arguments and therefore should be passed to the script at the end of the line. By default it always goes to direct inheritance (is_a), but there are more relations as well. Take a look at 40324 for more options.

So for example:
$ python3 card_reader -b https://card.mcmaster.ca -l /ontology/36008 -n antibiotic_molecule confers_resistance_to confers_resistance_to_drug

Starts the default process. It looks for the antibiotic_molecule drug resistance conferring genes.

  • card_functions.py

    If you want you can take a look at the auto generated documentation for the functions below:

    Created on Mon Jul 10 16:32:22 2017
    @author: hielke
    This module contains functions for processing the CARD database with web crawling

    Functions

    getSoup(url)
    From a given url, returns the object containing the parsed html

    getSubTerms(entry, BASE)
    Collect the sub-Term(s) from the given card entry
    Returns dict with keys as relation and list with the entries

    getSubTermsFromSoup(soup)
    Collect the sub-Term(s) from the given soup

    processGenes(entry, BASE)
    processes a gene entry or an entry to multiple by producing the fasta file of the DNA sequence

    saveFasta(fastaDNA, webID)
    from the fasta data, create a fastafile

Search from the NCBI

Sometimes you only want to look into one specific gene and all its variants. The CARD database already contains the homologous sequences and so you don't know about the variants in one gene. For this you can search the NCBI for the variants. This however might give you some trouble, because sometimes a gene is annotated, but doesn't have its own entry. Therefore, we wrote a small script that automatically selects the annotated region and retrieves that instead.

Script: retrieve_from_ncbi.py
Internal depencies: custom_functions.py
Requirements: Python3, BioPython, Internet

Estimated Time: A minute or a bit more.

Set-Up

First install BioPython. Again, using conda will give you the highest success rate.

Now you can use the script. We put the help here again for your convenience:
$ python3 retrieve_from_ncbi.py -h

usage: retrieve_from_ncbi.py [-h] [-m MAIL] [-g GENE] ...

A script to retrieve all genes from the NCBI database, and save them. It is also able to retrieve genes that have no own entry but are only annotated.

Positional arguments:
organisms: At the end you can optional specify the organisms you want to filter on.

Optional arguments:
-h, --help: show this help message and exit
-m MAIL, --mail MAIL: You have to give your email address in order to be allowed to interact with the NCBI API.
-g GENE, --gene GENE: The gene from which you want to retrieve all the variants

Additional information about the variables

MAIL: If you overuse the server you will be contacted. Please take care you obey to this, because they might take further actions in preventing you to access the database or put on more restrictions for all users.

GENE: Give the gene name to search for. You can take a look at NCBI first to see if it is really listed like this.

ORGAS: The organisms you want to filter on. These are optional parameters and have to be given in the end or the line. Selecting the organisms usually gives much better results.

For example, for the mecA search we used these arguments: $ python3 retrieve_from_ncbi.py -m example@mail.com -g mecA Streptococcus Staphylococcus

Motif search

After you set up your database, or retrieved your sequences otherwise, you want to submit the motif search jobs to it. This piece of the documentation is about doing that. Basically this process contains two steps. First you have to gather all sequences you want and put them in one file. Next, you want to run multiple motif searches on these files. Results may vary so you need to do tweaking here.

Script: meme_process.py
Internal depencies: meme_functions.py, custom_functions.py
Requirements: Python3, MEME, BASH

Estimated time: a few hours to a day or more.
(Meme is not a very efficient program. We do not recommend going over the 200 unique sequences per batch. The program runs with parallel computing, but not really optimal. So it does not take into account the cores and all cores will not finish simultaneously.

Set-Up

First you have to comply to the BASH requirement. This is in fact the default shell your terminal will run in a *NIX system. We use this a few times because of laziness. However you need to be on a *NIX system anyway because otherwise you cannot install MEME. If you run Windows you might want to create a virtual box or install cygwin. You can find installation instructions for MEME online: http://meme-suite.org/doc/install.html?man_type=web.

Help is given here

$ python3 meme_process.py -h

usage: meme_process.py [-h] [-b BEGIN] [-e END] [-s STEP] [-m MOTIFS]
[-x EXTENSION] [-r ROOT]
mode ...

What kind of process related to meme you want to run

positional arguments:
mode: The mode you want to run in. Choose from: ['gather', 'meme']
gather: Creates one fastafile from all fastafiles.
Not all the options are used for this one.
meme: Start the meme process with the folowing options:
dirs: Optional different folders in which to execute the same process

optional arguments:
-h, --help: show this help message and exit
-b BEGIN, --begin BEGIN: For the range of the amount of sites to visit, this is the beginning. (Including) DEFAULT=2
-e END, --end END: For the range of the amount of sites to visit, this is the end. (Excluding) DEFAULT=7
-s STEP, --step STEP: The step size for the range of the amount of sites to visit. DEFAULT=1
-m MOTIFS, --motifs MOTIFS: The amount of motifs you want to find, otherwise a bit of a clever algorithm tries to find these.
-x EXTENSION, --extension EXTENSION: The extension of the file that contains all fasta sequences, it doesn't allows temp in the filename.
DEFAULT=.multiple.fasta
-r ROOT, --root ROOT: Give the root folder in which the process has to work on all available directories, if you want to do the same process in a given set of directories, provide them at the end of the command, and leave this one

Additional information about the variables

Like mentioned already there are two parts in this represented as two modes in the command line utility. These modes are 'gather', are 'meme'. Specify these after the flags (with the dash), but before the optional list of directories to work in.

There are also two ways the command line utility work. It can either work in all subdirectories for a given ROOT, or it takes the remainder of the command line input as directories to work from.

For the gather process, specify only which directory/directories it has to work on. It will now automatically create a file with all the sequences it finds in the whole directory (recursive). The file will be called gathered.multiple.fasta. There is also a file called temp.gathered.multiple.fasta. This file has incorrect line break encoding for the meme software to work, but is left there because the correct line break encoding is not recognized by the Windows operating system and old Mac operating systems. (This is the “classic” Mac OS. In Mac OS X and newer this has been changed in order to be POSIX compliant.) So if you are a Windows user you can check this file instead.

For the meme process, you have a lot more options.

EXTENSION: Only set this one if you created your .fasta file collection yourself. The default meets the file naming convention used by the 'gather' process, which is ending on .multiple.fasta and not containing temp.

BEGIN, END, STEP: These variables are used to create the range of sites in which the meme processes will look for motifs. MEME has to be set to maximum number of sites each motif covers. A higher number meets more errors in the motifs, while a lower number means less coverage. You have to find the sweet spots. The destination folder will in turn be suffixed with the number of sites, like meme_out_10. These settings also will determine the amount of processes that are spawned, so ideally you start the amount of processes equal to the amount of cores (or threads if your cores support hyper threading), or leave one free so that you can keep working on other things without much lag.

MOTIFS: This variable determines how much motifs will be found. If you collect more you have a higher success rate, but there is off course a max to this. There is however also mechanism that determines when to stop searching so that your computer doesn't run for nothing.

So for example for gathering all sequences in one file:

$ python3 meme_process.py -r antibiotic_molecule_36008 gather

and then specifically initiate the meme processes on two directories

$ python3 meme_process.py -m 50 meme glycopeptide_antibiotic_36220 elfamycin_antibiotic_37618

Postprocessing

After you collected your motifs you must collect the motifs that fulfill the requirements for Cas13a and that don't have too much incorrect spots in your motifs. The MEME program will give you its output in different formats. There is HTML, XML, and a simple TXT file. You can view the HTML for quick reference. The TXT file however is used by BioPython for parsing it in memory. The TXT file is also the only file that is created on the fly, and not at the end. This means that when a MEME process is stopped in the middle, there is still possibilities to retrieve the information from the file. The BioPython parser is however not capable for doing that. Next to that, there are problems with different versions of the MEME program. That's why we have rewritten this parser partly and included it in the folder.

Script: meme_postprocess.py
Internal depencies: meme_functions.py, custom_functions.py
Requirements: Python3, BioPython, BASH (only one option)

Estimated time: less than a few minutes

Set-Up

For visualizing the results we use matplotlib. We recommend you download this with conda. The only function that requires you to have BASH is the statistics one. We presume that if you already came this far and downloaded MEME you are having the control over a BASH shell, but you can also rewrite this one yourself if you want to. You have to look in meme_postprocess.py and then the find function.

Again the help is shown here

$ python3 meme_postprocess.py -h

usage: meme_postprocess.py [-h] [-s STATISTICS] [-f FILE] [-d DIR] [-v] ...

A program to calculate the motifs that pass our requirements and plot them in a meaningfull way

positional arguments:
dirs List of dirs to work in the meme_output*

optional arguments:
-h, --help: show this help message and exit
-s STATISTICS, --statistics STATISTICS: Give statistics of the given directory
-f FILE, --file FILE: Give a file and calculate from this file the motifs that did pass
-d DIR, --dir DIR: A directory to work in all subdirectories
-v, --verbose: Execute the code in a more verbose manner

Additional information for the variables

VERBOSE: Here we also have a verbose flag. You do not have to give a variable here, but just give it so that you will receive a more verbose output.

FILE: One way to use this script is just by simply giving the path to a file and you will simply get all the motifs back that pass the criteria.

DIR: How to work is again the same as with the meme_process.py. You either specify the directory to work in all subdirectories, or you specify the directories you work from on the end of the line.

DIRS: So these are the directories you work in, if you do not give a dire with '-d'.

  • meme_functions.py

    If you want you can take a look at the auto generated documentation for the functions below:

    @author: hielke
    This module is related to function for sending the correct jobs to the meme utility, and for post-processing the results from it.

    Functions

    memeCmd(file:str, sites:int, motifs:int, width:int=24) -> str Create a command for the terminal running the meme process on 'file' with a maximum number of sites to check and a number of motifs to find. The number of motifs largely determines the time the program will run. The number of sites will largely determine how good the results will be. If you attempt to cover more sites, the result will have more SNPs in them. Optionally, you can specify the width of the motif (default is 24).

    memeOnFile(f:str)
    This function is a factory. It returns a function. The returned function it the memeProcess function, but with the file already set.

    memeProcess(f:str, i:int, n:int) -> None
    This executes the meme program. It spawns as a new process, not as a new thread.

    only_one(motif, loc)
    Returns false if in the counts array there is only one base.
    (So a 100 % score.)
    (This might be a little strict though, a motif could contain valuable data, but still got thrown away while it might be just one sequence that ruins the 100 % score.)

    passed(motif)
    Returns true if the motif passes the requirements for Cas13a.
    * The is GC between 40 and 60 percent.
    * No more than 2 SNPs.
    * If there are 2 SNPs, don't allow any between position 9 and 16, that is the place of the seed region.

    wrong(motif)
    Returns the amount of SNPs and the place it is in.

  • custom_function.py
    And last, some general functions we used sometimes.

    custom_function.py
    @author: hielke
    Just some usefull general functions.
    The first are available on all operating systems, but the second half requires the use of *NIX operating system. (Like mac, or Linux, but Windows with cygwin installed might also mentioned.) The functions of the second half are only used in the scripts related to the meme program which by itself can only be installed on *NIX operating systems.

    cd(newdir, verbose=True)
    performs a change of directory, but with 'with' context manangement

    fileCat(extension, verbose=True, Windows=False)
    In the current folder collect all the files with the extension 'extension' recursively and catenate them. It will not take into account files that already have the word multiple prefixed to their extension. The last condition makes sure this function can be repeated without collecting duplicates. As a bonus converts the newline character to standard Creates file temp.gathered.multiple.extension with unaltered newline characters and file gathered.multiple.extension with altered newline characters NB. Windows might experience problems with this. so if you want you can pass windows=True in the function. NB2. Requires sed version 4.2.2, because of the -z option. Usually you will find it any modern *NIX based OS.

    grouper(iterable, n, fillvalue=None)
    Collect data into fixed-length chunks or blocks
    >>> grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx

    logit(f)
    # A decorator for logging all the important information

    mkdir(entry)
    Creates a directory from an entry, checks also if already exists

    nonEmptyDir() -> bool
    Lists all nonempty directories recursively
    NB. Returns a filter object (an iterator)
    Requires *NIX based OS.

    saveCat(*args)
    catenates all iterables in the input, and leaves out the None types if all are None, return None

    sh(cmd, verbose=True)
    Returns the stdout of the shell command as an iterator over the lines.
    The lines are strip()-ed.