Team:TUDelft/Main-Motif Finder

Finding conserved regions in antibiotic resistance genes

There are many different antibiotics and there are many ways for bacteria to be resistant to an antibiotic. Despite this, our detector should detect as many variants as possible for every type of resistance to an antibiotic, not only for variants in the same gene, but also for variants of homologous genes. We therefore looked for different genes encoding for the same type of antibiotic resistance and as well as started looking for different variants of the same gene. After we found a way to collect all those genes, we looked for conserved regions in those genes. From these we can make appropriate crRNA that fits most or even all bacterial strains containing antibiotic resistance genes.

Usually genes that confer a similar function will also share regions that are similar. You can expect that this will also be the case in antibiotic genes (Singh et al. 2009). Therefore we developed a software tool to find these regions.

Building a database

For the analysis of the groups of genes resistant to one type of antibiotic we used the (CARD) (Comprehensive Antibiotic Resistance Database) database (Jia et al. 2017). We picked this database because it was a database that is still curated and should therefore contain the most recently discovered genes. However, this database is not that easily accessible in programmatic terms.

Therefore we decided to build a web crawler to extract the gene sequences. We wrote a script that is able to collect all the sequences and also preserves the three structure of the database. You can download the resulting collection of sequences (here) (this also includes the rough results), or create it yourself by using our tool from the (software page).

The CARD database is divided in a three like structure in which the larger groups of genes are subdivided in smaller groups. Statistics of the groups are in Figure 1. he largest group of antibiotics are the beta-lactamase antibiotics. This group for example also includes genes conferring resistance to penicillin.

barplot

Figure 1: Stastitics of our database. The number of genes with known or unknown sequence or subgroup present in our database per antibiotic.

De novo motif finding - Showing were the genes are alike

Motifs are short (2-50) sequences that repeatedly occur in a set of genes. Since our probe (the cRNA) is also short (24-30) we focused on finding motifs in all homologs of genes conferring resistance to one certain antibiotic. To find such motifs we used the (meme) program. We ran the program on multiple groups of genes and after that we did some preprocessing to meet the requirements for CRISPR Cas13a. These motifs needs to meet three additional requirements. It needs to have a 40 to 60 percent of GC content and no G is allowed at the 3’-end of the motif. The third requirement is about the number of mutations in the conserved region. This is an maximum of 2 and in the seed region a maximum of 1 (see the off-targeting model page). The seed region is the part of the crRNA that is first bound to the CRISPR Cas13a system, which is estimated to be between base number 8 and 13. All this post processing is automated and the scripts for this are documented on the Software Tool Page.

To visualise the results we plotted the coverage of the motifs against the amount of motifs. The motifs are ordered with the maximum coverage first. See Figure 2.

Here are some interesting graphs created from the results (they are not all included):

Figure 2: Coveraged of genes. The coveraged of motifs that confer antibiotic resistance against Beta-lacatamase(a), Streptogramin (b),Fluoroquinolone (c) and Aminoglycoside (d). (Percentages 100x)

Discussion

As you can see in the figures, the maximum coverage does not exceed much beyond 50 percent. You can also see that the maximum coverage isn't so high. Genes differ signically from each other, and also Cas13a is so sensitive that small differences are usually not allowed. These small differences might be often caused by synonymous mutations, i.e mutations in the DNA do not cause a mutation in the protein. So, even if a region is fully conserved as a protein, it might not be fully conserved in the DNA. We did however reduce the amount of probes needed largely in comparison with a one probe per one gene strategy.

Looking at the variants - a case study

For the analysis of variants of the genes, we focused on Mastitis, a cow disease that can be caused by S. aureus. Antibiotic resistance in S. Aureus can be caused either by mecA or blaZ. mecA is a multi-drug resistance indicator and blaZ is a gene that confers resistance a much smaller amount of antibiotics, but these are the most used ones, like penicillin.

Because we focused on one gene alone, we wanted to make absolutely certain that the region we had in mind for it was the correct region. If you do a broad spectrum analysis, you will find conserved regions that are likely to be conserved in variants of one certain gene as well. Still we also did a conserved region for variants of one gene.

To retrieve variants of one gene we used the NCBI gene database. This database contains multiple variants of one gene. We searched by the gene name and retrieved the sequences of the returned accession numbers. Sometimes the accession number did not yield a DNA sequence directly and some additional parsing was required. The script that handles this is called 'retrieve_from_ncbi.py' and can be found on our Software Tool page.

Results

You can find the variants of the blaZ gene here.

From the 43 variants, we did a motif search and extracted the conserved regions. This process can be automated by using the function "only_one" from the file "meme_functions.py" (See also our software page):

—AXAXTGXTXAXXXAXXAAXXXAXAAAATXXTAAXGXXA
TXGGTGXXXXXAAAAAAGTTAAACAACXTXTAAAAXXAX
TXGGAGATAAAGTAACAAATCCAGTTAGATATGAXATAG
AATAAATTAXTATTCXCCAAAGAGCAAAAAAGATACTTC
AACXCCTGCTGCTTTCGGXAAGACTTTAAATAAACTTAT
CGCAAATGGAAAATTAAGCAAAXAAAAXAAAAAXTTCTT
ACTTGATTTAATGTTXAATAATAAAAXCGGAGAXACTTT
AATTAAAGAXGGTXTTXCAAAAGACTXTAAXGTXGCTGA
TAAAAGTGGTCAAGCAATAACATATGCTTCTAGAAATGA
TGTTGTTTTGTTTATCCTAAGGGCCAATCTGAACCTATT
GTTTTAGTCATTTTTACGAATAAAGACAATAAAAGTGAT
AAGCCAAATGATAAGTTGATAAGTGAAACCGCCAAGAGT
GTAATGAAGGAATTTTA

— indicates the beginning of the secuence which was not conserved in all sequences.

X indicates a base that is not conserved enough (it is not the same in the 42 sequences checked)

NB. There was one from 43 sequences that did not had any resemblance to the rest of sequences. So this is likely a faulty result from in the data and is therefore not included in the results.

Link to the lab

From the conserved regions we have also designed primers (IG0088 and IG0089). These primers have been tested on a real world sample with an unknown sequence. This test was successful and proved that this part of the software is working as it should be. See the sample prep results page for the results.