DNA signatures for detecting genetic engineering in bacteria
© Allen et al.; licensee BioMed Central Ltd. 2008
Received: 23 August 2007
Accepted: 18 March 2008
Published: 18 March 2008
Using newly designed computational tools we show that, despite substantial shared sequences between natural plasmids and artificial vector sequences, a robust set of DNA oligomers can be identified that can differentiate artificial vector sequences from all available background viral and bacterial genomes and natural plasmids. We predict that these tools can achieve very high sensitivity and specificity rates for detecting new unsequenced vectors in microarray-based bioassays. Such DNA signatures could be important in detecting genetically engineered bacteria in environmental samples.
Synthetic vector sequences are of fundamental importance in molecular biology. Cloning and expression vectors are among a multitude of synthetic sequence types commonly used as part of a basic tool set for DNA amplification and protein production . As the emerging maturity of synthetic biology research fast approaches , it is reasonable to imagine in the not too distant future the broad-scale manufacture of sophisticated synthetic plasmids to modify existing bacteria and possibly the construction of new functioning synthetic genomes . The potential exists to address challenges in many areas, from food production  to drug discovery . However, along with the potential benefit comes the increased risk of engineered pathogens [6, 7]. Thus, with improvements in genetic manipulation comes the need for tools to detect genetically modified bacteria in the environment.
Large-scale computational pipelines have advanced bio-defense by efficiently finding polymerase chain reaction (PCR) assay-based primers that are able to accurately identify dangerous bacterial and viral pathogens [8–10]. The development of random DNA amplification methods have highlighted microarrays as a potentially practical multiplexing complement to PCR  with DNA signatures on microarrays . Recent progress has made DNA signature design tools widely available to pathogen research through the development of a publicly available computational pipeline for designing PCR-based signatures . These advances demonstrate the utility of DNA signature pipelines, but the question remains whether such an approach could be used to detect genetically engineered bacteria.
A computational analysis was performed on the available synthetic vector sequences, which form an important basis for current tools in genetic engineering . One of the results of this work is a report on the presence of DNA signatures found to differentiate the vector sequences from the sequenced naturally occurring plasmid and chromosomal DNA. Candidate DNA signatures were found to cover nearly all artificial vector sequences using a wide range of signature lengths. The presence of these candidate DNA signatures opens the potential to develop assays in the future for detecting simple but widely available forms of genetic engineering. The vector sequence data was further leveraged to predict natural plasmids, which may form the basis for future vectors based on conserved functional sequences.
Results and discussion
Vector DNA signatures
A total of 3,799 partial and complete artificial vector sequences totaling 21,132,057 nucleotides were collected from various sequence databases (details given in Materials and methods) and analyzed for conserved sequence elements. Sequences were compared using exact k-mer matching (a k-mer is a nucleic acid sequence of length k). This alignment-free comparative sequence approach [15, 16] contrasts with methods that use conserved order among compared sequences . The alignment-free comparison is motivated by the abundance of similar artificial vector sequences, which can differ in the relative order of functional elements owing to differing sources of sequence construction. Conserved order comparison is further confounded by transposable elements and the need to efficiently compare several thousand sequences simultaneously.
k-mer sets collapse the redundant candidate signatures. A k-mer set X for sequences from a set of input sequenced vectors Y is the set of k-mers shared by all n sequences where n is maximal. (There can be no additional input vector sequence in Y with the same set of shared k-mers not included in X.) For example, with three sequences S1, S2 and S3, if S1 and S2 share 20 k-mers not found in S3, these 20 k-mers would form a single k-mer set with a pointer to the two source sequences S1 and S2. If additional k-mers are shared with all three sequences S1, S2 and S3, these k-mers would form a separate k-mer set with a pointer to all three sequences.
The completely sequenced vectors were divided into five partitions to check how closely vectors excluded from the signature creation pipeline match the candidate signatures. The hope is that a high percentage of the signatures are found in unseen vectors while remaining distinct from the background genomic sequence. The background genomic sequence is defined here as all sequenced natural plasmids and all sequenced bacterial and viral chromosomes along with the assembled draft sequence. Each partition was searched against a signature set generated from the remaining 80% of the vector data using NCBI BLAST . The background genomic sequence was similarly searched against each of the five signature sets. Each vector sequence and background genomic sequence was assigned its average bit score from the BLAST matches, plus the standard deviation. Support for differentiating between the artificial vector sequence and a background sample via differential cross-hybridization is enhanced when every artificial vector sequence's similarity to the signature set is higher than the background genomic sequence. It should be noted that the bit scores provide a rough estimate of hybridization potential and additional parameters may be used to optimize signature sets for a specific detection experiment and assay medium.
Two k-mer values, 30 and 60, were used with two signature set sizes, a smaller and larger set averaging 28,414 and 77,184 k-mers, respectively. Values for k (30 and 60) were chosen to examine signature types with different microarray hybridization patterns using lengths that we know from experience have different characteristics on our synthesized microarray platform. An alternative BLAST approach called MCS-only was included for comparison. MCS-only uses the multiple cloning sites of vectors exclusively as the source for creating signatures. The multiple cloning sites were first searched against the background sequence using BLAST, and regions without contiguous exact matches exceeding k were retained as input for constructing candidate signatures.
The MCS-only approach has the advantage of being easier to implement and requires less computational resources. Since the multiple cloning sites are expected to be good identifiers of vector sequence, it is possible that using all of the vector sequence as input provides limited information for creating signature data beyond what is already found at the multiple cloning sites. There are, however, potential disadvantages to this approach. Accessing the annotation specifying the multiple cloning site in every vector sequence is not easy. Despite our best efforts, we were unable to obtain multiple cloning site annotations for 18% of the completely sequenced vectors, although given the redundancy among vectors, the potential for extracting a good signature set is still possible.
The results indicate that the limited annotation of multiple cloning sites for vector sequences is not the only cause for the drop in MCS-only performance. The signature-based approach yields additional signatures outside the MCS region that boost confidence in the prediction of a vector, particularly in cases where the MCS region does not match well with the signature set. An additional advantage of using signatures outside the MCS region is to recover more information about the detected vector. Since signatures can come from other functional regions such as replication of origin sites and selection marker genes, matches to these signatures could provide additional information that would be useful in learning more about a vector and host type embedded in a complex sample.
It is important to note that longer probe lengths reduce microarray hybridization specificity. Using shorter k-mer sizes for microarray probe design may lead to more specific detection rates compared with longer k-mers, since single nucleotide differences are used to determine candidate signatures for all values of k. The results in Figure 5 suggest that longer probes can be filtered using BLAST to remove additional near matches to the background, which could improve hybridization specificity while maintaining good coverage across the complete set of artificial vectors.
Plasmid/vector conserved functional sequence
Bacteria with plasmids matched to artificial vectors.
Photobacterium damselae subsp. Piscicida
Environmental samples uncultured bacterium
Salmonella enterica subsp. enterica serovar Typhi str. CT18
Yersinia pestis biovar Orientalis str. IP275
GenBank identifiers for vector sequence matching Y. pestisplasmid.
Vector GenBank accession
Recombination site, CDS, promoter, transcription terminator
CDS, promoter, repeat region
Eight matching vectors
Origin of Replication
Summary description from the GenBank annotation of vectors matched to the Y. pestisplasmid.
Gene cloning vectors for Rhodobacter sphaeroides
Gene cloning vectors for Rhodobacter sphaeroides
The complete sequence of the BAC vector pECSBAC4
Improved antibiotic-resistance gene cassettes and omega elements
Analysis of transformation in Acinetobacter baylyi
Candidate DNA signatures were found for nearly all artificial vector sequence. In a small number of cases overlap between natural plasmids and artificial vectors preclude detection with DNA signatures. With two exceptions, where the signatures were found at k = 23 and 47, the lack of signature coverage for a vector sequence was explained by the occurrence of an equivalent natural analog, which makes clear the limits of many vector/plasmid distinctions. Natural analogs must be included in vector based signature detection systems along with other natural plasmid derivatives, which could be used to evade detection from the existing core signature set. With the potential for plasmids to be converted into artificial vector sequence [29, 30], developing predictive DNA signatures is an important challenge. At a minimum, signatures from the 21 plasmids sharing multiple functional elements with existing artificial vector sequence should be included to track potentially modified natural plasmids. Finding that 364 signatures cover nearly the complete set of vector sequences means that there is high sequence redundancy, making it feasible to maintain an expanding database of DNA signatures to track all sequenced vectors.
Future work should be directed towards bioassay design using DNA signatures on microarrays to test the efficacy of detecting genetically modified bacteria from a sample, which includes both modified and naturally occurring bacteria. We plan to collaborate more closely with scientists in the genetic engineering field to refine our bioinformatics tools to anticipate future natural plasmid-derived vector construction. As with any attempt to counter malicious use of technology, detecting genetic engineering in microbes will be an immense challenge that requires many different tools and continual effort. Cooperating with the scientific community to sequence and track available vector sequence will provide an opportunity for DNA signatures to support detection and deterrence against malicious genetic engineering applications.
Materials and methods
Natural plasmid sequence was extracted from an Entrez query of taxonomic classification 'other sequence; plasmids' , GenBank plasmids and the Plasmid Database . Sequences were checked for redundancy yielding the final natural plasmid sequence total of 65,341,821 bases in 1,567 contigs. In the pre-processed form there is overlap between the artificial vector set and the natural plasmid set. While some plasmids are naturally occurring, they are also used in genetic engineering. In cases where an engineered application is found, the sequence was treated as an 'artificial vector sequence'. The remaining artificial vector sequence was downloaded from the GenBank vector set available via anonymous ftp , ATCC , Virmatics  and an Entrez-based query of sequences classified taxonomically as artificial vector sequence. Vector sequences with fasta headers specifying eukaryote cell targets were removed, along with duplicate sequences. The background chromosomal sequence comes from the KPATH  database, which contains all available draft and completely sequenced microbial genomes (45,749 sequences totaling 4,057,440,823 bases).
Once the initial hash table is built, the sequence pointers of each k-mer entry become the keys for a second hash table, which records every combination of vector sequence with shared k-mers. A schematic of the hash table is labeled 'Hash table 2' in Figure 8. As an example, the second key from the top in Hash table 2 in Figure 8 forms a k-mer set called k-mer set-2, which shows that three sequences, 5, 30 and 110, share three k-mers, k-mer-2, k-mer-3 and k-mer-5. This comparative sequence approach presents a linear runtime with respect to the number of input nucleotides but has a theoretically high memory cost (owing to an exponential number of possible cluster combinations). In practice the entire study required less than 3 GB in online random access memory (RAM). Google sparse hash tables  were used to limit RAM consumption. DNA signatures are found by checking each nucleotide in the background dataset (natural plasmids and chromosomal sequence) and storing the k-mers shared with the initial vector derived hash table.
The background and vector sequences were searched against the signature set so that comparable sized query database sizes were used in the comparison. The background genomic sequence was searched against all five signature sets and the average result was taken. Default parameter values were used for BLAST. The second plot (Figure 5) shows signatures removed from the detection set using a bit score threshold of 100 and 50 for k = 60 and 30, respectively. A signature was removed if it has at least one match with bit score above the threshold. The two k-mer based signature set sizes were chosen from two different criteria. The larger set was taken by selecting the first 10 signatures from each k-mer set (chosen at random). The smaller set was chosen by taking a maximum of the first 10 signatures per vector sequence selecting signatures shared by the largest number of vectors.
Matching vectors with plasmids
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is the list of artificial vector identifiers. Additional data file 2 is the list of natural plasmid identifiers. Additional data file 3 is the complete set of 30-mer signatures used in the cross-validation set. Additional data file 4 is the complete set of 60-mer signatures used in the cross-validation set.
List of abbreviations
multiple cloning site
polymerase chain reaction, RAM, random access memory.
This work was performed under the auspices of the United States Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. JEA is supported in part by an IC Postdoctoral Fellowship. Thanks to Marisa Lam and Jason Smith for assistance compiling genomic sequence data.
- Verma R, Boleti E, George AJT: Antibody engineering: comparison of bacterial, yeast, insect and mammalian expression systems. J Immunol Methods. 1998, 216: 165-181. 10.1016/S0022-1759(98)00077-5.PubMedView ArticleGoogle Scholar
- Benner SA, Sismour AM: Synthetic biology. Nat Rev Genet. 2005, 6: 533-543. 10.1038/nrg1637.PubMedView ArticleGoogle Scholar
- Smith HO, Hutchinson CA, Pfannkoch C, Venter JC: Generating a synthetic genome by whole genome assembly:phiX174 bacteriophage from synthetic oligonucleotides. Proc Natl Acad Sci USA. 2003, 100: 15440-15445. 10.1073/pnas.2237126100.PubMedPubMed CentralView ArticleGoogle Scholar
- Sturino JM, Klaenhammer TR: Engineered bacteriophage-defence systems in bioprocessing. Nat Rev Microbiol. 2006, 4: 395-404. 10.1038/nrmicro1393.PubMedView ArticleGoogle Scholar
- Khosla C, Keasling JD: Metabolic engineering for drug discovery and development. Nat Rev Drug Discov. 2003, 2: 1019-1025. 10.1038/nrd1256.PubMedView ArticleGoogle Scholar
- Bugl H, Danner JP, Molinari RJ, Mullligan JT, Park HO, Reichert B, Roth DA, Wagner R, Budowle B, Scripp RM, Smith JAL, Steele SJ, Church G, Endy D: DNA synthesis and biological security. Nat Biotechnol. 2007, 25: 627-629. 10.1038/nbt0607-627.PubMedView ArticleGoogle Scholar
- Budowle B, Schutzer SE, Ascher MS, Atlas RM, Burans JP, Chakraborty R, Dunn JJ, Fraser CM, Franz DR, Leighton TJ, Morse SA, Murch RS, Ravel J, Rock DL, Slezak TR, Velsko SP, Walsh AC, Walters RA: Toward a system of microbial forensics: from sample collection to interpretation of evidence. Appl Environ Microbiol. 2005, 71: 2209-2213. 10.1128/AEM.71.5.2209-2213.2005.PubMedPubMed CentralView ArticleGoogle Scholar
- Gardner SN, Kuczmarski TA, Vitalis EA, Slezak TR: Limitations of TaqMan PCR for detecting divergent viral pathogens illustrated by hepatitis A, B, C, and E viruses and human immunodeficiency virus. J Clin Microbiol. 2003, 41: 2417-2427. 10.1128/JCM.41.6.2417-2427.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Slezak T, Kuczmarski T, Ott L, Torres C, Medeiros D, Smith J, Truitt B, Mulakken N, Lam M, Vitalis B, Zemla A, Zhou C, Gardner S: Comparative genomics tools applied to bioterrorism defence. Brief Bioinform. 2003, 4: 133-149. 10.1093/bib/4.2.133.PubMedView ArticleGoogle Scholar
- Fitch JP, Gardner SN, Kuczmarski TA, Kurtz S, Myers R, Ott LL, Slezak TR, Vitalis EA, Zemla AT, Mccready PM: Rapid development of nucleic acid diagnostics. Proc IEEE Inst Electr Electron Eng. 2002, 90: 1708-1720.View ArticleGoogle Scholar
- Vora GJ, Meador CE, Stenger DA, Andreadis JD: Nucleic acid amplification strategies for DNA microarray-based pathogen detection. Appl Environ Microbiol. 2004, 70: 3047-3054. 10.1128/AEM.70.5.3047-3054.2004.PubMedPubMed CentralView ArticleGoogle Scholar
- Tembe W, Zavaljevski N, Bode E, Chase C, Geyer J, Wasieloski L, Benson G, Reifman J: Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics. 2007, 23: 5-13. 10.1093/bioinformatics/btl549.PubMedView ArticleGoogle Scholar
- Phillippy AM, Mason JA, Ayanbule K, Sommer DD, Taviani E, Huq A, Colwell RR, Knight IT, Salzberg SL: Comprehensive DNA signature discovery and validation. PLoS Comput Biol. 2007, 3: e98-10.1371/journal.pcbi.0030098.PubMedPubMed CentralView ArticleGoogle Scholar
- Primrose SB, Twyman RM: Principles of Gene Manipulation and Genomics. 2006, Oxford: Blackwell, 7Google Scholar
- Vinga S, Almeida J: Alignment-free sequence comparison - a review. Bioinformatics. 2003, 19: 513-523. 10.1093/bioinformatics/btg005.PubMedView ArticleGoogle Scholar
- Haubold B, Pierstorff N, Möller F, Wiehe T: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics. 2005, 6: 123-10.1186/1471-2105-6-123.PubMedPubMed CentralView ArticleGoogle Scholar
- Batzoglou S: The many faces of sequence alignment. Brief Bioinform. 2005, 6: 6-22. 10.1093/bib/6.1.6.PubMedView ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMedPubMed CentralView ArticleGoogle Scholar
- Nordberg EK: YODA: selecting signature oligonucleotides. Bioinformatics. 2005, 21: 1365-1370. 10.1093/bioinformatics/bti182.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Muyrers JP, Zhang Y, Steward AF: Techniques: recombinogenic engineering - new options for cloning and manipulating DNA. Trends Biochem Sci. 2001, 26: 325-331. 10.1016/S0968-0004(00)01757-6.PubMedView ArticleGoogle Scholar
- Wessner DR: Techview software. Planning plasmids. Science. 1999, 286: 1495-1496. 10.1126/science.286.5444.1495.PubMedView ArticleGoogle Scholar
- Walton DK, Gendel SM, Atherly AG: DNA sequence and shuttle vector construction of plasmid pGL3 from Plectonema boryanum PCC 6306. Nucleic Acids Res. 1993, 21: 746-10.1093/nar/21.3.746.PubMedPubMed CentralView ArticleGoogle Scholar
- Nieto C, Fernandes de Palencia P, Lopez P, Espinosa M: Construction of a tightly regulated plasmid vector for Streptococcus pneumoniae: controlled expression of the green flourescent protein. Plasmid. 2000, 43: 205-213. 10.1006/plas.2000.1465.PubMedView ArticleGoogle Scholar
- Okamura Y, Takeyama H, Sekine T, Sakaguchi T, Wahyudi AT, Sato R, Kamiya S, Matsunaga T: Design and application of a new cryptic-plasmid-based shuttle vector for Magnetospirillum magneticum. Appl Environ Microbiol. 2003, 69: 4274-4277. 10.1128/AEM.69.7.4274-4277.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee JH, Halgerson JS, Kim JH, O'Sullivan DJ: Comparative sequence analysis of plasmids from Lactobacillus delbrueckii and construction of a shuttle cloning vector. Appl Environ Microbiol. 2007, 73: 4417-4424. 10.1128/AEM.00099-07.PubMedPubMed CentralView ArticleGoogle Scholar
- Welch TJ, Fricke WF, McDermott PF, White DG, Rosso ML, Rasko DA, Mammel MK, Eppinger M, Rosovitz M, Wagner D, Rahalison L, LeClerc JE, Hinshaw JM, Lindler LE, Cebula TA, Carniel E, Ravel J: Multiple antimicrobial resistance in plague: an emerging public health risk. PLoS ONE. 2007, 2: e309-10.1371/journal.pone.0000309.PubMedPubMed CentralView ArticleGoogle Scholar
- Schweizer HP, Hoang TT, Propst KL, Ornelas HR, Karhoff-Schweizer RR: Vector design and development of host systems for Pseudomonas. 2001, Doredrecht: Kluwer, 23: 69-82.Google Scholar
- Inui M, Nakata K, Roh JH, Vertes AA, Yukawa H: Isolation and molecular characterization of pMG160, a mobilizable cryptic plasmid from Rhodobacter blasticus. Appl Environ Microbiol. 2003, 69: 725-733. 10.1128/AEM.69.2.725-733.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee JH, O'Sullivan DJ: Sequence analysis of two cryptic plasmids from Bifidobacterium longum DJO10A and construction of a shuttle cloning vector. Appl Environ Microbiol. 2006, 72: 527-535. 10.1128/AEM.72.1.527-535.2006.PubMedPubMed CentralView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2007, 35 (Database issue): D21-D25. 10.1093/nar/gkl986.View ArticleGoogle Scholar
- Molbak L, Tett A, Ussery DW, Wall K, Turner S, Bailey M, Field D: The plasmid genome database. Microbiology. 2003, 149: 3043-3045. 10.1099/mic.0.C0123-0.PubMedView ArticleGoogle Scholar
- Vector subset of GenBank. [ftp://ncbi.nlm.nih.gov/blast/db]
- ATCC Vectors. [ftp://ftp.atcc.org/pub/vector_seqs]
- Virmatics vector database. [http://www.virmatics.com/vcs/index.php]
- Choi JH, Cho HG: Analysis of common k -mers for whole genome sequences using SSB-Tree. Genome Inform. 2002, 13: 30-41.PubMedGoogle Scholar
- Höhl M, Kurtz S, Ohlebusch E: Efficient multiple genome alignment. Bioinformatics. 2002, 18: S312-S320.PubMedView ArticleGoogle Scholar
- Google Sparse Hash Table. [http://code.google.com/p/google-sparsehash/]
- Bomze I, Budinich M, Pardalos P, Pelillo M: The maximum clique problem. Handbook of Combinatorial Optimization. 1999, Dordrecht: Kluwer, 4:Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.