An accessible, efficient and global approach for the large-scale sequencing of bacterial genomes

We have developed an efficient and inexpensive pipeline for streamlining large-scale collection and genome sequencing of bacterial isolates. Evaluation of this method involved a worldwide research collaboration focused on the model organism Salmonella enterica, the 10KSG consortium. By optimising a logistics pipeline that collected isolates as thermolysates, permitting shipment in ambient conditions, the project assembled a diverse collection of 10,419 clinical and environmental isolates from low- and middle-income countries in less than one year. The bacteria were obtained from 51 countries/territories dating from 1949 to 2017, with a focus on Africa and Latin-America. All isolates were collected in barcoded tubes and genome sequenced using an optimised DNA extraction method and the LITE pipeline for library construction. After Illumina sequencing, the total reagent cost was less than USD$10 per genome. Our method can be applied to genome-sequence other large bacterial collections at a relatively low cost, within a limited timeframe, to support global collaborations.


Introduction 33
Whole genome sequencing (WGS) is an important tool that has revolutionised our 34 understanding of bacterial disease over the past decade 1-4 . Recognising the immense 35 advantages that WGS data provides for surveillance, functional genomics and population 36 dynamics, both public health and research communities have adopted genome-based 37 approaches. 38 Until recently, large-scale bacterial genome projects could only be performed in a handful of 39 sequencing centres around the world. Here, we aimed to make this technology accessible to 40 bacterial laboratories worldwide. The high demand for sequencing human genomes has 41 driven down the costs of sequencing reagents to below USD$1,000 per sample 5-7 . However, 42 the genome sequencing of thousands of microorganisms has remained expensive due to 43 costs associated with sample transportation and library construction. of human cases of enterocolitis and concerns related to food safety, more genome sequences 66 have been generated for Salmonella than for any other genus. The number of publicaly 67 available sequenced Salmonella genomes will soon reach 300,000 19  We saw a need to simplify and expand genome-based surveillance of salmonellae from Africa 83 and other parts of the world, involving isolates associated with invasive disease and 84 gastroenteritis in humans, and extended to bacteria derived from animals and the 85 environment. We optimised a pipeline for streamlining the large-scale collection and 86 sequencing of samples from LMI countries with the aim of facilitating access to WGS and 87 worldwide collaboration. Our pipeline represents a relatively inexpensive and robust tool for 88 the generation of bacterial genomic data from LMI countries, allowing investigation of the 89 epidemiology, drug resistance and virulence factors of isolates. 90

Development of an optimised logistics pipeline 92
The "10,000 Salmonella genomes" (10KSG; https://10k-salmonella-genomes.com/) is a global 93 consortium that includes collaborators from 25 institutions and a variety of settings, including 94 research and reference laboratories across 16 countries. Limited funding resources prompted us to design an approach that ensured accurate sample tracking and captured comprehensive 96 metadata for individual bacterial isolates whilst minimising costs for the consortium. A key 97 driver was to assemble a set of genomic data that would be as informative and robust as 98 possible. 99 The 10,419 isolates were collected from 53 countries/territories spanning 5 continents (America, Africa, Asia, Europe, & Oceania), with most isolates originating from Africa (56%) and America (26%). The samples were mostly of human origin (86%), of which 52% were blood isolates, 41% were stool isolates, and 7% from other body compartments. About 5% samples originated from environmental sources, 6% were of animal origin, and 3% unknown.
The bacterial pathogens were isolated over a 68-year time period, from 1949 to 2017. The majority of samples were isolated after 1990.
Members of the 10KSG provided access to 10,419 bacterial isolates from collections that 100 spanned 51 LMI countries and regions (such as Reunion Island, an overseas department and tested the effective inactivation of other non-Salmonella Gram-positive (Staphylococcus 128 aureus) and Gram-negative (Escherichia coli) organisms (Supplementary Table 2). 129 Temperature is a key factor in the transportation of samples, especially in some LMI countries 130 where dry ice is expensive and difficult to source, and access to international courier 131 companies is limited or very costly. To allow transport without refrigeration, we tested the 132 stability of the resulting thermolysates at room temperature for more than seven days by 133 controlling the quality of extracted DNA (Supplementary Table 2). Minimising the steps 134 required for sample collection allowed us to reach collaborators with limited access to facilities 135 and personnel. 136 We collected samples using screwed-cap barcoded tubes (FluidX tri-coded jacket 0.7 mL, 137 Brooks Life Sciences, 68-0702-11) costing USD$0.23 each, which we distributed from the UK 138 to collaborators worldwide. Individually barcoded tubes were organised in FluidX plates in a 139 96-well format, each with their own barcode. Both QR codes and human-readable barcodes 140 were included on each tube to ensure that the correct samples were always sequenced, and 141 to permit the replacement of individual tubes when required. 142 All isolates were obtained in compliance with the Nagoya protocol 26 . The combination of 143 method optimisation, development and distribution of easy-to-follow protocols in English and 144 Spanish (French was used only for communication), generating thermolysates and using 145 barcoded tubes, the process of collecting the bacterial isolates was completed within one year. 146 Barcoded tubes were distributed to collaborators, including an extra ~20% to permit 147 replacements as required. In total, 11,823 tubes were used in the study, of which 10,419 were 148 returned to the sequencing centre containing bacterial thermolysates for DNA extraction and 149 genome sequencing. A comprehensive list of isolates is available in Supplementary Table 3. To validate this approach for bacteria other than Salmonella, ~25% (2,573, 24.7%) of the 151 samples were isolates from a variety of genera, including Gram-negatives such as Shigella 152 and Klebsiella, and Gram-positives such as Staphylococcus. 153

DNA extraction, library construction, quality control and genome sequencing 154
Our high-throughput DNA extraction and library construction pipeline was designed to be 155 versatile, scalable and robust, capable of processing thousands of samples in a time and cost-156 efficient manner. The procedure included DNA extraction, quality control (QC), normalisation, 157 sequencing library construction, pooling, size selection and sequencing. The time taken for 158 each step, and the associated consumable cost, is shown in Table 1. All the parts of the 159 pipeline are scalable and can be run simultaneously with robots, allowing hundreds of samples 160 to be processed each day, in a 96-well format. With dedicated pre-and post-PCR robots, up 161 to 768 bacterial samples were processed each day. The total consumable cost for extraction 162 of DNA and genome sequence generation was less than USD$10 per sample (excluding staff 163 time). Given the high-throughput nature of this project, and the difficulty in optimising the 164 processes to account for every possible variation in DNA/library quality and quantity, this cost 165 includes a 20% contingency. 166 In designing the DNA extraction pipeline, we anticipated that samples would contain a wide 167 range of DNA concentrations due to the different approaches by collaborators, some of whome 168 sent thermolysates and others extracted DNA. The DNA was isolated in a volume of 20 L, 169 and the total yield ranged from 0 to 2,170 ng (average of 272 ng). Less than 6% samples 170 contained less than 2.5 ng ( Supplementary Fig. S1). 171 To facilitate large-scale low-cost whole-genome sequencing, we developed the LITE (Low 172 Input, Transposase Enabled; Fig. 2) pipeline, a low-cost high-throughput library construction 173 protocol based on the Nextera kits (Illumina). Prior to LITE library construction, all DNA 174 samples were normalised to 0.25 ng/L unless the concentration was below that limit, in which 175 case samples remained undiluted. We calculated that given a bacterial genome size of  The DNA was extracted using a protocol based on the MagAttract HMW DNA isolation kit (Qiagen). Library construction was performed by tagmentation using Nextera tagmentation kit, size selected on a BluePippin, and quantified using a High Sensitivity BioAnalyzer kit (Agilent) and Qubit dsDNA HS Assay (ThermoFisher). Genome sequencing of "super pools" was performed in a HiSeq TM 4000 (Illumina) system, and re-sequencing in NovaSeq TM 6000 (Illumina) when needed, both with a 2 x 150 bp paired ends read metric.
To maximise the multiplexing capability for the LITE pipeline, we designed 438 bespoke 9-bp 184 barcodes (Supplementary Table 4 HiSeq 4000 system lane, with a 2 x 150 bp paired-end read metric. After the initial screen was 201 completed, samples that failed to produce 30x genome coverage were re-sequenced on a 202 NovaSeq 6000 system, also with a 2 x 150 bp read metric. In total 1,525 (15.2%) of the 9,976 203 samples processed required re-sequenced, a proportion that was within the 20% contingency 204 added to our unit cost. 205

Bioinformatic analysis and data provision 206
To complete our WGS approach, we developed and implemented a bespoke sequence 207 analysis bioinformatic pipeline for the Salmonella samples included in the study. The full 208 pipeline is available from https://github.com/apredeus/10k_genomes. Because the estimation 209 of sequence identity and assembly quality is relatively species-independent, and annotation 210 is strongly species-specific, the pipeline can be easily adapted to other bacterial species by 211 changing quality control criteria and specifying relevant databases of known proteins. 212 Following DNA extraction, sequencing and re-sequencing, we generated sequence reads for 213 9,976 (96.0%) samples, of which 7,236 were bioinformatically classified as Salmonella 214 enterica using Kraken2 and Bracken 27,28 . A small proportion of the samples (209 out of 9,976; 2.1%) had been wrongly identified as Salmonella prior to sequencing. The remaining samples 216 corresponded to 1,157 Gram-positive and Gram-negative bacterial isolates that were included 217 to validate the study. The 443 (4.3%, out of the 10,419 samples received) samples that did 218 not generate sequence reads reflected poor quality DNA extraction, due to either low biomass 219 input or partial cell lysis. Overall, the generation of sequence data from the vast majority of 220 samples demonstrated the robustness of the use of thermolysates coupled with the high-221 throughput LITE pipeline for processing thousands of samples from a variety of different 222 collaborating organisations. 223 To assess the quality of sequence data, we focused on the 7,236 (69.5%) genomes identified 224 as Salmonella enterica (Fig. 3). To allow the bioinformatic analysis to be customisable for 225 other datasets, we developed a robust quality control (QC) pipeline to do simple uniform 226 processing of all samples, and to yield the maximum amount of reliable genomic information. 227 Well-established software tools were used to assess species-level identity from raw reads, 228 trim the reads, assess coverage and duplication rate, assemble genomes, and to make 229 preliminary evaluation of antibiotic resistance and virulence potential. 230 Trimming abundant adapters from the reads produced by the LITE pipeline was critical for 231 optimal genome assembly. Using Quast 29 and simple assembly metrics, we evaluated the 232 performance of Trimmomatic 30 in palindrome mode with and without retention of singleton 233 reads, compared with BBDuk (https://jgi.doe.gov/data-and-tools/bbtools) in paired-end mode. 234 BBDuk was selected for our analysis because this tool generated genomes with a higher N50, 235 and a comparable number of mis-assemblies.

Fig. 3
The sequential quality control process used to select whole-genome sequences for detailed analysis.
Of the 10,419 isolates, 443 failed the DNA extraction or quality control prior to genome sequencing. We produced sequencing libraries of 9,975 samples, of which 1,366 were not bioinformatically-identified as Salmonella enterica. These 1,366 corresponded to 1,157 which were part of the 25% non-Salmonella component of the project, plus 209 isolates that had been mis-identified as Salmonella before sequencing. Of the 7,236 Salmonella genomes, 6,248 had sequence coverage over 10x, of which 5,833 passed the "stringent criteria". Of the 415 samples that failed the "stringent criteria", 284 samples were rescued based on a "clean up" (55) or a "relaxed criteria" (229). Overall, we generated 6,117 high-quality Salmonella genomes.
Genome assembly was performed using SPAdes 31 via Unicycler 32 in short-read mode. 237 SPAdes is an established and widely-used tool for bacterial genome assembly, whilst 238 Unicycler optimises SPAdes parameters and performs assembly polishing by mapping reads 239 back to the assembled genomes. Genome assembly QC was done using the criteria 240 established by the genome database EnteroBase 33 . Specifically, these "stringent criteria" 241 required: 1) total assembly length between 4 and 5.8 Mb, 2) N50 of 20 kb or more, 3) fewer 242 than 600 contigs, and 4) more than 70% sequence reads assigned to the correct species.
Using this approach for S. enterica, 5,833 of the Salmonella genomes (80.6%) passed QC 244 (Fig. 3). 245 To "rescue" all possible S. enterica in the remaining assemblies with coverage greater than 246 10x that failed the stringent QC, two approaches were used: "relaxed criteria" and "clean up". 247 The "relaxed criteria" accepted assemblies of 4 Mb to 5.8 Mb overall length, species-purity of 248 90% or more, N50 > 10kb, and fewer than 2,000 contigs. In contrast, the "clean up" approach 249 was used for assemblies that had < 70% Salmonella sequence reads using the "stringent 250 criteria". The raw reads of these samples were "cleaned" using Kraken2 & Bracken, with the 251 reads assigned to Salmonella being retained, and subjected to the "stringent criteria" for QC 252 detailed above. The assemblies rescued by these two approaches accounted for a further 253 3.9% (284) assemblies from our initial Salmonella collection. In total, we generated 6,117 high 254 quality S. enterica genomes, corresponding to 84.5% of the total Salmonella isolates 255 successfully sequenced through the LITE pipeline ( Fig. 3 and 4). 256 Genome sequence data were shared with collaborators via downloadable packages hosted 257 by the Centre of Genomic Research, University of Liverpool (UK). These packages included 258 sequencing statistics, raw (untrimmed) fastq files of sequence reads, and the individual 259 genome assemblies. We included the genome-derived Salmonella serovar and sequence type 260 of each isolate (Fig. 4). However, our approach did pose manual and logistical challenges. We propose that for future 284 implementations of a similar approach for sequencing thousands of bacterial isolates, it is 285 important to make an early investment in the development of a shared, protected and version 286 controlled database to store epidemiological information, coupled with automated scripts to 287 handle sequencing data, and a streamlined system for the sending and receiving of samples. 288 Our method is suitable for other large collections of Gram-negative or Gram-positive bacteria, 289 and is designed to complete an academic genome sequencing project within a limited time-290 frame (one year). However, the LITE pipeline represents a compromise in terms of data quality 291 to maximise economic value. It is important that all QC steps and the rigorous bioinformatic 292 approach that we specify are followed to produce a reliable dataset, which in this case 293 generated 84.5% high-quality genomes of the 7,236 successfully-sequenced Salmonella 294 isolates ( Fig. 3 and 4). 295 A key aspect of our methodology was the involvement of researchers fluent in multiple 296 languages, to maximise clear communication and ensure access to countries across the world. The approach will be particularly relevant when rapid, low-cost, and collaborative 298 genome sequencing of bacterial pathogens is required. Our concerted approach 299 demonstrates the value of true global collaboration, offering potential for tackling international 300 epidemics or pandemics in the future. 301   Table 2). 434 To test the effect of transport, the samples were subjected to genomic DNA extraction using 435 a DNeasy Blood & Tissue Kit (Qiagen) after incubation at room temperature for more than 7 436 days. The quality of extracted DNA was assessed by 1% agarose gel electrophoresis, and 437 fluorometric DNA quantification using Qubit™ dsDNA HS Assay Kit (Invitrogen™) 438 (Supplementary Table 2). 439 Detailed protocols were sent to collaborators, along with a metadata template and barcoded 440 tubes. The design of the metadata template and protocol booklet was tested several times for 441 clarity and to obtain unified information avoiding different interpretations by the user. The 442 metadata template (Supplementary Table 1 The amplified library was then subjected to a magnetic bead-based purification step on a 497 Biomek NX P instrument. 25 µL of Kapa Pure beads (Roche, UK) were added to 25 µL of 498 amplified library, and mixed. This library was incubated at room temperature for 5 min, briefly 499 spun in an Eppendorf 5810R centrifuge and placed on a 96-well magnetic particle 500 concentrator. Once the beads had pelleted, the supernatant was removed and discarded, and 501 the beads washed twice with 40 µL of freshly prepared 70% ethanol. After the second ethanol 502 wash, the beads were left to air dry for 5 min. The 96-well plate was removed from the MPC 503 and the beads were re-suspended in 25 µL of 10 mM TRIS-HCl, pH 8 (Elution Buffer). The DNA was eluted by incubating the beads for 5 min at room temperature. The plate was 505 replaced on the MPC, the beads allowed to pellet, and the supernatant containing the DNA 506 was transferred to a new 96-well plate. 507 To assess the concentrations of individual libraries, 20 µL of elution buffer was added to 2 µL 508 of purified library, and run on a LabChip GX (Perking Elmer) using the High throughput, High 509 Sense reagent kit and HT DNA Extended Range Chip according to manufacturers' 510 instructions. To determine the amount of material present in each library between 400 and 511 600 bp, a smear analysis was performed using the GX analysis software. The resulting value 512 was used to calculate the amount of each library to pool. Pooling of each 96-libraries was 513 performed using a Biomek Nx instrument. 100 µL of the pooled libraries were added to 100 µL 514 of Kapa Pure beads in a 1.5 mL LoBind tube. The sample was vortexed and incubated at room 515 temperature for 5 min to precipitate the DNA onto the beads. The tube was then placed on an 516 MPC to pellet the beads, the supernatant discarded, and the beads were washed twice with 517 200 µL of freshly prepared 70% ethanol. The beads were left to air dry for 5 min and then re-518 suspended in 30 µL Elution Buffer. The samples were incubated at room temperature for 5 min 519 to elute the DNA. The plate was placed back on the MPC and the DNA was transferred to a 520 new 1.5 mL tube. 521 The concentrated sample containing a pool of 96 libraries was subjected to size selection on 522 a BluePippin (Sage Science, Beverly, USA). The 40 µL in each collection well of a 1.5% 523 BluePippin cassette were replaced with fresh running buffer, and the separation and elution 524 current checked prior to loading the sample. 10 µL of R2 marker solution were added to 30 µL 525 of the pooled library, and then the combined mixture was loaded into the appropriate well. 526 Using the smear analysis feature of Perkin Elmer GX software, we calculated the amount of 527 material between 400 and 600 bp for each library. We targeted this region based on the 528 electropherograms in Supplementary Fig. S2, to minimise the overlap between 150 bp paired end reads and maximise the number of libraries that would generate data. We determined the 530 detection limit for the molarity within this size range to be 0.007 nM, meaning that libraries with 531 lower concentrations were reported as 0.007 nM. The amount of library material between 400 532 to 600 bp ranged from 0.0 to 2.4 nM (average of 0.3 nM), with less than 6% having less than 533 0.007 nM (Supplementary Fig. S1). 534 Post size selection, the 40 µL from the collection well were recovered, and the library size was 535 determined using a High Sensitivity BioAnalyzer kit (Agilent) and DNA concentration 536 calculated using a Qubit dsDNA HS Assay (ThermoFisher). "Super pools" were created by 537 equimolar pooling of up to 12 size-selected 96-sample pools, each with a different P5 barcode. 538 Using these molarity figures, 96 libraries were equimolarly-pooled, concentrated and then 539 size-selected using a 1.5% cassette on the Sage Science Blue Pippin. 540 To determine the number of viable library molecules, the super pools were quantified using 541 the Kapa qPCR Illumina quantification kit (Kapa Biosystems) prior to sequencing. For the initial 542 screen, sequencing was performed on the HiSeq TM 4000 (Illumina). For re-sequencing of 543 samples, the sequencing was carried out in a lane of an S1 flowcell on the NovaSeq TM 6000 544 (Illumina), both with a 2x150 bp read metric. 545