The advantages of SMRT sequencing

Of the current next-generation sequencing technologies, SMRT sequencing is sometimes overlooked. However, attributes such as long reads, modified base detection and high accuracy make SMRT a useful technology and an ideal approach to the complete sequencing of small genomes.

Now a new technology, SMRT sequencing from Pacifi c Biosciences [1], has been developed that not only produces considerably longer and highly accurate DNA sequences from individual unamplifi ed molecules, but can also show where methylated bases occur [2] (and thereby provide functional information about the DNA methyltransferases encoded by the genome).
SMRT sequencing is a sequencing-by-synthesis technology based on real-time imaging of fl uorescently tagged nucleotides as they are synthesized along individual DNA template molecules. Because the technology uses a DNA polymerase to drive the reaction, and because it images single molecules, there is no degra dation of signal over time. Instead, the sequencing reaction ends when the template and polymerase dissociate. As a result, instead of the uniform read length seen with other technologies, the read lengths have an approximately lognormal distribution with a long tail. Th e average read length from the current PacBio RS instrument is about 3,000 bp, but some reads may be 20,000 bp or longer. Th is is roughly 30 to 200 times longer than the read length from a next-generation sequencing instrument, and more than a four-fold improvement since the original release of the instrument two years ago. It is notable that the recently announced PacBio RS II platform claims to have a further four-fold improvement, with twice the mean read length and twice the throughput of the current machine.

Applications of SMRT sequencing
Th e SMRT approach to sequencing has several advantages. First, consider the impact of the longer reads, especially for de novo assemblies of novel genomes. While typical next-generation sequencing can provide abundant coverage of a genome, the short read lengths and amplifi cation biases of those technologies can lead to fragmented assemblies whenever a complex repeat or poorly amplifi ed region is encountered. As a result, GCrich and GC-poor regions, which tend to be poorly amplifi ed, are particularly susceptible to poor quality sequencing. Resolving fragmented assemblies requires additional costly bench work and further sequencing. By also including the longer reads of SMRT sequencing runs, the read set will span many more repeats and missing bases, thereby closing many of the gaps automatically and simplifying, or even eliminating, the fi nishing time ( Figure 1). It is becoming routine for bacterial genomes to be completely assembled using this approach [3,4], and we expect this practice will translate to larger genomes in the near future. A complete genome is far more useful than the poor quality draft sequences that litter GenBank because it provides a complete blueprint for the organism; the genes encoded therein represent the full biological potential of that organism. With only draft assemblies available, one is always left with the nagging feeling that some crucial gene is missing -perhaps the one in which you are most interested! Th e long read lengths also have more power to reveal complex structural variations present in DNA samples, such as pinpointing precisely where copy number variations have occurred relative to the reference sequence [5]. Th ey are also extremely powerful for resolving complex RNA splicing patterns from cDNA libraries, since a single long read may contain the entire transcript end-to-end, thus eliminating the need to infer the isoforms [6].
Second, consider DNA methyltransferases. Th ese can exist as solitary entities or as parts of restrictionmodifi cation systems. In both cases, they methylate relatively short sequence motifs that can easily be recognized from SMRT sequencing data because of the change in DNA polymerase kinetics, as it moves along the template molecule, that result from the presence of epigenetic modifi cations. Th e altered kinetics cause a change in the timing of when the fl uorescent colors are observed, thus enabling direct detection of epigenetic modifi cations, which can ordinarily only be inferred, and bypassing the usual necessity of enrichment or chemical conversion. Often, thanks to bioinformatics, the gene responsible for any given modifi cation can be matched to the sequence motif in which the modifi cation lies [7,8]. When it cannot, then simply cloning the gene into a plasmid, which is subsequently grown in a non-modifying host and re-sequenced, can provide the match [9]. Moreover, SMRT sequencing has also been able to identify RNA base modifi cations through the same approach as DNA base modifi cations, but using an RNA transcriptase in place of the DNA polymerase [10]. In fact, SMRT sequencing represents an important step toward uncovering the biology that happens between DNA and proteins, including not only the study of mRNA sequences but also the regulation of translation [11,12]. Th us, functional information emerges directly from the SMRT sequencing approach.
Th ird, we must consider the persistent rumor that SMRT sequencing is much less accurate than other nextgeneration sequencing platforms, which has now been demonstrated to be untrue in several ways. First, a direct  comparison of several approaches to determining genetic polymorphisms has shown that SMRT sequencing has comparable performance to other sequencing tech nologies [13]. Second, the accuracy of assembling a complete genome using SMRT sequencing in combination with other technologies has proved to be as reliable and accurate as more traditional approaches [3,6,14]. Moreover Chin et al. [15] showed that an assembly using only long SMRT sequencing reads achieves comparable or even higher performance than other platforms (99.999% accuracy in three organisms with known reference sequences), including 11 corrections to the Sanger reference of these genomes. Koren et al. [6] showed that most microbial genomes could be assembled into a single contig per chromosome with this approach; it is by far the least expensive option for doing so.

Debunking the error myth
Th e power of SMRT sequencing data lies both in its long read lengths and in the random nature of the error process ( Figure 2). It is true that individual reads contain a higher number of errors: approximately 11% to 14% or Q12 to Q15, compared with Q30 to Q35 from Illumina and other technologies. However, given suffi cient depth (8x or more, say), SMRT sequencing provides a highly accurate statistically averaged consensus perspective of the genome, as it is highly unlikely that the same error will be randomly observed multiple times. Notoriously, other platforms have been found to suff er from systematic errors that need to be resolved by complementary methods before the fi nal sequence is produced [16]. Another approach that benefi ts from the stochastic nature of the SMRT error profi le is the use of circular consensus reads, where a sequencing read produces multiple observations of the same base in order to generate high-accuracy consensus sequence from single molecules [17]. Th is strategy trades read length for accuracy, which can be eff ective in some cases (targeted re-sequencing, small genomes) but is not necessary if one can achieve some redundancy in the sequencing data (8x is recommended). With this redundancy, it is preferable to benefi t from the improved mapping of longer inserts than opt for circular consensus reads, because the longer reads will be able to span more repeats and high accuracy will still be achieved from their consensus.

Figure 2. A sequencing context breakdown of the empirical insertion error rate of the two platforms on NA12878 whole genome data.
In this fi gure we show all contexts of size 8 that start with AAAAA. The empirical insertion quality score (y-axis) is PHRED scaled. Despite the higher error rate (approximately Q12) of the PacBio RS instrument, the error is independent of the sequencing context. Other platforms are known to have diff erent error rates for diff erent sequencing contexts. Illumina's HiSeq platform, shown here, has a lower error rate (approximately Q45 across eight independent runs), but contexts such as AAAAAAAA and AAAAACAG have extremely diff erent error rates (Q30 versus Q55). This context-specifi c error rate creates bias that is not easily clarifi ed by greater sequencing depth. Empirical insertion error rates were measured using the Genome Analysis Toolkit (GATK) -Base Quality Score Recalibration tool.

Conclusions
The considerations above make a strong case for combining the more traditional, sequence-dense data from other technologies with at least moderate coverage of SMRT data so that genomes can be improved, their methylation patterns obtained, and the functional activity of their methyltransferase genes deduced. We would especially urge all groups currently sequencing bacterial genomes to adopt this policy. That said, SMRT sequencing has also substantially improved eukaryotic genome assemblies, and we expect it to become more widely applied in this context over time, in light of the greater read lengths and throughput of the PacBio RS II instrument.
Perhaps it would even be worth redoing many genomes so that existing shotgun dataset-based assemblies could be closed and their complete methylomes obtained. The resultant assembled (epi)genomes would be inherently more valuable: the usefulness of a closed genome with associated functional annotation of its methyltransferase genes is far greater than the uncertainties left with a shotgun data set. Whereas we currently know much about the importance of epigenetic phenomena for higher eukaryotes, very little is known about the epigenetics of bacteria and the lower eukaryotes. SMRT sequencing opens a new window that may have a dramatic effect on our understanding of this biology.

Competing interests
None of the authors have competing financial interests, but RJR and MCS have collaborated extensively with scientists from Pacific Biosciences leading to several publications cited in the text.

Authors' contributions
All three authors contributed to the writing of this article.