Alignment and coverage
Both DNA and RNA were extracted from peripheral blood mononuclear cells (PBMCs) from the same individual. Both the cDNA and gDNA were sequenced using the Illumina Genome Analyzer II. Sequencing of the gDNA produced 1,450 million reads, each 75 bp long. Ninety percent of these reads were aligned to the human reference genome by BWA [4], and after removing potential PCR duplicates, the remaining 980 million reads produced a coverage of at least 5× for 94% of the bases in the genome (gaps in reference genome excluded), and the mean coverage for these bases was 24×.
Sequencing of the cDNA produced 280 million reads, half 75 bp long and half 68 bp long. TopHat [5] was used to align these reads to the reference genome, with exons and splice junctions restricted to the 45,455 protein-coding transcripts annotated in Ensembl version 50. Sixty-nine percent of these reads gave unique alignments, aligning to exactly one location in the specified transcriptome. Reads aligning to more than one location were discarded. After removing potential PCR duplicates, the remaining 81 million reads produced a mean coverage of above 5× for 51% of exons: these exons had a median coverage of 51×.
Single nucleotide variants and overlap between datasets
We used SAMtools to call SNVs in our aligned gDNA and cDNA sequences. Indels and large structural variants were not analyzed. SAMtools called 51,055 SNVs in protein-coding exons in gDNA and 64,128 in cDNA. Of these, 48,740 in gDNA and 40,605 in cDNA passed quality control filters, and 19,054 of these overlapped between the two datasets in terms of position. When considering overlap between cDNA and genomic SNV calls, two measures were examined: sensitivity and specificity. Sensitivity was defined as the number of true positives (SNVs overlapping between the two datasets) divided by the number of true positives plus the number of false negatives (SNVs existing in the gDNA but not the cDNA). Specificity was defined as the number of true positives divided by the number of true positives plus the number of false positives (SNVs existing in the cDNA but not the gDNA). Quality control filters were optimized to maximize both the sensitivity and specificity in this study. In this dataset the sensitivity was 0.39 and the specificity was 0.47. If an exact match of the genotype was required as well as location, then the sensitivity fell to 0.35 and specificity to 0.42.
SNVs called in the gDNA and cDNA were also compared with entries in dbSNP. It was found that 90% of the gDNA exonic SNVs corresponded to a dbSNP entry, while this was true of only 56% of the cDNA SNVs. However, a further breakdown revealed that 94% of the true positive cDNA SNVs corresponded to a dbSNP entry, while only 23% of the false positives did the same. The false negatives corresponded to dbSNP entries 89% of the time.
SNV identification at different levels of expression and coverage
Many of the exons in Ensembl's transcript library are hypothetical and not confirmed to be expressed. A list of core exons as defined by the Affymetrix (Santa Clara, California, USA) Human Exon 1.0 ST Array was utilized to focus on exons with better curation. This list was further screened to only include exons present in Ensembl's list of canonical transcripts, resulting in 172,739 core exons. When focusing on just core exons, sensitivity and specificity rose to 0.44 and 0.55, respectively (Figure 1), which coincided with the percentage of exons having at least 5× coverage rising to 61% (median coverage for these exons was 57×).
We then evaluated how the number of reads and the level of expression affected the specificity and sensitivity of cDNA sequencing. Using data from previous studies on the level to which most of the transcripts containing these core exons are expressed in PBMCs [6], we defined expression level for each transcript as a percentage of the most highly expressed transcript in that tissue. For exons in our dataset the sensitivity and specificity both rise as known PBMC expression increases, until an expression level of 4% of the most highly expressed transcript, at which point both measures asymptote, with variants called about equally well for all expression levels above this (Figure 2). Ninety-four percent of exons from genes above this expression level, or 'PBMC-expressed genes', had at least 5× coverage, and the median coverage for exons with at least 5× coverage was 126×. The sensitivity also rose to 0.81 and the specificity to 0.67.
We also evaluated how the absolute number of true positive SNVs called depends on the amount of sequence data, in lanes, for all exons and for exons from PBMC-expressed genes (Figure 3). Seventy-nine percent of the 6,434 true positive variants identified in PBMC-expressed genes were identified with even one lane of sequence data, which is approximately 35 million reads in this instance. The total number of variants identified in all genes, however, increased substantially as more lanes were added. For the approximately 4,500 PBMC-expressed genes (Figure 4), even a single lane can be expected to capture most of the coding variants present.
We also found that the percent overlap with dbSNP changed as expression level and coverage changed. Although the percentage of SNV calls with a corresponding dbSNP entry remained relatively stable at all expression and coverage levels for the true positives and false negatives in our dataset, this was not true of the false positives. The percentage of false positives that overlapped with dbSNP decreased as coverage increased (Supplemental figure S3 in Additional file 1) and increased as expression level increased (Supplemental figure S4 in Additional file 1).
SNV identification in genes with and without paralogs
An inspection of false positive SNVs identified in the cDNA revealed that some arose from alignment of a read to the wrong gene. In these cases the correct gene and the gene chosen for alignment always had very similar sequences. To determine if specificity would increase in a set of unrelated genes, SNVs were called separately for two groups of genes: those with paralogs, as annotated by Ensembl, and those without. When SNVs were restricted to exons in 12,124 transcripts from genes without annotated paralogs, the overall sensitivity rose from 0.39 to 0.42 and the specificity rose from 0.47 to 0.54. If restricted to PBMC-expressed genes without paralogs, the sensitivity actually dropped slightly from 0.81 to 0.80, but the specificity again rose from 0.67 to 0.72. In contrast, for SNVs in exons of 33,331 transcripts from genes with paralogs, the sensitivity was 0.38 and the specificity was 0.45 (sensitivity 0.81 and specificity 0.65 in PBMC-expressed genes with paralogs).
Single nucleotide variant identification at different read depths
We studied the effect of read depth on specificity by examining the read depth at individual SNV calls in our complete RNA-Seq dataset of eight lanes. We found that at a read depth of 3 (the minimum required for a variant to be called), the specificity for SNVs found in the complete set of exons was only 0.28, but that as read depth increased, so did specificity, until it reached a plateau of between 0.6 and 0.75 for read depths between 50 and 1,200 (Supplemental figure S1 in Additional file 1). The 13,892 SNVs called between these read depth levels had a specificity of 0.67 (sensitivity of 0.19). However, above a depth of 1,200, the specificity fell again, becoming as low as 0.05 for the 94 SNVs with read depths greater than 2,000. Similar trends were found for core exonic SNVs (sensitivity 0.22 and specificity 0.77), SNVs in PBMC-expressed genes (sensitivity 0.59 and specificity 0.84), and SNVs in PBMC-expressed genes without paralogs (sensitivity 0.58 and specificity 0.90) when restricting to SNVs with a read depth between 50 and 1,200 (Supplemental figure S1 in Additional file 1).
We also studied the effect of read depth on specificity as the dataset increased from one to eight lanes. We found that at only one lane of RNA-Seq data, the specificity was 0.5 for SNVs with a read depth of 3, which was far better than the value of 0.28 found when using all eight lanes of data. Again, the specificity increased as the read depth increased, but in this lower coverage dataset the specificity reached a plateau of between 0.6 and 0.75 at greater than 10 reads, far earlier than the 50 reads needed for specificity stability in the eight lane dataset. The specificity also decayed at a much lower read depth in this smaller dataset, becoming less than 0.6 when the read depth was greater than 150. As the dataset increased from one lane to all eight lanes, the overall specificity continually decreased (Figure 1), and the minimum and maximum read depth value required for specificity to remain stable continually increased (Supplemental figure S2 in Additional file 1).