Modeling nonuniformity in shortread rates in RNASeq data
 Jun Li^{1},
 Hui Jiang^{1, 2} and
 Wing Hung Wong^{1, 3}Email author
DOI: 10.1186/gb2010115r50
© Li et al.; licensee BioMed Central Ltd. 2010
Received: 20 November 2009
Accepted: 11 May 2010
Published: 11 May 2010
Abstract
After mapping, RNASeq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.
Background
Microarrays are an efficient technology to measure the expression levels of many genes simultaneously, but there are some limitations to this method. The expression estimates are typically not reliable for lowly expressed genes because the true signals are masked by crosshybridization effects [1, 2]. Furthermore, the design of the array depends on annotation of gene structures and thus the method is not ideal for the discovery of novel splicing events. A recently developed alternative approach, called RNASeq, has the potential to overcome these difficulties [3]. RNASeq uses ultrahighthroughput sequencing [4] to determine the sequence of a large number of cDNA fragments. The resulting sequences (reads) can be long (>100 nucleotides) or short, depending on the platform [4]. Two currently popular shortread platforms are Illumina's Solexa [5–11] and Applied Biosystems' (ABI's) SOLiD [12]. Each can produce tens of millions of short reads in a single run [5–12]. In this paper, we only consider the shortread RNASeq.
The reads produced by RNASeq are first mapped to the genome and/or to the reference transcripts using computer programs. Then, the output of RNASeq can be summarized by a sequence of 'counts'. That is, for each position in the genome or on a putative transcript, it gives a count standing for the number of reads whose mapping starts at that position. As an example (we have shortened the gene and reads for simplification), if a gene with a single isoform has sequence ACGTCCCC, and we have 12 ACGTC reads, 8 CGTCC reads, 9 GTCCC reads, and 5 TCCCC reads, then this gene can be summarized by a sequence of counts 12, 8, 9, 5.
Quantitative inference of RNASeq data, such as calculating gene expression levels [7] and isoform expression levels [13], is based on these counts. To utilize the data efficiently, it is crucial to have an appropriate statistical model for these counts. Current analysis methods assume, explicitly or implicitly, a naive constantrate Poisson model, in which all counts from the same isoform are independently sampled from a Poisson distribution with a single rate proportional to the expression level of the isoform [7, 13, 14]. Unfortunately, we found that this model does not provide a good fit to real data (see Results), and a more elaborate model is needed.
To better model the counts, it is natural to consider a Poisson model with variable rates; that is, the counts from an isoform are still modeled as Poisson random variables, but each Poisson random variable has a different rate (mean value). By checking the similarities among counts of different tissues (see Results), one can see that the Poisson rate depends on not only the gene expression level, but also the position of the read. Hence, we model the rate as the product of the gene expression level and the 'sequencing preference' of reads starting at this position. This sequencing preference is a factor showing how likely it is for a read to be generated at this position.
Dohm et al. [15] found that GCrich regions tend to have more reads than ATrich regions, but we find that models based purely on GC content work poorly (Additional file 1). Some clues on how to model the sequencing preferences may be obtained by reviewing how related issues are handled in microarrays. There are a set of probes for each gene in microarrays, and each probe gives a continuous measurement of the gene expression level. The values of the measurements from the same set are modeled by a Gaussian distribution with different means, each of which is the product of the gene expression level and the affinity of that probe to the cDNA sequences. Naef and Magnasco [16] proposed a model for the probe affinities, which only depends on the probe sequences:
where ω_{ i }is the affinity of probe i, K is the length of the probe, I(b_{ ik }= h)) is 1 when the k^{ th }base pair is letter h, and 0 otherwise, α and β_{ kh }are the parameters we want to estimate, and ε is Gaussian noise so that the parameters can be estimated by regular linear least squares. The key feature of this model is that it considers the letter appearing at each location, rather than just the total number of occurrences of each letter. This simple linear model can explain 44% of the differences of the affinities in an Affymetrix oligonuleotide array dataset. Similar models have been developed for other arrays or datasets [17–20].
In RNASeq experiments, cDNA synthesis is typically initiated by random priming. Depending on its sequence, an mRNA fragment may form secondary structures that obstruct the binding of the primers. Furthermore, the primer is usually tagged by a nonrandom flanking sequence that may preferentially interact with the mRNA depending on the mRNA sequence. Due to these effects, the probability for binding depends on both the nucleotide sequence and the protocol. After synthesis, the cDNAs are ligated to linkers, amplified and then sequenced. In these steps, the secondary structure of the cDNA and the details of the protocol can again influence the efficiency. Therefore, the protocol and the local sequence context may have a large influence on how likely an mRNA segment will be read. Hence, under a specific protocol, we may be able to predict, at least partly, the sequencing preferences based on the local nucleotide sequences.
Results and discussion
Datasets and overdispersion
Three genomewide RNASeq datasets are used in this paper. The first two were generated by Illumina's Solexa platform, and the third one was generated by ABI's SOLiD platform. The first dataset [7] is composed of 79, 76, and 70 million reads from three mouse tissues: brain, liver and skeletal muscle. Each read is of length 25. The second dataset [11] is composed of 12 to 29 million reads from 10 diverse human tissues and 5 mammary epithelial or breast cancer cell lines. Each read is of length 32. We use data from nine of these tissues or cell lines, and merge them into three groups (adipose, brain, and breast in group one, colon, heart, and liver in group two, lymph node, skeletal muscle, and testes in group three.). Each group contains 61 to 77 million reads. The third dataset [12] is composed of 16 million highquality reads from each of the two cell lines: embryoid bodies (EB) and undifferentiated mouse embryonic stem cells (ES). Each original read is 35 nucleotides, but some are truncated into 30 or 25 nucleotides to ensure high quality. We refer to these three datasets as Wold data, Burge data, and Grimmond data, respectively, in accord with the research group that originally generated the data. As we just described, each of the three datasets contains several subdatasets standing for different tissues, groups, or cell lines, and in total we have eight subdatasets: three (tissues) for Wold data, three (groups) for Burge data, and two (cell lines) for Grimmond data. In all our processing and calculations, the above subdatasets are considered separately; that is, only one subdataset is analyzed at a time.
First, the count data are extracted from the original datasets. The detailed procedure is described in Materials and methods. Briefly speaking, we map reads to all isoforms of all RefSeq genes, and then in order to avoid ambiguity, we only count reads uniquely mapped to genes that have only one isoform annotated in RefSeq and do not overlap with other genes, which we call 'nonoverlapped singletranscript genes'. Further, we use only the counts from the top 100 genes with the highest expression levels to fit our model since they have the highest signaltonoise ratio (see Additional file 1 for details).
Variancetomean ratios in different datasets
Variancetomean ratios  

Dataset  Subdataset  Maximum  Median  Minimum 
Wold  Brain  248  36  21 
Liver  1,503  48  19  
Muscle  2,088  34  18  
Burge  Group 1  835  78  14 
Group 2  1,187  102  28  
Group 3  1,593  112  20  
Grimmond  EB  24,385  806  47 
ES  9,162  345  22 
The Poisson linear model and its performance
where ν_{ I }= log(μ_{ i }), α is a constant term, I(b_{ ijk }= h) equals to 1 if the k^{ th }nucleotide of the surrounding sequence is h, and 0 otherwise, and β_{ kh }is the coefficient of the effect of letter h occurring in the k^{ th }position. This model uses about 3K parameters to model the sequencing preference. To fit the above model, we iteratively optimize the gene expression levels and the Poisson regression coefficients (Materials and methods).
We applied our model to each of the eight subdatasets. As local sequence context, we use 40 nucleotides prior to the first nucleotide of the reads and 40 nucleotides after them (that is, the first 40 nucleotides of the reads; see Additional file 1 for the reason for choosing this region). Thus, our model uses 3 × 80 = 240 parameters to model the sequencing preference. This is a relatively small number compared to the sample size (about 100,000 counts) in each subdataset.
R^{2} in different datasets
R ^{2}  

Poisson linear  MART  
Dataset  Subdataset  80 nucleotides^{a}, noncrossvalidation  80 nucleotides^{a}, crossvalidation  40 nucleotides^{a}, crossvalidation  40 nucleotides^{a}, crossvalidation 
Wold  Brain  0.52  0.51  0.51  0.70 
Liver  0.51  0.50  0.50  0.70  
Muscle  0.48  0.46  0.46  0.59  
Burge  Group 1  0.43  0.42  0.42  0.52 
Group 2  0.37  0.35  0.35  0.46  
Group 3  0.45  0.42  0.42  0.54  
Grimmond  EB  0.47  0.40  0.40  0.58 
ES  0.45  0.39  0.37  0.54 
The coefficients are strikingly similar in each subdataset of the same dataset, although they significantly differ in different datasets. This is strong evidence that these coefficients are meaningful rather than just random.
Although it is difficult to explain biologically the magnitude of each coefficient, it is possible for us to explain the main differences of coefficients between datasets by the protocols they used. Both the Wold and Burge data were generated by using the Illumina platform, so their curves look similar, especially in the central part. However, the mRNAs were fragmented into approximately 200nucleotide pieces before cDNA synthesis in the Wold data but not in the Burge data. Shorter pieces of mRNA are less likely to form nonlocal secondary structure. Therefore, the coefficient curve of the Wold data should have lighter tails. Grimmond's experiment used ABI's platform for sequencing and added quite different linkers to the synthesized cDNA before sequencing, so the whole curve looks quite different from that of the Wold and Burge data.
Our Poisson linear model shows that at least 37 to 52% of the nonuniformity can be explained by the sequence difference. However, this percentage may be an underestimate of the fraction of deviance explainable by local sequence context as the simple linear model cannot capture many other effects. Adding more predictors to the linear model is possible, and in particular adding the dinucleotide composition can considerably improve the fitting (Additional file 1), but we prefer to consider nonlinear models to get a better understanding of how much of the nonuniformity of the counts is systematic bias rather than random noise.
The MART model and its performance
Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. MART is a gradient treeboosting algorithm proposed by Friedman [23, 24]. One version of MART is available in the 'gbm' package [25] of R [26]. Also, to avoid the overfitting that commonly occurs for nonlinear models, we use crossvalidation and R^{2} in the testing data.
The details on using MART and on estimating crossvalidation R^{2} are given in Materials and methods. In this analysis, we use shorter surrounding sequences. For the Wold and Burge data, we use 25 nucleotides prior to the first nucleotide of the reads and 15 nucleotides after it, and for the Grimmond data, we use 15 nucleotides prior and 25 nucleotides after. These are the regions that have large coefficients in the Poisson regression model (Additional file 1). Using shorter surrounding sequences lowers the dimensions of the input data, thus shortening the training time and reducing the chance of overfitting.
Our high R^{2} shows that at least 50 to 70% of the nonuniformity in the sequencing preference is predictable from local sequences.
Applications of our models
Our results may benefit quantitative inference from RNASeq data. To reduce biases in gene expression estimates due to nonuniformity of read rates, we propose to estimate the expression of a singleisoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. In contrast, the standard estimate will divide the number of reads by the length of the gene, which is equivalent to dividing by the SSP under the uniform model where all sequencing preferences are set to be 1.
To test the new method, we first compared the gene expression levels estimated using the mouse liver subdataset of the Wold RNASeq data with those estimated using Affymetrix microarray data of the same tissue, as used by Kapur et al. [27]. For RNASeq data, we estimate gene expression level under the uniform model and our MART model, and for microarray data, we use the Robust Multichip Average [28]. All nonoverlapped singletranscript genes are included in the comparison, and the results are summarized by the Spearman's rank correlation coefficients. For all genes considered, using our MART model increased the rank correlation from 0.771 to 0.773 compared to the uniform model, which represents a very minor improvement.
The average fold change across genes in the Wold data is only 1.02; thus, it is not surprising that the performance of the new estimate is so close to the standard estimate. Consistently, when we examine the 100 genes with the largest fold changes (on average, the fold change is 1.10 in these 100 genes), the rank correlation shows a much larger improvement, from 0.095 to 0.198, that is, a 108% relative change.
Average fold changes of genes, exons, and junctions of chromosome 1
Average fold changes of mean sequencing preferences  

Dataset used to train the model  Genes  Exons  Junctions (read length = 35)  Junctions (read length = 100) 
Wold  1.02  1.12  1.13  1.07 
Burge  1.18  1.32  1.37  1.28 
Grimmond  1.25  2.17  2.34  1.73 
Spearman's rank correlation coefficients in mouse embryoid bodies
Fold change bin  SCC by uniform model  SCC by our MART model  Relative improvement 

(1.00, 1.09)  0.465  0.466  0.1% 
(1.09, 1.19)  0.437  0.444  1.4% 
(1.19, 1.33)  0.413  0.434  5.1% 
(1.33, 1.53)  0.481  0.520  8.2% 
(1.53, 4.82)  0.389  0.490  26.0% 
Pearson's correlation coefficients of %A Sex
Selection criterion  Tissue  Number of selected AS events  PCC by uniform model  PCC by our MART model  Relative improvement 

1  Liver  472  0.48  0.50  4.2% 
Muscle  451  0.40  0.45  12.5%  
Brain  699  0.36  0.40  11.1%  
2  Liver  228  0.60  0.60  0% 
Muscle  194  0.48  0.51  6.3%  
Brain  298  0.44  0.50  13.6% 
In the above, we find that the main factor determining how much improvement our model can bring is the magnitude of fold changes. Thus, we expect that our method can be applied to many other problems that involve short sequence elements. In new isoform discovery, a problem of great current interest, it is crucial to take into account the relative counts of reads along the region. For example, a region with more reads per base than its surrounding regions suggests a new exon. However, this might be misleading if this region has more reads merely because it has larger sequencing preferences than its surrounding regions. Further effort is needed to incorporate our method into current isoformdiscovery algorithms.
While the MART model gives better estimates of sequencing preferences and is thus used for statistical inference, the main purpose of the Poisson linear model is to select a proper K for the MART model. Nevertheless, it might still be possible for us to get more information from it, especially from the plot of the coefficients (like Figure 2). For example, if the coefficients in the central part of the curve have large absolute values, this may indicate that the difference in sequencing preferences is repeatedly enlarged in the experiment, most likely by multiround PCR, and we may need to use more mRNA samples instead of doing PCR for too many rounds. As another example, if the coefficient curve has heavy tails, this should indicate that the mRNA/cDNA tend to form complex nonlocal secondary structure, which is also unfavorable, and we may need to fragment the mRNAs into smaller pieces and/or choose better linkers with proper lengths. It might be possible for experienced technicians, who know all the details of the experiments, to provide more explanation of, or even pinpoint, the main causes of biases. This might help to improve the protocols of RNASeq.
Conclusions
Nonuniformity is dramatic in RNASeq data
In each of the eight subdatasets, the RNASeq count data are largely overdispersed. This is strong evidence that the nonuniformity of the counts is too great for Poisson distribution with constant rate tocapture. Also, among the subdatasets of each dataset, the trends that counts differ along the gene show a highly consistent pattern. This is not only evidence that the Poisson distribution fails, but also suggests that the changes of the counts depend on the position along the gene.
Poisson linear model
We proposed a Poisson linear model for the count data, and implement an iterative Poisson linear regression procedure to fit it. Using the surrounding 80 nucleotides, it is able to explain 37 to 52% of the difference in the counts for the most highly expressed genes. We find that the coefficients for nucleotides near the first nucleotide of a read have bigger abstract values, indicating that they play a more important role in determining the sequencing preferences.
MART model
To capture the nonlinear effects of the local sequences, we use MART to fit the log preferences, and a crossvalidation strategy is implemented to calculate R^{2}. MART gives a crossvalidation R^{2} of 0.52 to 0.70 in seven out of eight subdatasets, a 0.10 to 0.20 improvement. This result indicates that the major information about nonuniformity is in the local sequences.
Benefits of our models
Our models may help us to evaluate the protocol for RNASeq experiments. It can also give us better estimators for the quantitative inferences of RNASeq data. Since the average preferences can vary substantially in short pieces of sequences, the improvement can be significant. We believe that all quantitative analysis of RNASeq data should incorporate sequencing preference information. Particularly, we suggest training a model for sequencing preference using only the top 100 genes and MART, then using this trained model to predict the sequencing preference of all sites in the transcriptome, which are then used in further inferences.
Materials and methods
Extracting the count data from the original reads data
First, we downloaded from the UCSC genome browser website [30] the sequences of RefSeq genes [31, 32] (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap [33], allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of [12], where a threeround mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS [34] as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'nonoverlapped singleisoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Since some positions have the same local sequence (to the length of reads) as other positions because of the short length of reads, they are always assigned a zero count by our counting method. This might influence the results of our analysis. However, these positions comprise less than 2% of all positions even if the read length is only 25, so they should not change our analysis significantly.
Several more steps are performed afterwards. To avoid UTR ambiguity in the annotation and boundary bias in the sequencing [3], we truncated all UTRs and a further 100 nucleotides on both ends. We then discarded genes that are too short (less than 100 nucleotides) after the truncation. Finally, after calculating the gene expression levels measured by RPKM [7], we discarded all genes except the top 100 with the highest expression levels. The counts of these top genes were the only counts we used for fitting the models. Reads from these top genes make up a considerable proportion of all reads mapped unambiguously, and thus give sufficient information for the sequencing preference. In contrast, lowly expressed genes have no or only a few reads across them, and moderateexpressed genes often have zero counts for a considerable proportion of sites; thus, information on their sequencing preference is limited.
The count data for the top 100 genes in each subdataset are available in the R package 'mseq' [35], which is publicly available in CRAN (The Comprehensive R Archive Network).
Fitting the Poisson linear model
We use the following strategy to fit our Poisson regression model:
1. Initialize , where L _{ i }is the length of gene i.
2. Viewing as known offsets, fit the Poisson regression model to get and . This is a standard algorithm, and 'glm()' of R [26] implements it.
3. Update , where W _{ i }is the sum of sequencing preferences of all nucleotides of gene i, that is, .
4. Jump to step 2 unless the deviance decreases less than 1%.
In the above, step 2 gives the maximum likelihood estimate of α and β_{ kh }given , and it is easy to prove that step 3 gives the maximum likelihood estimate of ν_{ i }given α = and β_{ kh }= . So the above procedure maximizes the likelihood by iteratively optimizing the preference parameters and the gene expression levels.
The R codes implementing this procedure are available in the R package 'mseq' [35].
Strategy for using MART and estimating crossvalidation R^{2}
The strategy for using MART and estimating crossvalidation R^{2} includes the following steps: (1) Randomly divide the 100 genes into 5 groups. In each fold of crossvalidation, use one of them as the testing set, and the other four as the training set. (2) In each fold, for each gene in the training dataset, divide each count by the mean of counts in this gene. The resulting number is considered to be the sequencing preference of that position. To avoid zero preference, which is troublesome in step 3, we replace zero counts by a small number (0.5 in our calculation). (3) Get the logarithm of these preferences. (4) Train MART using the surrounding sequences as input and these log preferences as output. The parameters we used for MART are: interaction depth = 10, shrinkage = 0.06, and number of trees = 2000 (the method is robust to the choice of parameters; Additional file 1). Also, we put heavier weights on log preferences from more highly expressed genes since they have smaller variance. The weights for log preferences from gene i are set to be N_{ i }/L_{ i }, where N_{ i }is the total number of reads across this gene, and L_{ i }is the length of this gene. (5) Use the trained MART to predict the log preferences of the testing data. (6) Get the maximum likelihood estimate of the gene expression levels. That is, suppose for a gene the length is L, the log preferences are a1, ..., a_{ L }, and the counts are n_{1}, ..., n_{ L }, then the gene expression level is
(7) Calculate the deviance according to the log preferences in step 5 and the gene expression levels in step 6. Also calculate the null deviance. (8) Repeat steps 2 to 7 for all five folds. (9) Calculate the final crossvalidation R^{2}, which is the sum of deviances in the five folds over the sum of null deviances.
The R codes implementing this procedure are available in the R package 'mseq' [35].
Abbreviations
 ABI:

Applied Biosystems
 Apoe :

apolipoprotein E
 AS:

alternative splicing
 %ASex:

percent alternatively spliced exon exclusion
 bp:

base pair
 EB:

embryoid body
 ES:

embryonic stem cell
 MART:

multiple additive regression trees
 RPKM:

reads per kilobase of exon per million mapped sequence reads
 SSP:

sum of sequencing preferences
 UTR:

untranslated region.
Declarations
Acknowledgements
This research is supported by NIH grants R01 HG004634 and R01 HG003903. The computation in this project was performed on a system supported by NSF computing infrastructure grant DMS0821823. We thank Xi Chen for discussion about protocols of RNASeq experiments. We also thank three anonymous reviewers for their insightful and constructive comments and suggestions, which improved our paper substantially.
Authors’ Affiliations
References
 Okoniewski MJ, Miller CJ: Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics. 2006, 7: 27610.1186/147121057276.PubMedPubMed CentralView ArticleGoogle Scholar
 Royce TE, Rozowsky JS, Gerstein MB: Toward a universal microarray: prediction of gene expression through nearestneighbor probe sequence identification. Nucleic Acids Res. 2007, 35: e9910.1093/nar/gkm549.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
 Holt RA, Jones SJ: The new paradigm of flow cell sequencing. Genome Res. 2008, 18: 839846. 10.1101/gr.073262.107.PubMedView ArticleGoogle Scholar
 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 13441349. 10.1126/science.1158441.PubMedPubMed CentralView ArticleGoogle Scholar
 Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature. 2008, 453: 12391243. 10.1038/nature07002.PubMedView ArticleGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5: 621628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
 Lister R, O'Malley RC, TontiFilippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated singlebase resolution maps of the epigenome in Arabidopsis. Cell. 2008, 133: 523536. 10.1016/j.cell.2008.03.029.PubMedPubMed CentralView ArticleGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 15091517. 10.1101/gr.079558.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel shortread sequencing. Biotechniques. 2008, 45: 8194. 10.2144/000112900.PubMedView ArticleGoogle Scholar
 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
 Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massivescale mRNA sequencing. Nat Methods. 2008, 5: 613619. 10.1038/nmeth.1223.PubMedView ArticleGoogle Scholar
 Jiang H, Wong WH: Statistical inferences for isoform expression in RNASeq. Bioinformatics. 2009, 25: 10261032. 10.1093/bioinformatics/btp113.PubMedPubMed CentralView ArticleGoogle Scholar
 Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing. Nat Genet. 2008, 40: 14131415. 10.1038/ng.259.PubMedView ArticleGoogle Scholar
 Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultrashort read data sets from highthroughput DNA sequencing. Nucleic Acids Res. 2008, 36: e10510.1093/nar/gkn425.PubMedPubMed CentralView ArticleGoogle Scholar
 Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 011906PubMedView ArticleGoogle Scholar
 Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS: Modelbased analysis of tilingarrays for ChIPchip. Proc Natl Acad Sci USA. 2006, 103: 1245712462. 10.1073/pnas.0601180103.PubMedPubMed CentralView ArticleGoogle Scholar
 Potter DP, Yan P, Huang TH, Lin S: Probe signal correction for differential methylation hybridization experiments. BMC Bioinformatics. 2008, 9: 45310.1186/147121059453.PubMedPubMed CentralView ArticleGoogle Scholar
 Wu ZJ, Irizarry RA, Gentleman R, MartinezMurillo F, Spencer F: A modelbased background adjustment for oligonucleotide expression arrays. J Am Stat Assoc. 2004, 99: 909917. 10.1198/016214504000000683.View ArticleGoogle Scholar
 Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen R, Liu XS: Modelbased analysis of twocolor arrays (MA2C). Genome Biol. 2007, 8: R17810.1186/gb200788r178.PubMedPubMed CentralView ArticleGoogle Scholar
 Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010,Google Scholar
 Hardin JW, Hilbe JM: Generalized Linear Models and Extensions. 2007, College Station, TX: Stata Press, 2Google Scholar
 Friedman JH: Greedy function approximation: A gradient boosting machine. Ann Stat. 2001, 29: 11891232. 10.1214/aos/1013203451.View ArticleGoogle Scholar
 Friedman JH: Stochastic gradient boosting. Comput Stat Data Anal. 2002, 38: 367378. 10.1016/S01679473(01)000652.View ArticleGoogle Scholar
 Greg Ridgeway. gbm: Generalized Boosted Regression Models. R package version 1.63. 2007, [http://cran.rproject.org/web/packages/gbm/index.html]
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2008, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
 Kapur K, Jiang H, Xing Y, Wong WH: Crosshybridization modeling on Affymetrix exon arrays. Bioinformatics. 2008, 24: 28872893. 10.1093/bioinformatics/btn571.PubMedPubMed CentralView ArticleGoogle Scholar
 Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e1510.1093/nar/gng015.PubMedPubMed CentralView ArticleGoogle Scholar
 Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, Babak T, Siu H, Hughes TR, Morris QD, Frey BJ, Blencowe BJ: Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol Cell. 2004, 16: 929941. 10.1016/j.molcel.2004.12.004.PubMedView ArticleGoogle Scholar
 UCSC Genome Browser. [http://genome.ucsc.edu/]
 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
 Waterston RH, LindbladToh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520562. 10.1038/nature01262.PubMedView ArticleGoogle Scholar
 Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008, 24: 23952396. 10.1093/bioinformatics/btn429.PubMedPubMed CentralView ArticleGoogle Scholar
 Ondov BD, Varadarajan A, Passalacqua KD, Bergman NH: Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications. Bioinformatics. 2008, 24: 27762777. 10.1093/bioinformatics/btn512.PubMedPubMed CentralView ArticleGoogle Scholar
 mseq on CRAN. [http://cran.rproject.org/web/packages/mseq/index.html]
 Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIPchip and ChIPseq data. Nat Biotechnol. 2008, 26: 12931300. 10.1038/nbt.1505.PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.