Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads
© Turro et al.; licensee BioMed Central Ltd. 2011
Received: 19 September 2010
Accepted: 10 February 2011
Published: 10 February 2011
We present a novel pipeline and methodology for simultaneously estimating isoform expression and allelic imbalance in diploid organisms using RNA-seq data. We achieve this by modeling the expression of haplotype-specific isoforms. If unknown, the two parental isoform sequences can be individually reconstructed. A new statistical method, MMSEQ, deconvolves the mapping of reads to multiple transcripts (isoforms or haplotype-specific isoforms). Our software can take into account non-uniform read generation and works with paired-end reads.
High-throughput sequencing of RNA, known as RNA-seq, is a promising new approach to transcriptome profiling. RNA-seq has a greater dynamic range than microarrays, which suffer from non-specific hybridization and saturation biases. Transcriptional subsequences spanning multiple exons can be directly observed, allowing more precise estimation of the expression levels of splice variants. Moreover, unlike traditional expression arrays, RNA-seq produces sequence information that can be used for genotyping and phasing of haplotypes, thus permitting inferences to be made about the expression of each of the two parental haplotypes of a transcript in a diploid organism.
The first step in RNA-seq experiments is the preparation of cDNA libraries, whereby RNA is isolated, fragmented and synthesized to cDNA. Sequencing of one or both ends of the fragments then takes place to produce millions of short reads and an associated base call uncertainty measure for each position in each read. The reads are then aligned, usually allowing for sequencing errors and polymorphisms, to a set of reference chromosomes or transcripts. The alignments of the reads are the fundamental data used to study biological phenomena such as isoform expression levels and allelic imbalance. Methods have recently been developed to estimate these two quantities separately but no approaches exist to make inferences about them simultaneously to estimate expression at the haplotype and isoform ('haplo-isoform') level. In diploid organisms, this level of analysis can contribute to our understanding of cis vs. trans regulation  and epigenetic effects such as genomic imprinting . We first set out the problems of isoform level expression, allelic mapping biases and allelic imbalance, and then propose a pipeline and statistical model to deal with them.
Isoform level expression
Multi-mapping reads. Approximate proportion of reads mapping to multiple Ensembl transcripts or genes in human using 37 bp single-end or paired-end data obtained from HapMap individuals.
37 bp single-end
37 bp paired-end
where X it are the (unobserved) counts of reads from region i of transcript t, b is a normalization constant used when comparing experiments, μ t is a parameter representing the expression of transcript t and s i is the effective length of region i (that is the number of possible start positions for reads in the region). This model can be fit using an expectation maximization (EM) algorithm, since the X it are unobserved but their sums across transcripts are observed.
This model has been used by  in their POEM software, with i representing exons. Their method does not use reads that span multiple exons or reads that map to multiple genes. The same model has been used in , with i representing exons or part exons, or regions spanning exon junctions, enabling good estimation of isoform expression within genes. They do not, however, include reads mapping to multiple genes. The RSEM method  employs a similar model, but models the probability of each read individually, rather than read counts. This method allows reads to come from multiple genes as well as multiple isoforms of the same gene. The modeling of individual reads allows RSEM to accommodate general position-specific biases in the generation of reads. However, two recent papers [7, 8] have shown that deviations from uniformity in the generation of reads are in great part sequence rather than position-dependent for a given experimental protocol and sequencing platform. Furthermore, the computational requirements of modeling individual reads increasing proportionately with read depth, which, in the case of RSEM, is exacerbated further by the use of computationally intensive bootstrapping procedures to estimate standard errors. None of the above methods are compatible with paired-end data. A recently published method, Cufflinks , focuses on transcript assembly as well as expression estimation using an extension of the  model that is compatible with paired-end data. However, this method does not model sequence-specific uniformity biases and uses a fixed down-weighting scheme to account for reads mapping to more than one transcription locus, meaning that the abundances of transcripts in different regions are estimated independently.
Studies of imbalances between the expression of two parental haplotypes have mostly been restricted to testing the null hypothesis of equal expression between two alleles at a single heterozygous base, typically with a binomial test [1, 2, 10]. However, as transcripts may contain multiple heterozygotes, a more powerful approach is to assess the presence of a consistent imbalance across all the heterozygotes in a gene together. This has been done on a case-by-case basis using read pairs that overlap two heterozygous SNPs  while  propose an extension to the binomial test for detecting allelic imbalance that takes into account all SNPs and their positions in a gene. However, this approach, which is a statistical test rather than a method of quantifying haplotype-specific expression, assumes imbalances to be homogeneous along genes and thus does not take into account the possibility of asymmetric imbalances between isoforms of the same gene.
Allelic mapping biases
Aligners usually have a maximum tolerance threshold for mismatches between reads and the reference. Reads containing non-reference alleles are less likely to align than reads matching the reference exactly, so genes with a high frequency of non-reference alleles may be underestimated. Ideally, aligners would accept ambiguity codes for alleles that segregate in the species (cf. Novoalign ), but no free software is currently able to do this. A possible workaround is to change the nucleotide at each SNP to an allele that does not segregate in the species, as has been proposed to remove biases when estimating allelic imbalance . However, in the context of gene expression analysis, this leads to even greater underestimation of genes with many non-reference alleles and an increase in incorrect alignments to homologous regions. Instead, we propose aligning to a sample-specific transcriptome reference, constructed from (potentially phased) genotype calls.
In this paper we present a new pipeline, including a novel statistical method called MMSEQ, for estimating haplotype, isoform and gene specific expression. The MMSEQ software is straightforward to use, fully documented and freely available online  and as part of ArrayExpressHTS . Our pipeline exploits all reads that can be mapped to at least one annotated transcript sequence and reduces the number of alignments missed due to the presence of non-reference alleles. It is compatible with paired-end data and makes use of inferred insert size information to choose the best alignments. Our method permits estimating the expression of the two versions of each heterozygote-containing isoform ('haplo-isoform') individually and thus it can detect asymmetric imbalances between isoforms of the same gene. Our software further takes into account sequence-specific deviations from uniform sampling of reads using the model described in  but can flexibly accommodate other models. We validate our method at the isoform level with a simulation study, comparing our results to RSEM's, and applying it to a published Illumina dataset consisting of lymphoblastoid cell lines from 61 HapMap individuals . We validate our method at the haplo-isoform level by showing we can deconvolve the expression estimates of haplo-isoforms on the non-pseudoautosomal (non-PAR) region of the X chromosome using a pooled dataset of two HapMap males. We further apply our method to a published dataset of F1 initial and reciprocal crosses of CAST/EiJ (CAST) and C57BL/6J (C57) inbred mice  and demonstrate that MMSEQ is able to detect parental imbalance between the two haplotypes of each isoform.
Overview of the pipeline
Expression estimation using alignments to a pre-defined transcriptome reference,
Expression estimation using alignments to a transcriptome reference that is obtained from the RNA-seq data.
In case (a), the level of estimation (haplo-isoform or isoform) depends on whether the reference includes two copies of heterozygous transcripts. In case (b), it depends on whether the genotypes are phased. The most exhaustive use of the pipeline proceeds as follows. First, the reads are aligned to the standard genome reference using TopHat . Then, genotypes are called with SAMtools pileup . Genotypes are then phased with polyHap  using population genotype data to produce a pair of haplotypes for all gene regions on the genome. The standard transcriptome reference is then edited for each individual to match the inferred haplotypes. The reads are realigned to the individualized haplotype specific transcriptome reference with Bowtie , finding alignments for reads that originally failed to align due to having too many mismatches with the standard reference (approximately 0:3% more reads recovered, with some transcripts receiving up to 13% more hits, in the HapMap dataset ). Finally, our new method, MMSEQ, is used to disaggregate the expression level of each haplo-isoform.
We use the model in Equation 1 as a starting point for modeling gene isoforms and extend it to apply to haplo-isoforms. First, we employ a more general definition of 'region': each read maps to one set of transcripts, which may belong to the same gene or to various different genes, and which can have two versions, one containing the paternal and the other the maternal haplotype. These sets are labeled by i. Many reads will map to the exact same set, hence we can model reads counts (k i ) for the set. The M it are defined very straightforwardly as the indicator functions for transcript t belonging to set i. The region length s i is the effective length of the sequence shared between the whole set. If the set of transcripts all belong to the same gene and haplotype, then s i may be the effective length of an exon or part exon. However, aligned reads often map to multiple genes equally well (Table 1) so the region need not correspond to an actual region on the genome. Using our definition of a region, the s i would be difficult to calculate given the sheer number of overlaps and regions, but in fact the s i are not needed in the calculation of the model (see Materials and Methods). Hence we have a model for read counts in which the data and fixed quantities (k i and M it ) are calculated in a straightforward way, and which allows for reads mapping to multiple isoforms of the same or different genes in exons or exon junctions and to paternal and maternal haplotypes separately.
where is an adjusted effective length, referred to as the sum of sequence preferences (SSP) in . We use their Poisson regression model to adjust the length of each transcript based on its sequence, but other adjustment procedures may be used instead. Briefly, the logarithm of the sequencing preference of each possible start position in a transcript is calculated as the sum of an intercept term plus a set of coefficients determined by the sequence immediately upstream and downstream of the start position. It would also be possible to integrate the method described in , which uses a weighting for reads based on the first seven nucleotides of their sequences, by applying this weighting in our calculation of k i . However, this approach does not incorporate the effects of the sequence composition on the reference upstream of the read start positions or further downstream than seven bases, and we thus prefer to use the  method instead. The normalization constant b is used to make lanes with different read depths comparable. We set b to the total number of reads (in millions) and measure transcript lengths in kilobases, which means the scale of the expression parameter μ t is equivalent to RPKM (reads per kilobase per million mapped reads) described in . In downstream analysis, a more robust measure can be used, such as the library size parameter suggested by .
The only unknown parameters in the model are the μ t . The observed data are the k i and the matrix M and effective transcript lengths l t are known. In principle the effective lengths of the transcript sets s i can be calculated, but in fact, they are not needed (see Materials and Methods).
The maximum likelihood (ML) estimate of μ t cannot be obtained analytically, so instead we use an expectation maximization (EM) algorithm to compute it, an approach also taken by [4, 6] for isoforms. After convergence of the algorithm, we output the estimates of μ t and refer to them as MMSEQ EM estimates.
Again, the only lengths needed are the l t . The conjugacy of the Poisson-Gamma model makes the sampling fast and straightforward as the full conditionals are in closed form (see Materials and Methods). We use the final EM estimate of the μ t as the initial values for the Gibbs sampling. We then produce samples from the whole posterior distributions of the μ t and calculate the sample means and their respective Monte Carlo standard errors (MCSE), which take into account the autocorrelations of the samples . We set the hyperprior parameters to α = 1.2 and β = 0.001, producing a vague prior on the μ t that captures the well-known broad and skewed distribution of gene expression values. We output the means of the Gibbs samples of μ t , which we refer to as MMSEQ GS estimates. As we shall show, the regularization afforded by the Bayesian algorithm produces estimates with a lower error than the MMSEQ EM estimates. Moreover, it can readily be shown that for transcript with low coverage, the ML estimate is often zero, even though this is likely to be an underestimate of the expression. For example, suppose there exist two equally-expressed haplo-isoforms differing by only one heterozygote. Under the assumption of uniform sampling of 0.01 reads per nucleotide for both haplo-isoforms, if the read length is 35, then the probability of observing a read containing one allele but no reads containing the other allele is fairly high (2(1-e-0.35)e-0.35 ≃ 0.42). The ML estimate of the haplo-isoform with the unsampled allele under this scenario is zero while the ML estimate of the haplo-isoform with the sampled allele is overestimated. With Gibbs sampling, on the other hand, this effect is tempered by the Gamma prior. The MMSEQ GS estimates are thus our preferred expression measures.
Best mismatch stratum filter
While a read may align to multiple transcripts, not all alignments may be equally reliable. We therefore filter out all alignments that do not have the minimal number of mismatches for a given read or read pair (similar to the --strata switch in Bowtie, but compatible with paired as well as single end data). In the case of paired-end data, the number of mismatches from both ends is added up to determine the 'mismatch stratum' of a read pair. This filter is crucial in order to correctly discriminate between the two versions of an isoform at a heterozygous position, since reads from one haplotype also match the alternative haplotype with an additional mismatch. The stratum filter thus ensures that reads are properly assigned to the correct haplotype.
Insert size filter for paired-end data
For paired-end data, both reads in a pair must align to a transcript for the mapping to be considered. If the fragments are sufficiently large, the alignments may span three exons and align to transcripts that both retain and skip the middle exon. However, the alignment with an inferred fragment size (also called insert size) that is nearer to the expected insert size from the fragmentation protocol, is more likely to be correct. We exploit this information by applying an insert size filter to alignments in the best mismatch stratum for each read. If an alignment's insert size is nearer than x bp (for example equivalent to one standard deviation) away from the expected insert size, then all other alignments for that read with an insert size greater than x bp away from the expected insert size are removed. This filter can be thought of as an extension of mismatch-based filtering for reporting only alignments with moderately high probability of being true. Although full probabilistic modeling is more principled, filtering is a commonplace approach to reducing alignment candidates for each read to a set that can be dealt with pragmatically. For the HapMap dataset, mistakes in the protocol resulted in two distributions of insert sizes within some samples, so we omitted this filter.
The mmseq program produces three files each containing EM and GS expression estimates with associated MCSEs. The first file provides estimates at the transcript/haplo-isoform level, the second file provides aggregate estimates for sets of transcripts that have been amalgamated due to having identical sequences (and therefore indistinguishable expression levels), and the third file aggregates transcript estimates into genes, thus providing gene-level estimates. Homozygous transcripts are aggregated together, whereas heterozygous transcripts are aggregated separately to produce 'haplo-gene' level estimates. With respect to transcripts that have identical sequences and hence indistinguishable and unidentifiable expression levels, the posterior samples exhibit high variance and strong anti-correlation but the sum of their expression can be precisely estimated (Additional file 1). We therefore recommend use of the amalgamated estimates.
Performance and scalability
mmseq performance. Performance of the mmseqprogram on subsets of different sizes of the HapMap paired-end dataset.
Read pairs (millions)
Dimension of M
63,924 × 68,666
84,417 × 75,649
97,576 × 79,035
107,344 × 81,289
134,489 × 86,528
166,100 × 91,023
Correction for non-uniform read sampling
Simulation study of isoform expression estimation
Comparison of isoform expression estimation between MMSEQ and RSEM
Like MMSEQ, the RSEM method  makes use of all classes of reads to estimate isoform expression. The authors have shown an improvement of their method for gene-level estimation over strategies that discard multiply aligned reads or allocate them to mapped transcripts according to the coverage by single-mapping reads (as in ). However, isoform-level results for their method have not been assessed. We obtained RSEM estimates for Ensembl transcripts using our simulated human sequence dataset for the purposes of comparison.
We scaled our simulated and estimated expression values to add up to one in order to make them comparable to RSEM's fractional expression estimates. We found that RSEM and MMSEQ EM are comparable but, unlike the MMSEQ EM algorithm, RSEM tended to overestimate some medium-expression transcripts. Both the RSEM and MMSEQ EM algorithms tended to underestimate some low-expression transcripts, pushing them very close to zero and thus producing very large errors on the log scale. This was avoided by the regularization of the Gibbs algorithm, which produced tighter estimates and only overestimated slightly some very lowly expressed transcripts (Figure 5 and Additional file 5), showing the benefits of using the whole posterior distribution of μ t to estimate expression rather than a maximization strategy.
Isoform-level application to the HapMap dataset
Although the ordering of transcripts and genes was broadly maintained even between lanes belonging to different individuals and runs, we found a striking contrast in the distribution of expression values between lanes of the same individual and lanes of different individuals (Additional file 6). The consistency of expression values for lanes of the same individual indicates that the technical replicability of the Illumina GAII sequencer is extremely high and therefore that the variation observed between lanes from different individuals is mostly a reflection of biological variability. This is in line with previous research showing that sequence count data follow a negative binomial distribution in biological replicates and a Poisson distribution in technical replicates . As such, we expect the variance of our estimates to be proportional and greater than proportional to the expression values for technical and biological replicates respectively. This is indeed borne out both at the gene and transcript level (Additional file 7) and corroborates the need to take into account extra variability for highly-expressed transcripts in differential expression analysis with biological replication (see Discussion).
Validation of haplo-isoform deconvolution
The non-pseudoautosomal region (non-PAR) of the X chromosome in human males is haploid, and thus the alleles in that region can be called directly without the need for phasing. We validated our method for deconvolving expression between two haplotypes of the same isoform as follows. We used the RNA-seq data of two males from the HapMap data (NA12045 and NA12872) to call their haplotypes. We identified 117 isoforms on the non-PAR of the X chromosome that differed between the two individuals. We created custom transcriptome references for each of the two males, containing their individual versions of the 117 isoforms. We then created a third hybrid reference containing two copies of the 117 isoforms, one matching the haplotype of one male and the second matching the haplotype of the other. This hybrid reference mimics the case of a female with two X chromosomes with unknown expression of the two parental copies of each isoform. We obtained individual expression estimates of the 117 isoforms using the separate transcriptome references in each male and compared them with estimates obtained by aligning a dataset pooled from the data of both males to the hybrid reference. Although the original correlation between the two males was 0.85, the correlation between the individual estimates and the deconvolved estimates was 0.96 and 0.98, showing MMSEQ is capable of disaggregating the expression from paternal and maternal isoforms (Additional file 8).
Demonstration of haplo-isoform expression estimation using an F1hybrid mouse brain dataset
We have applied MMSEQ to a published murine embryonic day 15 RNA-seq dataset of CAST/C57 initial (F1i) and reciprocal (F1r) crosses . Each RNA sample was a pool from four individuals. The C57 reference transcriptome used by the authors is available from the UCSC Genome Browser . The authors called SNPs by aligning reads from the CAST samples to the C57 reference. We created a CAST reference transcriptome by changing alleles in the C57 reference sequences according to those SNP calls. The two references were combined in a hybrid reference containing two entries for isoforms that differed in sequence between C57 and CAST. Thus there is a one-to-one mapping between SNPs called in the parents and heterozygotes in the hybrids. The data consist of 152 and 159 million 36 bp Illumina GAII reads for F1i and F1r respectively.
We also identified a clustering of transcripts that exhibited CAST overexpression in the F1i hybrids but approximately balanced expression in the F1r's. We identified the cluster as consisting wholly of transcripts on the X chromosome (Additional file 9), which suggests that the initial crosses were male and the reciprocal crosses female. The sexes of the hybrid mice are in fact unknown. There was a slight skew in favor of the CAST strand in the reciprocal crosses. We think it is unlikely that this was due to mapping biases, since the CAST reference was produced from SNP calls against the C57 reference and was thus of lower quality, so any mapping bias would be expected to be in favor of C57. Moreover,  found a similar skew in adult samples of the same crosses. It is possible that the skew is the result of a selective bias in favor of C57 X-inactivated cells , possibly caused by one or more of the three mutations on the X-inactivation transcript (UCSC ID uc009tzp.1) or mutations in its promoter region.
The third grouping in the plot is on the lower-left to upper-right diagonal. These transcripts demonstrate consistent CAST/C57 differential expression regardless of the sex-strain combination of the parents, and are thus indicative of cis regulation.
One advantage of MMSEQ is that imbalances are assessed at the transcript level rather than for individual SNPs. Thus it is not necessary to set arbitrary thresholds on the numbers of heterozygotes or the magnitude and significance of the imbalances to make claims about transcript-level imbalances. Indeed, some of the transcripts that contain one or more heterozygote with a significant P-value but were not classified as 'consensus imprinted' by  are clearly shown to be imprinted by our results. Note however that 27 transcripts had significant heterozygotes with imbalances in opposing directions, demonstrating that it is not always appropriate to generalize from a single locus to make claims of imbalance at the transcript level (Figure 8 and Additional file 10).
MMSEQ estimates for H13 and Mcts2 isoforms in F1 hybrid samples. MMSEQ estimates for each haplotype and isoform of H13 and Mcts2of the initial and reciprocal crosses are shown.
We have presented a pipeline and statistical method that can disaggregate expression between isoforms and even between the two haplotypes of each isoform within an individual. MMSEQ produces improved isoform estimates compared to RSEM for medium to low expression transcripts, is more scalable, and estimates standard errors more efficiently. Furthermore, our principled approach to haplo-isoform quantification obviates the need for ad-hoc interpretations of SNP-by-SNP imbalances in terms of transcripts. Two aspects of our method, however, deserve further discussion.
MMSEQ aims to quantify the abundance of known transcripts, and as such relies on the comprehensiveness of the transcriptome's annotation. It is usually possible to align a very large proportion of the reads to Ensembl transcripts (approximately 75% in the HapMap study using Ensembl version 56). However, samples may contain previously unobserved genes or isoforms. MMSEQ can in such cases work in tandem with transcript discovery methods by adding newly predicted isoform sequences to the reference transcript FASTA file and using it in the alignment and mapping steps of the MMSEQ workflow.
Modeling biological variability
The Poisson distribution captures technical variability arising in repeated sequencing experiments with the same biological sample. The true expression value is, in effect, fixed by the experiment, and the only source of variability arises from measurement error and mapping uncertainty. However, between biological replicates such as different individuals in the HapMap study, there is, additionally, variability of a biological origin. As has been previously reported, this results in expression values between replicates that show overdispersion, captured, for example, by a negative binomial distribution .
Here we have focused on the problem of estimating the posterior distribution of expression values independently per sample. Nevertheless, it would be possible to add a further level to our Bayesian model to capture overdispersion across samples flexibly. For example, if exchangeable Gamma priors are set on the μ t , a suitable negative binomial model can be induced.
Phasing with paired-end data
In this work, we have phased genotype calls obtained from SAMtools pileups - an approach that works well with both single and paired-end data. However, in the case of paired-end data, the haplotypes observed directly at multiple SNPs spanned by overlapping read pairs could be used to increase confidence in the phasing calls. Although incorporating this information would benefit phasing estimates only for some sets of SNPs, we believe it is a worthwhile area of future research. As phasing is a distinct step in our pipeline, improved methodologies can be integrated flexibly as they become available.
RNA-seq is a promising and rapidly developing technology that provides sequence and expression intensity information of a sample in a single experiment. We have presented a novel pipeline and fast, scalable methodology to estimate expression of diploid organisms at the haplotype, isoform and gene levels. This allows researchers to go beyond allele-specific expression analysis and assess imbalance between paternal and maternal copies of isoforms, which in turn may be compared to differential isoform expression between individuals. We have shown that our method is able to deconvolve the expression of transcripts on each of two X chromosomes from human males in a pooled dataset, and that it can be successfully applied to detect genomic imprinting and cis-regulated transcripts in mouse hybrids. Our method retains reads that emanate from junctions as well as wholly within exons, models alignments to multiple transcripts, potentially across genes, exploits insert size information in paired-end data to choose the best alignments and flexibly incorporates corrective models for non-uniform read sampling. The pipeline, the MMSEQ software and related documentation are freely available online .
Materials and methods
which converges to the ML estimate of μ t . To initialize the algorithm, we set μ(0) equal to , which is equivalent to distributing k i evenly between cells of X i where M it is one. For a given region i, the probability of reads being allocated to a given transcript depends only on the μ t and not on s i (as the region is the same length on all transcripts). Hence, the s i do not appear in the update steps.
Bayesian model and Gibbs sampling
Again, the s i are not needed as they are absent from the full conditionals.
Gapped alignment to the genome is performed with TopHat. We use a GFF file (specified with -G) based on the Ensembl annotation. We set --no-novel-juncs, --min-isoform-fraction 0.0 and --min-anchor-length 3. The expected inner distance between mate pairs is specified with the -r switch.
SAMtools pileup settings
Genotypes output by SAMtools pileup were filtered using samtools.pl varFilter with default options and setting a minimum Phred-scaled probability of the genotype being identical to the reference ('SNP quality') threshold of 20.
Alignment to the transcriptome with Bowtie is performed with the -a --best switches, which ensure all the best alignments in terms of mismatches are produced. Additionally, we recommend using --strata to output only alignments with the minimum number of mismatches, although it currently has no effect on paired-end data. The minimum and maximum insert sizes should be set appropriately with the -I and -X switches respectively, as should --norc/--nofw for stranded protocols.
Utah residents with ancestry from northern and western Europe
Genome Analyzer II
Monte Carlo standard errors
reads per kilobase per million mapped reads
single nucleotide polymorphism
University of California: Santa Cruz.
This research was supported by BBSRC grant BBG0003521. We thank William Astle for constructive comments on the statistical model.
- McManus CJ, Coolon JD, Duff MO, Eipper-Mains J, Graveley BR, Wittkopp PJ: Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Res. 2010, 20: 816-825. 10.1101/gr.102491.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Gregg C, Zhang J, Weissbourd B, Luo S, Schroth GP, Haig D, Dulac C: High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science. 2010, 329: 643-648. 10.1126/science.1190830.PubMedPubMed CentralView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Richard H, Schulz MH, Sultan M, Nürnberger A, Schrinner S, Balzereit D, Dagand E, Rasche A, Lehrach H, Vingron M, Haas SA, Yaspo ML: Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res. 2010Google Scholar
- Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009, 25: 1026-1032. 10.1093/bioinformatics/btp113.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26: 493-500. 10.1093/bioinformatics/btp692.PubMedPubMed CentralView ArticleGoogle Scholar
- Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010, 38: e131-10.1093/nar/gkq224.PubMedPubMed CentralView ArticleGoogle Scholar
- Li J, Jiang H, Wong WH: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010, 11: R50-10.1186/gb-2010-11-5-r50.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009, 25: 3207-3212. 10.1093/bioinformatics/btp579.PubMedPubMed CentralView ArticleGoogle Scholar
- Heap GA, Yang JHM, Downes K, Healy BC, Hunt KA, Bockett N, Franke L, Dubois PC, Mein CA, Dobson RJ, Albert TJ, Rodesch MJ, Clayton DG, Todd JA, van Heel DA, Plagnol V: Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum Mol Genet. 2010, 19: 122-134. 10.1093/hmg/ddp473.PubMedPubMed CentralView ArticleGoogle Scholar
- Fontanillas P, Landry CR, Wittkopp PJ, Russ C, Gruber JD, Nusbaum C, Hartl DL: Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing. Mol Ecol. 2010, 19 (Suppl 1): 212-227. 10.1111/j.1365-294X.2010.04472.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Novocraft. [http://novocraft.com]
- Bayesian Gene eXpression. [http://bgx.org.uk]
- Goncalves A, Tikhonov A, Brazma A, Kapushesky M: A pipeline for RNA-seq data processing and quality assessment. Bioinformatics. 2011Google Scholar
- Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464: 773-777. 10.1038/nature08903.PubMedView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Su SY, Asher JE, Jarvelin MR, Froguel P, Blakemore AIF, Balding DJ, Coin LJM: Inferring combined CNV/SNP haplotypes from genotype data. Bioinformatics. 2010, 26: 1437-1445. 10.1093/bioinformatics/btq157.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106-10.1186/gb-2010-11-10-r106.PubMedPubMed CentralView ArticleGoogle Scholar
- Law AM: Confidence intervals in discrete event simulation: a comparison of replication and batch means. Naval Res Logist Q. 1977, 23: 667-678. 10.1002/nav.3800240414.View ArticleGoogle Scholar
- UCSC Genome Browser. [http://genome.ucsc.edu]
- Gregg C, Zhang J, Butler JE, Haig D, Dulac C: Sex-specific parent-of-origin allelic expression in the mouse brain. Science. 2010, 329: 682-685. 10.1126/science.1190831.PubMedPubMed CentralView ArticleGoogle Scholar
- Puck JM, Willard HF: X inactivation in females with X-linked disease. N Engl J Med. 1998, 338: 325-328.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.