One approach for capturing more of a species’ genetic variation is to generate multiple high-quality reference sequences (RefSeqs) from accessions that represent various subpopulations and then to use these RefSeqs to assemble a ‘pan-genome’. The pan-genome can then be probed with population re-sequencing data to pinpoint single nucleotide polymorphisms (SNPs) and haplotype and structural variation within the whole species.
A clever and cost-effective alternative, developed by a group from Huazhong Agricultural University in Wuhan China, led by Weibo Xie [1], skips the expensive multi-high-quality RefSeq step and goes directly to the root of the problem. Briefly, Yao et al. [1] used low coverage (1–2.5×) re-sequencing data from 1483 cultivated rice accessions, which they divided into two subpopulations: indica (containing the indica and aus subgroups) and japonica (containing the temperate and tropical subgroups of japonica). These data were mapped to the IRGSP RefSeq and to three additional Oryza assemblies (one Sanger and two Illumina) of varying quality. The reads that did not map to the RefSeq were subsequently assembled. This resulted in an indica assembly of ~52 k contigs (N50 = 2344 kb) and a japonica assembly of ~30 k contigs (N50 = 2219 kb), henceforth tagged the ‘dispensable genome’, or the non-essential genome, of cultivated rice. These assemblies contain sequences that are not common in all members of the species, or that are unique to one individual. Only ~7700 contigs overlapped between the indica and japonica assemblies, leading the authors to suggest that the majority of the contig assemblies in each dispensable genome were subpopulation specific.
To validate their contig assemblies, Yao et al. [1] used several approaches, such as PCR amplification with 43 primer pairs from random contigs, generating amplicons of the expected sizes in the corresponding accessions. For example, amplification using an indica-specific primer pair resulted in an amplicon from an indica accession and no amplification from the japonica reference accession.
Annotation of the dispensable genome contig assemblies revealed 6000 japonica and 8900 indica protein-coding genes, of which 1120 and 1913, respectively, were annotated as ‘high-confidence genes’ on the basis of expression and/or homology. Not surprisingly, ~30 % of each dispensable genome was composed of transposable elements.
Because the dispensable genome assemblies were derived from 1483 accessions, the sequence data were derived from many different haplotypes (that is, groups of SNPs that are linked or inherited together in single accessions). To resolve these haplotypes from the dispensable genome of indica, each contig was reassembled locally under more stringent conditions, and 70 % of the reassembled contigs produced between four and seven haplotypes. This result demonstrates that contig-level haplotype information can be harvested from a conglomeration of low-coverage population-level re-sequencing data.