Maize DNA-sequencing strategies and genome organization
© BioMed Central Ltd 2004
Published: 16 April 2004
Skip to main content
© BioMed Central Ltd 2004
Published: 16 April 2004
A large amount of repetitive DNA complicates the assembly of the maize genome sequence. Genome-filtration techniques, such as methylation-filtration and high-CoT separation, enrich gene sequences in genomic libraries. These methods may provide a low-cost alternative to whole-genome sequencing for maize and other complex genomes.
The maize and human genomes have similar sizes (2,500 and 3,200 megabases, respectively) and contain large amounts of repetitive sequence [1, 2]. But differences between the two genomes create unique challenges. The available data suggest that most maize repetitive sequences accumulated in the past six million years . This means that they should be more conserved than human repetitive sequences, most of which are over 25 million years old . Plant genes, including maize genes, tend to be small; Arabidopsis and rice genes average between 2.4 and 5 kilobases [4–6], whereas human genes average about 27 kilobases . Identifying genes may therefore be easier in maize; but whole-genome sequence assembly may prove more difficult because of the degree of conservation of its repetitive sequences.
Completion of a draft rice genome sequence [5, 7] stimulated discussion on how to proceed with similar efforts for other crops. This discussion is tempered by an awareness of the difficulties to be faced with most crops. Plant genomes are usually large, composed largely of repetitive sequences, and are often polyploid. The costs of whole-genome sequencing will be substantial. In 2001, the National Science Foundation (NSF) sponsored a workshop to discuss sequencing the maize genome in light of these realities . Out of these discussions came a strategy for using genome filtration as a low-cost alternative to fully sequencing the maize genome, so as to sequence clones from libraries enriched for genes, and then place these sequences on genetic or physical maps.
Two genome-filtration techniques were proposed for enriching gene sequences in genomic libraries. The first technique uses 'high-CoT' libraries; in this approach renaturation kinetics (represented by the product of DNA concentration (Co) and time (T), CoT, at which renaturation occurs) are used to separate repetitive sequences from low-copy sequences. The low-copy DNA renatures more slowly than repetitive sequences, and this fraction is enriched for genes . The second technique, methylation filtration, is based on the tendency for repetitive sequences to be hyper-methylated in higher plants. Genomic libraries are constructed in Escherichia coli strains that have a functional McrBC restriction-modification system, which does not permit the propagation of heavily methylated DNA, thus excluding most repetitive sequences and enriching the library for gene-rich sequences . Among major cereal crops, maize has an intermediate-sized genome, whereas the genomes of wheat, barley and oat are much larger. Decisions made with maize will thus help determine how to proceed with sequencing other crop genomes. Two recent papers by Palmer et al.  and Whitelaw et al.  describe the application of genome filtration to maize.
The Whitelaw et al. paper  compared genome filtration with random genomic shotgun sequencing. From the random library, 73% of 34,074 sequences were identified as repetitive. In contrast, 35% of the 95,233 methylation-filtered and 21% of 100,000 high-CoT sequences were repetitive. Over 900,000 sequence reads of the latter two libraries have now been completed and deposited in a public database . The high-CoT and methylation-filtered clone sequences were found to be enriched for sequences related to known plant genes. For example, 13% of methylation-filtered and 11% of high-CoT sequences were similar to known plant expressed sequence tags (ESTs), whereas only 4% of sequences from random libraries were similar. Palmer and coworkers  developed an independent set of approximately 100,000 methylation-filtered sequences, and found that 8.6% of these exhibited sequence similarity to their gene database, while 24% of them matched a known repetitive sequence. They additionally showed that rates of new gene discovery per sequence read were similar for EST and methylation-filtration libraries .
An earlier study suggested that methylation-filtration can detect 95% of maize exons , and analyses in the two recent papers [9, 10] suggest that most maize genes may be captured by filtration. These predictions are, however, based on detecting typical polypeptide-encoding genes. Will enrichment techniques capture genes encoding very small proteins or small RNAs? Tandem duplications, which are common in plant genomes, are another concern [4, 6]. Will filtration be able to distinguish between copies, including those that have evolved distinct functions? It is possible that genome filtration could miss a number of genes.
There are, however, reasons for optimism. First, sequences for genes encoding small polypetides or RNAs could be among the uncharacterized sequences found in the filtered libraries. After sequencing reads were assembled into contigs, 63% of high-CoT assemblies and 39% of methylation-filtration assemblies had no significant matches to a gene or repeat sequence in the database at The Institute for Genomic Research [10, 11]. Second, the methylation-filtration and high-CoT techniques sample from partially different fractions of the maize genome. It was estimated that of all the sequences sampled in the methylation-filtration and high-CoT libraries, approximately one-third were recovered by both approaches . Using both techniques thus samples a greater fraction of the genome, and it seems possible that genes encoding microRNAs and small polypeptides will be captured by one or other technique.
The application of genome filtration for sequencing the maize genome would require the mapping of sequences onto physical or genetic maps, as noted at the NSF workshop . How this mapping step is carried out will be a critical decision. As positional cloning is likely to be a major use of the mapped sequences, high-resolution map data are desirable. Placing sequences onto maps derived from bacterial artificial chromosome (BAC) contigs by hybridization or low-pass sequencing, would be appropriate. Genome filtration may prove to be most effective when a closely related species has already been sequenced, because synteny between species can then provide the positional information. Studies of cereal genomes suggest that rice is not sufficiently related to maize to adequately fill this gap in genome information [13, 14]. In this light, synteny to important crops, in addition to genome size, may be an important criterion for selecting model species to sequence in the future.
Enrichment may not be an appropriate approach for all species. Methylation filtration has worked well in maize because plant genes are largely unmethylated . Furthermore, there is little repetitive sequence within plant genes themselves that could interfere with high-CoT selection, the exception being MITES (miniature inverted-repeat transposable elements), which are very small and usually poorly conserved . Plant transcription units tend to be small [4–6], and their regulatory regions are compact. A wealth of experience with transgene constructs in plants demonstrates that in general only a few kilobases of flanking sequence are required for tissue and developmental regulation, although exceptions do exist. For instance, the maize P1 gene promoter is unusually large, extending 5 kilobases upstream of the transcription start site . Gene and genome organization must be considered before applying genome-filtration techniques to other organisms.
If funding becomes available, there are strong reasons for sequencing the entire maize genome. Access to the hundreds of mutations isolated over the past 75 years is one compelling reason. The agronomic importance of maize, in the United States and other countries, is another. A complete sequence of the maize genome would provide researchers with gene sequences, regulatory sequences, precise positional information, and markers for high-resolution mapping. These are the obvious reasons for whole-genome sequencing, but others may in fact prove more rewarding. We now know that different maize lines do not have identical complements of genes. In one region sequenced from two lines, four of the ten genes present in one line were absent from the other . Tandem duplications provide an opportunity for gene number to increase or decrease within pedigrees [18, 19], and duplication allows epigenetic regulation of gene expression [19, 20]. Perhaps epigenetic interactions and variation in gene content underlie heterosis, whereby hybrids show increased vigor compared to their parents. This, together with the long breeding records and extraordinary genetic variation in maize, provides very special opportunities. Genome filtration coupled with mapping relatively inexpensively provides much of the same information that can be found in a complete genome sequence. But a full genome sequence provides a much broader foundation for exploring the complete genome.