High resolution discovery and confirmation of copy number variants in 90 Yoruba Nigerians
© Matsuzaki et al.; licensee BioMed Central Ltd. 2009
Received: 20 May 2009
Accepted: 9 November 2009
Published: 9 November 2009
Copy number variants (CNVs) account for a large proportion of genetic variation in the genome. The initial discoveries of long (> 100 kb) CNVs in normal healthy individuals were made on BAC arrays and low resolution oligonucleotide arrays. Subsequent studies that used higher resolution microarrays and SNP genotyping arrays detected the presence of large numbers of CNVs that are < 100 kb, with median lengths of approximately 10 kb. More recently, whole genome sequencing of individuals has revealed an abundance of shorter CNVs with lengths < 1 kb.
We used custom high density oligonucleotide arrays in whole-genome scans at approximately 200-bp resolution, and followed up with a localized CNV typing array at resolutions as close as 10 bp, to confirm regions from the initial genome scans, and to detect the occurrence of sample-level events at shorter CNV regions identified in recent whole-genome sequencing studies. We surveyed 90 Yoruba Nigerians from the HapMap Project, and uncovered approximately 2,700 potentially novel CNVs not previously reported in the literature having a median length of approximately 3 kb. We generated sample-level event calls in the 90 Yoruba at nearly 9,000 regions, including approximately 2,500 regions having a median length of just approximately 200 bp that represent the union of CNVs independently discovered through whole-genome sequencing of two individuals of Western European descent. Event frequencies were noticeably higher at shorter regions < 1 kb compared to longer CNVs (> 1 kb).
As new shorter CNVs are discovered through whole-genome sequencing, high resolution microarrays offer a cost-effective means to detect the occurrence of events at these regions in large numbers of individuals in order to gain biological insights beyond the initial discovery.
Genetic differences between individuals occur at many levels, starting with single nucleotide polymorphisms (SNPs) , short insertions and deletions of several nucleotides (indels) , and extending out to copy number variants (CNVs) that span several orders of magnitude in length . A thorough cataloging of genetic variations in the human genome is well underway, as evidenced by the HapMap Project  and 1,000 Genomes Project , and data repositories such as dbSNP  and the Database of Genomic Variants (DGV) . The ability to genotype large numbers of individuals in various study cohorts at large numbers of known loci has in turn led to significant associations between specific genetic differences and phenotypic differences, which often manifest as complex disorders. Recent notable studies have associated SNP markers with bipolar disorder, coronary artery disease, Crohn's disease, hypertension, rheumatoid arthritis, type 1 diabetes, and type 2 diabetes , and CNVs with autism and schizophrenia [8–10].
Progressively higher resolution microarrays, starting with earlier low resolution bacterial artificial chromosome (BAC) arrays followed by commercially available array comparative genome hybridization (CGH) and SNP genotyping arrays, have steadily driven the discovery of new CNVs and have refined the boundaries of earlier reported CNVs. Specifically, the earliest CNVs described by Sebat et al.  and Iafrate et al. , using BAC arrays and lower resolution oligonucleotide arrays, had median lengths of approximately 222 kb and approximately 156 kb, respectively. Later, Redon et al.  used both BAC arrays and SNP genotyping arrays from Affymetrix to report CNVs with median lengths of approximately 234 kb and approximately 63 kb, respectively. More recent examples are the Perry et al.  study, which used Agilent high resolution CGH arrays, the McCarroll et al.  study, which used the Affymetrix SNP 6.0 array, and the Wang et al.  study, which used data from Illumina BeadChips. The Perry et al.  study examined known regions in the DGV (November 2006) at approximately 1 kb resolution, and refined the lengths of over 1,000 CNVs to a revised median length of approximately 10.2 kb. The Wang et al.  study analyzed genome-wide SNP genotype data having median inter-SNP distance of approximately 3 kb from over a hundred individuals to detect CNVs having median lengths of approximately 12 kb. The McCarroll et al.  study examined the entire genome (as represented in the whole-genome sampling of NspI and StyI restriction fragments) at approximately 2-kb resolution, and reported > 1,300 CNVs having a median length of approximately 7.4 kb.
Here in this study, we set out to demonstrate the benefits, as well as limitations, of Affymetrix oligonucleotide arrays with higher resolution than previously available arrays, first in unbiased whole-genome scans to discover CNV regions, and subsequently in localized regions to determine sample-level CNV calls. Our custom arrays were manufactured using standard Affymetrix processes , but with phosphoramidite nucleosides bearing an improved protecting group to provide for more efficient photolysis and chain extension , which enabled the synthesis of longer probes. We first used our genome-scan arrays to examine the entire genome with uniform coverage at a resolution of approximately 200 bp. We designed a set of three custom oligonucleotide whole-genome scan arrays that span the entire non-repetitive portion of the human genome. Each of the genome-scan arrays consists of over 10 million 49-nucleotide long probes that are spaced at a median distance of approximately 200 bp apart along the chromosomes. The set of 90 Yoruba Nigerians from the HapMap Project  was chosen for the scans because they represent an anthropologically early population likely to be harboring a fair proportion of common and more older CNVs, similar to the occurrence of common SNPs . A number of previous CNV studies also used some or all of the Yoruba individuals, making it possible to compare event calls reported in the literature with those observed in our work. Additionally, because the 90 Yoruba individuals are each members of 30 family trios, inheritance patterns of the observed and reported events can be measures of accuracy and event call completeness.
A fourth custom oligonucleotide array was designed to confirm putative CNV regions identified from the initial genome scans, as well as subsets of CNVs reported in the DGV (November 2008), including those reported by Perry et al. , Wang et al. , and McCarroll et al. , and to determine sample-level event occurrence. Additionally, we were particularly interested in observing events in the 90 Yoruba at shorter CNVs discovered through the whole-genome sequencing of two individuals. The design of our CNV-typing array prioritized CNVs reported in the landmark Levy et al.  and Wheeler et al.  studies, which contributed the initial whole-genome sequences of two individuals of Western European descent. Since the Bentley et al.  and Wang et al.  studies were added to the DGV after the design of the CNV-typing array, the shorter regions discovered by whole-genome sequencing of one of the Yoruba and an Asian were not included. The CNV-typing array consists of approximately 2.4 million 60-nucleotide long probes concentrated at the known and putative CNVs, at variable spacing as close as 10 bp apart.
Our arrays are essentially tiling designs with probe sequences picked from the reference genome (build 36), and are more similar to early BAC and Agilent CGH arrays than to recent genotyping arrays, such as the Affymetrix SNP 6.0 or the Illumina BeadChips, which generate allele-specific signals (with the exception of subsets of non-genotyping copy number probes). To observe copy number events on our arrays, we processed our probe signals with circular binary segmentation (CBS) , a CNV detection algorithm originally developed for BAC arrays but also suitable for our tiling arrays.
Summary of putative and confirmed CNVs
CBS all probes
Confirmed high conf
Putative high conf
Number of CNVs
% of parent set
Med len in DGV
Med len novel
Of the 3,850 putative CNVs having events observed in at least two individuals (defined as high confidence), approximately 67% overlapped at least one record in the DGV (March 2009), while only approximately 44% of the remaining regions having an event in only one individual (singletons) overlapped a DGV record (Table 1). Overlap is defined as greater than 5% of a putative region coinciding with a DGV record, not including inversions and records with lengths less than 100 bp. The minimum requirement of 5% overlap with DGV records was set low to accommodate a wide range of differences in resolutions between previous studies and our genome-scan. Since the union of DGV records (March 2009) covers a fair proportion of the genome (approximately 30%), a > 5% overlap does not necessarily validate a region, but serves as a starting point for comparison with previous studies. The high resolution of the genome-scan arrays revealed several instances of multiple smaller CNVs lying within regions that were earlier reported as one longer CNV in studies using lower resolution methods. Two such examples are shown in Figure S2 in Additional data file 1; the first is a 200-kb region with at least four CNVs and the second is a 20-kb region with two CNVs. These example regions overlap multiple DGV records from earlier studies such as Redon et al. , and more recent higher resolution studies such as Perry et al. . The putative CNVs observed in the 90 Yoruba more closely match the shorter DGV records from the newer studies (Figure S2 in Additional data file 1).
These PCR results provided some assurance that the genome scans had relatively low false discovery rates for CNV regions; however, because of the stringent requirements applied to call an event, a noticeable false-negative observation rate was also demonstrated. PCR tests were performed on Yoruba DNAs selected in pairs, whereby an event was observed in one DNA but not the other on the genome-scan arrays. However, the patterns of bands in the PCR gels showed cases of actual losses or gains in 'non-event' DNAs (Figure 2; Additional data file 3). At three regions where truncated PCR amplicons from 'non-event' DNAs were excised and sequenced (including the CNV shown in Figure 2), the deletions mapped to the exact same breakpoints as in the event DNAs (Table S3 in Additional data file 1). For qPCR, out of16 selected gain events tested, 9 were confirmed and 3 were ambiguous, but 4 showed clear evidence of homozygous deletions in the 'non-event' DNA rather than gains in the 'event' DNA (Table S5 in Additional data file 1). Similar to the gel based PCRs, the qPCR results confirmed a fair proportion of putative regions, but also demonstrated that event calls in many individuals were missed.
Because the primary objective of the genome-scans was CNV region discovery, we set stringent requirements for event detection that prioritized low false discovery of regions at the expense of sensitivity to observe sample level calls at those regions. Once CNV regions had been identified in the genome scans, we focused on designing a new array more suited to generating sensitive and reliable sample-level calls, where space on the genome-scan array originally occupied by additional array probes residing outside of CNV regions can now be better used. To optimize array design parameters that would increase sample-level call sensitivity, we designed a small test array with variable probe lengths from 39 to 69 nucleotides, variable probe feature sizes, and 5 replicates of each unique probe, at 150 arbitrarily chosen regions of which 105 were putative CNVs from the genome scan and the remainder were records from the DGV. Filters were not applied to the choice of probe sequences for the test array, which included probes that overlapped any known repetitive regions, including Alu elements. Results from a subset of 12 Yoruba individuals on the small test array suggested the use of 60-nucleotide long probes at 5 micron pitch, with 3 replicates per probe, and the inclusion of probes in repetitive regions, with the exception of Alu elements (data not shown). Probes on the test array corresponding to nearly all Alu elements were not responsive to copy number differences, while probes at other repetitive regions had variable responses that ranged from no change (similar to Alus), reduced response, or full response (similar to non-repetitive regions), with no clear correlation to the class of repeat elements (data not shown). Based on the test array findings, the CNV-typing array was designed to have higher sensitivity for event detection, and includes probes corresponding to repetitive regions (other than Alu elements). Using data from the CNV-typing array, a thorough study of the possible relationships between repeat elements and CNVs is also possible, but is beyond the scope of the current work.
There were approximately 98,000 events observed at the putative CNVs across the 90 Yoruba on the CNV-typing array. Nearly 97% (6,368) of the putative CNV regions discovered in the genome scans were confirmed to have at least one observed event on the CNV-typing array (Table 1). The high confidence putative CNVs had a higher confirmation rate of approximately 99% compared to the singletons (approximately 94%), suggesting a degree of specificity in the region confirmations. Integer copy number event calls, where 0 is homozygous loss, 1 is one copy heterozygous loss, and 3 or more are gain events, were based on CBS at thresholds determined by comparison to reference calls. The reference calls were primarily from the McCarroll et al.  study, which used the Affymetrix SNP 6.0 genotyping array to determine event calls at approximately 1,300 CNVs in 270 individuals from the HapMap Project , including the 90 Yoruba. The validation PCRs (discussed above) were a secondary reference set. Comparisons with the reference calls provided a measure of event sensitivity; and a subset of CNVs that had no events among the Yoruba in the McCarroll et al.  study, provided an estimate of event specificity (see Materials and methods). Sample-level event calls in the 90 Yoruba individuals at the confirmed CNVs, and at CNVs from the McCarroll et al.  study, are listed in Additional data files 6 and 7, respectively. Often an individual had two or more event segments within a putative region; this was either because event segments were split by intervening repeat elements, where probes were not responsive to copy number differences, or because the region is complex, having two or more smaller CNVs within a narrow region. Split event segments within a region were treated as one event call if the direction of the multiple segments was consistently all loss or all gain in an individual. On the other hand, complex regions were identified wherever a loss and gain event was observed within a region in the same individual. Complex regions are annotated in Additional data file 2. The positions of the confirmed CNVs listed in Additional data file 2 are based on the first and last positions of event segments detected among individuals.
Comparison of Yoruba event calls
% call agreement
Events in study
% study compared
Events in our work
% our work compared
Bentley et al. 2008
Kidd et al. 2008
Korbel et al. 2007
McCarroll et al. 2008
Perry et al. 2008
Wang et al. 2007
Since the 90 Yoruba are each members of 30 family trios, we examined the inheritance of events from parents to children. The majority of copy number polymorphisms are inherited , rather than rare de novo occurrences . The observations of events in children but not in either of the parents are due to false-positive observation in the child, or false-negative detection in either or both of the parents, with only a very small proportion likely to be true de novo events. The approximately 98,000 event calls at 6,368 confirmed CNVs across the 90 Yoruba were grouped by the 30 family trios. Of the total observed events, approximately 10,500 (10.8%) were observed in only the children of trios. The same 30 trios were also part of the McCarroll et al.  study, in which there were approximately 7,800 reported events (along with approximately 1,600 no_calls) at 859 CNVs in the Yoruba, of which only 25 (0.3%) events were observed in only the children. The 36 Yoruba genotyped in the Wang et al.  study are members of 12 of the trios, in which approximately 1,110 events were reported, of which 13 (1.2%) were observed only in children. The event calls in the McCarroll et al.  study benefited from having two fully replicated data sets of 270 individuals run independently in separate laboratories, as well as manual curation of scatter plots that were used to cluster the samples into estimated copy number classes. The sensitivity and specificity of event calls in the Wang et al.  study benefited from the direct use of the family trio information in the calling algorithm, which markedly reduced the observations of what Wang et al. referred to as CNVs inferred in offspring but not detected in parents (CNV-NDPs).
Three-way comparison of event calls in trios
McCarroll et al. (2008)
Wang et al. (2007)
Agree with reference
Events at CNVs discovered by whole-genome sequencing
Summary of events at CNV regions discovered by sequencing
Confirmed < 1 kb
Levy + Wheeler
with YRI events
% with events
Events per region
Homozygous loss (0)
One copy loss (1)
One copy gain (3)
Multiple gains (4+)
That our high resolution genome scans of the 90 Yoruba uncovered as many as 2,690 potentially new CNVs with a median length of approximately 3.0 kb suggests that there are many more CNVs yet to be discovered on the shorter end of the size range. Because of the high resolution of our genome-scan arrays, we were able to delineate neighboring multiple smaller CNVs at regions earlier reported as single larger CNVs, as illustrated in Figure S2 in Additional data file 1. Perry et al.  observed and validated other such instances of multiple CNVs in close proximity, and describe these cases as architecturally complex CNV regions. The tight correlation between observed event lengths and actual lengths determined by PCR and breakpoint sequencing (Figure 2b) reflects fairly accurate breakpoint mapping of events in the approximately 1 to 2 kb range, and suggests, by extrapolation, accuracy in longer ranges. Of the 16 CNVs confirmed by PCR and breakpoint sequencing, six were exact matches to DGV records reported by sequencing-based methods (Table S3 in Additional data file 1). Specifically, three of the six matched records from the Mills et al.  study, which mapped publicly available sequencing trace data, two matched records from the Wheeler et al.  study, which whole-genome sequenced an individual of Western European descent, and one matched a record from the Bentley et al.  study, which sequenced a different Yoruba. These are instances of the same exact events occurring in different individuals of varying ethnicities, and likely represent older CNVs that have taken root in the genome. The whole-genome sequencing data generated by sequencing more individuals, such as in the 1000 Genomes Project, will undoubtedly produce a more thorough catalog of shorter CNVs in the genome, including an assessment of the age of these variations.
Even at a resolution of approximately 200 bp, our genome scan detected only a fraction of the CNVs reported in whole-genome sequencing studies (Levy et al.  and Wheeler et al.  studies at 9% and 22%, respectively). Our inability to detect shorter (< 1 kb) CNVs shows one limitation of using microarrays, although continued advances in array manufacturing technology could further increase array probe density in the future. In the meantime, a viable approach is to rely on DNA sequencing for CNV region discovery in limited numbers of samples, and follow up with microarrays for localized sample-level event detection across larger sample sets as we have done here. Shorter (< 1 kb) regions that were identified in our genome-scan, such as the example shown in Figure 2, were often instances of homozygous deletions, which manifest stronger event segments. In contrast, one-copy-loss events give weaker segments that were often missed, but are likely to occur more frequently than homozygous deletions. These instances of false-negative CNV discovery, particularly in shorter regions with rare event frequencies, could be mitigated by using an improved genome-scan array design with longer probes and the inclusion of multiple replicates of each probe, just as we have demonstrated for the CNV-typing array. In contrast, the higher overlap (91%) between putative CNVs and the generally longer CNVs (median length approximately 7.4 kb) from the McCarroll et al.  study suggests that the genome scan captured a fair proportion of CNVs > 1 kb. We were able to observe events at approximately 97% of the putative CNVs from the genome scans on the CNV-typing array. The low false-positive rate of putative CNVs on the typing array, and the fairly successful PCR validation, are consistent with the stringent requirement of having had to observe events in at least two of three genome-scan array designs, which served as technical replicates. To reduce noise from probe to probe intensity variations, the CNV-typing array has each unique probe placed in triplicate at scattered locations on the array, and the signals from the triplicate probes were summarized by median polish. The example segmentation results shown in Figure 2 illustrate the reduction in noise on the typing array. In addition to the triplicate probes, the CNV-typing array has improved sensitivity for event detection by the use of 60-nucleotide long probes compared to non-replicated 49-mer probes on the genome scan arrays.
The disparity in agreement of sample-level event calls and matched regions between our work and previous studies (Table 2) may be due to sampling differences, which ranged from only one Yoruba individual in common up to 90 individuals; but more likely reflects underlying differences in specificity and sensitivity, as well as genome coverage biases inherent in the various methods, as well as in our work. These differences are also apparent in pair-wise comparisons among the six previous studies (Table S6 in Additional data file 1), and point to the difficulty in determining absolute accuracy and event call completeness. An examination of the inheritance of events from parents to children among the Yoruba trios in our work, along with events reported among trios in the McCarroll et al.  and Wang et al.  studies, provided an assessment of false-positive and false-negative rates of event detection. Although slightly higher, the specificity of event detection on the CNV-typing array was comparable to the previous studies, and may be underestimated because of the higher resolution; on the other hand, the sensitivity to detect events was noticeably lower (Table 3). The majority of events observed only in the children of trios were due to missed events in the parents. The sensitivity could improve with the availability of additional and replicate data sets and manual curation of intermediate results, or the use of family trio information, as was likely the case in the McCarroll et al.  and Wang et al.  studies, respectively. That these two previous studies also showed varying degrees of false negatives and false positives, and the low proportion of CNVs in common between the Levy et al.  and Wheeler et al.  sequencing studies (Table 4), reinforces the benefit of building a consensus from multiple studies. As more sample-level data become available, particularly from whole-genome sequencing and higher resolution microarray-based studies, many of the discrepancies in the inter-reference comparisons (Table 2; Table S6 in Additional data file 1) should be resolved through higher confidence consensus among methods and studies.
Events observed in the 90 Yoruba showed higher frequencies at shorter CNVs compared to longer CNVs (> 1 kb; Figure 8). The higher frequencies are consistent with expectations that events in shorter regions are under less selective pressure than at longer regions, which are more likely to be deleterious . The differences in the cumulative event frequencies, even under stringent specificity thresholds (Figures 6 and 7), are likely a reflection of differences in the di-deoxy and 454 polony sequencing methods used in the Levy et al.  and Wheeler et al.  studies, respectively, and suggest that the CNV-typing array is sensitive to detect some subtle characteristic differences inherent in the regions discovered in the two separate studies.
Recent studies using high resolution microarrays and whole-genome sequencing have made major inroads toward a complete catalog of CNVs in the human genome. Our work demonstrated the use of even higher resolution microarrays to uncover approximately 2,700 potentially new CNVs, and to observe events in 90 Yoruba at regions discovered by whole-genome sequencing of single individuals. The approximately 3,300 shorter regions (< 1 kb) examined in our current work are likely just a fraction of what will eventually be discovered through sequencing more individuals. In the near term, high resolution microarrays offer a cost-effective means to confirm these shorter CNVs, and type large numbers of individuals in order to gain biological insights beyond the initial discovery.
Materials and methods
Arrays were synthesized following standard Affymetrix GeneChip manufacturing methods utilizing contact lithography and phosphoramidite nucleoside monomers bearing photolabile 5'-protecting groups. Fused-silica wafer substrates were prepared by standard methods with trialkoxy aminosilane as previously described . An improved 5'-protecting group provided for more efficient photolysis and chain extension, and therefore fewer truncated probe sequences . The genome-scan arrays and CNV-typing array required 141 and 179 synthesis steps, respectively, resulting in 3'-immobilized DNA probes of 49 and 60 nucleotides in length. After the final lithographic exposure step, the wafer was de-protected in an ethanolic amine solution for a total of 8 hours prior to dicing and packaging.
Candidate 49-mer probe sequences for the three genome-scan array designs were chosen from the non-repetitive regions of the genome, and filtered for extraneous matches to the genome in the central 16 nucleotides, resulting in a total of 32 million unique probes. Rather than placing probes sequentially across the three arrays, probes were dispersed such that every second and third probe against the genome was placed on separate arrays (Figure S1A in Additional data file 1). Because of the inter-digitating of probes across the three designs, the inter-probe interval in any one design between the center positions of neighboring probes is generally 147 bp (the combined length of three probes). However, because probes were filtered out at repetitive regions throughout the genome the overall median interval between neighboring probes on the genome-scan arrays is 196 bp.
The CNV-typing array design consists of approximately 800,000 unique probes, with each in triplicate for a total of approximately 2.4 million 60-mer probes. The replicate probes are placed in separated locations on the array to mitigate any regional variations in signals. The approximately 800,000 unique probes are organized into approximately 16,000 partitions, each containing up to 50 unique probes. The probe partitions correspond to putative or reported CNV boundaries. The probes within a partition are evenly spaced along chromosomes, with the exception of regions corresponding to Alu elements and occurrences of high allele frequency SNPs. In order to mitigate any potential compounding effects on signals, probes with a common SNP (minor allele frequency > 0.05) in the HapMap repository  within the central 30 nucleotides were not allowed. In contrast to the genome-scan arrays where probes in repetitive elements were mostly filtered out, the CNV-typing design has probes in all repeat regions other than Alu elements. The closest spacing between the central positions in the 60-mer probes is 10 bp apart. For CNVs shorter than 500 bp, the partition will contain less than 50 unique probes; for example, a 300 bp region will have 30 overlapping probes with center positions spaced 10 bp apart. For CNVs longer than 500 bp, the 50 probes will be spaced further apart; for example, a 3,000 bp CNV will have 50 probes lined end-to-end, with no overlap between 60-mer probes. Partitions corresponding to shorter CNVs discovered by whole-genome sequencing of individual genomes [18, 19] were prioritized and assigned first, followed by putative CNVs from the genome scan, and then supplemented with regions that overlapped in at least two records in the DGV (November 2008) (Figure S1B in Additional data file 1). Because shorter CNVs were assigned first, the shorter CNVs tend to have the highest probe density. A longer CNV that overlaps a shorter CNV will be represented by two partitions with different probe densities. In this way, a partition can map to one or more CNV regions; conversely, a CNV can be represented by one or more probe partition (Figure S1B in Additional data file 1). The probe sequences and build 36 chromosome positions of all the four array designs are available at ArrayExpress  under accession number E-TABM-838.
The 90 Yoruba individuals are from the HapMap Project ; genomic DNA samples were obtained as immortalized cell line isolates from the Coriell Institute . During initial analysis of the genome scans, unusually high occurrences of gain events in chromosome 12 from NA19193, and chromosome 9 from NA19208 were observed (Figure S3 in Additional data file 1). These observations are consistent with lymphoblastoid cell line artifacts that have been previously reported in these two samples [12, 44]. Data from these two chromosomes were excluded from all subsequent segmentation analyses.
Whole-genome amplification of genomic DNA samples was performed using the REPLI-g Midi kit (Qiagen, Valencia, CA, USA) following manufacturer-supplied instructions, starting with 200 ng of input DNA in a 60 μl reaction. Amplified DNA was randomly fragmented by controlled partial digestion with DNase I. The optimal DNA target length for hybridization to the arrays was found to be in the range of 50 to 300 bp, with the majority of fragments at 100 to 200 bp. DNaseI at 2.5 U/μl (Affymetrix, Santa Clara, CA, USA) was freshly diluted in 10 mM Tris pH 8 to a concentration of 0.3 U/μl; 3 μl of the diluted DNaseI was added to 60 μl of amplified DNA and 7 μl Fragmentation buffer (Affymetrix) at 37°C. To achieve the optimal size range, test fragmentation time courses were first performed using a small amount of the amplified DNA samples, where the incubation varied from 4 to 26 minutes. Following fragmentation, the amplified DNA was ethanol precipitated and resuspended in 33.5 μl water; 1 μl was removed to measure concentration, which was typically approximately 1.5 μg/μl. The fragmented DNA was then end-labeled with biotin using 2.5 μl of 30 mM DNA labeling reagent (Affymetrix) and 5 μl of Terminal Transferase (Affymetrix) in a 50 μl reaction, which included 10 μl of 5× TdT buffer (Affymetrix). Labeling reactions were incubated for 2 hours at 37°C until heat inactivation at 95°C for 10 minutes.
Hybridization to arrays
The labeled DNAs were hybridized to each array in 200-μl volumes. In addition to 15 μl of approximately 1 μg/μl labeled DNAs, the hybridization solution contained 100 μg denatured Herring sperm DNA (Promega, Madison, WI, USA), 100 μg Yeast RNA (Ambion, Austin, TX, USA), 20 μg freshly denatured COT-1 DNA (Invitrogen, Carlsbad, CA, USA), 12% formamide, 0.25 pM gridding oligo (Affymetrix), and 140 μl hybridization buffer, which consists of 4.8 M TMACl, 15 mM Tris pH 8, and 0.015% Triton X-100. Hybridizations were carried out in Affymetrix ovens for 40 hours at 50°C with rotation set at 30 rpm. Following hybridization, arrays were rinsed twice, and then incubated with 0.2× SSPE containing 0.005% Trition X-100 for 30 minutes at 42°C with rotation set at 15 rpm. The arrays were rinsed and filled with Wash buffer A (Affymetrix). Staining with streptavidin, R-phycoerythrin conjugate (Invitrogen) and scanning with the GCS3000 instrument (Affymetrix) were performed as described in the Affymetrix GeneChip SNP 6.0 manual .
PCR and sequencing
A sampling of putative CNVs in pairs of Yoruba samples was selected where an event was observed in one DNA but not the other (Additional data file 3 and Table S5 in Additional data file 1). For standard PCR, putative CNVs having an event segment within a sample in the range 400 bp to 2.5 kb were tested; for quantitative PCR, CNVs having gain segments between 500 bp and 10 kb were tested. Primer sequences for standard PCR were designed from 300-bp candidate regions upstream or downstream of the longest event segments within a sample, and for qPCR, from within the shortest gain segment. Candidate regions having less than 50% RepeatMask (UCSC) were processed in either Primer3  or PrimerExpress 3.0 (Applied Biosystems, Foster City, CA, USA) for standard or qPCR primer design, respectively. Primer sequences are listed in Additional data file 5. Standard PCRs using Advantage LA polymerase (Clontech, Mountain View, CA, USA) and 400 nM primers (synthesized by IDT Integrated DNA Technologies, Coraville, IA, USA) started with 100 ng sample DNA. Following denaturation at 94°C for 1 minute, reactions were cycled 30 times as follows: 94°C for 30 seconds, 58°C for 30 seconds, and 72°C for 3 minutes, with a 72°C final hold for 7 minutes. Amplicons corresponding to loss events were excised from agarose gels, and sequenced using either of the PCR primers. CNV loss breakpoints were determined by mapping the amplicon sequences to genome build 36 with BLAT  (UCSC). An Applied Biosystems 7300 machine was used for quantitative PCRs, according to the manufacturer's instructions. Typically, the reactions included SYBR Advantage 2× qPCR mix (Clontech), 200 nM primers, 500 nM ROX reference (Roche, Indianapolis, IN, USA), and 90 ng sample DNA. Following denaturation at 95°C for 30 seconds, reactions were cycled 40 times as follows: 95°C for 5 seconds and 60°C for 34 seconds.
Signal intensities were quantile normalized  in sets of > 90 samples for each of the 3 genome-scan chip designs, or in two separate sets of 45 samples for the CNV-typing design. The triplicate probes on the CNV-typing array were summarized by median polish. For probes on autosomal chromosomes, median signals were calculated using all samples, while for probes on chromosome X and chromosome Y only female or male samples were used, respectively. Two to five additional non-Yoruba samples were part of the normalization and calculation of medians in the genome-scan, but were not included in subsequent analyses. The medians were the basis of log2 ratios for segmentation analysis. In the initial analysis of the genome-scans, a small subset of samples had disproportionately high occurrences of apparent gain or loss events. These artifact events were no longer observed after filtering out probes with GC content > 0.6, and by applying a sample-specific correction to the log2 ratios. The corrected log2 ratios were derived in the following manner: for each probe, calculate the GC content of its surrounding 50 kb region; sort the GC content values into 50 equal size bins; within each sample, for each bin, calculate the median of the log2 ratios for all probes with GC content in that bin; correct the log2 ratios in that sample by subtracting off the medians derived in the prior step. Figure S4A in Additional data file 1 shows an example of artificially high log2 ratio values corresponding to probes with high GC content in one of the samples with artifact gain events. The benefit of the filtering and correction to the segmentation analysis is illustrated in Figure S4B in Additional data file 1. Similar benefits of probe filtering and correction have been reported in copy number analysis using other arrays [48, 49].
Segmentation: genome scan
CBS  was implemented in the R package DNAcopy . For each sample and each genome-scan design, sets of 750 probes (approximately 150 to 200 kb windows) were analyzed using signal from all probes (without smoothing) to specifically look for CNVs shorter than 100 kb. To identify longer CNVs, segmentation analysis was performed with signal smoothing using Nexus Rank Segmentation (BioDiscovery, El Segundo, CA, USA), a proprietary algorithm based on CBS. The smoothed segmentation was run on entire chromosomes. Signals were smoothed by averaging eight consecutive probes. The level of smoothing was chosen based on chromosome X receiver operating characteristic (ROC) analyses that compared smoothing with 2 probes up to 256 probes (Figure S5B in Additional data file 1). Initially, consecutive inter-digitated probes from the three genome-scan arrays were combined to get the highest possible resolution, up to 49 bp in non-repetitive regions. However, the ROC analysis in Figure S5B in Additional data file 1 shows lower sensitivity and specificity when combining the three designs, compared to using probes from only one design (Figure S5A in Additional data file 1). Averaging the combined probes from the three designs (smooth 3 in Figure S5B in Additional data file 1) appears to have comparable performance to the unsmoothed curve using only one of the array designs (all probes in Figure S5A in Additional data file 1). However, actual segmentation analysis from averaging three probes combined from the three designs resulted in a highly disparate range of event tallies in individual samples, indicative of false positives. Although using probes from only one design at a time entailed a lower resolution (at best 147 bp) in the genome scan, segmentation was computed separately for each design. By using the three genome-scan designs as technical replicates instead of in combination, lower rates of false discovery (higher specificity) was prioritized over higher resolution and sensitivity to detect shorter CNVs.
For both non-smoothed and smoothed segmentation analyses, gain and loss event thresholds were set to segment mean log2 ratios of > 0.25 and <-0.25, respectively. For each sample, overlapping segments from at least two of three chip designs was required to meet the thresholds in order to call a gain or loss. The boundaries of an individual event were defined by the longest overlap between any two event segments meeting the threshold. A putative CNV was defined as regions having events observed in at least one individual; and the boundaries of a CNV were defined by the longest event among individuals. There were 401 regions where putative CNVs from the non-smoothed segmentation intersected putative regions from the smoothed segmentation. In regions where multiple putative CNVs from the non-smoothed segmentation corresponded to one putative region from the smoothed segmentation, the non-smoothed CNVs were chosen. In regions of one-to-one correspondence, the generally longer putative CNVs from the smoothed segmentation were chosen.
Segmentation: CNV typing
The sample-level event calling thresholds used in the segmentation analysis of the CNV-typing array data were determined by comparing against reference event calls taken primarily from the McCarroll et al.  study and, to a lesser extent, from the PCR validation. The McCarroll et al.  study reported integer copy number calls at 1,301 CNVs in 270 individuals from the HapMap Project, including the 90 Yoruba. Of these 1,301 CNVs, 1,153 regions were represented on the CNV-typing array by at least one probe partition corresponding to regions overlapping at least two records in the DGV (November 2008). These 1,153 CNVs were grouped into a subset of 859 CNVs with at least one reported event in a Yoruba, and a second subset of 294 regions that did not have any Yoruba events reported in the McCarroll et al.  study. These non-Yoruba event regions were further checked against five other papers cited in the DGV [13, 15, 20, 30, 31], where events were reported in at least one Yoruba. Of the subset of 294 CNVs without Yoruba events in the McCarroll et al.  study, 234 regions had no reported Yoruba events in any of the five other papers. After excluding no-calls from the McCarroll et al.  study, there were a total of 20,847 diploid calls at the 234 regions in 90 Yoruba (listed as REF-NonPoly6papers in Additional data file 8). These diploid calls were used as reference to assess call specificity, as reflected in false-positive event observations. Initial comparisons of CNV-typing data with the 859 CNVs having reported Yoruba events in the McCarroll et al.  study, showed that a subset of 127 regions had reported calls that agreed only when offset by one integer. Comparisons with calls reported in the five other papers with Yoruba events showed lower agreement in this subset of 127. A cursory examination of HapMap genotypes suggested higher congruence with offset calls at many of the 127 regions, where, for example, one-copy-loss events should correspond to consecutive SNP loci with homozygous genotypes, but instead diploid copy number calls were reported. After omitting these 127 regions, the remaining 732 CNVs from the McCarroll et al.  study had 7,752 reported events that were used as reference to assess event sensitivity (listed as REF-McCarroll-Sel in Additional data file 8). Events from PCRs shown in Tables S3 and S5 in Additional data file 1, and in Additional data file 3 were also used as reference (listed as pcr-GS in Additional data file 8).
To assess specificity and sensitivity of event detection in the CNV-typing data, segmentation thresholds were titrated at the 732 McCarroll reference CNVs, and at 6,578 putative CNVs from the genome scan. Any false positives or false negatives in the McCarroll reference event calls will artificially lower the estimates of sensitivity or specificity, respectively, of the CNV-typing array. Figure S6 in Additional data file 1 summarizes the results at seven threshold values that ranged from 0.35 (-0.35) to 0.10 (-10), and shows the trade-off between higher specificity and lower sensitivity. Event thresholds of -0.175 and 0.175 for loss and gain calls, respectively, were chosen; based on further titrations, second-level thresholds of -0.70 and 0.45 were chosen for homozygous deletions and multi-copy gain events, respectively. For each individual Yoruba sample, sets of probes for each CNV were analyzed separately by CBS, and segments with log2 ratios above or below the thresholds were called as events. Probes in the CNV-typing design were grouped into partitions corresponding to known or putative CNVs, where a given CNV may be represented by more than one partition (Figure S1B in Additional data file 1). Although the CNVs vary in the number and density (probes per base-pair) of corresponding probes, the degree of discrimination of log2 ratios above or below the event thresholds were comparable across a range of event lengths and numbers of probes, with only slight loss of discrimination at longer lengths and fewer probes (Figure S7 in Additional data file 1). Microarray raw intensities and chip library files are available at ArrayExpress  under accession number E-TABM-838. Reported CNVs are displayed at the DGV .
Additional data files
The following additional data are available with the online version of this paper: Figures S1 to S7 and Tables S3, S5, S6 and S7 (Additional data file 1); a table listing confirmed and putative CNVs (Additional data file 2); a table listing PCR validation results at 44 regions along with gel images, which correspond to 4% agarose (E-gel), gradient polyacrlyamide (PA gel), and 1% agarose (1% gel) electrophoresis gels (Additional data file 3); list of event calls and consensus reference in trios (Additional data file 4); list of primer sequences, along with sizes of the expected amplicons (Additional data file 5); integer copy number events observed on the CNV-typing array in 90 Yoruba at 6,368 confirmed CNVs (Additional data file 6); observed events on the CNV-typing array in the 90 Yoruba at 1,153 CNVs reported in the McCarroll et al.  study (listed as chp-McCarroll2008) and at regions from the Levy et al.  and Wheeler et al.  studies as summarized in Table 4 (listed as chp-LevyWheel, chp-LevyOnly, and chp-WheelerOnly) (Additional data file 7); reported events from six papers that included at least one Yoruba (Additional data file 8).
bacterial artificial chromosome
circular binary segmentation
comparative genome hybridization
copy number variant
Database of Genomic Variants
receiver operating characteristic
We thank Alan Williams, Mike Mittmann, Mei-Mei Shen, Guoying Liu, and Gangwu Mei for chip design; Jim Veitch for analysis; Glenn McGall, Bob Kuimelis and Pilot Ops for array manufacturing; and Keith Jones and Tom Gingeras for critical reading.
- IHMC: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.View ArticleGoogle Scholar
- Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE: An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006, 16: 1182-1190. 10.1101/gr.4565806.PubMedPubMed CentralView ArticleGoogle Scholar
- Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 85-97. 10.1038/nrg1767.PubMedView ArticleGoogle Scholar
- 1000 Genomes Project. [http://www.1000genomes.org/page.php]
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29: 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView ArticleGoogle Scholar
- Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet. 2004, 36: 949-951. 10.1038/ng1416.PubMedView ArticleGoogle Scholar
- WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimaki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, et al: Strong association of de novo copy number mutations with autism. Science. 2007, 316: 445-449. 10.1126/science.1138659.PubMedPubMed CentralView ArticleGoogle Scholar
- Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM, Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L, Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L, Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS, et al: Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008, 320: 539-543. 10.1126/science.1155174.PubMedView ArticleGoogle Scholar
- Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M: Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet. 2008, 40: 880-885. 10.1038/ng.162.PubMedView ArticleGoogle Scholar
- Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science. 2004, 305: 525-528. 10.1126/science.1098918.PubMedView ArticleGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, et al: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMedPubMed CentralView ArticleGoogle Scholar
- Perry GH, Ben-Dor A, Tsalenko A, Sampas N, Rodriguez-Revenga L, Tran CW, Scheffer A, Steinfeld I, Tsang P, Yamada NA, Park HS, Kim JI, Seo JS, Yakhini Z, Laderman S, Bruhn L, Lee C: The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet. 2008, 82: 685-695. 10.1016/j.ajhg.2007.12.010.PubMedPubMed CentralView ArticleGoogle Scholar
- McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D: Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008, 40: 1166-1174. 10.1038/ng.238.PubMedView ArticleGoogle Scholar
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17: 1665-1674. 10.1101/gr.6861907.PubMedPubMed CentralView ArticleGoogle Scholar
- Fodor SP, Read JL, Pirrung MC, Stryer L, Lu AT, Solas D: Light-directed, spatially addressable parallel chemical synthesis. Science. 1991, 251: 767-773. 10.1126/science.1990438.PubMedView ArticleGoogle Scholar
- Photocleavable Protecting Groups. [http://www.wipo.int/pctdb/en/wo.jsp?wo=2002020150]
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.PubMedView ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, et al: The diploid genome sequence of an Asian individual. Nature. 2008, 456: 60-65. 10.1038/nature07484.PubMedPubMed CentralView ArticleGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657-663. 10.1093/bioinformatics/btl646.PubMedView ArticleGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.PubMedView ArticleGoogle Scholar
- Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet. 2006, 38: 75-81. 10.1038/ng1697.PubMedView ArticleGoogle Scholar
- Cooper GM, Nickerson DA, Eichler EE: Mutational and selective effects on copy-number variants in the human genome. Nat Genet. 2007, 39: S22-29. 10.1038/ng2054.PubMedView ArticleGoogle Scholar
- de Smith AJ, Tsalenko A, Sampas N, Scheffer A, Yamada NA, Tsang P, Ben-Dor A, Yakhini Z, Ellis RJ, Bruhn L, Laderman S, Froguel P, Blakemore AI: Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. Hum Mol Genet. 2007, 16: 2783-2794. 10.1093/hmg/ddm208.PubMedView ArticleGoogle Scholar
- Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe'er I: Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009, 19: 318-326. 10.1101/gr.081398.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA: Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet. 2006, 38: 82-85. 10.1038/ng1695.PubMedView ArticleGoogle Scholar
- Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, Leemput van de J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB: Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008, 451: 998-1003. 10.1038/nature06742.PubMedView ArticleGoogle Scholar
- Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, et al: Mapping and sequencing of structural variation from eight human genomes. Nature. 2008, 453: 56-64. 10.1038/nature06862.PubMedPubMed CentralView ArticleGoogle Scholar
- Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007, 318: 420-426. 10.1126/science.1149504.PubMedPubMed CentralView ArticleGoogle Scholar
- Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL, Cheng Z, Schwartz S, Albertson DG, Pinkel D, Altshuler DM, Eichler EE: Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet. 2006, 79: 275-290. 10.1086/505653.PubMedPubMed CentralView ArticleGoogle Scholar
- McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM: Common deletion polymorphisms in the human genome. Nat Genet. 2006, 38: 86-92. 10.1038/ng1696.PubMedView ArticleGoogle Scholar
- Pinto D, Marshall C, Feuk L, Scherer SW: Copy-number variation in control population cohorts. Hum Mol Genet. 2007, 16 (Spec No. 2): R168-173. 10.1093/hmg/ddm241.PubMedView ArticleGoogle Scholar
- Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005, 77: 78-88. 10.1086/431652.PubMedPubMed CentralView ArticleGoogle Scholar
- Simon-Sanchez J, Scholz S, Fung HC, Matarin M, Hernandez D, Gibbs JR, Britton A, de Vrieze FW, Peckham E, Gwinn-Hardy K, Crawley A, Keen JC, Nash J, Borgaonkar D, Hardy J, Singleton A: Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum Mol Genet. 2007, 16: 1-14. 10.1093/hmg/ddl436.PubMedView ArticleGoogle Scholar
- Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Fine-scale structural variation of the human genome. Nat Genet. 2005, 37: 727-732. 10.1038/ng1562.PubMedView ArticleGoogle Scholar
- Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE, Lam WL: A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet. 2007, 80: 91-104. 10.1086/510560.PubMedPubMed CentralView ArticleGoogle Scholar
- Zogopoulos G, Ha KC, Naqib F, Moore S, Kim H, Montpetit A, Robidoux F, Laflamme P, Cotterchio M, Greenwood C, Scherer SW, Zanke B, Hudson TJ, Bader GD, Gallinger S: Germ-line DNA copy number variation frequencies in a large North American population. Hum Genet. 2007, 122: 345-353. 10.1007/s00439-007-0404-5.PubMedView ArticleGoogle Scholar
- Conrad DF, Hurles ME: The population genetics of structural variation. Nat Genet. 2007, 39: S30-36. 10.1038/ng2042.PubMedPubMed CentralView ArticleGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851-861. 10.1038/nature06258.PubMedView ArticleGoogle Scholar
- ArrayExpress. [http://www.ebi.ac.uk/microarray-as/ae/]
- Coriell Institute for Medical Research. [http://www.coriell.org/]
- Wang Y, Moorhead M, Karlin-Neumann G, Falkowski M, Chen C, Siddiqui F, Davis RW, Willis TD, Faham M: Allele quantification using molecular inversion probes (MIP). Nucleic Acids Res. 2005, 33: e183-10.1093/nar/gni177.PubMedPubMed CentralView ArticleGoogle Scholar
- Affymetrix: Affymetrix® Genome-Wide Human SNP Nsp/Sty 6.0 User Guide. 2007, [http://www.affymetrix.com/support/downloads/manuals/genomewidesnp6_manual.pdf]Google Scholar
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.PubMedGoogle Scholar
- Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.PubMedPubMed CentralView ArticleGoogle Scholar
- Komura D, Nishimura K, Ishikawa S, Panda B, Huang J, Nakamura H, Ihara S, Hirose M, Jones KW, Aburatani H: Noise reduction from genotyping microarrays using probe level information. In Silico Biol. 2006, 6: 79-92.PubMedGoogle Scholar
- Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP, Tavare S, Hurles ME: Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007, 8: R228-10.1186/gb-2007-8-10-r228.PubMedPubMed CentralView ArticleGoogle Scholar
- Bioconductor Open Source Software for Bioinformatics. [http://www.bioconductor.org/]
- Database of Genomic Variants. [http://projects.tcag.ca/variation/]
- Wang Y, Moorhead M, Karlin-Neumann G, Wang NJ, Ireland J, Lin S, Chen C, Heiser LM, Chin K, Esserman L, Gray JW, Spellman PT, Faham M: Analysis of molecular inversion probe performance for allele copy number determination. Genome Biol. 2007, 8: R246-10.1186/gb-2007-8-11-r246.PubMedPubMed CentralView ArticleGoogle Scholar
- UCSC Genome Browser. [http://www.genome.ucsc.edu/cgi-bin/hgLiftOver]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.