An optimization framework for unsupervised identification of rare copy number variation from SNP array data
 Gökhan Yavaş^{1},
 Mehmet Koyutürk^{1, 2},
 Meral Özsoyoğlu^{1},
 Meetha P Gould^{3} and
 Thomas LaFramboise^{2, 3, 4}Email author
DOI: 10.1186/gb20091010r119
© Yavaş et al.; licensee BioMed Central Ltd. 2009
Received: 21 September 2009
Accepted: 23 October 2009
Published: 23 October 2009
Abstract
Copy number variants (CNVs) have roles in human disease, and DNA microarrays are important tools for identifying them. In this paper, we frame CNV identification as an objective function optimization problem. We apply our method to data from hundreds of samples, and demonstrate its ability to detect CNVs at a high level of sensitivity without sacrificing specificity. Its performance compares favorably with currently available methods and it reveals previously unreported gains and losses.
Background
Identifying DNA variants that contribute to disease is a central aim in human genetics research. Pinpointing these causal loci requires the ability to accurately assess DNA sequence variation on a genomewide scale. In recent years, considerable progress has been made in identifying and cataloging singlenucleotide polymorphisms (SNPs) in many populations [1]. Commercial SNP microarray platforms can now genotype, with >99% accuracy, over one million SNPs in an individual in one assay [2, 3].
The discovery of copy number variants (CNVs) as a significant source of variation has complicated the identification of genetic differences among humans. CNVs are defined as chromosomal segments at least 1,000 bases (1 kb) in length that vary in number of copies from human to human [4]. Since their discovery, several highprofile studies have been published associating copy number variation in the genome with a variety of common diseases. Recent examples include Alzheimer's disease [5], Crohn's disease [6], autism [7], and schizophrenia [8]. The significance of the gains (copy number greater than two) and losses (copy number less than two) that comprise these variants is increasingly evident, and cataloging them and assessing their frequencies has become an important goal.
SNP arrays contain hundreds of thousands of unique nucleotide probe sequences, each designed to hybridize to a target DNA sequence. When a DNA sample is properly prepared and applied to the array, specialized equipment can produce a measure of the intensity of hybridization between each probe and its target in the sample. The underlying principle is that the hybridization intensity depends upon the amount of target DNA in the sample, as well as the affinity between target and probe. Extensive processing and analysis of these raw intensity measures yield estimates of some characteristic of the target sequences in the sample  either target quantity [9, 10], base composition [11, 12], or both. In copy number inference, the objective is to identify chromosomal regions at which the number of copies per cell deviates from two; these include gains and losses.
There is now a large body of literature describing algorithms to infer copy number from SNP array data. All such algorithms address one or more of the three general steps: normalization, raw copy extraction, and CNV calling. Normalization is performed on the raw array intensity data in order to be able to compare these values fairly, thereby taking into account differences in overall array brightness and additional sources of nuisance variation. Raw copy number extraction entails converting the multiple measurements for each genomic site into a single raw measure of copy number. The word 'raw' here indicates that measurements from surrounding loci are not yet taken into account, and the measure is permitted to be noninteger. Since gains and losses occur in discrete segments often encompassing several such loci, true copy number is locally constant. Consequently, the final CNV calling step takes advantage of this fact, smoothing or segmenting the raw copy numbers into discrete segments of consistent copy number.
The Affymetrix SNP array was originally designed so that each SNP is interrogated by 24 to 40 unique probes. Of these, half are perfectly complementary to the sequence harboring the SNP site (perfect match probes), while half mismatch the sequence at the probe's middle nucleotide (mismatch probes). The mismatch probes were intended to capture background effects such as crosshybridization. The perfect match/mismatch design was used for the 10,000, 100,000, and 500,000SNP versions of the array. Most recently, Affymetrix has introduced the SNP Array 6.0, which interrogates nearly one million SNPs and differs fundamentally from previous versions. First, each SNP on the 6.0 array is interrogated only by six or eight perfect match probes  three or four replicates of the same probe sequence for each of the two alleles. Therefore, intensity data for each SNP consist of three or four repeated pairs of measurements. Second, the SNP probe sets are augmented with nearly one million CNV probes, which are meant to interrogate regions of the genome that do not harbor SNPs, but that may be polymorphic with regard to copy number. Each such CNV site is interrogated by only one probe.
For the Affymetrix platform, the community has largely settled upon quantile normalization [13] as a simple but effective normalization method. The next step, raw copy number extraction, typically entails fitting some model to raw probe intensity data [14–17]. Methods devoted to the final step  making CNV calls from raw copy number data  are numerous, and employ various strategies. Three commonly used strategies are hidden Markov models (HMMs) [17, 18], circular binary segmentation [19, 20], and adapted weight smoothing [21, 22]. Although these methods appear to be quite different from one another in terms of the computational or statistical model they incorporate, at the core of each is an objective function whose optimum solution yields the method's copy number inference for a region. Each objective function is defined by the observed data (raw copy number) and is a function of inferred state (copy number call). The sequence of copy number calls (states) that optimizes the objective function gives the CNV call for each method.
In this paper, we present a general framework to call CNVs from raw copy number using optimization, based on an objective function that is composed of several explicitly formulated objective criteria. These criteria are carefully designed to quantify the desirability of a CNV assignment with respect to various biological insights and experimental considerations. Our general approach is to first apply a signal processing method to aggressively flag candidate gains and losses. The objective function is then optimized on each region and flanking sequence, yielding final CNV calls and boundaries. Note that the optimization process also filters out many candidate regions; that is, complete rejection of a candidate region is quite possible as it is part of the solution space for the corresponding optimization problem. This twostep procedure has the advantages of drastically reducing the computational time necessary to find the set of solutions, while identifying precise boundaries for each putative CNV. Indeed, for N markers and C CNV classes, the solution space of the optimal copy number assignment problem is of size O(C^{ N }). Exhaustively searching for the optimal solution is quite infeasible unless N becomes very small. In our case, N ≈ 1.8 million, so we adapt a simulated annealingbased algorithm that efficiently searches the solution space at nearinteractive rates.
We note here the distinction between CNVs and copy number polymorphisms (CNPs). CNPs are defined to be CNVs that are present and have identical boundaries (and are therefore likely identicalbydescent) in at least 1% of the human population [23]. Computationally, such higherfrequency polymorphisms present opportunities for detection that are not otherwise possible. A recent study [17] proposes separate methods to detect CNVs and CNPs, with the latter involving detecting correlations in raw copy numbers across samples. The current work is designed to address the problem of identifying rare and de novo CNVs, as it does not make use of multiple samples to convert raw copy number into CNV inferences.
A key feature of our method is that it is highly configurable, allowing researchers to define their own objective functions and tune parameters to emphasize the relative importance of different objective criteria. We demonstrate with a simple objective function involving a linear combination of variability, parsimony, and length, which performs surprisingly well. We evaluate the performance of our method on Affymetrix 6.0 array data from 270 HapMap individuals [1]. These samples are increasingly well characterized with regard to CNVs and include 60 motherfatherchild trios. Therefore, they serve as an excellent benchmark data set. We show via systematic in silico studies that the proposed method compares favorably with four methods that are currently publicly available. Furthermore, we experimentally validate, using laboratory techniques on genomic DNA, several CNVs newly discovered by our method. These results demonstrate the proposed method's potential to uncover human genetic variation that may be missed by other computational approaches.
Results and discussion
The distribution of identified CNVs by ethnicity
CEU  YRI  JPT  CHB  Total  

Gains  1,726  2,325  856  765  5,672 
Losses  3,500  3,443  1,760  1,753  10,456 
Total  5,226  5,768  2,616  2,518  16,128 
Trio discordance as a copy number variant detection assessment tool
Although CNVs can arise in a de novo manner, it is believed that at least 99% of all CNVs in an individual's genome are inherited [23]. The 60 motherfatherchild trios in the HapMap data set therefore provide an opportunity to assess the accuracy of CNV detection algorithms by measuring the rate of Mendelian concordance. A CNV in a trio child is said to be Mendelian concordant if it appears in at least one of the parents. Unless the CNV is de novo, any discordance is either the result of a false positive call in the child or a false negative call in one of the parents (in rare cases, discordance could also result from a parent harboring a duplication and a deletion at the same locus but on different chromosomal homologs). Discordance rate, while useful, is imperfect as an assessment measure. In particular, it is possible for a CNV identification algorithm to have artificially low discordance rates by calling each CNV in a large number of samples. Even if the samples in which a gain or loss is called are randomly selected, frequently called CNVs will have a lower discordance rate, simply by chance. Therefore, while comparing the performance of algorithms according to trio discordance rate, we also account for the number of frequently called CNVs, as discussed in the next subsection.
This measure provides a standard way of determining the similarity in the chromosomal location of two CNVs, regardless of the scale of the events. For our discordance and sensitivity analysis, we use the MRO measure with a threshold of 0.5 to decide whether two CNVs identified in two different individuals correspond to the same event. That is, at least half of c_{1} must be overlapping with c_{2} and vice versa for c_{1} and c_{2} to be considered as the same CNV in different samples.
Performance of ÇOKGEN in comparison to existing software
We compared the performance of our algorithm with that of four other software packages. The DNAChip Analyzer (dChip) [24] is a Windows software package for Affymetrix platform and highlevel analysis of gene expression microarrays and SNP microarrays [14, 25]. Birdseye [17] is a rare CNV identification tool based on HMMs, and is part of the Birdsuite platform [17]. QuantiSNP [26] is an analytical tool for the analysis of copy number variation using whole genome SNP genotyping data. It was originally developed for Illumina arrays, but version 1.1 of this software supports Affymetrix 6.0 data files with additional data conversion steps. PennCNV [27] is the last software tool that we use for CNV detection for our comparative analyses. Although it is also designed to handle signal intensity data from Illumina arrays, it currently supports Affymetrix.
Comprehensive experimental results show that ÇOKGEN outperforms all of these four CNV identification tools in terms of general trio discordance. Overall, ÇOKGEN has a 30.8% discordance rate whereas Birdseye, dChip, QuantiSNP and PennCNV demonstrate discordance rates of 42.6%, 94%, 74% and 32.9%, respectively, on the same array data. It is important to note that dChip was originally optimized for detecting somatic copy number aberrations in cancer cells from earlier versions of the Affymetrix platform, and QuantiSNP is designed for data obtained from the Illumina platform. Therefore, Birdseye, PennCNV, and ÇOKGEN's superior performance compared to dChip and QuantiSNP on Affymetrix 6.0 data is not surprising. For this reason, we restrict our assessment to ÇOKGEN, Birdseye and PennCNV in the remainder of this section.
Sensitivity comparison across methods
Trio discordance is a reasonable hybrid measure of sensitivity (recall) and specificity (precision), but these two measures cannot be easily decoupled based only on discordance rate. A recent study [28] assembled a 'stringent dataset' comprising CNVs identified by at least two independent algorithms. The dataset contains a total of 808 autosomal CNV regions reported by the study to be harbored in at least one of the 270 HapMap individuals. Another study [23] identified 1,292 autosomal CNP regions in 270 HapMap samples. We use these two as 'gold standard' data sets to evaluate the sensitivity of our method. We refer to sensitivity based on the data presented in [28] as sensitivityPinto and sensitivity based on the CNP data set presented in [23] as sensitivityMcCarroll.
In terms of sensitivityPinto, we observe that ÇOKGEN detects 696 of 808 (approximately 86.1%) CNVs from the study presented in [28]. PennCNV obtains the best result by a narrow margin, by identifying 716 of 808 (approximately 88.6%) CNVs. Birdseye achieves an 84.7% success rate, slightly less than that of our method. In terms of sensitivityMcCarroll, ÇOKGEN and PennCNV detect 20.7% and 25.5%, respectively. Birdseye detects 68.2%, which is the best sensitivity rate among all the methods compared for this data set; however, as mentioned in [23], Birdeye is one of the methods used for identifying the CNPs in this dataset. For this reason, this result is not surprising. PennCNV is slightly more sensitive than our method on this dataset, though this seems to be at the cost of a modest increase in trio discordance rate, as shown above.
Run time performance
To analyze the run time performances of ÇOKGEN, PennCNV, and Birdseye, we compare ÇOKGEN with PennCNV on a Windows system, and time both ÇOKGEN and Birdseye on a Linux system (Birdseye is not available in a Windows version). Performances are measured from the time at which the CEL file is taken as an input to the time at which the list of CNVs is output. On a Windows system that has an Intel Core 2 Quad CPU with a clock speed of 2.4 GHz and 4 gigabytes of memory, we observe that ÇOKGEN processes 22 chromosomes of a single HapMap sample in an average of 343 seconds compared to an average of 271 seconds for the PennCNV package.
The Linux experiments are done on a dual Intel Xeon 3 Ghz Centos 5 × 86 64bit machine with 4 gigabytes of memory. Since Birdsuite is designed to be run as a pipeline of consecutive steps, we are unable to run only Birdseye in isolation. Thus, we report the run time for the whole package rather than single steps, which may admittedly inflate the time that Birdseye would take to run alone. In this experiment, ÇOKGEN processes 22 chromosomes of a single sample in an average of 702 seconds compared to 2,232 seconds for the whole Birdsuite pipeline.
In addition to computational efficiency, these experiments also highlight the userfriendliness of our package. Indeed, ÇOKGEN is wholly contained in a single, simple (composed of three commands) R package, making it completely platformindependent and available to Windows, Mac, or Linux/UNIX users. In contrast to the competing software, ÇOKGEN does not require the installation of additional tools such as Active Perl [29] or Affymetrix Power Tools [30].
Experimental validation of copy number variants not previously reported
MLPA results for some of the nonpreviously reported regions identified by ÇOKGEN
Chromosome  Sample  Basepair start*  Basepair end*  Length (bp)  MLPA probe position  Type  MLPA 

5  NA11830  59753489  59816458  62969  59766589  Gain  2.4 
5  NA10846  101261596  101308054  46458  101261461  Loss  1.35 
5  NA12144  101256012  101308054  52042  101279312  Loss  1.18 
6  NA10846  99225525  99249603  24078  99237564  Loss  1.44 
6  NA12144  99225007  99245596  20589  99226748  Loss  1.3 
16  NA10839  77818007  77832838  14831  77819334  Loss  1.35 
2  NA10854  108944933  108952869  7936  108945672  Loss  1.33 
6  NA11830  97308635  97316868  8233  97311558  Loss  1.29 
The software package
Conclusions
We present a method to detect germline CNVs from Affymetrix 6.0 SNP array data. Our approach, with its accompanying software, will be useful for researchers querying constitutional DNA for association of gains and losses with disease. Indeed, CNVs are emerging as important factors in a growing number of diseases, and the 6.0 array has the highest genomewide resolution of current commercially available platforms. The current work shows that the problem of detecting CNVs from raw array data may be recast as an optimization problem with an explicit objective function. The objective function chosen here is quite simple and intuitive, but its effectiveness is clear. Our method is wholly contained in a freely available and flexible software package that efficiently processes raw probelevel .CEL files to produce lists of inferred gains and losses. The software allows the user to tune parameters for the desired specificitysensitivity balance. With detailed experimental studies on the HapMap dataset, we have demonstrated its sensitivity to detect both previously reported and novel CNVs, while keeping a low false positive rate, as demonstrated by high Mendelian consistency in trios.
The method described in this paper could also be adapted to other SNP arrays, including earlier versions of the Affymetrix platform, Illumina arrays, or array comparative genomic hybridization. Any platform that produces a measure of raw copy number at markers across the genome would be suitable. As SNP arrays continue to improve with regard to throughput and accuracy, our approach will be adaptable to handle the data as they become available.
The optimizationbased approach is the key to our method's flexibility. Although we have constructed our own default function to capture the criteria that we wish to emphasize, one may easily envision alternative criteria that other researchers would wish to incorporate. For example, since very long CNVs are quite rare in the human genome, researchers might wish to include a term in the objective function that takes into account the number of bases covered by a putative CNV region. Another possibility would be to incorporate allelic ratio intensity information at SNP markers, as is done in some HMM approaches [26, 27]. We anticipate that users will design their own objective functions and apply them, using our software, to their own specific applications and data.
It should be emphasized that previously established approaches may actually also be considered special cases of functional optimization. For example, HMMs often used in the copy number setting [14, 17, 26] entail finding 'state paths' (markerbymarker sequences of copynumber calls) that maximize a loglikelihood function. In HMM applications, however, the model parameters are often estimated simultaneously with the copy number states via a Viterbi algorithm [33], based on training samples. Precise parameter estimation relies on sufficient representation from each copy number state, which may be unrealistic for rare CNVs. Another popular approach to inferring CNVs from raw copy number data is circular binary segmentation [19]. Rather than explicitly representing copy number state as a solution to an optimization problem, circular binary segmentation aims to find change points from one copy state to another. It does so by maximizing functions of marker indices. The optimum values of the function determine the boundaries of the CNV regions. A third example is the GLAD (Gain and Loss Analysis of DNA) algorithm [22], which has been adapted extensively using methods developed to analyze tumor DNA [15, 34]. To find CNVs, GLAD explicitly models raw copy number as a function of position. The true underlying copy number is encoded in a positiondependent parameter. The CNV regions are inferred by maximizing a weighted likelihood function using an adaptive weights smoothing procedure [21]. Note that the objective functions in HMMs, binary segmentation and GLAD all make distributional assumptions about the raw copy number measurements. The function that we adopt in the current study makes no such assumptions, but could be modified to incorporate them. Furthermore, our CNV calling method is fully unsupervised in that it does not require any training samples in terms of known copy numbers. Lastly, rather than estimating and fixing parameters (thus fixing the performance of the algorithm), our method presents the opportunity to tune parameters, which makes it possible to adjust the performance of the algorithm to obtain the best results in a semiautomatic manner.
Three other studies have utilized various smoothing and edge detection algorithms: wavelet footprints [35], nonlinear diffusion filtering [36], and kernel smoothing [37]. We also apply an edge detection scheme on lowpass filtered data to identify regions that potentially correspond to aberrations. Unlike other approaches, however, we apply edge detection rather aggressively to identify all candidate regions that may correspond to aberrations. This is because the raw copy number signal is extremely noisy due to the artifacts of microarray technology, as seen in Figure 7a. Furthermore, since the markers are distributed unevenly across the genome, the onedimensional signal represents a nonuniform sample of the actual copy number signal. Consequently, it is not straightforward to choose a smoothing and edge detection scheme that will be most appropriate for all experiments, samples, chromosomes, or even chromosomal segments. For example, in Figure 7b, the edge detection scheme identifies a single duplication as two separate duplications, since the markers at the middle of the region exhibit relatively low raw copy numbers, probably due to noise. This problem can be alleviated by smoothing the signal more aggressively to eliminate such artifacts, although this might result in falsely eliminating many aberrations that span relatively less numbers of markers. Motivated by these considerations, we use edge detection to identify all potential candidates and then use an optimization scheme with adjustable parameters to eliminate false calls among these candidates.
We also note that ÇOKGEN works on each sample individually and is therefore suited for rare CNV identification at the expense of losing some information to detect CNPs. The importance of rare CNVs is underscored by the recent deep sequencing of the entire genome of a single individual [38]. In that study, some 30% of the discovered CNVs had not been previously reported by any other study.
In addition to presenting a new software tool, the current work also casts Mendelian concordance, as an assessment tool, in a new light. While concordance rate is valuable as a metric to evaluate methods for calling germline variation, it is best viewed as a function of overall variant call rate. As we have shown, concordance rate can be artificially boosted simply by calling variants at a high rate. When evaluating the performance of future methods on familybased data sets, researchers may compare trio discordance results as a function of call frequency to the null expectation that we derive in the Materials and methods section.
Materials and methods
Our method takes as input raw .CEL files and produces a table of inferred genomewide gains and losses. The software package, ÇOKGEN, provides a configurable platform for CNV identification, allowing users to: adjust the parameters of our default formulation to tune the behavior of the method to the target application (for example, aggressive versus conservative in calling CNVs); and to specify their own target objective functions. ÇOKGEN also produces 'zoomable' plots of raw copy number at the chromosome and subchromosome level for manual inspection of identified copy numbers. An overview of the methods implemented in ÇOKGEN is given in Figure 1. Details of each step are provided in this section.
Intensity extraction and normalization of raw data
The raw probe intensities for each array are encoded in the binary .CEL files output by the Affymetrix instrument, one file for each array. As a first step, we use the R package affxparser [39] to extract the intensities for each array locus from the .CEL files. Next, we quantile normalize [13] the intensities across all arrays in the experiment. This enables fair comparison of intensities, taking into account systematic nonbiological differences such as overall array brightness.
Raw copy number for SNP markers
The genomic loci interrogated on the Affymetrix 6.0 array fall into two categories  SNP markers and copy number (CN) markers. The array contains 887,876 autosomal CN and 869,224 autosomal SNP markers, for a total of 1,757,100 (we discard the X and Y chromosomes to avoid gender complications, as well as mitochondrial markers). The markers are ordered from i = 1 to ~1.8 million according to genomic coordinates. A SNP marker is interrogated by either six or eight probes  half for each of the A and B alleles  and hence produces six or eight normalized intensity measurements for each array. Since the vast majority of SNP markers have six probes, we present that case here. Let A_{i1}, A_{i2}, A_{i3}, B_{i1}, B_{i2}, and B_{i3 }denote the three A allele and three B allele measurements for a SNP marker i. Our aim is to produce allelespecific raw copy numbers A_{ i }and B_{ i }for the two alleles such that the distance from the origin in (A, B) Cartesian coordinates produces a raw measure of the copy number at the i^{th} marker. Toward this end, we linearly rescale the intensities so that is approximately equal to 2.0, regardless of genotype, for markers that are already deemed to have normal copy numbers (that is, two copies).
via leastsquares regression, where is the rescaled copy number for allele A at SNP i; for 1 ≤ 1 ≤ 3 are model parameters, and is the error term. More specifically, in the absence of copy number variation, is 2.0 for an AA genotype, for an AB genotype, and 0 for a BB genotype. The fitting procedure yields estimates for the model parameters. We model B allele copy number in a similar manner, and obtain estimates for the model parameters, quantifying the relationship between B allele copy number and the six probe intensities. The objective here is to capture the individual responsiveness of each probe to varying quantities of DNA harboring the A and B alleles.
we can safely assume in general that most of the middle two quartiles, across all samples, of PI_{ i }are from individuals with two copies of the chromosomal segment that contains marker i. In other words, the individuals that fall into these quartiles for the corresponding marker are likely to carry diploid genotypes AA, AB, or BB. Consequently, we fit the model based on these samples' genotypes.
Raw copy numbers for CN markers
Using these two separate procedures for SNP and CN markers yields raw copy numbers R_{ i }for all markers i from 1 to ~1.8 million, ordered along the genome according to hg18 (build 36 of the human genome) coordinates. All 270 HapMap samples are used to parameterize the regression model for raw copy number estimation of both SNP and CN markers. Figure 7a gives an example of raw copy numbers for a 394marker region.
Algorithm for copy number variant detection
Key to our approach is the observation that CNV identification can be formulated explicitly as an optimization problem without any requirement of reference models or training data. Based on general knowledge of the microarray technology and basic biological insights on copy number variation, we specify various quantitative measures that gauge the suitability of copy number assignments based on observed array intensities. We then formulate an objective function that captures the tradeoff between these measures, so that the minima of this function represent optimal CNV assignments. This function is characterized by userdefined parameters, allowing the user to tune the performance of algorithms based on the requirements of the specific application (for example, minimizing false positives due to the cost of experimental verification versus minimizing false negatives to capture existing variation comprehensively).
Formally, the objective of CNV identification is to find a mapping S: {1, ..., N} → , where {1, ..., N} denotes the ordered set of markers for the whole genome and = {C_{+}, C_{0}, C_{}} is the set of the gain, normal and loss classes, denoted respectively as C_{+}, C_{0} and C_{}. Thus, our objective is to assign a class value from C to each marker on a genome based on the R_{ i }values such that the class assignment of consecutive markers and their raw copy number estimates are as consistent as possible.
We next introduce the objective criteria that are included in the default objective function implemented in ÇOKGEN and the motivation behind these criteria. Researchers may wish to design an objective function of their choice, and indeed our software takes the objective function as an argument precisely to accommodate this. We describe the function as applied to a chromosome with M markers since each chromosome is processed separately.
Variability in raw copy numbers within each copy class should be minimized
Consequently, a desirable S is expected to minimize σ(S) (subject to other constraints). Note that this formulation does not make any assumption about the expected raw copy numbers of the markers and is therefore robust to any systematic bias that might be encountered in measurement and normalization of R_{ i }.
Parsimony principle: observed variability should be explained via minimum number of anomalies
Here I(.) denotes the indicator function (that is, it is equal to 1 if the statement being evaluated is true, and 0 otherwise).
Filtering out noise by eliminating smaller regions
as an objective criterion that penalizes shorter CNVs (e denotes the natural logarithmic base).
Filtering out noise by eliminating possible false positives
into the objective function to minimize the effects of the noisy signal. Here, T_{gain} and T_{loss} are userdefined parameters that basically define the upper and lower limits for the raw copy number of markers in the set Π(C_{0}) (that is, the set of markers assigned to the normal class). As T_{gain} is increased and T_{loss} is decreased, candidate regions are penalized more harshly. In our experiments, we use 2.25 and 1.75 for T_{gain} and T_{loss}, respectively, since these values provide reasonable performance.
Putting the pieces together: a single objective function for copy number variant identification
is minimized at S = S*. Here the tunable coefficients k_{ σ }, k_{ χ }, k_{ λ }, k_{ δ }adjust the relative importance of the objective criteria with respect to each other. In our experiments, for k_{ λ }and k_{ δ }, we choose large values such as 10^{5} and 10^{6}, respectively, to prohibitively eliminate candidate regions that are likely to be false positive during the course of the algorithm (as opposed to filtering them out in a postprocessing phase).
Two phase copy number variant identification
Since the solution space of the optimal copy number assignment problem is exponential, we require a good initial solution and a heuristic algorithm that iteratively improves the solution. For this purpose, we use a twophase algorithm: we first determine a set of candidate gain and deletion regions via a filtering and aggressive edge detection procedure that we consider as an initial CNV assignment, S^{(0)}; and then we employ an iterative improvement based algorithm to adjust the boundaries of duplications and deletions accurately, and eliminate false positives.
Consequently, introducing an adjustable repetition parameter W, we obtain as a smooth version of the copy number intensity for a user defined value of W. Here, larger W provides smoother signals, thereby eliminating false positives, at the cost of missing true CNVs that span a smaller number of markers. For the ÇOKGEN's default value, we chose W = 20, for which we obtain reasonable results. Figure 7b demonstrates how the raw copy number R_{ i }in Figure 7a is converted into a smooth signal using the low pass filter.
Identification of candidate copy number variation regions via edge detection
where denotes the derivative of at marker i. These markers are the approximate inflection points of the signal .
Now let Q_{ ij }denote the indices corresponding to the set of contiguous markers on the genome starting from marker i and ending at marker j, where i ≤ j. Given the user defined thresholding parameter T_{gain} (see above), we designate Q_{ ij }as a candidate gain region (that is, ∀ k ∈ Q_{ ij }, S^{(0)}(k) = C_{+}) if it satisfies the following conditions: i ∈ D_{max} and j ∈ D_{min}; there exists at least one marker p, i ≤ p ≤ j, such that ; max(Q_{ ij }∩ D_{max}) < min(Q_{ ij }∩ D_{min}); and Q_{ ij }is a maximal set of contiguous markers satisfying the first three conditions.
All markers m that are not included in a candidate loss or gain region are preliminarily designated as 'normal', that is, S^{(0)}(m) is set to C_{0}. As a special case, if a candidate gain/loss region identified by edge detection is very close to another candidate region of its type, then we merge these two candidate regions into a single region, since they are likely to correspond to the same aberration.
This procedure gives us an initial CNV identification assignment S^{(0)}. This solution is quite aggressive in the sense that many truly normal (copy number two) markers are likely to be placed in the gain or loss classes. To eliminate these false positives and obtain S*, we use an optimizationbased algorithm to tune the boundaries of candidate gain and deletion regions as discussed in the next section.
Fine tuning of the region boundaries using optimization with simulated annealing
where S^{(t+1) }denotes the copy number assignment if move ν is made and S^{(t) }is the current copy number assignment. We use a stochastic algorithm based on simulated annealing [40] to determine the move. Simulated annealing is an iterative improvement heuristic that proceeds by repeated moves to improve the quality of the solution. Key to its efficiency is the stochastic nature of the selection of moves. At each step, the algorithm first randomly chooses a candidate gain or loss region, Q_{ ij }, from the set ζ(S) and then chooses a move v from the set of all moves that are validly defined on Q_{ ij }. If the gain γ(v) associated with the candidate move is positive, then the move is made. If the gain is not positive, the move is still made with a certain probability, which is proportional to the gain and declines as a function of time in the course of the algorithm. Therefore, simulated annealing starts its course with aggressive moves to jump out of undesirable local optima, and becomes more conservative as the algorithm proceeds, smoothly converging to a locally optimum solution. The procedure is repeated until either there is no valid positive gain move left to be done on the current solution or a userdefined number of negative gain moves, τ, have already been done consecutively (we use τ = 5 as our default). The mapping obtained at the end of the procedure is reported as S*.
We note that our algorithm allows us to consider the candidate regions in ζ(S) independently (as opposed to the entire chromosome) because the candidate regions with potential aberrations are sparse, and we therefore work on local subproblems associated with each candidate region separately. This results in significant improvements in computational efficiency. Since the distribution of raw copy numbers in the neighborhood of Q_{ ij }∈ ζ(S) provides a good sampling of raw copy numbers in the entire chromosome, the quality of the solution to this local problem does not deviate significantly from the global problem. Indeed, in our experimental evaluations, we observe that there is no significant difference between solving the class assignment globally (applying the above algorithm to a whole chromosome) or locally (as we describe above) in terms of their specificity and sensitivity in predicting copy number variations.
Data
For the application of our method, we used Affymetrix 6.0 array data from a total of 270 HapMap individuals. In the data set, there are 30 motherfatherchild trios from the Yoruba people of Ibadan, Nigeria, 45 unrelated individuals from Tokyo, Japan, 45 unrelated individuals from Beijing, China and 30 Caucasian trios that were collected in 1980 from US residents with northern and western European ancestry by the Centre d'Etude du Polymorphisme Humain (CEPH).
Multiplex ligationdependent probe amplification method
Each MLPA probe was designed synthetically to match sequences within the region of interest avoiding all SNPs in the area. Control probes were used from previously published work [41]. Oligonucleotides were synthesized by IDT (Coralville, IA, USA) with 5'phosphorylation of each downstream probe and tagged with common PCR primer sequences [32]. Probes were hybridized with 100 ng of DNA sample using MLPA reagents (part number EK1, MRCHolland BV, Amsterdam, The Netherlands) in accordance with the recommended protocol. Samples were diluted 20fold and analyzed on a 3130xl Genetic Analyzer from Applied Biosystems (Foster City, CA, USA) with GeneMapper software. Control DNA used were male and female genomics DNA pools (Promega, Madison, WI, USA). Peak height ratios were normalized to the mean of the entire data set, with subsequent elimination of outlier samples from the calculation of the mean.
Other methods
For analysis using dChip [24], we downloaded the version with a build date of 21 August 2008. We used the HMM as the Inferred copy method option with 50% of the samples trimmed. For Birdseye [17], we used version 1.5.1 of the Birdsuite package, which can be downloaded from [42]. The default parameters for that package were used. The latest version of PennCNV software, available as of 18 November 2008, was downloaded from [43] for analysis using PennCNV. We followed the steps described at [44] for the PennCNVAffy protocol and used the default parameters for analysis. For QuantiSNP, we downloaded version 1.1 from [45], followed the steps described at QuantiSNP in the Affymetrix tutorial document located at [46] and used the default parameters.
We also note that we combined copy number 0 and 1 into one category  loss  and copy number greater than 2 into one category  gain  for the results obtained by all packages, in order to compare their results with ÇOKGEN's results fairly.
Computation of expected discordance rate
Abbreviations
 CN:

copy number
 CNP:

copy number polymorphism
 CNV:

copy number variant
 ÇOKGEN (chokgen):

amalgamation of the Turkish words ÇOK = multiple and GEN = gene
 HMM:

hidden Markov model
 MLPA:

multiplex ligationdependent probe amplification
 SNP:

single nucleotide polymorphism.
Declarations
Acknowledgements
The authors express thanks to Sivakumaran Arumugam and Katherine Wilkins for running the Birdseye and dChip software, respectively. We also thank Simone Edelheit and the CWRU Genomics Core Facility for assistance with MLPA analysis. This work was supported by National Cancer Institute Grant R01 CA131341 to TL.
Authors’ Affiliations
References
 International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 12991320. 10.1038/4371241a.View ArticleGoogle Scholar
 Affymetrix: GenomeWide Human SNP Array 6.0 Data Sheet. 2007, Santa Clara, California: AffymetrixGoogle Scholar
 Illumina: Human1Mduo Beadchip Data Sheet. 2007, San Diego, CA: IlluminaGoogle Scholar
 Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 8597. 10.1038/nrg1767.PubMedView ArticleGoogle Scholar
 RoveletLecrux A, Hannequin D, Raux G, Le Meur N, Laquerrière A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant earlyonset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet. 2006, 38: 2426. 10.1038/ng1718.PubMedView ArticleGoogle Scholar
 Fellermann K, Stange DE, Schaeffeler E, Schmalzl H, Wehkamp J, Bevins CL, Reinisch W, Teml A, Schwab M, Lichter P, Radlwimmer B, Stange EF: A chromosome 8 genecluster polymorphism with low human betadefensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet. 2006, 79: 439448. 10.1086/505915.PubMedPubMed CentralView ArticleGoogle Scholar
 Sebat J, Lakshmi B, Malhotra D, Troge J, LeseMartin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, et al: Strong association of de novo copy number mutations with autism. Science. 2007, 316: 445449. 10.1126/science.1138659.PubMedPubMed CentralView ArticleGoogle Scholar
 Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M: Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet. 2008, 40: 880885. 10.1038/ng.162.PubMedView ArticleGoogle Scholar
 Zhao X, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004, 64: 30603071. 10.1158/00085472.CAN033308.PubMedView ArticleGoogle Scholar
 Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SW, Shen RM, Barker DL, Gunderson KL: Highresolution genomic profiling of chromosomal aberrations using Infinium wholegenome genotyping. Genome Res. 2006, 16: 11361148. 10.1101/gr.5402306.PubMedPubMed CentralView ArticleGoogle Scholar
 Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS: A genomewide scalable SNP genotyping assay using microarray technology. Nat Genet. 2005, 37: 549554. 10.1038/ng1547.PubMedView ArticleGoogle Scholar
 LindbladToh K, Tanenbaum DM, Daly MJ, Winchester E, Lui WO, Villapakkam A, Stanton SE, Larsson C, Hudson TJ, Johnson BE, Lander ES, Meyerson M: Lossofheterozygosity analysis of smallcell lung carcinomas using singlenucleotide polymorphism arrays. Nat Biotechnol. 2000, 18: 10011005. 10.1038/79269.PubMedView ArticleGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185193. 10.1093/bioinformatics/19.2.185.PubMedView ArticleGoogle Scholar
 Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNParraybased lossofheterozygosity data. Bioinformatics. 2004, 20: 12331240. 10.1093/bioinformatics/bth069.PubMedView ArticleGoogle Scholar
 LaFramboise T, Harrington D, Weir BA: PLASQ: a generalized linear modelbased procedure to determine allelic dosage in cancer cells from SNP array data. Biostatistics. 2007, 8: 323336. 10.1093/biostatistics/kxl012.PubMedView ArticleGoogle Scholar
 Bengtsson H, Irizarry R, Carvalho B, Speed TP: Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008, 24: 759767. 10.1093/bioinformatics/btn016.PubMedView ArticleGoogle Scholar
 Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E, Veitch J, Collins PJ, Darvishi K, Lee C, Nizzari MM, Gabriel SB, Purcell S, Daly MJ, Altshuler D: Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40: 12531260. 10.1038/ng.237.PubMedPubMed CentralView ArticleGoogle Scholar
 Zhao X, Weir BA, LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG, Sugarbaker D, Chen F, Rubin MA, Jänne PA, Girard L, Minna J, Christiani D, Li C, Sellers WR, Meyerson M: Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Res. 2005, 65: 55615570. 10.1158/00085472.CAN044603.PubMedView ArticleGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics. 2004, 5: 557572. 10.1093/biostatistics/kxh008.PubMedView ArticleGoogle Scholar
 Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657663. 10.1093/bioinformatics/btl646.PubMedView ArticleGoogle Scholar
 Polzehl J, Spokoiny S: Adaptive weights smoothing with applications to image restoration. J R Stat Soc, Ser B. 2000, 62: 335354. 10.1111/14679868.00235.View ArticleGoogle Scholar
 Hupé P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20: 34133422. 10.1093/bioinformatics/bth418.PubMedView ArticleGoogle Scholar
 McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D: Integrated detection and populationgenetic analysis of SNPs and copy number variation. Nat Genet. 2008, 40: 11661174. 10.1038/ng.238.PubMedView ArticleGoogle Scholar
 dChip Software Website. [http://www.dchip.org]
 Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 3136. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
 Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J: QuantiSNP: an objective Bayes hiddenMarkov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007, 35: 20132025. 10.1093/nar/gkm076.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome SNP genotyping data. Genome Res. 2007, 17: 16651674. 10.1101/gr.6861907.PubMedPubMed CentralView ArticleGoogle Scholar
 Pinto D, Marshall C, Feuk L, Scherer SW: Copynumber variation in control population cohorts. Hum Mol Genet. 2007, 16: R168173. 10.1093/hmg/ddm241.PubMedView ArticleGoogle Scholar
 Active Perl. [http://www.activestate.com/activeperl/]
 Affymetrix Power Tools Software Package. [http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx]
 Database of Genomic Variants. [http://projects.tcag.ca/variation/]
 Schouten JP, McElgunn CJ, Waaijer R, Zwijnenburg D, Diepvens F, Pals G: Relative quantification of 40 nucleic acid sequences by multiplex ligationdependent probe amplification. Nucleic Acids Res. 2002, 30: e5710.1093/nar/gnf056.PubMedPubMed CentralView ArticleGoogle Scholar
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc. 1977, 39: 138.Google Scholar
 Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellinghoff IK, Hofer MD, Descazeaud A, Rubin MA, Meyerson M, Wong WH, Sellers WR, Li C: Inferring lossofheterozygosity from unpaired tumors using highdensity oligonucleotide SNP arrays. PLoS Comput Biol. 2006, 2: e4110.1371/journal.pcbi.0020041.PubMedPubMed CentralView ArticleGoogle Scholar
 PiqueRegi R, Tsau ES, Ortega A, Seeger R, Asgharzadeh S: Wavelet footprints and sparse bayesian learning for DNA copy number change analysis. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1520 April 2007; Honolulu, HI. 2007, IEEE Signal Processing Society Editors: IEEE Publications, Piscataway, NJ, USA, 1: 353356.Google Scholar
 Alqallaf A, Tewfik A, Selleck S, Johnson R: Framework for the analysis of genetic variations across multiple DNA copy number samples. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 30 March4 April 2008; Las Vegas, NV. 2008, IEEE Signal Processing Society Editors: IEEE Publications, Piscataway, NJ, USA, 1: 553556.View ArticleGoogle Scholar
 Shen F, Huang J, Fitch KR, Truong VB, Kirby A, Chen W, Zhang J, Liu G, McCarroll SA, Jones KW, Shapero MH: Improved detection of global copy number variation using high density, nonpolymorphic oligonucleotide probes. BMC Genet. 2008, 9: 2710.1186/14712156927.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et al: The diploid genome sequence of an Asian individual. Nature. 2008, 456: 6065. 10.1038/nature07484.PubMedPubMed CentralView ArticleGoogle Scholar
 Affymetrix File Parsing SDK. [http://www.bioconductor.org/packages/2.2/bioc/html/affxparser.html]
 Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science. 1983, 220: 671680. 10.1126/science.220.4598.671.PubMedView ArticleGoogle Scholar
 Macconaill LE, Aldred MA, Lu X, Laframboise T: Toward accurate highthroughput SNP genotyping in the presence of inherited copy number variation. BMC Genomics. 2007, 8: 21110.1186/147121648211.PubMedPubMed CentralView ArticleGoogle Scholar
 Birdsuite: Downloads. [http://www.broad.mit.edu/science/programs/medicalandpopulationgenetics/birdsuite/birdsuitedownloads0]
 PennCNV Download. [http://www.openbioinformatics.org/penncnv/penncnv_download.html]
 PennCNVAffy Tutorials. [http://www.openbioinformatics.org/penncnv/penncnv_tutorial_affy_gw6.html]
 QuantiSNP Download. [http://www.well.ox.ac.uk/~ioannisr/quantisnp/]
 QuantiSNP for Affymetrix Tutorials. [http://groups.google.co.uk/group/quantisnp/files]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.