CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing

High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, as a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs. Electronic supplementary material The online version of this article (10.1186/s13059-018-1578-y) contains supplementary material, which is available to authorized users.

1. For each sample j, fit a smoothing spline Y :j /N j old exp(g ⇥ h T j: ) ⇠ GC to get f j (GC).

For each exon
3. Denote Z = Nf(GC) new . Apply SVD to row-centered log(Y/Z) to obtain the K right singular vectors and use as h old .
(a) For each i, fit Poisson log-linear regression with Y i: as response, h old as covariates, log(Z i: ) as fixed o↵set to obtain updated estimates as {g i1 , · · · , g iK }. (b) For each j, fit Poisson log-linear regression with Y :j as response, g as covariates, log(Z :j ) as fixed o↵set to obtain updated estimates as {h new j1 , · · · , h new jK }. (c) Center each row of g ⇥ (h new ) T and apply SVD to the row-centered matrix to obtain the K right singular vectors to update h new . (d) Repeat steps (a) to (c) with h old = h new till convergence to obtain h, g.

4.
Repeat steps 1 to 3 with old = new till convergence.
Algorithm S2. Parameter estimation algorithm for CODEX2 with negative control samples.
Let J c be the indices of the control samples.
1. Apply CODEX to the null samples Y :Jc and GC using Supplementary Algorithm 1 to get an estimate of h Jc: , g, and f Jc (GC).
Algorithm S3. Parameter estimation algorithm for CODEX2 without negative control samples.
Let I ⇤ be the indices of exons with CNVs with high population frequencies.
1. Apply CODEX to the null regions Y I ⇤ : and GC I ⇤ using Supplementary Algorithm 1 to get an estimate of h, g I ⇤ , I ⇤ and f (GC I ⇤ ).
2. Use the estimated non-parametric smooth spline function to estimate the GC content bias in the non-null regions f (GC I ⇤ ) with the corresponding GC I ⇤ .
(a) E-step: whereμ i ⇤ and⇡ i ⇤ are from the previous M-step and P (Y i ⇤ j | i ⇤ j ) is the probability for Poisson distribution with i ⇤ j calculated based on the given parameters.
whereẐ i ⇤ j is from the previous E-step. Run Poisson log-linear regression with Y i ⇤ : as response, h andẐ i ⇤ : as covariates, and log(N ⇥ f (GC)) i ⇤ : as fixed o↵sets to obtain estimates ofˆ i ⇤ as intercept, {ĝ i ⇤ 1 , ...,ĝ i ⇤ K } andμ i ⇤ as coe cients. Figure S1. WGS CNV calls from Phase 3 release by two callers. WGS CNV calls from Phase 3 release by two callers. Number of deletion and duplication calls are reported in the cohort of 90 HapMap samples that we used to benchmark. There is substantial discrepancy between the results by the two callers. These CNV calls are contaminated by false positives.  Figure S2. WGS CNV calls in trio dataset by CNVnator and CODEX2. WGS CNV calls in family trio dataset by (A) CNVnator and (B) CODEX2. CNVnator and CODEX2 return on average 1011 and 188 CNVs per individual, respectively. Results by CODEX2 show higher similarity between child and parents and lower similarity between non-related individuals.

Caller1
(A) (B) CNVnator CODEX2 Figure S3. Performance assessment of WGS CNV calls. Performance assessment of WGS CNV calls by CNVnator and CODEX2. CNVs that are true positives are more likely to be shared between related individuals than between unrelated individuals. (A) Proportion of CNVs detected in mother that are also in father. Except for common CNVs that are shared between non-related individuals, this proportion should be low and is indicative of specificity.
(B) Proportion of CNVs detected in child that are also in either parents. This Mendelian concordance should be high and is indicative of sensitivity. (C) Ratio of concordance between related individuals from (A) and unrelated individuals from (B) as a joint measurement of sensitivity and specificity.  Figure S4. Model selection and parameter estimation in spike-in studies. Model selection and parameter estimation in spike-in studies. (A) Number of latent factors is determined by variance reduction and BIC based on the null read depth after quality control procedures and is kept the same throughout the spike-in studies. (B-D) True underlying parameters and estimated parameters by CODEX and CODEX2. CODEX biasedly estimates the exonspecific latent factor {g 1 , g 2 , g 3 } for the exons, where common CNV signals are in silico added. Figure S5. Biased parameter estimation by CODEX with common CNV signals. Parameter estimation by CODEX when the null data is mixed with alternatives with di↵erent frequencies. Here we focus on a specific exon i ⇤ and try to estimate g i ⇤ 1 from Poisson generalized linear model taking h 1 as known from previous iteration.   Figure S6. Assessment of precision and recall rates with di↵erent CNV lengths. Assessment of precision and recall rates with di↵erent CNV lengths. CODEX2 has nearly perfect performance compared to CODEX and SVD-based method, which su↵er from low recall and precision rates for large and common CNV singals. Figure S7. Relationship between CNV state and batch e↵ect. Relationship between CNV state and batch e↵ect. (A) The null read depth from the 1000 Genomes Project after quality control procedures show distinct batch e↵ect between two centers, where the samples were sequenced. An example spike-in run where the CNVs are mostly added to the samples from one center. The CNV state and the batch e↵ect are highly correlated. (B) An example of Poisson log-linear regression to estimate the exon-specific latent factor for a specific exon in a common CNV region. The vertical axis is Y (in this case the spike-in read depth) while the horizontal axis is the (estimated null read depth by CODEX and CODEX2). If there is no CNV, these two quantities should be approximately equal along the diagonal line. CODEX's fitted results separate the carriers and non-carriers at two sides of the diagonal line.