Whole exome sequencing coupled with unbiased functional analysis reveals new Hirschsprung disease genes

Background Hirschsprung disease (HSCR), which is congenital obstruction of the bowel, results from a failure of enteric nervous system (ENS) progenitors to migrate, proliferate, differentiate, or survive within the distal intestine. Previous studies that have searched for genes underlying HSCR have focused on ENS-related pathways and genes not fitting the current knowledge have thus often been ignored. We identify and validate novel HSCR genes using whole exome sequencing (WES), burden tests, in silico prediction, unbiased in vivo analyses of the mutated genes in zebrafish, and expression analyses in zebrafish, mouse, and human. Results We performed de novo mutation (DNM) screening on 24 HSCR trios. We identify 28 DNMs in 21 different genes. Eight of the DNMs we identified occur in RET, the main HSCR gene, and the remaining 20 DNMs reside in genes not reported in the ENS. Knockdown of all 12 genes with missense or loss-of-function DNMs showed that the orthologs of four genes (DENND3, NCLN, NUP98, and TBATA) are indispensable for ENS development in zebrafish, and these results were confirmed by CRISPR knockout. These genes are also expressed in human and mouse gut and/or ENS progenitors. Importantly, the encoded proteins are linked to neuronal processes shared by the central nervous system and the ENS. Conclusions Our data open new fields of investigation into HSCR pathology and provide novel insights into the development of the ENS. Moreover, the study demonstrates that functional analyses of genes carrying DNMs are warranted to delineate the full genetic architecture of rare complex diseases. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1174-6) contains supplementary material, which is available to authorized users.


Generation of ENS candidate genes
Candidate genes were selected by a literature review on Hirschsprung disease research, which included both genetic and functional studies. Most of them were also covered in Jiang et al. [58] and Gui et al. [59], which previously summarized possible genes related to HSCR or involved in ENS development. The genes were categorized into 4 major types, genes selected based on: genetic linkage, genetic association, microarray expression, and animal models. In total 116 genes were selected that fit more than 1 category (Additional file 7: Table S6). A few of these genes fall into the same pathways previously implicated in neural crest cell migration, proliferation and differentiation. Three pathways (RET signaling pathway, EDNRB signaling pathway and KBP signaling pathway) were key partners involved in ENS development [60].

Quality assessment and control for exome variants
Concrete criterions in quality assessment (QA) include: total number of variants; dbSNP137 coverage; Transition/Transversion (Ti/Tv) ratio; genotype concordance rate and cross-sample identical-by-decent (IBD) relatedness [53]. Two complementary steps were applied in quality control (QC), including variant-level filtering (hard filtration or variant quality recalibration (VQSR)) and genotype-level filtering. In detail, we annotated GATK-called variants as low quality SNPs ("QD <2.0" or "MQ <40.0" or "FS >60.0" or "HaplotypeScore >13.0" or "MQRankSum <-12.5" or "ReadPosRankSum <-8.0" in their 'info' field) and low quality Indels ("QD <2.0" or "ReadPosRankSum <-20.0" or "InbreedingCoeff <-0.8" or "FS >200.0 in 'info' field); in addition, VQSR differentiated a few relatively low quality SNVs (labeled as "TruthSensitivityTranche99.90to100.00" after Gaussian mixture modeling at true sensitivity 99%) from other passed SNVs. On the other hand, individual genotypes were evaluated by quality parameters in the field of genotyping, mainly reflecting the likelihood of three possible genotypes (reference homozygous, heterozygous and alternative homozygous). A heterozygous genotype was kept only if it was supported by >4 total reads, and the ratio for alternative allele is above 0.25. Comparatively, a reference or alternative homozygous genotype was accepted if it was supported by > 4 total reads, and ratio for reference or alternative allele is above 0.95.

Mutation validation and prediction
Each DNM candidate was manually inspected using the Integrative Genomic Viewer (IGV) and they were categorized into five different groups: probably true positive, possibly true positive, unclear, possibly false positive and probably false positive. Two lists of putative DNM candidates were generated for confirmation by Sanger sequencing. The first list contains 74 variants with high confidence ranking (probably true positive and possibly true positive). Raw data were then re-evaluated to generate 48 candidates with relatively low-confidence (unclear), especially for those trios without any confirmed DNM in the first round. Rare (minor allele frequency < 0.01 in public databases) predicted damaging variants in genes carrying confirmed de novo mutations were extracted from exome calls and submitted for Sanger validation. The allele origin was determined by checking the mutation site in both parents. Phasing of DNM and inherited variants in the same gene was also performed by Sanger sequencing. Rare damaging inherited variants located in 116 ENS candidate genes were extracted from exome reads using the same pipeline (Additional file 2: Figure S2); and the transmission patterns of these variants were determined by referring to parental and maternal genotypes at the same site.
Stepwise logistic regression was used to select effective predictors of the de novo status in a trio and for the presence or absence of a mutation in a given individual. The performance of these prediction models was evaluated using 10-fold cross validation by the software WEKA.
For model fitting to DNM status in the trios, genotype quality (represented by normalized phred likelihood score for the second most likely genotype) in the child and alternative allelic ratio in the parents were prioritized. The Area Under the Receiver Operating Characteristic Curve (AUC) was 0.959 (Additional file 4: Table S3) which suggests that the model predicts the DNM status accurately. This model was then adopted to test all other unvalidated de novo candidates (falling under the "unclear", "possibly false positive" or "probably false positive" categories), which all turned out to be negatives. For model fitting to the presence or absence of a variant in the patients, genotype quality and alternative allelic ratio in each individual were retained. The AUC was 0.824 (Additional file 4: Table S3). This second model was then used to help predict the presence of rare variants in the DNM genes or ENS genes. Only those variants predicted as positive candidates were shown (Additional file 6: Table S5). Figure S1 Flow chart of the study design I: statistical evidence from gene-wise burden analysis (detail in Additional file 7: Table S6); II: bioinformatics prediction of the mutation impact and the gene network (detail in Additional file 8: Table S7 and Additional file 2: Figure S7); III: mutation profile (de novo mutations, rare damaging variants in ENS candidate genes, and risk RET enhancer common SNP) for each patient (detail in Additional file 6: Table S5

Figure S7: Connection of DNM genes and ENS genes at pathway/network level
Ingenuity Pathway Analysis (IPA) was used to link 116 ENS candidate genes (left, Additional file 9: Table S8) with the 20 newly found genes harboring de novo mutations (right). Solid and dotted lines represent direct and indirect interactions, respectively.

Figure S8: qPCR confirmation of gene knockdown by SBMO
Relative expression of the candidate genes between SBMO-injected (black bar) and control morpholino-injected embryos (grey bar) by qPCR.