Metastatic tumor evolution and organoid modeling implicate TGFBR2 as a cancer driver in diffuse gastric cancer

Background Gastric cancer is the second-leading cause of global cancer deaths, with metastatic disease representing the primary cause of mortality. To identify candidate drivers involved in oncogenesis and tumor evolution, we conduct an extensive genome sequencing analysis of metastatic progression in a diffuse gastric cancer. This involves a comparison between a primary tumor from a hereditary diffuse gastric cancer syndrome proband and its recurrence as an ovarian metastasis. Results Both the primary tumor and ovarian metastasis have common biallelic loss-of-function of both the CDH1 and TP53 tumor suppressors, indicating a common genetic origin. While the primary tumor exhibits amplification of the Fibroblast growth factor receptor 2 (FGFR2) gene, the metastasis notably lacks FGFR2 amplification but rather possesses unique biallelic alterations of Transforming growth factor-beta receptor 2 (TGFBR2), indicating the divergent in vivo evolution of a TGFBR2-mutant metastatic clonal population in this patient. As TGFBR2 mutations have not previously been functionally validated in gastric cancer, we modeled the metastatic potential of TGFBR2 loss in a murine three-dimensional primary gastric organoid culture. The Tgfbr2 shRNA knockdown within Cdh1-/-; Tp53-/- organoids generates invasion in vitro and robust metastatic tumorigenicity in vivo, confirming Tgfbr2 metastasis suppressor activity. Conclusions We document the metastatic differentiation and genetic heterogeneity of diffuse gastric cancer and reveal the potential metastatic role of TGFBR2 loss-of-function. In support of this study, we apply a murine primary organoid culture method capable of recapitulating in vivo metastatic gastric cancer. Overall, we describe an integrated approach to identify and functionally validate putative cancer drivers involved in metastasis. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0428-9) contains supplementary material, which is available to authorized users.


Index patient
The patient (III-1) had a family history of gastric cancer consistent with hereditary diffuse gastric cancer (HDGC) syndrome [51]. The family's pedigree is shown in Figure 1. The patient's sister (III-2) had diffuse gastric cancer at the age of 46 years and was also found to be a carrier of the familial CDH1 mutation. The maternal uncle had pancreatic cancer (II-2). The patient's maternal aunt (II-3) was diagnosed with gastric carcinoid at the age of 56 years and had a recurrence at 62 years.

Whole genome and exome sequencing, alignment and coverage analysis
Exome libraries were prepared with the Roche EZ SeqCap exome enrichment probes v2.0 [52].
We used an Illumina HiSeq 2000 instrument with 100 base reads. Whole genome and exome sequencing libraries were clonally amplified through cluster-generation on an Illumina cBot using paired end flowcells and Illumina TruSeq v2 chemistry. Using PicoGreen assay (Invitrogen, San Diego CA) for quantitating the amount of library, we prepared samples for input according to the Illumina cBot User Guide. We loaded sample sequencing libraries at a final concentration of 1.5 pM; we generated an average cluster density of 450K clusters/mm 2 . The clustered flowcell was sequenced on an Illumina HiSeq 2000 for 2 X 100 cycle reads with indexing, using Illumina TruSeq v2 reagents.
Sequence reads aligned to the human genome version hg19 using Burrows-Wheeler Aligner (BWA) [53]. The Genome Analysis Toolkit (GATK) was used to determine overall sequencing coverage for both whole and exome sequencing data sets [54]. Sequencing coverage is listed in Table S1. For the exome capture, the dataset used to determine the overlaps was the annotation BED file for the NimbleGen SeqCap EZ platform (v2) (www.nimblegen.com/products/seqcap/ez/index.html) for the exome capture. The library covers over 30,000 genes across a 36 Mb region, and the probes cover a total of 44.1 Mb.

Variant calling from cancer exomes and whole genome sequencing
Somatic single-nucleotide variants (SNVs) and insertion deletions (indels) were called using the variant caller programs GATK (vs 1.0.577) [54] and Varscan [53]. Both variant callers were run to improve our sensitivity of detection. Varscan was run with the suggested parameters for tumor-normal pair comparison [55]. For Varscan calls, we required evidence on both forward and reverse strands and applied a p-value cut-off of 0.001.
We used the following parameters for analysis with GATK [56]. Base recalibration relied on CountCovariates and TableRecalibration functions. Four different covariates settings were used for the "CountCovariates" option of CountCovariates, ReadGroupCovariate, QualityScoreCovariate, CycleCovariate and DinucCovariate. The output of CountCovariates was used with GATKs TableRecalibration to recalibrate quality scores. The default options were used for this application. In between steps, BAM sequence files were resorted and indexed with Picard sequence analysis suite (http://picard.sourceforge.net/index.shtml) as needed. GATK's UnifiedGenotyper was used for variant calling with parameters recommended by the Broad Institute's best practices for variant discovery guidelines for coverage > 10 (-stand_call_conf 30.0 -stand_emit_conf 10.0). SNVs and indels were called together using the "BOTH" option for the "glm" parameter of the UnifiedGenotyper. Filters were applied to flag poor quality/alignment artifact SNVs. Filters that were applied were for SNV clusters (--clusterWindowSize 10) and hard to validate (MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1), strand bias (SB >= -1.0) and high depth (DP > 50 for genome, DP > 80 for exome). From the VCF output files we required a Phred scaled probability score of greater than 30 and a GATK filter setting of "PASS". We used the BED file for exome coverage within 100 bases of an exon boundary to limit variants to coding and adjacent regions.
For the next step post-variant calling, we filtered out the germline variants from cancer-specific mutations identified among the primary and metastatic cancer genome. Cancer-specific SNV and indel mutations were additionally categorized by potential functional relevance. For the primary gastric tumor, we aggregated the mutation results between the exome and whole genome sequencing. Not surprisingly, we found that the exome sequencing produced higher quality calls from exon regions. Whole genome sequence was useful for delineating mutations outside of exons. Given the lower coverage of the whole genome sequencing of the metastatic tumor, we exclusively relied on the exome sequencing for cancer mutation analysis. Our results are listed in Table S2.
All filtered GATK and Varscan calls were annotated using SeattleSeq131 (http://snp.gs.washington.edu/SeattleSeqAnnotation134/) to identify single nucleotide polymorphisms (SNPs) versus novel variants. Part of this annotation includes the comparison of variant positions with dbSNP130 [57, 58] and variants generated from the 1,000 Genomes Project [59]. We considered tumor-specific, novel variants overlapping with coding sequences, splice sites and RNA genes as potential mutations and these candidates were validated as described later.
Of the total variants detected in whole genome analysis by GATK, 92% of SNVs and 5% of indels were described in dbSNP130. Similarly, 94% of SNVs called by GATK in the exome data, and 6% of indels were present in dbSNP130. In the exome data, GATK predicted 49 nongermline, somatic variants present in both tumor and metastasis and 49 metastasis-specific variants. Varscan predicted 94 non-germline, somatic variants present in both tumor and metastasis and 160 metastasis-specific variants. Given that the concordance between SNP array genotypes and GATK filtered calls was 99% in the normal sample, we generally used GATK calls and included a subset of Varscan mutation calls not identified from GATK for subsequent validation with an independent targeted resequencing analysis.

Analysis of chromosomal allelic imbalance and copy number variation
We used several methods to assess loss-of-heterozygosity (LOH) and somatic copy number [61]. Our analysis relied on minor allelic frequency (MAF) data. The MAF is a ratio comparison of allelic read depths from heterozygous SNVs identified from the normal genome compared to the same position from the tumor. First, from each genome the GATK SNV calls are filtered to remove low quality data (quality < 30). Using read depths available from the VCF outputs for each of these heterozygous SNPs, we determined the MAF value by dividing the minor allele frequency over the major allele frequency for each genome (e.g. normal or cancer). For graphic display, we used a smoothed MAF value based on a window average of 10 contiguous SNPs from each genome. Cancer-related allelic imbalance alterations were based on deriving the log2 value of the cancer MAF over the normal MAF.
For statistical calling of the genomic intervals of specific LOH events, we first employed the analysis method published by Chen et al. on SNVs determined from the whole genome sequencing analysis of the two cancer genomes [62]. The associated R package for this analysis method is available under the project name "PSCN" at the following URL (http://rforge.r-project.org/). The input data was the same as used in visualization and graphics. This data set involved 2.1 million SNPs that were noted to be heterozygote in the normal genome.
Our analysis involved a comparison with the same positions in the primary or metastatic tumor genomes. The PSCN algorithm was applied one chromosome at a time or applying the method to the whole genome without separating by chromosomes. We used the GATK-based read counts for the each allele of the SNP comparing the tumor to the normal sample to determine the intervals of allelic imbalance. Chen et al.'s method is also able to discriminate allelic imbalances even in the context of intermixed normal genomic DNA.
To determine somatic copy number alterations and the affected genomic intervals from whole genome sequencing data, we also used the published SeqCBS method [63]. This analysis method was informative in distinguishing LOH events attributable to genomic deletions versus copy number neutral alterations. SeqCBS uses circular binary segmentation (CBS) to find regions whose coverage differs significantly from neighboring regions [63]. The software implementation is available as an open-source R package named SeqCBS (http://cran.rproject.org). The CNV analysis used an R script that reads a configuration file listing the sequence data sets to be compared, namely the case (normal) versus the control (cancer). We also conducted an analysis comparing the primary versus the metastatic tumor as additional evidence of CNV differences. The algorithm then performs the segmentation on these two files, compares them, and produces both local and whole-chromosome CNV plots (Figure 2, Additional file 1; Figure S1). For any such region, there is a general test statistic and a relative gain or loss copy number value. Generally, we required a test statistic > 1,000 as a basic cutoff and a copy number value of greater than 2.5 or less than 1.00 as our thresholds for marking an event as a significant amplification or loss. We determined the genes within CNV intervals specific to the primary or metastatic tumor.

Cancer genome inter-chromosomal translocations, intra-chromosomal rearrangements and other structural variations
On the primary, metastatic and normal genome sequence, we used the Breakdancer program [64] to assess for inter-chromosomal translocations, large indels, rearrangements and other structural variants (SVs). For inter-chromosomal translocations, we performed separate primary tumor/normal tissue and metastatic tumor/normal tissue comparisons using breakdancer_max.
To be considered as a potential variant, we required an anchor sequence of 20 base pairs on each side of a rearrangement breakpoint (breakdancer_max -t -s 20 -r 10 configfile). We also filtered out implausible cases (e.g. involving Y in female), required a minimum number of reads (20 for an individual genome finding or 10 common to two different genomes) and eliminated calls seen in the normal germline genome. For cancer-specific, intra-chromosomal events such as large genomic deletions, we required at least 20 reads to cover the putative breakpoint of the event that were seen in the primary or metastatic cancer but not the normal germline genome.
As a final filter, we eliminated putative SVs where the anchor sequence occurred in highly repetitive sequences that were a potential source of mapping errors.
We also employed an integrated structural variation (SV) analysis pipeline [68]  . HugeSeq calls a rearrangement if it is identified by at least two algorithms with at least 50% overlap of the genomic coordinates for the variants called. We separated the SV calls into high-confidence and low-confidence calls. As a default, this analysis generates cancer-specific calls by taking the set of variants called in the cancer by at least two algorithms and subtracting out those variants called in the normal tissue by at least two algorithms. We defined as high-confidence the tumor-specific SVs called by two algorithms in the primary tumor that were not identified in the normal genome. Filtering putative translocations whose anchors occur in repetitive regions left only three candidate translocations. All of these were found to occur either in normal only or in both normal and tumor and hence are not somatic. This included a number of candidates including 146 deletions, 93 insertions, and 11 inversions. In total, 66 SVs were chosen for separate validation based on their proximity to coding regions. This validation set included all 43 putative events overlapping exons (17 high-confidence) and an additional 23 events overlapping non-exonic elements of genes involved in cancer (4 high confidence).

Targeted resequencing and validation of cancer mutations and rearrangements affecting candidate cancer genes
From patient 525, we identified a total of 345 SNVs, indels, and rearrangements overlapping with coding and regulatory regions and were in the primary and/or metastatic tumor. These cancer mutations and genomic aberrations were chose for independent targeted resequencing validation. As a general quality control, we included twelve control SNVs at random from a list of heterozygous germline SNVs variants present in both the tumor and the normal tissue. All but one of the randomly chosen control SNVs were in dbSNP.
The targeted resequencing assay as fully described by Myllykangas et al. For structural variants, we used reads from the putative breakpoints that had a Phred-like score greater than 25. We aggregated the sequence reads for each rearrangement loci, eliminated reads that matched the reference genome and then conducted a localized assembly using Velvet [71] with the remaining reads. Parameters for Velvet included a hash length of 19, a contig length-minimum of 50 and a contig coverage depth minimum of 4. The generated Velvet contigs were aligned with megablast and filtered based on location on the correct chromosome, discontinuous sequence alignment starting from the breakpoint and appearance only in the cancer genome. We validated rearrangements based on contigs demonstrating clear breakpoint after this analysis.

Cancer gene mutation and genomic aberration interpretation
After targeted resequencing validation of specific mutations and genomic aberrations (e.g. CNVs), the confirmed cancer-specific variants were assessed their potential impact on protein function, mutations occurring in coding sequences were evaluated by SIFT {Kumar, 2009 #308} and PolyPhen2 version 2.1.0 [16]. Both analysis methods were run with default parameters.
For SIFT, we consider an amino acid substitution to be damaging when the score is <= 0.05, and tolerated if the score is > 0.05. For PolyPhen2, we used the following thresholds for pathogenic classification: 5% / 10% false positive rate (FPR) for the HumDiv model and 10% / 20% FPR for HumVarmodel. Mutations with their posterior probability scores associated with estimated false positive rates at or below the first lower FPR value were predicted to be probably damaging. Mutations with the posterior probabilities associated with false positive rates at or below the second (higher) FPR value were predicted to be possibly damaging.
Mutations with estimated false positive rates above the second (higher) FPR value were classified as benign. We identified the subset of deleterious mutations that were identified with either method.
We utilized several approaches to determine the mutations, somatic CNVs and cancer rearrangements affected known or putative cancer driver genes. First, our gene list with pathogenic cancer mutations was compared to the Cancer Census gene list and the Catalogue of Somatic Mutations in Cancer (COSMIC), cancer-associated gene lists curated as part of the Cancer Genome Project at the Sanger Institute [23]. Second, we researched the function of every cancer mutation-associated gene using NCBI Gene (http://www.ncbi.nlm.nih.gov/gene) and PubMed searches. Each search involved using "(gene name) cancer" as the search term.

Allelic analysis of mutations in the gastric cancer genomes
To estimate the clonal derivation between the two cancer genomes, we used an allelic frequency analysis for mutations found either in the primary versus metastatic tumor. We considered all mutations with potential biological impact for the primary and metastatic tumors Additional file 1, Table S5). Total read depths for the targeted resequencing validation are also listed. Based on the method of Ding et al. [74], this analysis involves determining the proportion of reads containing the cancer mutation-of-interest among the different cancer genomes in comparison to the normal tissue. First, based on pathologic examination, we estimated the tumor sample cellularity given the presence of normal stroma. From the aligned BAM files, we extracted and counted the sequence reads containing either the mutation or wildtype allele for any given position. For each tumor genome, we calculated an allelic ratio for each mutation using the counts of the mutation allele reads versus the wildtype allele reads, adjusting for the tumor composition. From this analysis, we report the minor allelic frequency of each mutation for the primary and metastatic tumor. This analysis was conducted on the whole genome, exome and targeted resequencing data sets for the normal, primary tumor and metastatic tumor. Among all three sequencing approaches and their data sets, our results regarding fractional composition were concordant. We generally used the targeted resequencing data set and we also report the total read depth for each cancer somatic variant position that was evaluated (Additional file 1; Table S5).

SNP array analysis
Standard protocols for DNA preparation, array hybridization and scanning were used to analyze the normal, primary tumor and metastatic tumor samples using SNP 6.0 arrays (Affymetrix, Santa Clara, CA). We used 1 µg of genomic DNA from our samples for array hybridization.
Data analysis was performed using the Genotyping Console software and Birdseed V2 algorithm. We used 30 additional microarray data sets in concert with the studied samples to assess the quality of the SNP calls. SNP 6.0 array data was filtered using P-value threshold of 0.01. We used the Matlab programming language (Mathworks) for quantile probe normalization, probe level summarization and estimation of the copy number between a tumor and normal samples using log2 ratio of the normalized signals (http://www.mathworks.com/products/bioinfo/examples.html?file=/products/demos/shipping/bioi nfo/affysnpcnvdemo.html). Normal DNA from peripheral leukocytes and normal gastric tissue was used for SNP analysis and compared to insure concordance.    Figure S1. Nadauld, et al. Figure S1. Biallelic events in the TP53 loci and gene. For Chromosome 17, copy number variation (CNV) and loss-ofheterozygosity (LOH) is mapped with a cancer driver mutation in TP53. For CNVs, the X axis indicate the CNVs of the cancer to matched normal genome iusing a normalized read depth. The Y axis designates position on the respective chromosome which is shown to the left of the copy number profile. LOH is indicated by a SNP minor allele fraction ratio over the metastasis over the normal.Mutations are shown as boxed arrows with the gene symbol. Both the primary tumor and metastasis had biallelic events involving a common TP53 point mutation and deletions. In the case of the primary tumor a genomic deletion occurred in the p arm as shown by a CNV value of -1 which indicates hemizygous loss. In the metastasis, this deletion was extended to involve most of the q arm as seen by extensive LOH with a value of -1 when comparing the metastasis to the normal genome SNPs. b Figure S2. Deletion of CDH1 and TRP53 expression in gastric organoids.
(a) Cdh1 fl/fl ; Trp53 fl/fl gastric organoid cultures were infected with Fc-expressing adenovirus and imaged by immunofluorescence with antibodies against CDH1 and TRP53.
C /P C /P /T *