The 3D genome organization of Drosophila melanogaster through data integration

Genome structures are dynamic and non-randomly organized in the nucleus of higher eukaryotes. To maximize the accuracy and coverage of 3D genome structural models, it is important to integrate all available sources of experimental information about a genome’s organization. It remains a major challenge to integrate such data from various complementary experimental methods. Here, we present an approach for data integration to determine a population of complete 3D genome structures that are statistically consistent with data from both genome-wide chromosome conformation capture (Hi-C) and lamina-DamID experiments. Our structures resolve the genome at the resolution of topological domains, and reproduce simultaneously both sets of experimental data. Importantly, this framework allows for structural heterogeneity between cells, and hence accounts for the expected plasticity of genome structures. As a case study we choose Drosophila melanogaster embryonic cells, for which both data types are available. Our 3D geome structures have strong predictive power for structural features not directly visible in the initial data sets, and reproduce experimental hallmarks of the D. melanogaster genome organization from independent and our own imaging experiments. Also they reveal a number of new insights about the genome organization and its functional relevance, including the preferred locations of heterochromatic satellites of differnet chromosomes, and observations about homologous pairing that cannot be directly observed in the original Hi-C or lamina-DamID data. To our knowledge our approach is the first that allows systematic integration of Hi-C and lamina DamID data for complete 3D genome structure calculation, while also explicitly considering genome structural variability.


Abstract
Genome structures are dynamic and non-randomly organized in the nucleus of higher eukaryotes. To maximize the accuracy and coverage of 3D genome structural models, it is important to integrate all available sources of experimental information about a genome's organization. It remains a major challenge to integrate such data from various complementary experimental methods. Here, we present an approach for data integration to determine a population of complete 3D genome structures that are statistically consistent with data from both genome-wide chromosome conformation capture (Hi-C) and lamina-DamID experiments. Our structures resolve the genome at the resolution of topological domains, and reproduce simultaneously both sets of experimental data. Importantly, this framework allows for structural heterogeneity between cells, and hence accounts for the expected plasticity of genome structures. As a case study we choose Drosophila melanogaster embryonic cells, for which both data types are available. Our 3D geome structures have strong predictive power for structural features not directly visible in the initial data sets, and reproduce experimental hallmarks of the D. melanogaster genome organization from independent and our own imaging experiments. Also they reveal a number of new insights about the genome organization and its functional relevance, including the preferred locations of heterochromatic satellites of differnet chromosomes, and observations about homologous pairing that cannot be directly observed in the original Hi-C or lamina-DamID data. To our knowledge our approach is the first that allows systematic integration of Hi-C and lamina DamID data for complete 3D genome structure calculation, while also explicitly considering genome structural variability.

Introduction
It has become increasingly clear that a chromosome's three-dimensional organization influences the regulation of gene expression and other genome functions. Early microscopy and biochemical studies showed that chromosomes in higher eukaryotes form distinct territories, which although stochastically organized tend to be located at preferred positions within the nucleus. For example, lamina-DamID experiments have identified specific chromatin domains with a high propensity to be located at the nuclear envelope (NE), confirming the important role of the NE in spatial genome organization and gene regulation in Drosophila, human and mouse [1][2][3]. Chromosome conformation capture experiments (Hi-C and variants) detect chromatin interactions at genome-wide scale [4][5][6][7][8][9][10] and reveal a hierarchical chromosome organization: the chromatin can be segmented into domains, which in turn combine to form subcompartments of functionally related chromatin [10][11][12]. Topological associated domains (TADs) are defined by observing an increased probability of interaction between chromatin regions in a domain relative to interactions between domains. In addition, it has been shown that the border regions between domains are enriched in specific insulator proteins, such as CTCF and ZNF143 in mammalian cells and BEAF, CTCF and CP190 in Drosophila cells. However, the precision of domain border detection depend to some extent on the sequencing depth as well as algorithmic parameter settings. At increased sequencing depth it is possible to detect reliably individual chromatin loops, which often demarcate contact domains (at ~100kb domain length) [5].
Computational approaches can aid in mapping the global 3D structures of genomes at various scales [13][14][15][16][17][18][19]. These are currently divided into data-driven and de novo modeling techniques [20]. Data-driven models use experimental information, often Hi-C data, to generate 3D genome structures that are constrained to be consistent with the data. Data-driven models can be further subdivided into two classes. The first represents the genome as a consensus structure most consistent with the data. This approach often assumes an anti-correlation between the Hi-C contact probability of two chromatin regions and their average spatial distance. In contrast, the second class of methdos explicitly models the large variability of genome structures between isogenic cells (even within a sample of synchronized cells) by creating a population of thousands of model structures, in which the accumulated chromatin contacts in all structures reproduce the observed Hi-C matrix rather than each structure individually. These approaches do not need to assume any functional relationship between contact frequencies and spatial distances.
We introduced a method for population-based modeling to analyze the structure of complete diploid genomes from Hi-C data [6,21,22]. Our method uses an iterative, probabilistic optimization framework to deconvolve the Hi-C data into a population of individual structures by inferring cooperative chromatin interactions that are likely to cooccur in the same cells. Our method generates a large number of genome structures whose chromatin contacts in the models are statistically consistent with those from the Hi-C data. Other ensemble-based methods have been introduced and applied to individual chromosomes or chromatin domains [13,19].
So far, computational models of genome structures have typically relied on just one data type, such as Hi-C, even though a single experimental method cannot capture all aspects of the spatial genome organization. However, data are available from a wide range of technologies with complementary strengths and limitations. Integrating all these different data types would greatly increase the accuracy and coverage of genome structure models. Moreover, such models would offer a way to cross-validate the consistency of data obtained from complementary technologies. For example, lamina-DamID experiments show a chromatin region's probability to be close to the lamina at the nuclear envelope [23,24], while Hi-C experiments reveal the probability that two chromatin regions are in spatial proximity. Large-scale 3D fluorescence in situ hybridization (FISH) experiments show the distance between loci directly, and can be used to measure the distribution of distances across a population of cells.
It remains a major challenge to develop hybrid methods that can systematically integrate data from many different technologies to generate structural maps of the genome. In this paper, we present a method for integrating data from Hi-C and lamina-DamID experiments to maximize the accuracy of population-based 3D genome structural models. We apply this approach to model the diploid genome of Drosophila.
Drosophila melanogaster is a popular model organism to study the organization and functional relevance of 3D genome structure, owing to its relative small genome and the availability of many genetic tools. A variety of microscopy-based experiments have already studied the nuclear organization of D. melanogaster, and elucidated some regulatory mechanisms [25][26][27][28][29]. For example, the pairing of homologous chromosomes has been observed in the somatic cells of D. melanogaster and other dipteran insects [30][31][32][33]. This kind of pairing can influence gene expression by forming interactions between regulatory elements on homologous chromosomes, a process called transvection [26,34]. Although transvection is common in Drosophila, not every gene region with homologue pairing shows transvection. Therefore, questions remain as to whether somatic homolog pairing has other regulatory roles. In Drosophila, the centromeres tend to cluster and relocate close to the periphery of the nucleolus during interphase [35]. Centromere clustering is also observed in many other organisms, including yeast, mouse and human, and this process is thought to play an important role in determining the overall genome architecture [36,37].
Over the past ten years, high-throughput genetic and genomic techniques have generated genome-wide maps of histone modifications, transcription factor binding, and chromatin interactions for D. melanogaster [1,7,8,38,39]. Pickersgill et al. used lamina-DamID experiments combined with a microarray technique to detect the binding signals of genome-wide chromatin to the lamina matrix [1]. Around 500 genes were detected to interact with the lamina. These genes were transcriptionally silenced and late-replicating. Pickersgill et al. then used FISH experiments to confirm that the laminatargeted loci were more frequently located at the nuclear envelope than other loci.
Recently, genome-wide chromatin contacts have been determined for 16-18 hr Drosophila embryos using the Hi-C technique [8]. The euchromatin genome was divided into 1169 physical domains based on Hi-C interaction profiles. These physical domains (which would be referred to as topological associated domain, or TADs, in mammalian cells) were assigned to four functional classes based on their average epigenetic signatures: Null, Active, Polycomb-Group (Pc-G) and HP1/Centromere. Despite all this work, the global 3D nuclear architecture of the D. melanogaster genome is still unknown. Because both Hi-C and lamina-DamID data are available for Drosophila embryonic cells, we use these data to test our integration method. Each diploid genome structure in our population-based model is defined by the 3D positions of all 1169 TADs.
The structures are generated by optimizing a likelihood function, so that the ensemble is statistically consistent with both the experimentally derived contact probabilities between all chromatin domains from Hi-C data and the probability that a given chromatin domain is close to the NE from lamina-DamID data.
We validated our 3D genome models against independent experimental data and known structural features. Our models confirm the formation of distinct chromosome territories, with relatively low rates of intermingling between chromosomes [40,41]. In addition, our models often show a polarized organization of chromosomes in the nucleus [27,42,43]. Analysis of the model population leads to a number of new insights about the nuclear organization of D. melanogaster and its functional relevance. For instance, our models reveal the preferred locations of heterochromatin and the nucleolus, which we were able to confirm by 3D FISH experiments. The nucleolus serves as an anchor for chromosomes, and is surrounded by pericentromeric heterochromatin. The distance of pericentromeric heterochromatin regions from the periphery varies by chromosome, with chromosomes 4 and X heterochomatin more peripheral relative to Pericentromeric regions of other chromosomes. Interestingly, the frequency of homologous pairing varies along the chromosomes with the lowest frequencies observed in our models for domains enriched in protein binding sites for Mrg15. These observations support the model that Mrg15 plays a role in the dissociation of homologous chromosome pairs during interphase, as previously suggested [44]. Finally, the structure population suggests that homologous chromosome pairing plays a functional role in transcriptional activity and DNA replication program.

Population-based genome structure modeling from data integration
Our goal is to determine a population of 3D genome structures for Drosophila melanogaster that is consistent with data from Hi-C and lamina-DamID experiments.
Suppose A is a probability matrix derived from Hi-C data, and E is a probability vector derived from lamina-DamID data. The elements of A describe how frequently a given pair of TADs are in contact with each other in an ensemble of cells, and E describes how frequently a given TAD is in contact with the nuclear envelope (NE). The goal is to generate a population of genome structures X , whose TAD-TAD and TAD-NE contact frequencies are statistically consistent with both A and E . We formulate the genome structure modeling problem as a maximization of the likelihood P A, E X ( ) .
More specifically, the structure population is defined as a set of M diploid genome structures X = {X 1 ,X 2 ,...,X M } , where the m-th structure X m is a set of 3-dimensional vectors representing the center coordinates of 2N domain spheres and defines the probability for each TAD to be localized at the nuclear envelope (NE).
With known A and E , we calculate the structure population X such that the likelihood The Hi-C and lamina-DamID experiments provide data that is averaged over a large population of cells, so they cannot reveal which contacts co-exist in the same 3D structure. Therefore, both A and E are interpreted as ensemble averages. To represent information derived from individual cells, we introduce two latent variables W and V . The "contact indicator tensor" W = (w ijm ) 2 N ×2 N ×M is a binary, 3rd-order tensor. It contains the information missing from the Hi-C data A , namely which domain contacts belong to each of the M structures in the model population and also which homologous chromosome copies are involved ( w ijm = 1 indicates a contact between domain spheres i and j in structure m; w ijm = 0 otherwise). W is a detailed expansion of A into a diploid, single-structure representation of the data. The structure population X is consistent with W . Therefore, the dependence relationship between these three variables is given , specifies which domain is located near the NE in each structure of the population and also distinguishes between the two homologous TAD copies ( v im = 1 indicates that TAD i is located near the NE in structure m; v im = 0 otherwise). The dependence relationship between X , V and E is given as X → V → E , because X is the structure population consistent with V and V is a detailed expansion of E at a diploid and single-structure representation of the data.
In addition to the Hi-C and lamina-DamID data, we also consider additional information specific for Drosophila genome organization, e.g. the nuclear volume, an upper bound for homolog chromosome pairing, constraints connecting consecutive domains (including heterochromatin domains) as well as costraints for anchoring centromeres to the nucleolus (see the detailed description in the Materials and Methods section).
Thus, the optimization problem is expressed as: subject to spatial constraint I: nuclear volume constraints spatial constraint II: excluded volume constraints spatial constraint III: chromosome pairing upper bound spatial contraint IV: consecutive domain constraint The log likelihood can be expanded as We have developed a variant of the EM method to iteratively optimize the log likelihood [22]. Each iteration consists of two steps (Fig. 1A): • Assignment step (A-step): Given the current model X ( i) , estimate the latent variables W (i+1) and V (i+1) by maximizing the log-likelihood over all possible values of W and V .
The detailed implementation of the A-step and M-step are described in Materials and Methods. We follow the step-wise optimization strategy described previously [22], and gradually increase the optimization hardness by adding contact constraints at a decreasing contact probability threshold.

A population of Drosophila genome structures at the TAD level
The euchromatin regions of Drosophila melanogaster chromosomes 2, 3, 4, and X are partitioned into 1169 TADs, as previously described [8]. and S10).

Validation of the structure population
Reproducing the Hi-C contact probabilities. We first assessed the consistency between  Table 1). This relatively weak agreement between the model and the experimental data for inter-arm and inter-chromosome interactions can be explained by the following argument. In the Hi-C data, inter-arm and inter-chromosome interactions are relatively infrequent and unstructured, indicating that contacts between chromosomes are predominantly random. Due to their low occurrence, these interactions are also less reproducible than intra-arm interactions, especially at low sequencing depth. This reasoning is confirmed by comparing two Hi-C experiments performed with two different restriction enzymes [6,48]. The differences in contact frequencies between the two experiments are generally much larger for interchromosome arm interactions than for intra-chromosome arm interactions.
Another quality measure for our models is how well we can predict the frequencies of chromatin interactions that were not included as constraints in the optimization. In our models, we did not impose constraints for any pair of TADs whose contact probability was lower than a ij =0.06. Very low contact probabilities are expected to contain a higher fraction of experimental noise. Such pairs include ~99.99% of all inter-chromosome and inter-chromosome arm interactions. However, our structure population is capable of predicting the missing data ( Fig. 2B right panel). Many of the low-frequency contacts are formed as a consequence of imposing more significant interactions (with contact probabilities a ij >0.06), and their correct prediction is a good indicator of the model quality.
Reproducing the lamina-DamID binding frequency. Lamina-DamID experiments identify the probability that a locus is associated with the NE (more precisely, with the lamina protein located at the NE). We first assess the consistency between our structure population and the lamina-DamID experiment (a TAD domain-NE contact is defined when the domain surface is less than 50 nm from the NE). The association probabilities are in excellent agreement, with a Pearson's correlation of 0.95 (Figs. 2C, 2D and S1A).
Recalling that the TADs of Drosophila are divided into four functional classes, we find that TADs in the "Active" class are less frequently in contact with the NE than those from the other three classes (HP1, PcG, and Null) (Fig. S1A). This result agrees with prior observations in the literature that the genes interacting with lamina are usually transcriptionally silent and lack active histone marks [1]. The control population generated using only Hi-C data also shows good (albeit substantially lower) correlations between its NE association probabilities and the lamina-DamID experiments (Pearson's correlation is 0.64, with p-value < 2.2e−16) (Figs. 2D and S1B). This relatively high correlation value in the control population shows a strong consistency between the Hi-C based models with the independent lamina-DamID data and confirms the generally good quality of our Hi-C based structure modeling.
Agreement with FISH experiments. Our genome structures also predict well the NE association frequencies observed by independent FISH mapping of 11 different genomic loci [1]. The Spearman's rank correlation coefficient between experiment and model is 0.642 for these loci, with a significant p-value = 0.03312 (Fig. S2A). The corresponding correlation with the control structure population is substantially lower (Spearman's rank correlation coefficient = 0.376 with p-value = 0.2542) (Fig. S2B), demonstrating the benefit of data integration to generate more accurate genome structures.
Presence of chromosome arm territories. Chromosome territories have been observed directly in higher eukaryotes, including mammalian cells [49,50]. In Drosophila, chromosome territories can be inferred from the fact that Hi-C contact frequencies between chromatin regions in the same chromosome arms are substantially higher than those between chromosome arms [7,8]. Previous 4C experiments on larval brain tissue confirm the limited nature of interactions between genes on different chromosome arms [41]. FISH experiments have also suggested chromosome territories in Drosophila [40].
In our models, we analyze the formation of chromosome territories by calculating a territory index (TI), which measures the extent of chromosome mixing [24]. To calculate TI in each structure, first we define the spanning volume of each chromosome, which is the surface convex hull of all its domain positions [24]. TI is then defined as the percentage of all domains occupying the chromosome spanning volume of the target chromosome (Suppl. Methods C.2). By definition, the maximum TI value of 1 indicates that the chromosome's spanning volume is exclusively occupied by its own domains, and therefore experiences limited chromosome mixing. When considering domains from homologue chromosome copies, the territorial index ranges between 0.96 and 1.0 for all the chromosome arms ( Fig. S3A and Suppl. Table 2). When separating the homologue chromosomes, however, the TI values range between 0.62 and 1.0 for the larger chromosome arms (Fig. S3B), suggesting that homologue chromosome pairs share almost the same territory due to strong homologue pairing.
Residual polarized organization. In a polarized genome organization, each chromosome occupies an elongated territory with the centromere at one nuclear pole and telomeres on the opposite side of the nucleus. Such an organization, called Rabl, typically occurs after mitosis and has been observed in a variety of plants [23], yeast, and both polytene and non-polytene Drosophila nuclei; it is also common in Drosophila embryos [27,42,43]. In the majority of our genome structures (67.4%, Suppl. Methods C.3), more than half of the chromosomes arms (chr2L, chr2R,chr3L, chr3R and chrX) are organized with their centromeres and telomeres located in opposite nuclear hemispheres (Figs. S4B, C, and D). This organization is also apparent when calculating the localization probabilities of chromosomes, which are highest for the telomeres in a region near the NE opposite to their respective centromeres ( Figs. 3A and B). Taken together, these results suggest that interphase chromosomes retain some features of Rabl organization.
Nuclear colocalization of Hox gene clusters. In Drosophila, the two PcG-regulated Hox gene clusters (Antennapedia complex and Bithorax complex) tend to co-localize in the head of 10-11 stage embryos [51], despite being separated by 10Mb in sequence on chromosome 3 (Fig. 1B). To test their spatial colocalization in our models, we calculate the pairwise spatial distances between the two gene clusters in every structure of the population (Suppl. Methods C.4). As a random control, we also calculate the pairwise probability was below 0.06 ( Fig. S5 top panel). Therefore, these results support the predictive power of our model.

White gene localizing near pericentromatic heterochromatin. Position-effect variegation
(PEV) is a process whereby a euchromatic gene is deactivated through an abnormal juxtaposition with heterochromatin, due to chromosome rearrangements or transpositions. PEV has been intensively studied for the Drosophila white gene [52,53], which is on the distal end of chromosome X and separated by more than 19 Mb from the pericentromeric heterochromatin region (Fig. 1B). A chromosome inversion can insert the white gene in sequence next to the pericentromeric heterochromatin, which leads to its repression. Hence, such chromosomal rearrangement may be favored if the white gene has an increased chance of being in spatial proximity to the heterochromatin.
However, technical limitations prevent us from directly measuring contacts between the white gene and heterochromatin with Hi-C experiments. Using our structure population, we can measure how often the white gene is located close to the pericentromeric heterochromatin of chromosome X. As a control set, we took the four domains that are located at equivalent sequence distances to the heterochromatin regions on chromosomes 2 and 3.
Interestingly, the spatial distance between the white gene and the X chromosome heterochromatin is significantly smaller than the corresponding distances of the control groups (one-tailed Welch's two sample t-test, p-value < 2.2e−16) (Fig. S6A). Although it is unlikely for distal loci to come together in 3D, we found that in ~1.3% of structures the white gene and the heterochromatin were positioned (within a distance of 200 nm) (Fig.   S6B). This frequency is nine times larger than the colocalization frequency in the control sets (0.14% of structures). Therefore, our models reveal an increased propensity for the white gene to be located near the pericentromeric heterochromatin, compared to equivalent sites on other chromosomes. This result suggests that spatial proximity facilitates the occurrence of this translocation in living cells.

Different chromosome domains have different preferred locations in the nucleus
The evidence listed above demonstrates the consistency of our models with experimental data and known properties of the Drosophila genome organization. Next, we describe new findings on the nuclear architecture and its functional significance based on our analysis of the model structure population. Importantly, we validated this model prediction in vivo, using Drosophila Kc cells (Fig.   S7). Immunofluorescence analysis of nucleoli and pericentromeric heterochromatin confirms that the average distance between the center of the nucleolus and the nuclear periphery is less than half of the nuclear radius (Fig. S7B). Interestingly, the nucleolus is positioned close to the nuclear periphery in 68% of cells, and close to the center of the nucleus in the remaining cells, revealing a bimodal distribution (Figs. S7A, C). In most cells, pericentromeric heterochromatin partially encloses the nucleolus (Fig. S7A). Euchromatic regions are either active or repressed, and can be divided into 4 classes based on their epigenetic profiles: Null, Active, Polycomb-Group (PcG), and HP1 [8] (Suppl. Table 3). The TADs of the Null, Active, and PcG classes have similar average radial positions (Fig. 5B).
Moreover, the paired chromosomes touch only at a few specific interstitial sites [31]. In our structures, we define a domain as being paired if the surface-to-surface distance between the two homologs is less than 200nm (Fig. 6A). Interestingly, the pairing frequencies of homologous domains show distinct and reproducible variation along the chromosomes (Fig. 6B left panel). The active class shows smallest homologous pairing frequency for each chromosome (Fig. 6B right panel) is TAD-specific and highly reproducible in independently calculated structure populations (Fig. 6C). This effect is an indirect consequence of the genome-wide Hi-C and lamina-DamID constraints imposed on the structures.
The consistency of this pairing behavior raises the question of why certain regions attain higher levels of pairing. One clue is that we find a small but significant correlation between pairing frequency and the location of the TAD in the nucleus. Pearson's correlation between the frequency of pairing and the frequency of being in proximity to the NE is 0.34 (p-value < 2.2e-16) (a TAD-NE contact is defined when its domain surface is is less than 50nm from the NE). We hypothesize that genomic regions that are often positioned near the NE may be more restricted in their movements, which may facilitate homolog pairing. We also investigated whether the local crowdedness around the domains could influence the spatial distances between homologues, and found that in the majority of structures the local crowdedness is not different between paired domains and unpaired domains (Supp. Methods C.6).
Mrg15 is enriched in active domains and depleted in repressive domains. Several proteins have been reported to affect somatic homolog pairing in Drosophila [32,33,44].
Among them is Mrg15, which binds to chromatin and recruits CAP-H2 protein to mediate homolog unpairing [44]. Interestingly, we find an anticorrelation between Mrg15 binding enrichment in a domain and a domain's homologous pairing frequency, even though this information is not imposed as an input constraint in our models (Fig. 6D).
The higher the Mrg15 enrichment signal in a domain, the lower the fraction of paired homologues in the structure population (Fig. 6D). Pearson's correlation coefficient between the binned Mrg15 binding signal and the averaged frequency of homologous pairing for each bin is −0.81, with p-value = 7.59e−06 (Fig. 6D). In the control model (using only Hi-C data), the Pearson's correlation coefficient between them is -0.697 with p-value = 0.000446. We also divided the domains into three subsets based on their Mrg15 scores. The average pairing frequency for domains enriched with Mrg15 is significantly less than that for domains with lower Mrg15 scores (one-tailed Mann-Whitney U test, p-value < 2.2e-16) (Fig. S8A).
Among the four TAD classes, "Active" domains are generally more enriched with Mrg15binding sites (Fig. 6E right panel). Appropriately, we observe that transcriptionally active domains have a lower pairing frequency than the three repressive classes (Fig. 6E left panel). The most intuitive explanation is that a loose pairing makes an active domain more accessible to regulatory factors. PcG domains, which are enriched with polycomb group proteins, show higher levels of homologous pairing in our models than the active domains (one-tailed Welch's two sample t-test, p-value = 2.09e−9). Therefore, our structure population supports the notion that PcG domains form tight pairs to enhance gene silencing (reviewed in ref. [26]).
While active domains generally have low frequencies of homologous pairing, our models also have some clear and reproducible counterexamples of active domains with extremely high frequencies of homologous pairing (the specific TADs with this behavior are reproducible in independently generated structure populations) (Fig. 6C). Therefore, we divided the active domains into two subclasses, labeled "active-tight" and "activeloose". Interestingly, domains in the active-loose subclass have significantly higher Mrg15 enrichment than domains in the active-tight subclass (one-tailed Mann-Whitney U test, p-value = 0.03436) (Fig. S8B). It is interesting that our model further supports a role for Mrg15 in disrupting homologue pairing, even though the structures were generated without any locus-specific constraints on the separation of homologous domains. Importantly, the anticorrelation between homologue pairing frequency and Mrg15 binding signal further increases when lamina-DamID data is integrated in the model, which indicates that data integration helps generate more accurate genome structures.
Active-tight domains show higher transcriptional efficiency. Interestingly, we found significant functional differences between active-loose and active-tight domain subclasses. Active-tight domains contain more genes (Fig. 7A). Surprisingly, the activetight subclass shows significantly lower binding levels of the TATA-binding protein (TBP) and RNA polymerase II, as well as lower H3K4me2 signals (one-tailed Mann-Whitney U test, p-values are 1.17e−04, 2.95e−03 and 5.19e−04 respectively) (Fig. 7B) (Fig. 7C). This indicates that chromatin in the active-loose domains is more depleted of nucleosomes, and hence these domains contain a higher density of regulatory chromatin complexes. In Drosophila, the organization of nucleosomes plays an important role in determining origin recognition complex (ORC) binding sites [55]. and PCG respectively), indicating that DNA replication is often initiated in the chromatin of the active class (Fig. S9B). Strikingly, we discovered that ORC-binding regions are much more frequent in active-loose domains than in active-tight domains (one-tailed Mann-Whitney U test, p-value=1.54e−4) (Fig. 7D), supporting the model that chromatin in the active-loose subclass replicates significantly earlier (i.e., in early S-phase as opposed to late S-phase). Homologous pairing has been studied for years, and it has been found to play a large role in gene regulation. Transvection is a phenomenon whereby gene expression is modulated by the physical pairing of homologous loci. A case study showed that more transcripts are produced when both alleles of the gene Ubx are paired than when they are spatially separated [28]. A possible explanation is that each gene copy can be activated by both its own and the other copy's enhancer [26]. Interestingly, when we compare actively transcribed genes in chromatin regions with very high or very low levels of homologous pairing, the former show significantly lower signals in RNAPII and TATA protein binding, but at the same time similar levels of transcripts. This observation indicates that a more efficient transcription of genes occurs when pairing is frequent.

Discussion
Our model also shows that regions with looser homologue pairing initiate replication earlier than regions with tighter homologue pairing.

Conclusions
In this study, we address one of the principal challenges of genome structure analysis:

General description
The population-based approach is a probabilistic framework to generate a large number of 3D genome structures (i.e., the structure population) whose chromatin domain contacts are statistically consistent with experimental Hi-C data and other spatial constraints derived from a priori knowledge and/or independent data types. Our model is a deconvolution of the ensemble-averaged Hi-C data, and the resulting structures can be considered the most likely representation of the true structure population over a population of cells, given all the available data. Our method distinguishes between interactions involving chromosome homologues, so it can generate structure populations representing entire diploid genomes. Further, because the generated population contains many different structural states, this approach can accommodate all experimentally observed chromatin interactions, including those that would be mutually exclusive for a single structure. Compared to our previous research, which introduce the population-based approach using Hi-C data alone, in this study we also integrate lamina-DamID data to generate an improved structure population.

Chromosome representation
The nuclear architecture of Drosophila cells consists of the nuclear envelope (NE), the nucleolus, and eight individual chromosomes (the diploid pairs chr2, chr3, chr4 and chrX). Chr2 and chr3 each have two arms, labeled 2L-2R and 3L-3R, connected by centromeres (Fig. 1B).
Each chromosome contains three main regions: euchromatin, pericentromeric heterochromatin, and a centromere (Fig. 1B). Euchromatin regions in chromosome arms 2L, 2R, 3L, 3R, 4 and X are linearly partitioned into a total of 1169 well demarcated physical domains [8], which are represented as spheres in the model [22].
A domain sphere is characterized by two radii: (1) its hard (excluded volume) radius, which is estimated from the DNA sequence length and the nuclear occupancy of the genome; and (2) its soft (contact) radius which is twice the hard radius. A contact between two spheres is defined as an overlap between the spheres' soft radii. This two- For every chromosome, the centromere is modeled as a sphere with 5% the volume of its corresponding heterochromatin domain (or sum of two heterochromatin domains for chr2 and chr3).
The nuclear radius is set to 2 microns (µm) as suggested by fluorescence imaging experiments ( [35,46] and Figs. 4C, S7A). The nucleolus radius is set to 1/6 of the nuclear radius (Fig. S7A). Centromeres are clustered together and attached to the nucleolus [35]. Pericentromeric heterochromatin of chrX surrounds the rDNA cluster regions, so it lies in close proximity to the nucleolus. (Suppl . Table 3 lists all domain radii in the model.) All these units are represented by a total of 2359 spheres (see Table 1). The outlines of the chromosomes are depicted in Fig. 1B. In the next section, we briefly describe the chromosome model and list all of the structural constraints that we imposed while optimizing the population.

Probabilistic platform for data integration
Our method closely follows our recent population-based modeling framework [22].
However, we now generalize this framework to support the integration of lamina-DamID data with Hi-C data. The Hi-C data is contained in the ensemble contact probability matrix A , and the lamina-DamID data is contained in the ensemble chromatin-NE contact probability vector E .
We aim to generate a structure population X that maximizes the likelihood P A, E X ( ) .
We introduce two latent variables W and V , which represent features of individual cells that aggregate into the ensemble information A and E , respectively. W = (w ijm ) 2 N ×2 N ×M is the contact indicator tensor, which contains the missing information in the Hi-C data  With these probabilistic models, we can maximize the log-likelihood log P(A, E, W, V | X) , expressed as follows: We assume that a pair of spheres (i, j) are in contact in structure m if and only if their center distance is between certain lower and upper bounds, .
The lower bound is the sum of their hard radii, , and the upper bound is the sum of their soft radii, . We modeled the probability of a contact between two domain spheres i and j as a variant of the rectified or truncated normal distribution, expressed as follows.
with very small variance, e.g. .
The probability for a domain to reside near the NE is described as (7) where to ensure that the enforced TAD is at the inside surface of the NE, and likewise .

Additional spatial constraints for the Drosophila genome
In addition to the data from Hi-C and lamina-DamID experiments, we include the following additional information as spatial constraints: 1. Nuclear volume constraint: All 2359 spheres are constrained to lie completely inside a sphere with radius R nuc , i.e. . Without loss of generality, we use the origin (0,0,0) as the nuclear center, so is the distance from the nuclear center. are usually close to each other [30][31][32][33]. Therefore, we constraint the distance between 2 homologous domains to be less than an upper bound, which is four times the sum of their radii i.e. .

Consecutive TAD constraint:
To ensure chromosomal integrity, we apply an upper bound to the distance between two consecutive TAD domains, which is derived from the experimentally determined contact probability "# . The upper Therefore, we constrain the centromere spheres to be close to the spherical volume representing the nucleolus, defined as where is the radius of the nucleolus volume.

Distance threshold method for estimating W and V
We adopt the distance threshold method introduced elsewhere [22

Optimization
As described elsewhere [22], we used step-wise optimization and the A/M iteration algorithm to generate the structure population. We first generated a Modeling Platform (IMP) [57].

Data collection and processing
Our processing methods for Hi-C, lamina-DamID and other epigenetics data are described in the Supplemental Material.

Analysis of the structure population
Our statistical analysis of the structure population and details on all statistical tests are described in the Supplemental Material.

Cell culture and immunofluorescence
Kc cells were maintained at 25°C as logarithmically growing cultures in Schneider's medium (Sigma) + FBS (Gemini), and fixed and stained as previously described [46].
FISH probes were purchased from Integrated DNA Technologies, and designed as described in [58].

Imaging and image analysis
All images were captured using a Deltavision fluorescence microscopy system equipped with a CoolsnapHQ2 camera, using 60x and 100x objectives and 10-12 Z stacks with Z-intervals of 0.2-0.4. Images were deconvolved with softWorx software (Applied Precision/GE Healthcare) using the conservative algorithm with five iterations.
The distances between signals in 3D volume reconstructions of Kc cells or in individual Z stacks of larval tissues were calculated with softWorx. All distances were normalized to the nuclear diameter of their respective cells. Quantification of FISH signals in larval brains was limited to cells that displayed clear homologous pairing, defined as proximal or overlapping FISH signals for each probe.

Additional files
Additional file 1: This .docx file contains the supplementary methods.
Legends for these figures are presented under each figure.
Additional file 3: This .xls file constains three supplementary spreedsheets, each included as a separate tab: Suppl. Tables 1 (Summary of the Pearson's correlation between contact probability from structure models and Hi-C experiment), Suppl. Table 2 (Summary of chromosomal territory index (TI) for individual arms and pairs of homologous arms.) and Suppl.

A.1 Bin-level contact frequency
The sequencing data were downloaded from Gene Expression Omnibus under accession number GSE34453 [1]. We adopted the pipeline developed by Leonid Mirny lab [2] to process the Hi-C data. First, the two sides of each read were mapped to the Drosophila melanogaster genome (assembly dm3) independently using bowtie2 [3] with "very-sensitive" option. We truncated the reads to 20bp, and then remapped the nonmapped and multiple mapped reads by increasing truncation length with 5bp gradually.
The truncating step significantly yields more double-sided mapped reads. 216,199,696 uniquely double-sided mapped reads are retained from the original 362,669,793 paired reads (with the mapping ratio at ~60%).
Then, the reads alignments for the artificial or non-informative contacts were filtered out, including self-ligation products, the products without ligation junction and PCR duplicates. After filtering process, 14,481,367 interactions are left for downstream analysis. Third, the valid double-sided alignments were used to construct the genomewide contact matrix at 40k bin size, resulting in a 3012*3012 matrix. Before correcting the matrix for biases, we performed four types of bin-level filtering which will affect the normalization procedure. We removed the contacts between loci located within the same bin; removed bins with more than 50% are N's in the reference genome; removed 1% of bins with low coverage; and truncated top 0.05% of inter-chromosomal counts which truncated values of the top 0.05% to be the same as highest value among the rest 99.95%. Finally, the iterative correction was performed on the filtered genome-wide contact map to get a normalized map, denoted as " = ( &' ) " * " .

A.2 Bin-level contact probability
The contact probability is defined as the probability for observing a given contact in the structure population. We define a threshold value *+, , which defines the frequency at which a contact is formed in 100% of the structure population. We assume the contact frequencies that constitute any stable TAD can serve as reference where the interaction could exist in 100% of the cells. First, we register all contact frequencies inside TADs.
Then we apply an R (CRAN statistical software) function boxplot.stats on this contact frequency set to get the extreme lower whisker of the boxplot and set it as the *+, .
Namely, in R The contact probability between bin i and bin j is derived as *+, , 1} We applied the *+, calculation method to the normalized frequency matrix to obtain the contact probability at 40 kb-binned matrix.

A.3 Domain-level contact probability
Based on the Hi-C contact frequency map, 1169 physical domains or topological associated domains (TADs) are detected by a quantitative probabilistic approach [1].

Reproducing lamina-DamID binding frequency
We define a chromatin domain-lamina contact, if the distance between the domain's surface and the NE is less than 50nm. The NE-domain association probability is the fraction of structures in the population, in which the domain is in contact to the NE.

C.2 Chromosome territory index
To quantify how effectively one chromosome excludes other chromosomes from the volume it occupies in the 3D space, we adopted the quantity called chromosome territory. There is no universal definition of chromosome territory, but we follow the definition in a recent publication about the structure modeling of Drosophila polytene chromosomes [6]. Chromosome territory index (TI) is defined as the fraction of domains inside a convex hull that belongs to the chromosome used for its construction.

C.3 Residual polarized organization
The polarized (Rabl-like) organization shows that each chromosome occupies an elongated territory, with the centromere in one nuclear hemisphere and telomere in the opposite hemisphere. We investigated the position of each centromere and its corresponding telomere and obtain the number of polarized configuration chromosomes or arms (chr4 are excluded, therefore the number ranges from 0 to 10). To identify the presence of this organization, we measure the angle between each centromere, the nuclear center, and its corresponding telomere. If the cosine of the angle is positive, then centromere and telomere are in the same hemisphere. Otherwise, they occupy opposite hemispheres, forming polarized organization. If more than half of the chromosome arms (>=6) in one nucleus are in polarized organization, we consider this as a polarized nuclear structure.

C.4 Nuclear colocalizations of Hox gene clusters
The All the contact probability of clusters are lower than 6%, which means the constraints are not imposed for those clusters in our models. If the closest surface-tosurface distance between two clusters in one group is less than 200 nm, we consider these clusters colocalized.

C.5 Pericentromeric heterochromatin cluster detection
We calculated all pairwise surface-to-surface distances (normalized by the sum of the radii of the domain spheres) among the 12 heterochromatin spheres, for each structure in the population, then obtained the average pairwise distances in the matrix. Hierarchical cluster analysis is performed on the average distance matrix by using hcluster function in R.

C.6 Homologous pairing
A domain is defined as paired if the surface-to-surface distance between two homologs is less than 200 nm. Then the pairing frequency is defined as how often a domain is paired among the structure population. The domains with frequency higher than Top 3 rd quantile are determined to be "tight", otherwise, to be "loose". We have 71 "active-tight" domains and 423 "active-loose" domains.

Pairing is not determined by the positon and the neighborhood crowdedness in the nucleus
The variation of the distance along homologous chromosomes and among nuclei raises the question of why certain regions attain higher level of homologous pairing than others in certain nuclei. First, we notice that the linear distance to centromere or to heterochromatin does not influence the extent of pairing of a domain (Fig. 6B We followed this 4-class annotation in our structure analysis. We also collected the data of histone modifications and binding of chromatin proteins in the study [4] from http://research.nki.nl/vansteensellab/Drosophila_53_chromatin_proteins.htm. Wig files were downloaded, and then transferred into bigwig format. BigwigSummary program (from USC Genome Browser) was used to extract the individual signals for requested regions (the defined 1169 TADs in this study). The signal was calculated as an average per domain, avoiding the bias of the genomic length.

C.8 Transcription analyses
Gene expression data (embryonic samples collected at 16-18h) were obtained from the modENCODE project [7]. 1169 physical domains covered 12947 genes with available expression data.  [4]. BigwigSummary program was used to calculate the average signal for each defined domain.

C.9 DNA replication analyses
Data for ORC-binding regions and early activating replication origins for Kc167 cell line were downloaded from modENCODE (accession no. GSE20889 and GSE17285, respectively). BigwigSummary program was used to calculate the average signal for each defined domain.

C.10 Statistical test
The association test between two paired signals is done by cor.test function in R using Pearson's product moment correlation coefficient. For example, the positive correlation between the frequency near NE derived from our population of structures and the lamina binding signal from lamina-DamID experiment provides a validation for our models; the negative correlation between the frequencies of homologous pairing derived from our population of structures and the Mrg15 protein binding signal from lamina-DamID experiment matches the unpairing function of Mrg15 protein and also validate our models.
The difference test between two sets is done by wilcox.test function in R, which performs a Wilcoxon rank sum test (equivalent to the Mann-Whitney test) when the population cannot be assumed to be normally distributed. For example, we found the ORC binding signals are much stronger in the active-loose than in the active-tight domains. Figure S1.