MoDLE implementation and design overview
MoDLE is implemented in C++17 and is compiled with CMake. MoDLE uses a producer-consumer architecture where a single producer (a thread) communicates with multiple consumers through asynchronous message passing. The producer thread is responsible for reading input files and generating a set of simulation tasks to be consumed by a pool of worker threads. Tasks are implemented as light-weight C++ structs that are computationally cheap to generate and consume. A single task contains all the information needed for simulating DNA loop extrusion on a single chromosome in a specific simulation instance. Simulation instances are for the most part independent from each other and can thus run in parallel. We designed MoDLE such that each simulation instance requires less than 10 MB of memory to simulate loop extrusion on large mammalian chromosomes, such as chromosome 1 from the human genome. The space complexity of the thread–local state is linear with respect to the number of LEFs or extrusion barriers, whichever is largest. For a more detailed overview of MoDLE’s implementation, see Additional file 1: Section 1.
Most of MoDLE’s memory budget is used to store molecular contacts generated by loop extrusion. MoDLE stores one instance of its custom contact matrix data structure for each chromosome that is being actively simulated. The space complexity of a contact matrix instance depends on the chromosome length, diagonal width and bin size. With default settings, representing contacts for chromosome 1 of the human genome requires approximately 120 MB of memory. Common operations on the contact matrix class are made thread-safe using lock striping implemented through hashing. For more details regarding the specialized contact matrix data structure, refer to Additional file 1: Section 2.
To achieve high-performance, MoDLE stores most of its data in contiguous memory using simple data structures such as vectors and bitsets (see Additional file 1: Section 3). Data is indexed such that extrusion barriers and extrusion units in a simulation instance can be efficiently traversed in 5′-3′ and 3′-5′ directions (see Additional file 1: Section 8). This allows MoDLE to bind/release LEFs, process collisions, register contacts, and extrude DNA in linear time-complexity and with good locality of reference. The only step relying on an algorithm with super-linear time complexity is indexing, which has a worst-case time complexity of O(n log n) while approaching O(n) for the typical case.
More design and implementation details are available in Additional file 1. The latest MoDLE source code can be obtained in MoDLE’s GitHub repository: github.com/paulsengroup/modle
Running a simulation instance
The entire simulation instance is executed by a single worker thread and consists of the following phases:
-
Wait until one or more tasks are available on the task queue.
-
Setup the simulation internal state based on the task specification, this includes seeding the PRNG and setting the initial state for the extrusion barriers based on the occupancy (see Additional file 1: Sections 1, 3, and 4).
-
Run the simulation loop until a stopping criterion is met.
A single simulation epoch is articulated in the following steps:
-
Select (inactive) LEFs that are currently not associated with DNA, and activate them. This is done by loading LEFs to a random position on the chromosome that is being simulated. The position is sampled from a uniform distribution (see Additional file 1: Section 5).
-
Index extrusion units moving in the same direction so that they can be visited in 5′-3′ and 3′-5′ order (see Additional file 1: Section 8).
-
Randomly select a subset of the active LEFs and use their position along the chromosome to generate molecular contacts in the chromosome contact matrix (see Additional file 1: Section 9).
-
Generate candidate moves for each extrusion unit (see Additional file 1: Section 10).
-
Update the extrusion barrier states by computing the next state in the Markov chain used to model extrusion barriers (see Additional file 1: Section 6).
-
Detect collision events between LEFs and extrusion barriers as well as between LEFs (see Additional file 1: Sections 12b-d and g).
-
Update the candidate moves for extrusion units involved in collision events to satisfy the constraints imposed by the collision events (see Additional file 1: Sections 12e-g).
-
Advance LEFs’ extrusion units by their respective moves (see Additional file 1: Section 5). Because of the preceding steps, this will yield a new valid simulation state, as moves have been updated to satisfy all the constraints imposed by collision events.
-
Iterate over active LEFs and release them based on the outcome of a Bernoulli trial whose probability of success is computed based on the average LEF processivity and LEF state (e.g., LEFs whose extrusion units are involved in collision events with a pair of extrusion barriers in convergent orientation have a lower probability of being released). LEFs that are being released go back in the pool of available LEFs and will be loaded on a new genomic region during the next epoch (see Additional file 1: Section 5).
MoDLE will continue iterating through the above steps until one of the simulation stopping criteria is met:
Both stopping criteria can be modified by users. By default, MoDLE will simulate loop extrusion until reaching an average contact density of 1 contact per pixel.
Hardware specifications
Analysis and benchmark code used to generate the data accompanying was run using the hardware specifications listed in Table 1.
MoDLE simulations
MoDLE’s data used for the heat map comparison shown in Fig. 2 were generated using the heatmap_comparison_pt1 Nextflow [47] workflow available on GitHub [48] and archived on Zenodo [49].
The list of candidate extrusion barrier positions and directions were generated by running MAST from the MEME suite [50] on GRCh38.p13 (GCF_000001405.39 [51] using the CTCF frequency matrix MA0139.1 from JASPAR 2022 [52].
The list of candidate barriers was then filtered using CTCF and RAD21 ChIP-seq data (fold-change over control and optimal IDR thresholded peaks). In brief, candidate barriers were intersected with the narrow-peak BED files for CTCF and RAD21. Then, each filtered barrier region was assigned with an occupancy computed by passing the RAD21 fold-change over control signal through a logistic function. Finally, the output of the logistic function was binned at 1 kbp to yield a barrier occupancy that is proportional to the number of CTCF motif instances as well as RAD21 fold-change over control signal. This procedure is largely based on [Fudenberg 2016]. The result of the procedure outlined above is a list of extrusion barrier occupancies binned at 1 kbp resolution. CTCF and RAD21 ChIP-seq for H1-hESC data was downloaded from ENCODE [53, 54] (ENCFF255FRL [55], ENCFF473IZV [56], ENCFF821AQO [57], and ENCFF913JGA [58].
Contact matrices were generated using MoDLE v1.0.0-rc.7 with the parameters from Additional file 3: Table S1. Parameters not listed in the table were left at default.
Contact matrices produced by MoDLE were then subsampled to an average contact density of 3 using cooltools random-sample v0.5.1 [59]. The resulting cooler files were then converted to multi-resolution cooler files using cooler zoomify [40]. Finally, multi–resolution contact matrices were visualized in HiGlass (v1.11.7) [60].
Molecular dynamics (OpenMM) simulations
Molecular dynamics data used for the heat map comparison in Fig. 2 were generated using the heatmap_comparison_pt1 Nextflow workflow available on GitHub [48] and archived on Zenodo [49]. This workflow uses OpenMM [37] to run MD simulations.
Simulation code is largely based on [61]. Simulations were carried out on 10 Mbp regions from chromosomes 2, 3, 5, 7, and 17 using a monomer size of 1 kbp and 200 kbp for LEF processivity and separation. Extrusion barrier positions, directions, and occupancy were generated following the procedure outlined in the “Methods” section (part 1).
Contact matrices were generated with Polychrom [62] using a bin size of 5 kbp. The resulting cooler files were then converted to multi-resolution cooler files using cooler zoomify v0.8.11 [40].
Assessing loop extrusion feature similarities from contact frequencies
To objectively compare the contact matrices produced by MoDLE with contact matrices generated from Micro-C experiments and MD simulations, we developed a specialized scoring algorithm. The algorithm was inspired by Stripenn [63].
The score is computed on rows and columns of a pair of contact matrices of identical resolutions transformed as follows.
First, both matrices are convolved using the difference of Gaussian (DoG). This highlights stripe and dot patterns found in contact matrices. Next, the transformed contact matrices are discretized using a step function mapping values below a certain threshold to 0 and all the others to 1. This results in two binary matrices, where non-zero pixels can be interpreted as part of a stripe or dot. Finally, we take advantage of the fact that stripes produced by loop extrusion always should start from the matrix diagonal. Thus, given a row or column of pixels starting on the matrix diagonal, and extending away from it, we stipulate that the last non-zero pixel in the vector of values represents the end of a stripe produced by DNA loop extrusion.
Given the above, we can measure the similarity of stripes between two contact matrices by considering the same row of pixels in a pair of contact matrices, computing the last non-zero pixels in both rows, and counting the number of matches. The same approach can be applied to columns of pixels. Finally, counting mismatches instead of matches can be used as a measure of dissimilarity. Contact matrix convolution and discretization, as well as computing this special score, can be done using MoDLE’s helper tools (modle_tools transform and modle_tools evaluate respectively).
Contact matrix comparison
For comparison with MoDLE and OpenMM output, we used available Hi-C and Micro-C data from H1-hESC because these were of high resolution and had accompanying ChIP-seq data for both CTCF and RAD21 (4DNFIFJH2524 [64], 4DNFI9GMP2J8 [65], ENCFF255FRL [55], ENCFF473IZV [56], ENCFF821AQO [57], and ENCFF913JGA [58]). To assess stripe similarity of a pair of contact matrices, we used the scoring algorithm described in the “Methods” section (part 6). The score was computed using Micro-C data as the ground truth. Pixel accuracy was computed as the ratio of correctly classified pixels to the total number of pixels in a 3 Mbp subdiagonal window around each barrier. The Pearson correlation between OpenMM and MoDLE was calculated based on all corresponding 5 kbp-pixel values in the 3 Mbp subdiagonal window of the OpenMM simulation regions.
Benchmark methodology
Benchmarks were run on a computing cluster using the run_benchmarks Nextflow workflow available on GitHub [48] and archived on Zenodo [49].
We ran two suites of benchmarks to assess the performance of MoDLE and compare it with that of molecular dynamics simulations based on OpenMM.
The first suite (Fig. 3A–C) compared the performance of MoDLE and OpenMM when simulating loop extrusion on an artificial chromosome with increasing length (ranging from 1 to 250 Mbp).
This benchmark was run using MoDLE (1 and 52 CPU cores) as well as OpenMM GPU and CPU implementation (1 CPU core, 1 GPU, and 8 CPU cores respectively). CPU benchmarks were run on server C while benchmarks relying on GPU acceleration were run on server D (see Table 1). For OpenMM CPU implementation, we limited the number of CPU cores to 8 (16 SMT cores) as the CPU implementation is known to not scale well past 16 threads [66]. OpenMM CPU implementation was used to simulate chromosome lengths up to 5 Mbp for practical reasons. MoDLE was run with default settings except for the number of cells, which was set to 104 to match the maximum number of available SMT cores.
OpenMM simulations were run using a monomer size of 2 kbp and LEF processivity and separation of 200 kbp.
The second suite of benchmarks involved simulating loop extrusion on the human genome (GRCh38) using MoDLE with a number of CPU cores ranging from 1 to 52. MoDLE was run with default settings except for the number of cells which was set to 104. The extrusion barrier annotation was generated as described in the “Methods” section (part 1).
In both cases, measurements were repeated 10 times for MoDLE and 5 times for OpenMM.
Genome-wide extrusion barrier parameter optimization
The genome-wide optimization of parameters affecting extrusion barrier occupancies was carried out using the gw_param_optimization Nextflow workflow available on GitHub [48] and archived on Zenodo [49].
The first step in the optimization procedure is running Stripenn v1.1.65.7 [63] on the H1-hESC Micro-C (4DNFI9GMP2J8 [67]) dataset to identify architectural stripes, which resulted in the identification of 5254 stripes. A small subset of these stripes were visually validated by comparing the annotated stripes with stripes that are visible from Micro-C data. Annotated stripes were split into two equally sized datasets by random sampling without replacement. One dataset was used for parameter optimization while the other was used for validation.
Parameter optimization is performed through the Bayesian optimization from scikit-optimize v0.9.0 [68] using an objective function based on the scoring metric described in Methods (part 6).
The parameters that are being optimized are the extrusion barrier occupancy (πB) and PUU, the self-transition probability of the unbound state.
The evaluation of the objective function proceeds as follows:
-
A new genome-wide simulation is performed using the parameters proposed by the optimizer.
-
The resulting cooler file is transformed with modle_tools transform by applying the difference of Gaussians followed by a binary discretization step, where pixel values above a certain threshold are discretized to 1 and all the others to 0.
-
The score described in Methods (part 6) is then computed row and column-wise on the entire genome using modle_tools eval, producing two BigWig files. Here, the transformed Micro-C 4DNFI9GMP2J8 [67] dataset is used as reference.
-
Scores are intersected with the extrusion barrier dataset for optimization and validation considering stripe direction (i.e., vertical stripes are intersected with column-wise scores while horizontal stripes are intersected with row-wise scores).
-
Scores resulting from the intersection are then averaged, producing a floating-point number that is then returned to the optimizer, which will try to minimize this number.
In the transformation step, a σ of 1.0 and 1.6 are used to generate the less and more blurry contact matrices to subtract when computing the difference of Gaussians. For the binary discretization of the Micro-C data, a threshold of 1.5 was used, while simulated data was discretized using 0.75 as threshold.
The optimizer evaluated the objective function 400 times, each time computing the average score for the training and validation datasets.
Finally, the parameters that yielded the best score on the training dataset were used to generate a contact matrix in cooler format (see Fig. 4D, bottom panel).
Local extrusion barrier parameter optimization
The local extrusion barrier parameter optimization was carried out using the extrusion_barrier_param_optimization Nextflow workflow available on GitHub [48] and archived on Zenodo [49].
In brief, this workflow takes as input an extrusion barrier annotation consisting of barrier position and direction, and then optimizes the parameters for each individual barrier to maximize similarities with a reference HiC matrix.
The optimization approach is based on evolutionary algorithms (EAs) and was developed using primitives from the DEAP library [69].
Optimization was performed on a 5 Mbp region of the human chromosome 1 (20–25 Mbp, GRCh38) using the list of candidate CTCF binding sites overlapping this region as extrusion barrier annotation, for a total of 2103 extrusion barriers. Candidate CTCF binding sites were annotated using MAST as described in Methods (part 4). The H1-hESC Micro-C (4DNFI9GMP2J8 [65]) matrix was used as reference.
At a high level, the optimization workflow consists of running the same optimization script three times, using the output of an optimization run as input for the next run. The first run is tuned to favor exploration over exploitation, while the second and third runs used more conservative optimization parameters to progressively reduce the rate of exploration and favor exploitation.
The following is an overview of how the optimization strategy was developed:
-
The optimization uses μ, λ as evolution strategy, where μ is the population size and λ is the number of offspring produced each generation. With this strategy, offspring that make it through the selection stage replace the previous population entirely. By default, μ = 256 and λ = 512.
-
Individuals are represented as two lists of real numbers of size N, where N is the number of extrusion barriers to be optimized. The first list of numbers represent s extrusion barrier occupancies (πB), while the second list represents the self-transition probability of the unbound state (PUU).
-
Individuals are mutated by adding a relatively small offsets to \(\overrightarrow{\pi_B}\) and \(\overrightarrow{P_{UU}}\). Offsets are drawn from a normal distribution with μ = 0 and σ set based on the desired degree of exploration. Values are clamped between 0.0 and 1.0, so mutating an individual always leads to another valid individual.
-
The two-point crossover operator is used for mating.
-
During selection, offsprings are sorted based on their fitness, and the top μ offsprings are selected to proceed to the next generation.
-
The population is initialized differently depending on whether results from a previous optimization run are available.
-
Results from previous optimization are available: population initialized through random sampling with replacement from the set of fittest individuals that ever lived in the previous optimization run.
-
Otherwise, population is randomly initialized by generating μ individuals with πB and PUU set to random numbers drawn from the uniform distribution U(0.0,1.0).
-
Fitness is computed using a slightly modified version of the scoring function \(f\left(\overrightarrow{x}\right)\)described in Methods (part 6). Function \(f\left(\overrightarrow{x}\right)\) is not effective at guiding the optimization when occupancy is relatively low (e.g., < 0.5), and there are no stripes or dots in the reference matrix, as any parameter combination resulting in such a low occupancy produces no visible stripe or dot. To this end, we define a penalty function p(πB) that returns a coefficient between 1.0 and 2.0. The returned coefficient is close to 2.0 when πB approaches 0.5 and rapidly falls to 1.0 when πB moves towards 0.0 or 1.0. πB very close to 1.0 are also penalized. See Additional file 2: Fig. S12 for more details regarding the penalty function p(πB).
-
The output of the scoring function \(f\left(\overrightarrow{x}\right)\) and penalty function p(πB) are multiplied together to produce the score used to compute the fitness of an individual \(s\Big(\overrightarrow{x}\), \({\pi}_B\Big)=f\left(\overrightarrow{x}\right)\cdot p\left({\pi}_B\right)\). The fitness of an individual is computed as the average of the scores \(s\Big(\overrightarrow{x}\), πB) computed in correspondence of each extrusion barrier object of the optimization.
-
The optimization runs until one of the following conditions is met:
-
A target number of generations have been simulated (i.e., 1000 generations).
-
Optimizer failed to significantly improve the population fitness (e.g., less than 1% fitness improvement over the last 25 generations).
-
The population variability approaches 0.
To improve the performance of the optimizer on these regions, we split the population into mainland population and one or more insular populations and change some aspects of the optimization strategy.
First, we initialize and optimize the mainland population (μ = 256 and λ = 512). When one of the stopping criteria is met, the fittest individuals from mainland are used to initialize the population of m islands. For each island, we randomly select and mask k consecutive alleles or barriers. k is generated by rounding a number drawn from a normal distribution with μ = 25 and σ = 5.0. Crucially, masked barriers are inactive and are not allowed to mutate.
For one of the m islands, instead of masking a random stretch of extrusion barriers, we inactivate all weak barriers when initializing the population. Thus, we replace alleles with πB < 0.5 with the πB = 0.0; PUU = 1.0 allele. In this case, all loci are allowed to mutate. Islands have μ = 128 and λ = 256. Islands evolve independently from each other and from the mainland and follow the same stopping criteria used for the mainland.
Once all islands have been optimized, half of the mainland individuals are replaced with individuals from any of the islands. Island individuals are selected using fitness proportionate selection (i.e., random sampling with replacement, weighted by fitness).
Mainland and island optimization continue alternating until a total target number of mainland generations have been simulated, or when an optimization cycle fails to significantly improve the average mainland population fitness.
Simulations to predict the effect of TAD border alterations
Data for this section was generated using the comparison_with_mut Nextflow workflow available on GitHub [48] and archived on Zenodo [49].
Simulations were carried out using GRCm38.p6 as reference genome (GCF_000001635.26 [70].
CTCF and RAD21 ChIP-seq fold-change over control for JM8.N4 was generated by processing data from GSE90994 [71] (SRR5085152 [72], SRR5085153 [73], SRR5085154 [74], SRR5085155 [75], SRR5085156 [76], SRR5085157 [77]) using the ENCODE ChIP-seq pipeline v2 [78] and using ENCODE4 genomic datasets for GRCm38.
The wild-type extrusion barrier annotation was generated following the procedure outlined in the “Methods” section (part 4).
The barrier annotation was further refined using the parameter optimization strategy described in the “Methods” section (part 10) using a JM8.N4 Micro-C dataset as reference (4DNFINNZDDXV [79]).
The optimized extrusion barrier annotation was then mutated by removing extrusion barriers overlapping the del1-13d9lac and delattP-Rel5d9lac regions from Rodríguez-Carballo 2017 [43].