A comprehensive map of human glucokinase variant activity

Background Glucokinase (GCK) regulates insulin secretion to maintain appropriate blood glucose levels. Sequence variants can alter GCK activity to cause hyperinsulinemic hypoglycemia or hyperglycemia associated with GCK-maturity-onset diabetes of the young (GCK-MODY), collectively affecting up to 10 million people worldwide. Patients with GCK-MODY are frequently misdiagnosed and treated unnecessarily. Genetic testing can prevent this but is hampered by the challenge of interpreting novel missense variants. Result Here, we exploit a multiplexed yeast complementation assay to measure both hyper- and hypoactive GCK variation, capturing 97% of all possible missense and nonsense variants. Activity scores correlate with in vitro catalytic efficiency, fasting glucose levels in carriers of GCK variants and with evolutionary conservation. Hypoactive variants are concentrated at buried positions, near the active site, and at a region of known importance for GCK conformational dynamics. Some hyperactive variants shift the conformational equilibrium towards the active state through a relative destabilization of the inactive conformation. Conclusion Our comprehensive assessment of GCK variant activity promises to facilitate variant interpretation and diagnosis, expand our mechanistic understanding of hyperactive variants, and inform development of therapeutics targeting GCK. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02935-8.

diabetes [6][7][8], leading to unnecessary treatment and surveillance [3]. Correct diagnosis of GCK-MODY can therefore terminate pharmaceutical treatments and decrease surveillance [3], with both economic and lifestyle benefits. Diagnosis of GCK-MODY can be achieved by identifying pathogenic variants in the gene encoding glucokinase (GCK,.
GCK regulates glucose levels by catalyzing the first step of glycolysis -the phosphorylation of glucose to form glucose-6-phosphate. Glucose phosphorylation is the ratelimiting step in insulin secretion in pancreatic β-cells [9][10][11] and glycogen synthesis in liver cells [12], and these processes are therefore regulated by GCK activity.
In addition to GCK, there are three other human hexokinases. These hexokinases have a high affinity for glucose and show hyperbolic kinetics. In contrast, GCK has a low affinity for glucose (S 0.5 7.5-10 mM) [13] and sigmoidal kinetics, which together allow GCK to respond rapidly to changes in glucose levels in the physiological range. The sigmoidal kinetics of GCK are due to positive cooperativity with glucose (Hill coefficient 1.7) [14]. This positive cooperativity is unusual as GCK functions as a monomer and contains only one glucose binding-site. Instead, the sigmoidal response to glucose is caused by intrinsic protein conformational dynamics [15][16][17].
GCK is a dynamic 52-kDa enzyme consisting of 465 amino acid residues, which fold into a large and a small domain. Between the two domains is a cleft forming the active site where glucose binds. The orientation of the two domains is not static, as GCK exists in multiple conformational ensembles [18,19]. These ensembles are often described as three conformations: the super-open, open and closed conformation [15]. At low glucose levels, GCK predominantly exists in the super-open conformation. Upon glucose binding, GCK shifts into the open conformation [20,21], and subsequently catalysis takes place in the closed conformation. Due to slow conversion between the super-open and open states [22], the population of each of the three conformations depends on glucose levels. At low glucose levels, GCK shifts into the super-open conformation prior to binding a new glucose molecule following catalysis, which results in slow glucose turnover. Turnover increases with higher glucose levels as GCK bypasses the super-open state and cycles between the open and closed conformations. The ratio between the slow and fast catalytic cycles, dependent on glucose concentration, gives rise to the positive cooperativity that is the basis for GCK sigmoidal kinetics and function [15,23].
Underscoring the importance of GCK function for glucose homeostasis, variants that alter GCK activity are associated with several diseases [24]. Gain-of-function variants that increase GCK activity cluster at an allosteric activator site [25,26] and cause hyperinsulinemic hypoglycemia (HH, MIM# 601820), which is characterized by increased insulin secretion even at low blood glucose levels [25,27]. In contrast, loss-of-function variants where GCK activity is eliminated or decreased cause hyperglycemia. Inactivation of both GCK alleles can result in the severe permanent neonatal diabetes mellitus (PNDM, MIM# 606176) [28,29], while heterozygous mutations cause a mild form of diabetes known as GCK-MODY [1,2]. GCK-MODY has an estimated population prevalence of 0.11%-0.21% [30,31], suggesting about 10 million people worldwide have GCK-MODY. Patients have mild and stable fasting hyperglycemia within 6-8 mM that does not usually require treatment, in contrast to other types of diabetes. Due to the population prevalence of GCK-MODY, and given that early diagnosis can limit unnecessary treatment and surveillance due to misdiagnosis [3], GCK has been proposed for inclusion in population screening programs [32]. Therefore, a comprehensive dataset of GCK activity and an increased understanding of GCK variants is highly warranted.
To address this challenge, we generated a variant effect map of GCK using a multiplexed assay of variant effects [33,34]. We assessed GCK activity using functional complementation in yeast, scoring both hypo-and hyperactive variants. The variant effect map recapitulates both the active site and a known allosteric activator site and includes 9003 of the 9280 (97%) possible missense and nonsense variants. The activity scores correlate with previous in vitro measurements of the catalytic efficiency and fasting blood glucose levels in patients. Furthermore, the map correctly classifies 78% of functionally characterized pathogenic variants that were previously curated [24]. To substantiate the map more broadly, we analyzed evolutionary conservation and conformational free energies. Conservation analysis generally agreed with the activity scores but was unable to capture hyperactive variants. When we examined these variants mechanistically, we found that some hyperactive variants likely shift GCK towards the active conformation through differential destabilization of the super-open and closed conformations. In conclusion, we present a comprehensive map of GCK activity to aid in variant understanding and interpretation.

Assessing human GCK activity using yeast complementation
To measure the activity of human GCK variants at large scale, we coupled yeast growth to human GCK activity using yeast complementation. To test complementation, we constructed a yeast strain deleted for all three yeast hexokinase genes (hxk1Δ hxk2Δ glk1Δ) that is unable to grow on glucose medium (Fig. 1A). This growth deficiency was rescued by expressing human pancreatic GCK (Fig. 1B), as previously shown [35].
Next, we examined whether the complementation assay could separate both hypoand hyperactive pathogenic variants from non-pathogenic GCK variants. We selected an initial test set of eight variants. Two variants (p.D217N and p.E279Q) were used as non-damaging controls, as they were two of the three most common alleles in gnomAD [36], while a third variant (p.D205H) has previously been shown to be catalytically dead [15,37] and was used as a loss-of-function control. The five remaining test-set variants were pathogenic variants from the ClinVar database [38], including both one variant (p.W99R) associated with HH [26] and four variants (p.F150S, p.R191W, p.G223S and p.G261R) with compelling evidence for linkage to GCK-MODY [39][40][41][42]. To test complementation of the eight variants, they were expressed in the hxk1Δ hxk2Δ glk1Δ yeast strain (Additional file 1: Fig. S1). On glucose medium, the two common variants showed growth that was similar to reference ('WT') GCK, while the catalytically dead variant (p.D205H) grew comparable to the vector control (Fig. 1B). The hyperactive variant (p.W99R) grew faster than WT, while the four disease-associated variants with reduced activity (p.F150S, p.R191W, p.G223S and p.G261R) grew slower than WT (Fig. 1B). The growth of the eight GCK variants was consistent with their level of activity, showing that the complementation assay could assess GCK activity.
GCK activity depends on the concentration of its substrate glucose. To find a glucose concentration that enabled detection of variants with both decreased and increased glucose affinities, we next tested complementation on media with varying glucose concentrations (Fig. 1B). A concentration of 0.2% (11.1 mM) glucose enabled detection of the hyperactive variant (p.W99R), while retaining a good dynamic range between WT and variants with decreased activity (p.F150S, p.R191W, p.G223S and p.G261R) (Fig. 1B). This concentration is close to GCK's affinity for glucose (S 0.5 = 7.5-10 mM) [13,24].
To further establish the fidelity of the yeast system, we tested whether glucokinase regulatory protein (GKRP), a known inhibitor of hepatic GCK in humans [43,44], could repress GCK activity in the yeast system. We co-expressed human GKRP and GCK in the hxk1Δ hxk2Δ glk1Δ yeast strain and examined growth on glucose medium (Fig. 1C). Co-expression of GKRP on glucose medium led to reduced growth (Fig. 1C). As GCK expression levels were comparable (Additional file 1: Fig. S2), human GKRP inhibited GCK activity in yeast, further validating the relevance of the yeast system to assay human GCK activity.
Having established an assay coupling yeast growth to GCK enzymatic activity, we used the assay to score the activity of a saturated library of human GCK variants produced by codon-randomization. For library construction, we divided the GCK sequence into three regions that were separately mutagenized, assayed and sequenced. Regional libraries were mutagenized using pools of oligos, each containing a central NNK-degeneracy targeting a codon within the targeted region. Subsequently, we cloned the mutagenized regional libraries en masse into a yeast expression vector and transformed the resulting plasmid libraries into the hxk1Δ hxk2Δ glk1Δ yeast strain. The yeast libraries were grown on 0.2% glucose media in duplicate to select for GCK activity. Before and after selection, plasmids were extracted from yeast libraries and each region was deeply sequenced in tiles of ~ 150 bp (~ 1.6 M-4.8 M reads per sequenced tile) (Fig. 1D) [45]. The resulting reads were used to quantify the relative frequency of each variant both before and after selection, and thus calculate an activity score and an associated measurement error for each variant. Activity scores were rescaled such that synonymous variants had a scores centered on one while nonsense variants had scores centered on zero. The resulting dataset contained scores for 9003 (97%) of the 9280 possible single amino acid GCK variants (including stop codons) ( Fig. 2A), and most of the remaining variants could be imputed using the A score of one (white) corresponds to the activity of wild-type-like synonymous variants, a score of zero (red) corresponds to total loss-of-function (nonsense) variants and a score above one (blue) corresponds to activity above synonymous variants. The median (MED) activity score at each position is included. Yellow indicates the wild-type amino acid and variants with missing data ('missing variants') are shown in grey. The secondary structure of GCK is shown next to the activity map with β-sheets shown as arrows and α-helices shown as waves. A map where missing variants have been imputed can be found in Additional file 1: Fig. S3. B Distributions of activity scores for synonymous variants (green), nonsense variants (red) and missense variants (grey). The activity score of the hyperactive variant p.W99R (blue) and the median activity score of the catalytic residue p.D205 (red) are shown. C The median activity score at each position was mapped onto the closed glucose-bound conformation of GCK (PDB 1V4S). The color scheme is the same as in panel A Human Protein Variant Effect Map Imputation Toolkit [45,46] (Additional file 1: Fig.  S3AB). The activity scores of nonsense and synonymous variants were well separated (Fig. 2B), and the distribution of missense variants spanned from total loss of function to abovewildtype activity (Fig. 2B). The activity of our test-set variants was similar to the lowthroughput growth assays, such that the common variants (p.D217N and p.E279Q) had wild-type-like activity, the HH-associated variant (p.W99R) had increased activity and the GCK-MODY-associated variants (p.F150S, p.R191W, p.G223S and p.G261R) had decreased activity. Although the catalytic site variant p.D205H was not included in the map, all other variants at this position showed decreased activity as expected.
To examine variant effects structurally, we mapped the median activity score at each position onto the structure of glucose-bound GCK [15] (Fig. 2C, Additional file 1: Fig.  S4AB). The resulting activity-colored structure was consistent with characteristics of proteins in general and GCK specifically. First, surface residues were generally resistant to mutations, while active site and buried residues were mutation-sensitive (Fig. 2C, Additional file 1: Fig. S4AB). Second, several positions where mutations on average increased activity clustered at a known allosteric activator site [25]. Together with the expected behavior of our test-set variants, these observations support that the map reflects human GCK activity.

Correlations with enzyme kinetics, fasting blood glucose levels, and clinical genetics
To examine what aspects of GCK activity the map reflects, we examined the correlation between our assay scores and previously published kinetic parameters of 38 variants characterized in vitro [47]. Assay scores correlated with the catalytic efficiency (k cat /S 0.5 ) of GCK variants (Fig. 3A, r s = 0.76, 95% CI [0.58, 0.88]), indicating that our yeast assay captures GCK catalytic efficiency with a dynamic range that includes both decreased and increased values.
We next examined whether the scores of previously classified GCK-MODY, HH, and benign variants separated into distinct classes. Our dataset included 71 variants with experimentally determined activity scores: 68 (60 GCK-MODY, 8 HH) pathogenic variants which have previously been functionally characterized [24] and three benign variants [38,48,49]. We were unable to generate a score for the benign variant p.G68D and, due to the already limited number of benign variants, used the imputed score for this variant obtained using the Human Protein Variant Effect Map Imputation Toolkit [45,46]. Although the assay did not correctly classify all variants, variants belonging to each of the three classes showed distinct distributions centered on either scores comparable to synonymous mutations (benign variants), a hyperactive score (HH variants) or a decreased activity score (GCK-MODY variants) (Fig. 3C). Furthermore, variants with a high allele frequency (> 10 -4 ) in gnomAD [36] had WT-like scores while rarer variants displayed a wide range of activities (Fig. 3D). To determine threshold values to classify variants as either GCK-MODY or HH, we applied receiver-operating characteristic  [47] and the black dotted line shows the best fitting curve. B The activity score plotted against the fasting glucose level of 33 carriers with an identified single variant in the GCK gene. The black dotted line indicates the best fitting curve. Spearmans' ρ is -0.58. C Raincloud plot showing the distributions of activity scores for 60 functionally characterized glucokinase-maturity-onset diabetes of the young (GCK-MODY) variants [24], eight functionally characterized hyperinsulinemic hypoglycemia (HH) variants [24] and three benign variants [38,48,49]. Due to the limited number of benign variants, the imputed activity score of the benign variant p.G68D was used, as no experimental score was obtained for this variant. The black dotted lines show the thresholds for variants associated with GCK-MODY (0.66, AUC = 0.88) and HH (1.18, AUC = 0.94) based on receiver-operating characteristic (ROC) analyses. D Plot showing the activity scores and the associated gnomAD allele frequency [36] for GCK variants present in gnomAD. As in panel C, the imputed activity score of p.G68D was used. In addition, the pathogenic variants from panel C that were not present in gnomAD are plotted to the left of the stippled line (ROC) analyses to the set of previously classified pathogenic and benign variants, noting that the very small number of benign variants hampers a detailed analysis. The threshold for variants associated with GCK-MODY was 0.66 (AUC = 0.88), while variants with a score above 1.18 were predicted to be associated with HH (AUC = 0.94). Using these cutoff values, the activity assay correctly identifies 76.7% of the analyzed GCK-MODY variants and 87.5% of the analyzed HH variants.
Although our assay was able to detect the vast majority of pathogenic variants, some reported pathogenic variants appeared as benign. There were several causes for this misclassification, but as examples we include three variants (p.V62M, p.T65I, and p.H137R) that show different molecular mechanisms causing them to score as benign in our assay. The genetic evidence for p.V62M as a loss-of-function mutation is compelling but initial in vitro kinetic characterization demonstrated increased affinity for glucose. Subsequent functional studies demonstrated that it is thermally labile with evidence for defective regulation by both GKRP and allosteric activators [59]. Similarly, the HH variant p.T65I has an increased affinity for glucose (reduced S 0.5 ) and therefore induces insulin secretion at inappropriate glucose levels [26]. However, p.T65I also has a decreased turnover number, resulting in a WT-like catalytic efficiency [47], which is the aspect of GCK activity that our assay captures. Finally, the GCK-MODY variant p.H137R has been previously reported to have WT-like kinetics but mildly decreased thermal stability [60]. Although our assay was not able to detect the p.H137R instability variant, it could detect severely unstable variants such as p.E300K [61]. Thus, based on an evaluation of these three mutations, which display complex mechanisms for their pathogenicity, we note that our assay may not detect all pathogenic variants involving complex mechanisms, including modest instability, especially for those with WT-like intrinsic catalytic efficiency.

Evolutionary conservation predicts glucokinase variant effects
While our experiments probed nearly all possible GCK variants, only a limited number of variants have previously been characterized experimentally. We therefore examined GCK activity scores more broadly by analyzing evolutionary conservation across species. Conservation analysis can assess the mutational tolerance of each position in a protein [62,63], and is widely used to assess the effect of protein variants [64][65][66][67]. We analyzed GCK evolutionary conservation using a sequence alignment of homologous proteins, evaluating the evolutionary distance between the GCK WT sequence and the single mutant variant, using an alignment that included both hexokinases and glucokinases more widely. The resulting evolutionary distance score (ΔE) quantifies the likelihood of a given substitution (Additional file 1: Fig. S5AB). Therefore, a score close to zero suggests that a substitution is compatible with function and does not affect the structural stability of the native conformations, while variants with a highly negative score are likely to be detrimental. Activity scores correlated weakly with ΔE (r p = 0.44, 95% CI [0.43, 0.46], Additional file 1: Fig. S6A). However, the correlation increased when we compared the two datasets using the residue median score (r p = 0.64, 95% CI [0.57, 0.68], Additional file 1: Fig. S6B), as averaging decreases noise in both datasets. Residueaveraged ΔE and activity scores agreed on regions where mutations severely decreased activity as well as regions that were tolerant towards mutations (Fig. 4A, Additional  Fig. 4 Interpreting the glucokinase activity scores using analyses of evolutionary conservation and conformational free energies. A Plots showing the residue median activity scores and evolutionary conservation scores for GCK. For activity scores, a score of one corresponds to a wild-type-like activity, a score of zero corresponds to total loss of function, and a score above one indicates increased activity. For GEMME scores (ΔE score), a score of zero suggests compatibility with function, while a high negative score indicates that mutations are detrimental.  Fig. S6E). However, the evolutionary conservation analysis did not detect gain-offunction positions nor the mutational sensitivity of the ~ 150-200 region that is likely specific for GCK compared to other hexokinases (Fig. 4A, Additional file 1: Fig. S6E). Residues in this region include the 151-179 loop that undergoes a disorder to order transition upon glucose binding [68]. The glucose-induced conformational changes in GCK are the basis for the sigmoidal kinetics and low glucose affinity that distinguishe GCK from other hexokinases. Hence, the ~ 150-200 region and gain-of-function variants are likely not captured by the evolutionary analysis due to the broadness of the multiple sequence alignment (MSA). When we restricted our analysis to include only residues with a median assay score below 1.18, correlation with ΔE further increased (r p = 0.73, 95% CI [0.69, 0.77], Additional file 1: Fig. S6CD). The remaining deviation between ΔE and activity scores could stem from assay conditions buffering mutation effects, such as the high expression level, the temperature, or the short timeframe of selection compared to evolution, making subtle fitness effects difficult to discern [69]. In conclusion, analysis of evolutionary conservation supports the activity assay. However, our conservation analysis does not include effects that are GCK-specific, such as substitutions that increase activity or affect conformational regulation.

Mechanistic evaluation of variant effects
To explore variants with functional impacts that were not detected via conservation analysis, we examined GCK variant effects mechanistically. We speculated that some variants shift the equilibrium of the conformational ensemble towards the catalytically inactive (or active) conformation, thereby decreasing (or increasing) GCK activity. This shift might arise if a given variant differentially affects the stability of the two conformations. To identify such variants, we used Rosetta [71] to predict changes in thermodynamic protein stability (ΔΔG) for both the super-open and closed GCK structure [15] (Additional file 1: Fig. S7AB), and then calculated the difference between the two structures (ΔΔG super-open -ΔΔG closed , Additional file 1: Fig. S7C). A negative score indicates that a given variant destabilizes the closed conformation relative to the super-open, while positive scores indicate variants that destabilize the super-open conformation relative to the closed. However, this shift is only relevant for variants that do not overly destabilize both structures, which would lead to disruption of the enzyme structure, likely degradation and loss of function.
Variants at most positions destabilized the two conformations equally (Fig. 4B white lines, Additional file 1: Fig. S8A). However, variants shifting the equilibrium towards the super-open inactive conformation were concentrated in two regions surrounding position 150 and 450 respectively (Fig. 4B lower panel, Additional file 1: Fig. S8A). The region surrounding position 450 mapped to an α-helix that is surface-exposed in the super-open state but buried in the closed state (Fig. 4C), consistent with variants causing a greater destabilization of the closed state. Similarly, the residues surrounding position 150 were buried in the active conformation (Fig. 4C). In addition, this region is N-terminal to residues 151-179 that form a disordered loop in the super-open conformation and therefore are not included in the stability calculations. The disordered loop folds into a β-hairpin in the closed conformation [15,68]. Due to this disorder-order transition, mutations in the 151-179 loop/β-hairpin are likely to destabilize the closed state relative to the super-open state. A conformational shift towards the inactive state is therefore a potential mechanism for mutations in the ~ 150-200 region. Positions where variants shifted the equilibrium towards the closed active conformation (Fig. 4B blue lines) were spread throughout the GCK sequence. However, these positions concentrated in the 3D structure at the allosteric activator site (Fig. 4C), which is also evident from the structure of GCK bound to a synthetic allosteric activator (Additional file 1: Fig. S8BC). Previously, several hyperactive variants were examined to determine their mechanism using kinetic analysis [22,70]. By examining enzyme kinetics, Heredia et al. found that p.T65I, p.Y215A and p.A456V mainly accelerate glucose binding, while p.W99R, p.Y214A, p.Y214C and p.V455M predominantly increase the conformational preference towards the active state. Although p.T65I, p.Y215A and p.V455M affect the equilibrium mildly, all variants except p.A456V to some extent increased conversion to the closed state [70]. Consistently, p.A456V shifted the equilibrium towards the super-open conformation according to our stability predictions (Fig. 4D). In addition, except for the p.T65I variant, stability calculations predicted the remaining five variants (p.W99R, p.Y214A, p.Y214C, p.Y215A and p.V455M) to shift the equilibrium towards the active conformation (Fig. 4D). Thus, for six out of seven variants our stability predictions were consistent with prior kinetic and mechanistic analysis, thereby validating our mechanistic predictions. We next used the stability calculations to assess how many hyperactive variants potentially shift the equilibrium towards the active state. The supplementary data contains a list of 467 hyperactive variants (activity score > 1.18) which are predicted to both be stable (ΔΔG closed < 2 kcal/mol) and promote increased populations of the closed active conformation (Δ(ΔΔG) > 0.25 kcal/mol) (see Additional file 2).

Discussion
Recent developments in molecular biology and sequencing technologies have made it possible to perform deep mutational scanning experiments, in which thousands of gene variants are assayed in a single, multiplexed experiment. Here we applied this technology to GCK, a gene of central importance in metabolism and where missense and nonsense variants are associated with several diseases. The resulting activity map covers 97% of all possible single amino acid variants, and activity scores correlate with in vitro catalytic efficiency, fasting glucose levels in individuals carrying GCK variants, and evolutionary conservation.
There are a number of limitations to our study. First, due to our mutagenesis strategy, each variant in the library may contain multiple mutations within the mutagenized region. As each regional library is sequenced in tiles, only in cis mutations that occur within the same ~ 150 bp tile are detected. To limit the impact of potential background mutations that co-occur in the same clone(s), the regional libraries have a low mutational density, with an estimated average of 0.3-0.4 mutations per mutagenized region, and a high coverage so that each variant's score is based on a large number of independent clones. In this way, the impact of secondary variants within a clone can be averaged out. However, a few variant scores may be affected by background mutations, as potentially seen for synonymous variants. The activity score distributions of synonymous and nonsense variants show a slight overlap (Fig. 2B). For synonymous variants with an unexpectedly low activity score, the vast majority are associated with a high estimated standard deviation, and the remaining variants might be explained by background mutations. For nonsense variants associated with an unexpectedly high activity score, the majority are associated with a high standard deviation or are in the extreme C-terminus where they appear compatible with activity. The remaining high-scoring nonsense variants could be due to stop codon readthrough, which is known to occur contextdependently in yeast [72,73], and/or the presence of yeast cells transformed with vectors expressing multiple distinct variants, although we attempted to limit this by outgrowth for three days following transformation [74]. We also note that apparently functional nonsense variants have previously been observed in high throughput cell-based assays [45,[75][76][77][78]. A second limitation is that apparently-functional nonsense variants in our study should not be confidently inferred to provide normal function in humans, as our assay does not provide a faithful model for impacts due to nonsense-mediated decay in a human cell. A third limitation is that our mutagenesis libraries include only synonymous, missense and nonsense variants. Other types of pathogenic variation have previously been identified in the GCK gene, including insertions [79,80], a promoter variant [81], and splice site variants. A map of missense variants is, however, still highly relevant. First, the effects of missense variants can be challenging to interpret without functional data. Second, missense variants constitute 437 out of the total 794 current GCK ClinVar entries [38] (accessed on February 20 th 2023), and of the 201 GCK variants that have been annotated as variants of uncertain significance, 161 (80%) of these are missense variants. A fourth limitation of our study is that 22.1% of the analyzed pathogenic variants have scores that are similar to those of wild-type-like synonymous variants. The sensitivity of the assay could potentially be increased by repeating the assay using different expression levels and glucose concentrations as well as increasing the number of replicates. However, others have found that pathogenic variants, even when thoroughly examined in low-throughput assays, can appear wild-type-like [47,82,83]. Multiple assays characterizing different aspects of GCK function might therefore be necessary to capture all pathogenic variants.
The hyperactive variants identified in our assay clustered at a site distal from the active site, known as the allosteric activator site [25]. The allosteric site is being pursued as a target for treatment of type 2 diabetes using drugs known as glucokinase activators (GKAs) [84,85]. The development of GKAs has faced several problematic effects including accumulation of triglycerides in the liver [86,87] and hypoglycemia [88,89]. However, recently both a dual-acting and a hepatoselective GKA have shown promising results [90,91]. Most GKAs bind the allosteric activator site and mimic the effect of known hyperactive variants. Previously, a few hyperactive variants were found to increase GCK activity by favoring the conformational change to the closed active state [22,70]. We extended the mechanistic analysis of hyperactive variants using predictions of protein stability for the super-open and closed conformations. Our findings are in accordance with prior kinetic analyses [22,70] and we identify 467 hyperactive variants that are predicted to shift GCK towards the closed state. Furthermore, our results indicate that these variants facilitate isomerization by destabilizing the super-open conformation relative to the closed. In conclusion, our study provides a comprehensive mapping of the allosteric activator site and an improved mechanistic understanding of some hyperactive variants. Our results may aid in refining drug design and the development of GKAs, specifically to address the problem of hypoglycemia resulting from GKA treatment.
In contrast to hyperactive variants we found that, in the region spanning residues ~ 150-200, nearly all substitutions substantially decreased activity. Variants previously identified in the region are associated with loss of activity and elevated fasting plasma glucose levels [24,47], and this region is central in the conformational dynamics of GCK [17,68]. Hence, substitutions in this region might interfere with the transitions required for activity. In addition, some variants might destabilize the closed active conformation relative to the super-open state, thereby leading to increased population of the inactive state. However, the mechanisms underlying the mutational sensitivity of this region are not entirely clear and require further studies.

Conclusions
Here, we provide the first comprehensive map of GCK variant activity, measuring the functional consequences of thousands of previously uncharacterized GCK variants. More than 1 in 1000 people are estimated to suffer from GCK-MODY, and although the functional evidence from our variant effect map cannot alone classify a variant as pathogenic, it increases our understanding of GCK variants including those causing GCK-MODY.

Yeast growth assays
For yeast growth assays on solid medium, cultures were grown overnight (30 °C, vigorous agitation) until reaching exponential phase. The yeast cells were washed once in sterile water (1200 g, 5 min, RT) and resuspended in sterile water. The OD 600nm of all cultures were adjusted to 0.4, and the cultures were used for a five-fold serial dilution in water. Serial-diluted cultures were spotted in 5 µL drops onto agar plates, which were briefly dried and incubated at 30 °C for 2-4 days.

Protein extraction from yeast cells
For extraction of proteins from yeast, cultures were grown overnight (30 °C, vigorous agitation) until reaching exponential phase, at which point 100-125 × 10 6 cells were harvested (1200 g, 5 min, 4 °C). The pelleted cells were washed in 25 mL ice-cold water (1200 g, 5 min, 4 °C). The cell-pellet was resuspended in 1 mL 20% ice-cold TCA, transferred to an Eppendorf tube and centrifuged (4000 g, 5 min, 4 °C). The supernatant was discarded and the pellet was resuspended in 200 µL 20% ice-cold TCA. The resuspension was transferred to a screw cap tube containing 0.5 mL glass beads (Sigma), and cells were lysed using a Mini Bead Beater (BioSpec Products) by three 15 s cycles with 5 min incubations on ice between each burst. Then, 400 µL ice-cold 5% TCA was added, the tubes were punctured at the bottom using a needle and transferred to a 15 mL Falcon tube containing a 1.5 mL Eppendorf tube without the lid. The sample was isolated from the glass beads by centrifugation (1000 g, 5 min, 4 °C). The Eppendorf tube containing the sample was centrifuged (10,000 g, 5 min, 4 °C) and the resulting pellet was washed in 500 µL 80% acetone (10,000 g, 5 min, 4 °C). The acetone was removed and the pellet was dried for 5 min before resuspension in 100 µL SDS sample buffer (1.5x) and 25 µL 1 M Tris/HCl, pH 9. The samples were boiled for 5 min and cleared by centrifugation (5000 g, 5 min, RT). The supernatant was transferred to an Eppendorf tube and was analysed by SDS-PAGE and Western blotting.

Electrophoresis and blotting
SDS-PAGE was done using 12.5% acrylamide gels. Following SDS-PAGE, 0.2 µm nitrocellulose membranes were used for the Western blotting procedure. After protein transfer, membranes were blocked in 5% fat-free milk powder, 5 mM NaN 3 and 0.1% Tween-20 in PBS.

Library mutagenesis
To construct a library of GCK variants, oligos containing a central degenerate NNK codon were designed for each codon in the GCK sequence. An online tool (http:// llama. mshri. on. ca/ cgi/ popco deSui te/ main) [45] was used to obtain oligo sequences, and oligos were obtained from Eurofins. The GCK sequence was divided into three regions spanning aa 2-171 (region 1), 172-337 (region 2) and 338-465 (region 3). Oligos were pooled for each region. The three regional oligo pools were phosphorylated using T4 Polynucleotide Kinase (NEB) as recommended by the provider using 300 pmol of each oligo pool and incubation for one hour at 37 °C. Then, the phosphorylated oligos were annealed to the WT GCK sequence. For each region, the following was combined in a PCR tube: 25 fmoles pENTR221-GCK, 3 µL 10 µM SKG_1, 5.6 µL oligo pool, 10.1 µL nuclease-free water. The reactions were denatured (95 °C, 3 min) and cooled (4 °C, 5 min). Following template annealing, 5 µL of each reaction was combined with 5 µL Phusion Hot Start Flex 2 × Master Mix (NEB) and sequences were extended (95 °C 3 min, 4 °C 5 min, 50 °C 120 min). To each reaction, the following was added: 1.5 µL Taq DNA ligase buffer (NEB), 0.5 µL Taq Ligase (NEB), 3 µL nuclease-free water, followed by incubation (45 °C 20 min). Next, 1 µL of each reaction was combined with the following: The resulting PCR products containing Gateway attB-sites were gel purified and used for Gateway cloning.

Library cloning
Next, PCR products were cloned into pDONR221 to generate three regional pENTR221 libraries. For each of the three regions a 25 µL Gateway BP reaction was prepared: 114 ng PCR product, 375 ng pDONR221, 5 µL Gateway BP Clonase II enzyme mix (ThermoFisher), TE Buffer pH 8.0 to 25 µL. Reactions were incubated overnight at RT. The following day, 3.1 µL proteinase K was added and reactions were incubated for 10 min at 37 °C. For each region, 4 µL BP reaction was transformed into 100 µL of NEB 10-beta electrocompetent E. coli cells using electroporation. The cells were recovered in 3900 µL NEB 10-beta outgrowth medium in 50 mL Falcon tubes at 37 °C for 1 h. Then, cells were plated on LB containing kanamycin and were incubated at 37 °C overnight. A minimum of 500,000 colonies were obtained for each region. The following day, the cells were scraped from the plates using sterile water and plasmid DNA was extracted from 400 OD 600nm units (Nucleobond Xtra Midiprep Kit, Macherey-Nagel).
The resulting pENTR221-GCK libraries were used in large-scale Gateway LR reactions to clone the libraries into the pAG416GPD-EGFP vector. For each region, the following was mixed: 216.9 ng pENTR221-GCK library, 450 ng pAG416GPD-EGFP vector, 6 µL Gateway LR Clonase II enzyme mix (ThermoFisher), TE Buffer pH 8.0 to 30 µL.
The reactions were incubated at RT overnight. Next day, 3 µL proteinase K was added to each reaction and tubes were incubated at 37 °C for 10 min. The LR reactions were transformed into NEB 10-beta electrocompetent E. coli cells using electroporation of 4 µL reaction per 100 µL cells. Cells were recovered in NEB 10-beta outgrowth medium for 1 h at 37 °C, and were then plated on LB containing ampicillin for incubation at 37 °C overnight. A minimum of 500,000 colonies were obtained for each regional library. The following day, cells were scraped from plates using sterile water and plasmid DNA was extracted from 400 OD 600nm units (Nucleobond Xtra Midiprep Kit, Macherey-Nagel). The resulting plasmid DNA was used for yeast transformation.

Library yeast transformation
The GCK expression libraries were transformed into the hxk1Δ hxk2Δ glk1Δ strain as described in [94] using a 30 × scale-up and 30-60 µg of plasmid DNA. Small aliquots of each transformation were plated on SC-URA galactose in duplicate for colony counting. The rest of each transformation was diluted in SC-URA galactose to an OD 600nm of 0.2, and incubated at 30 °C with shaking until saturated. A minimum of 500,000 colonies were obtained for each regional library. For each regional library, 36 OD 600nm units of yeast cells were harvested in duplicate to serve as pre-selection samples. Pellets were stored at -20 °C before plasmid DNA extraction. In parallel with library transformations, a vector (pAG416GPD-EGFP-ccdB) and GCK WT control were transformed in small scale, and 36 OD 600nm units in duplicate were stored at -20 °C for DNA extraction.

Library selection
Yeast transformations were next used for selection of GCK activity. For each regional library, 20 OD 600nm units of yeast cells were harvested in duplicate to serve as two biological repeats. The cells were washed three times in sterile water and resuspended in 500 µL sterile water. The cells were then plated on large plates (500 cm 2 ) of SC-URA 0.2% glucose. Plates were incubated at 30 °C for three days. After selection, yeast cells were scraped off plates using 30 mL milliQ water and 36 OD 600nm units from each plate were harvested to serve as post-selection samples. Cell pellets were stored at -20 °C prior to plasmid DNA extraction. In parallel 2.6 OD 600nm units of vector (pAG416GPD-EGFP-ccdB) and GCK WT yeast transformations were washed, plated and incubated for three days at 30 °C in duplicate. After incubation, 36 OD 600nm units of cells were harvested and stored at -20 °C prior to plasmid DNA extraction.

Library sequencing
To determine the change in frequency of variants after selection for GCK activity, we sequenced the GCK ORF before and after growth on glucose medium. The GCK ORF was sequenced in 14 tiles, such that each tile could be sequenced on both strands to reduce base-calling errors. Region one (tile 1-5) and two (tile 6-10) both spanned five tiles, while region three was sequenced in four tiles (tile [10][11][12][13][14]. First, plasmid DNA was extracted from yeast cells for both duplicates of the regional libraries and the GCK WT control pre-and post-selection. Plasmid DNA was extracted from 9 OD 600nm units using the ChargeSwitch Plasmid Yeast Mini kit (Invitrogen). Plasmid DNA was adjusted to equal concentrations, and was then used for two rounds of PCR to first amplify each tile and then add index sequences to allow for multiplexing. In the first PCR, each tile was amplified with primers containing a binding site for Illumina sequencing adapters. For each tiling PCR, the following was mixed: 20 µL Phusion High-Fidelity PCR Master Mix with HF Buffer (NEB), 1 µL 10 µM forward primer, 1 µL 10 µM reverse primer, 18 µL plasmid library template. The sequences of forward and reverse primers for each tile are listed in the supplemental material (SKG_tilenumber_fw/rev). Tiles were amplified using the following program: 98 °C 30 s, 21 cycles of 98 °C 10 s, 63 °C 30 s, 72 °C 60 s, followed by 72 °C 7 min and 4 °C hold.
In the next PCR, Illumina index adapters were added to all amplified tiles. Each PCR consisted of: 20 µL Phusion High-Fidelity PCR Master Mix with HF Buffer (NEB), 2 µL 10 µM i5 indexing adapter, 2 µL 10 µM i7 indexing adapter, 1 µL 1:10 diluted PCR product, 15 µL nuclease-free water. Tiles were amplified using the following program: 98 °C 30 s, 7 cycles of 98 °C 15 s, 65 °C 30 s, 72 °C 120 s, followed by 72 °C 7 min and hold at 4 °C. An equal volume of each PCR was then pooled and 100 µL were used for gel extraction from a 4% E-Gel EX Agarose Gel (Invitrogen). The fragment size and quality of the extracted DNA were tested using a 2100 Bioanalyzer system (Agilent), and DNA concentration was determined using Qubit (ThermoFisher). Finally, libraries were paired-end sequenced using an Illumina NextSeq 550.

Statistical analyses
The empirically determined standard deviations of the activity scores were likely imprecise, as they were based on only two replicates. To obtain more robust estimates of standard deviations, we used Bayesian regularization or refinement as described by Baldi and Long [95]. The prior estimate of the standard deviation was obtained by linear regression based on the fitness score and the read counts from the permissive condition. The prior was combined with the empirical standard deviation using Baldi and Long's original formula: Here, σ 0 is the prior estimate obtained from regression, v 0 represents the degrees of freedom assigned to the prior estimate, n is the number of experimental replicates, and s is the empirical standard deviation. The full code used for performing error regularization can be found on Github (https:// github. com/ jweile/ tiles eqMave, version 0.6.0.9000).
Confidence intervals (CIs) were obtained using the SciPy bootstrap function with 10,000 resamples.

Imputation of missing activity scores
The Human Protein Variant Effect Map Imputation Toolkit webserver [45,46], was used to impute activity scores for missing variants. The webserver was run using standard parameters and with equal quality index on all variant scores. The original and imputed refined scores showed a Pearson's correlation of 0.985. The imputed scores were only used for Additional file 1: Fig. S3 except for the benign variant p.G68D. The imputed score of p.G68D was used for Fig. 3CD and receiver-operating characteristic (ROC) analyses, due to the limited number of benign variants.

Evolutionary conservation analysis
The HHblits suite [96,97] and GEMME package [98] were used to evaluate evolutionary distances from the WT GCK sequence (Uniprot ID: P35557-isoform 1). The MSA was generated using HHblits with an E-value threshold of 10 -20 and using UniClust30 as sequence database. The resulting MSA contained 1179 sequences. Two additional filters were applied to the HHblits output MSA: First, all the columns that were not present in the WT GCK sequence were removed. Second, all the sequences with more than 50% gaps were removed, leaving 1079 sequences in the MSA. The GEMME package was run using standard parameters. Median scores were evaluated for each position using all the 19 substitutions.

Thermodynamic stability measurements (ΔΔG)
Rosetta package (GitHub SHA1 c7009b3115c22daa9efe2805d9d1ebba08426a54) with Cartesian ΔΔG protocol [71,99] was used to predict changes in thermodynamic stability from the crystal structures [15] of super-open (PDB 1V4T) and closed (PDB 1V4S) conformations of GCK. The values obtained from Rosetta in internal Energy Unit were divided by 2.9 to convert the unit to kcal/mol [71]. Median scores were evaluated for each position using all the 19 substitutions.

Fasting plasma glucose study population
Variants in GCK were identified using sequencing. Samples were collected from a population-based cohort of 6,058 individuals both with and without diabetes [50], 2,930 patients with newly-diagnosed diabetes [51], patients diagnosed with GCK-MODY [53] and from a population of 1,146 Danish children [52]. Individuals were included if they carried one missense GCK variant according to transcript NM_000162 and if a measure of fasting plasma glucose was available. Measures of fasting plasma glucose were examined using a glucose oxidase method (Granutest; Merck, Darmstadt, Germany) in the population based cohort and in samples from patients with known GCK-MODY [50,53], an enzymatic hexokinase method (Gluco-quant Glucose/HK, Roche Diagnostics) in newly diagnosed diabetes patients [51], and using a Dimension Vista ® 1500 Analyzer (Siemens, Erlangen, Germany) in children [52]. Samples were excluded if fasting plasma glucose level exceeded 9 mM.