A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog

Background The accurate description of ancestry is essential to interpret and integrate human genomics data, and to ensure that advances in the field of genomics benefit individuals from all ancestral backgrounds. However, there are no established guidelines for the consistent, unambiguous and standardized description of ancestry. To fill this gap, we provide a framework, designed for the representation of ancestry in GWAS data, but with wider application to studies and resources involving human subjects. Result Here we describe our framework and its application to the representation of ancestry data in a widely-used publically available genomics resource, the NHGRI-EBI GWAS Catalog. We present the first analyses of GWAS data using our ancestry categories, demonstrating the validity of the framework to facilitate the tracking of ancestry in big data sets. We exhibit the broader relevance and integration potential of our method by its usage to describe the well-established HapMap and 1000 Genomes reference populations. Finally, to encourage adoption, we outline recommendations for authors to implement when describing samples. Conclusions While the known bias towards inclusion of European ancestry individuals in GWA studies persists, African and Hispanic or Latin American ancestry populations contribute a disproportionately high number of associations, suggesting that analyses including these groups may be more effective at identifying new associations. We believe the widespread adoption of our framework will increase standardization of ancestry data, thus enabling improved analysis, interpretation and integration of human genomics data and furthering our understanding of disease.


Detailed description 160
The detailed description aims to accurately represent the ancestry or 161 genealogy of each distinct group analyzed in a specific study in detail, as 162 reported by the author. Information about the homogeneity of the samples, 163 including whether the cohort is admixed or taken from a founder or isolated 164 population, is included. In the GWAS Catalog, the majority of the detailed 165 descriptions include terms that describe the location of participants' ancestors 166 over the past few generations ("French", "Japanese"), while admixed 167 populations are primarily described using ethnic descriptors ("Hispanic"). 168 Isolated populations are described using either location or ethnicity terms in 169 addition to being described explicitly as genetically isolated ("Old Order Amish 170 (founder or genetic isolate) population", "Norfolk Island (founder or genetic 171 isolate) population". 172 173 8

Ancestry categories 174
Ancestry category assignment from the list presented in Table 1 requires 175 careful consideration. When clearly stated, author-reported categories are 176 extracted, with precedence given to genetically-inferred data. If a category is 177 not stated, curators infer the category based on the detailed description for 178 the sample, which, as noted above, represents author-provided information. 179

180
In the absence of any ancestry data, the category "Not Reported" is assigned, 181 unless geographical location of sample recruitment is stated. In such 182 instances, curators infer ancestry from external sources, such as the United 183 Nations [15] and The World Factbook [16]. Selecting a category for samples 184 that derive from a country with a homogenous demographic composition, 185 such as Japan, is straightforward. However, for samples from populations with 186 limited known genetic genealogy, such as Azerbaijan, or for samples recruited 187 in countries with ancestral diversity, such as Singapore, assigning a category 188 is more challenging. These sources are particularly useful to obtain 189 geographical and country-specific population information. The World Factbook 190 is a regularly updated, comprehensive compendium of worldwide 191 demographic data, covering all countries and territories of the world. However, 192 since it does not necessarily provide ancestry data, the World Factbook is 193 consulted when the only known information is the country of recruitment of 194 samples. We expect that as increased care is taken to accurately report 195 ancestry data, reliance on this resource will decrease. Peer-reviewed 196 population genetic studies that characterize the genetic background of a given 197 population may also be consulted. This is particularly helpful in cases where 9 the sample cohort self-reported or is described using geographical or ethno-199 cultural terms, such as "Scandinavian" or "Punjabi Sikh". Supplementary 200 Table 3 provides a list of countries for which external sources were consulted.  201 If the ancestry data provided in publications does not allow the resolution of 202 samples into ancestrally distinct sets, more than one category may be 203 selected from the list in Table 1

Country information 207
Country of recruitment ( Figure 1) and country of origin provides additional 208 demographic information and is extracted for each distinct sample set. 209 Country of origin or recruitment is author-reported and not inferred from 210 ancestry data. An exception is made for occasions when authors combine 211 country of recruitment with an ancestry description ("Singaporean Chinese"). 212 In these cases, we infer the country of recruitment ("Singapore") although it is 213 Catalog, but focused exclusively on the detailed descriptions, which are 248 heterogeneous as they are based on the authors' language. Here we present 249 the first analyses using our ancestry categories and demonstrate the validity 250 of our framework to facilitate the tracking of ancestry in big data sets. The analysis of the over 3,000 GWAS publications revealed inconsistent and 299 ambiguous reporting of ancestry data, with a significant percentage of studies 300 (~ 4%) not reporting any ancestry information at all. Given that there are no 301 established guidelines for the description of ancestry, and in an effort to assist 302 the community as it seeks to improve in this area, we here provide a set of 303 specific recommendations for authors, also summarized in Box 1. We believe 304 implementation of these recommendations will improve the quality of reporting 305 and have a positive impact on the interpretation of published results, data re-306 use and reproducibility. 307

308
We recommend that authors make every effort to generate a detailed 309 description for each distinct set of individuals included in their studies. Authors 310 should also assess whether the genetic diversity of each distinct set is 311 representative of one of the known populations listed and defined in Table 1, populations. This will greatly facilitate integration of studies involving these 366 populations with data included in the Catalog, and, indeed, with any other 367 resource that implements our framework. We display the utility of the ancestry 368 categories to simplify the tracking of efforts towards diversity, allowing the 369 identification of gaps and highlighting specific areas for improvement. 370 Interestingly, in addition to confirming known biases, our category-based 371 analyses revealed that African and Hispanic or Latin American ancestry 372 populations contribute a disproportionately high number of associations, 373 suggesting that analyses including these groups may be more effective at 374 identifying new associations. Finally, stemming from our extensive manual 375 review of publications, we note a lack of current standards with regard to 376 ancestry reporting and offer recommendations to authors to implement when 377 describing their samples. This, we believe, will increase consistency and 378 reduce ambiguity, facilitating the interpretation of results. We do not view our categories as exhaustive or static. We envision that as 416 more cohorts from diverse populations are characterized, there might arise a 417 need to create additional categories or sub-categories. In addition, 418 anticipating that admixture is likely to increase in the future, due to migration, 419 for example, we also created categories to represent known (for example, 420 "Hispanic or Latin American") and emerging (for example, "Other admixed 421 ancestries") admixed groups. We recognize that classification of admixed 422 samples is particularly challenging. The degree and type of admixture may 423 vary within the population, and the accuracy of classification requires well-424 defined reference samples, which are lacking for some groups. As the 425 community moves towards genetically-inferred ancestry descriptions, our 426 categories are likely to become more precise and granular over time. European populations. Of the commonly studied traits, the largest diversity of 493 backgrounds was found for common anthropometric traits, heart disease, and 494 type 2 diabetes. This is perhaps not surprising considering that metrics for 495 these traits are easy to obtain, and the two diseases are among the top ten 496 causes of death around the world, according to the World Health which global disease burden is substantial tend to lead to increased funding 499 and research infrastructure. While we are encouraged by the trend we have 500 seen in recent years towards increased diversity, we note that there are still 501 very clear gaps as some groups continue to be underserved or ignored. We 502 strongly urge the scientific community to expand their efforts to assemble and 503 analyze cohorts, including especially underrepresented communities. To determine the distribution of individuals, associations and traits by ancestry 553 category, we first downloaded all Catalog data in tabular form [14]. All data 554 (gwas-catalog-associations_ontology-annotated.tsv, gwas-catalog-555 ancestry.tsv, gwas-catalog-studies_ontology-associated.tsv, gwas-efo-trait-556 mappings.tsv) included in these analyses were curated from GWA studies 557 Inferred from limited ancestry-related information (e.g. country information), 5. 567 No ancestry information reported and 6. Mixed method (when a combination 568 of methods was utilized to describe the study samples). Publications classified 569 as "Genetically assessed" includes those where the author had clearly 570 identified the genetic ancestry or admixture of the population, for example by 571 using methods such as those described in Supplementary Box 1. It also 572 includes those that confirmed self-reported information or defined samples 573 based on self-reports but then excluded genetic outliers. Publications where 574 no ancestry was stated, but curators inferred an ancestry based on country 575 information are included in the fourth classification. In many cases authors 576 used a statistical method to assess or control for ancestry or population 577 stratification, without assigning individuals to a particular category, for 578 example using a continuous axis of genetic variation from PCA to compute 579 the association statistic. However, since this did not add any information that 580 curators could use to assign a population ancestry to the study, it was not 581 included under category 2. 582 583

Declarations. 584
Ethics approval and consent to participate 585 The datasets generated and/or analyzed during the current study are 592 available on the NHGRI-EBI GWAS Catalog search interface[4] and in 593 spreadsheet form [14].