Last updated: 2026-03-31

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 82a1d86

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 82a1d86. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    BC2GM/
    Ignored:    BioC.dtd
    Ignored:    FormatConverter.jar
    Ignored:    FormatConverter.zip
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    doccano_venv/
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/supplement/
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacy_venv_requirements.txt
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  code/full_text_conversion/html_to_xml.R
    Untracked:  code/text_mining_models/tokenise_data.py

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/index.Rmd) and HTML (docs/index.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	82a1d86	IJbeasley	2026-03-31	Fix broken link in index page
html	ead0cde	IJbeasley	2026-03-24	Build site.
Rmd	d2fc105	IJbeasley	2026-03-24	Update index page
html	27c86b7	IJbeasley	2026-01-12	Build site.
Rmd	7acc413	IJbeasley	2026-01-12	Update index page
html	1a9afee	IJbeasley	2026-01-03	Build site.
Rmd	04bbd01	IJbeasley	2026-01-03	Update index page
html	326619e	IJbeasley	2025-10-27	Build site.
Rmd	8d734db	IJbeasley	2025-10-27	Update index page
html	52b8c49	IJbeasley	2025-10-09	Build site.
Rmd	97db3a5	IJbeasley	2025-10-09	Update index page
html	fba0bc8	IJbeasley	2025-09-24	Build site.
Rmd	a04e62d	IJbeasley	2025-09-24	More fixing diseases
html	5067e95	IJbeasley	2025-09-23	Build site.
html	6f76afa	IJbeasley	2025-09-23	Build site.
html	1a571a5	IJbeasley	2025-09-22	Build site.
html	99347f5	IJbeasley	2025-09-22	Build site.
Rmd	e15c4b8	IJbeasley	2025-09-22	Even more typo etc.
html	ff6e030	IJbeasley	2025-09-22	Build site.
Rmd	661b9c1	IJbeasley	2025-09-22	More typo correcting
html	0f48914	IJbeasley	2025-09-17	Build site.
html	1e5f203	IJbeasley	2025-09-17	Build site.
html	9fdfd2e	IJbeasley	2025-09-16	Build site.
html	6ed4b80	IJbeasley	2025-09-16	Build site.
html	74451e3	IJbeasley	2025-09-16	Build site.
html	38a161d	IJbeasley	2025-09-15	Build site.
html	6477659	IJbeasley	2025-09-15	Build site.
html	10775a9	IJbeasley	2025-09-15	Build site.
Rmd	7733a7b	IJbeasley	2025-09-15	Add disease terms unique number to index workflowr page
html	846312b	IJbeasley	2025-09-15	Build site.
Rmd	1bb743c	IJbeasley	2025-09-15	workflowr::wflow_publish("analysis/index.Rmd")
html	784ab91	IJbeasley	2025-09-15	Build site.
Rmd	cf46ce3	IJbeasley	2025-09-15	Update workflow index
html	153e9c5	IJbeasley	2025-09-15	Build site.
Rmd	dd214e1	IJbeasley	2025-09-15	Update workflow index
html	df2faf1	IJbeasley	2025-09-14	Build site.
Rmd	edc356d	IJbeasley	2025-09-14	Fix typo on index page
html	98018fa	IJbeasley	2025-09-14	Build site.
Rmd	d94474f	IJbeasley	2025-09-14	workflowr::wflow_publish("analysis/index.Rmd")
html	6b00297	IJbeasley	2025-09-09	Build site.
Rmd	906b603	IJbeasley	2025-09-09	Update index page
html	7e1b0fb	IJbeasley	2025-08-25	Build site.
Rmd	b293dab	IJbeasley	2025-08-25	Update homepage links
html	d5a6d6a	IJbeasley	2025-08-21	Build site.
Rmd	6e60a4b	IJbeasley	2025-08-21	Woops typos in index page
html	f6f4371	IJbeasley	2025-08-21	Build site.
Rmd	19d0887	IJbeasley	2025-08-21	Woops typos in index page
html	c055fb3	IJbeasley	2025-08-21	Build site.
Rmd	5dc8894	IJbeasley	2025-08-21	Updating index page
html	d49291e	IJbeasley	2025-08-20	Build site.
Rmd	d44c981	IJbeasley	2025-08-20	Update links in index
Rmd	4c3a4fb	IJbeasley	2025-08-19	Adding more cohort simplification
html	925266a	IJbeasley	2025-08-05	Build site.
Rmd	f58709b	IJbeasley	2025-08-05	Add icite to homepage
html	01adb87	IJBeasley	2025-07-30	Build site.
Rmd	92fa3c6	IJBeasley	2025-07-30	workflowr::wflow_publish("analysis/index.Rmd")
Rmd	7cb9ee8	Isobel Beasley	2022-02-16	Start workflowr project.

github repository link

1 Find GWAS studies of Lancet Commission Priority Diseases

1.1 Identify GWAS studies of diseases in general

I find GWAS Catalog studies that of diseases, by looking for disease-related EFO ontology mapped terms in the ‘MAPPED_TRAIT’ column of the GWAS Catalog metadata. I categorize traits into disease, response, measurement etc. using the EFO ontology.

[1] "Number of pubmed ids studying at least one disease"

[1] 4652

[1] "Number of unique disease terms: 2509"

1.2 Mapping traits to ICD 10 Codes

Next, for GWAS studies of disease I harmonise disease trait labels to reduce redundancy due to typos, synonyms etc.

Harmonising disease trait labels - typos, synonyms etc.

[1] "Number of unique disease terms: 2381"

Grouping cancer disease traits (removing disease subtypes etc)

[1] "Number of unique disease terms: 2192"

Grouping non-cancer disease traits (removing disease subtypes etc)

[1] "Number of unique disease terms: 1995"

1.2.1 Now: Mapping traits to ICD 10 Codes

Then, I map the harmonised disease trait labels to ICD-10 codes, by the following step-wise procedure:

Where available, extract author provided ICD-10 Codes in GWAS Catalog DISEASE/TRAIT metadata
Where available, extract author provided PheCodes in GWAS Catalog DISEASE/TRAIT metadata. These PheCodes are converted to ICD-10 codes using the PheWAS R package international mapping file, and checked against PheCode to ICD-10cm mappings (from https://phenomics.va.ornl.gov/phecodemap/).
If author provided ICD-10 or PheCodes are not available, match DISEASE/TRAIT labels to ICD-10 code descriptions
If not able to match DISEASE/TRAIT labels to ICD-10 code descriptions, try match DISEASE/TRAIT labels to PheCode descriptions
If not able to match DISEASE/TRAIT labels to PheCode descriptions, try match DISEASE/TRAIT labels to UMLS terms
Use manually created excel mapping file (ICD-10 to DISEASE/TRAIT terms, from WHO ICD-10 2019 index: https://icd.who.int/browse10/2019).
Repeat Steps 3-5 for but instead of trying to match terms to DISEASE/TRAIT try to match terms to collected_all_disease_terms labels (these are processed EFO and other ontology terms)
Use author provide ICD-10 codes from studies of the same disease (same collected_all_disease_terms )

The code creating the final map of disease traits to ICD-10 codes is below:

1.3 Find GWAS publications of Lancet Commission Priority Diseases

Use the above mapping of disease traits to ICD-10 codes to find GWAS Catalog studies of the Lancet Commission Priority Diseases. This is done by matching the ICD-10 codes of the diseases to the ICD-10 codes of the GWAS Catalog disease traits.

Matching GWAS traits to GBD causes / diseases

2 Extracting cohort labels

Allowing us to deal with overlapping samples.

Fixing & harmonising GWAS Catalog cohort metadata

Mostly just correcting for typos and different ways of writing the same cohort name (e.g. Finland vs FINLAND).
Missing cohort metadata in the GWAS Catalog

Investigation into missingness of GWAS Catalog cohort metadata, is it more common in older studies, or more recent studies?
Distribution of cohort information exploration

Explore the structure of GWAS Catalog metadata and how cohort labels are distributed across studies.

2.1 Fine tune text mining model

2.1.1 Get text to fine tune model

Pre-process article text for text mining:

Convert text to sentences: code/extract_text/spacy_obtain_sentences.pu
Get initial annotated training sentences

2.1.2 Other investigations

Extracting dbGAP accessions from GWAS Catalog studies

2.2 Calculating ancestry biases

Replicate observed ancestry biases figures in GWAS dataset

Replicating Martin et al. 2019 Figure
Disease investigated by ancestry

3 Exploratory analyses

Not included in analysis/study.

3.1 Disease burden statistics

Global Burden of Disease Study data - DALYs, deaths etc.

Global Burden of Disease Study (GBD) global statistics on incidence, prevalence, DALYs (Disability-Adjusted Life Years) and PAF (population Attributable Fraction) for diseases and risk factors.

3.1.1 Archived / not currently being used:

Looking into: can identify UK Biobank studies from citations?

3.1.1.1 Paper citation metrics

iCite - citation metrics for GWAS Catalog Papers

Replicating Reales & Wallace, 2023 - sharing gwas data results in more citations

3.1.1.2 Old grouping / filtering disease traits pages:

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] workflowr_1.7.2

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0      dplyr_1.1.4         compiler_4.3.1     
 [4] BiocManager_1.30.26 renv_1.1.8          promises_1.3.3     
 [7] tidyselect_1.2.1    Rcpp_1.1.0          stringr_1.6.0      
[10] git2r_0.36.2        callr_3.7.6         later_1.4.4        
[13] jquerylib_0.1.4     yaml_2.3.10         fastmap_1.2.0      
[16] here_1.0.1          R6_2.6.1            generics_0.1.4     
[19] knitr_1.50          tibble_3.3.0        rprojroot_2.1.0    
[22] bslib_0.9.0         pillar_1.11.1       rlang_1.1.6        
[25] cachem_1.1.0        stringi_1.8.7       httpuv_1.6.16      
[28] xfun_0.55           getPass_0.2-4       fs_1.6.6           
[31] sass_0.4.10         cli_3.6.5           withr_3.0.2        
[34] magrittr_2.0.4      ps_1.9.1            digest_0.6.37      
[37] processx_3.8.6      rstudioapi_0.17.1   lifecycle_1.0.4    
[40] vctrs_0.6.5         data.table_1.17.8   evaluate_1.0.5     
[43] glue_1.8.0          whisker_0.4.1       rmarkdown_2.30     
[46] httr_1.4.7          tools_4.3.1         pkgconfig_2.0.3    
[49] htmltools_0.5.8.1

GWAS Meta-science ideas exploration