Last updated: 2026-03-31

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 82a1d86. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    BC2GM/
    Ignored:    BioC.dtd
    Ignored:    FormatConverter.jar
    Ignored:    FormatConverter.zip
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    doccano_venv/
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/supplement/
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacy_venv_requirements.txt
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  code/full_text_conversion/html_to_xml.R
    Untracked:  code/text_mining_models/tokenise_data.py

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/index.Rmd) and HTML (docs/index.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 82a1d86 IJbeasley 2026-03-31 Fix broken link in index page
html ead0cde IJbeasley 2026-03-24 Build site.
Rmd d2fc105 IJbeasley 2026-03-24 Update index page
html 27c86b7 IJbeasley 2026-01-12 Build site.
Rmd 7acc413 IJbeasley 2026-01-12 Update index page
html 1a9afee IJbeasley 2026-01-03 Build site.
Rmd 04bbd01 IJbeasley 2026-01-03 Update index page
html 326619e IJbeasley 2025-10-27 Build site.
Rmd 8d734db IJbeasley 2025-10-27 Update index page
html 52b8c49 IJbeasley 2025-10-09 Build site.
Rmd 97db3a5 IJbeasley 2025-10-09 Update index page
html fba0bc8 IJbeasley 2025-09-24 Build site.
Rmd a04e62d IJbeasley 2025-09-24 More fixing diseases
html 5067e95 IJbeasley 2025-09-23 Build site.
html 6f76afa IJbeasley 2025-09-23 Build site.
html 1a571a5 IJbeasley 2025-09-22 Build site.
html 99347f5 IJbeasley 2025-09-22 Build site.
Rmd e15c4b8 IJbeasley 2025-09-22 Even more typo etc.
html ff6e030 IJbeasley 2025-09-22 Build site.
Rmd 661b9c1 IJbeasley 2025-09-22 More typo correcting
html 0f48914 IJbeasley 2025-09-17 Build site.
html 1e5f203 IJbeasley 2025-09-17 Build site.
html 9fdfd2e IJbeasley 2025-09-16 Build site.
html 6ed4b80 IJbeasley 2025-09-16 Build site.
html 74451e3 IJbeasley 2025-09-16 Build site.
html 38a161d IJbeasley 2025-09-15 Build site.
html 6477659 IJbeasley 2025-09-15 Build site.
html 10775a9 IJbeasley 2025-09-15 Build site.
Rmd 7733a7b IJbeasley 2025-09-15 Add disease terms unique number to index workflowr page
html 846312b IJbeasley 2025-09-15 Build site.
Rmd 1bb743c IJbeasley 2025-09-15 workflowr::wflow_publish("analysis/index.Rmd")
html 784ab91 IJbeasley 2025-09-15 Build site.
Rmd cf46ce3 IJbeasley 2025-09-15 Update workflow index
html 153e9c5 IJbeasley 2025-09-15 Build site.
Rmd dd214e1 IJbeasley 2025-09-15 Update workflow index
html df2faf1 IJbeasley 2025-09-14 Build site.
Rmd edc356d IJbeasley 2025-09-14 Fix typo on index page
html 98018fa IJbeasley 2025-09-14 Build site.
Rmd d94474f IJbeasley 2025-09-14 workflowr::wflow_publish("analysis/index.Rmd")
html 6b00297 IJbeasley 2025-09-09 Build site.
Rmd 906b603 IJbeasley 2025-09-09 Update index page
html 7e1b0fb IJbeasley 2025-08-25 Build site.
Rmd b293dab IJbeasley 2025-08-25 Update homepage links
html d5a6d6a IJbeasley 2025-08-21 Build site.
Rmd 6e60a4b IJbeasley 2025-08-21 Woops typos in index page
html f6f4371 IJbeasley 2025-08-21 Build site.
Rmd 19d0887 IJbeasley 2025-08-21 Woops typos in index page
html c055fb3 IJbeasley 2025-08-21 Build site.
Rmd 5dc8894 IJbeasley 2025-08-21 Updating index page
html d49291e IJbeasley 2025-08-20 Build site.
Rmd d44c981 IJbeasley 2025-08-20 Update links in index
Rmd 4c3a4fb IJbeasley 2025-08-19 Adding more cohort simplification
html 925266a IJbeasley 2025-08-05 Build site.
Rmd f58709b IJbeasley 2025-08-05 Add icite to homepage
html 01adb87 IJBeasley 2025-07-30 Build site.
Rmd 92fa3c6 IJBeasley 2025-07-30 workflowr::wflow_publish("analysis/index.Rmd")
Rmd 7cb9ee8 Isobel Beasley 2022-02-16 Start workflowr project.

github repository link



1 Find GWAS studies of Lancet Commission Priority Diseases

1.1 Identify GWAS studies of diseases in general

I find GWAS Catalog studies that of diseases, by looking for disease-related EFO ontology mapped terms in the ‘MAPPED_TRAIT’ column of the GWAS Catalog metadata. I categorize traits into disease, response, measurement etc. using the EFO ontology.

[1] "Number of pubmed ids studying at least one disease"
[1] 4652
[1] "Number of unique disease terms: 2509"

1.2 Mapping traits to ICD 10 Codes

Next, for GWAS studies of disease I harmonise disease trait labels to reduce redundancy due to typos, synonyms etc.

[1] "Number of unique disease terms: 2381"
[1] "Number of unique disease terms: 2192"
[1] "Number of unique disease terms: 1995"

1.2.1 Now: Mapping traits to ICD 10 Codes

Then, I map the harmonised disease trait labels to ICD-10 codes, by the following step-wise procedure:

  1. Where available, extract author provided ICD-10 Codes in GWAS Catalog DISEASE/TRAIT metadata

  2. Where available, extract author provided PheCodes in GWAS Catalog DISEASE/TRAIT metadata. These PheCodes are converted to ICD-10 codes using the PheWAS R package international mapping file, and checked against PheCode to ICD-10cm mappings (from https://phenomics.va.ornl.gov/phecodemap/).

  3. If author provided ICD-10 or PheCodes are not available, match DISEASE/TRAIT labels to ICD-10 code descriptions

  4. If not able to match DISEASE/TRAIT labels to ICD-10 code descriptions, try match DISEASE/TRAIT labels to PheCode descriptions

  5. If not able to match DISEASE/TRAIT labels to PheCode descriptions, try match DISEASE/TRAIT labels to UMLS terms

  6. Use manually created excel mapping file (ICD-10 to DISEASE/TRAIT terms, from WHO ICD-10 2019 index: https://icd.who.int/browse10/2019).

  7. Repeat Steps 3-5 for but instead of trying to match terms to DISEASE/TRAIT try to match terms to collected_all_disease_terms labels (these are processed EFO and other ontology terms)

  8. Use author provide ICD-10 codes from studies of the same disease (same collected_all_disease_terms )

The code creating the final map of disease traits to ICD-10 codes is below:


1.3 Find GWAS publications of Lancet Commission Priority Diseases

Use the above mapping of disease traits to ICD-10 codes to find GWAS Catalog studies of the Lancet Commission Priority Diseases. This is done by matching the ICD-10 codes of the diseases to the ICD-10 codes of the GWAS Catalog disease traits.


2 Extracting cohort labels

Allowing us to deal with overlapping samples.

2.1 Fine tune text mining model

2.1.1 Get text to fine tune model

Pre-process article text for text mining:

2.2 Calculating ancestry biases




3 Exploratory analyses

Not included in analysis/study.

3.1 Disease burden statistics



3.1.1 Archived / not currently being used:

3.1.1.1 Paper citation metrics

3.1.1.2 Old grouping / filtering disease traits pages:


sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] workflowr_1.7.2

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0      dplyr_1.1.4         compiler_4.3.1     
 [4] BiocManager_1.30.26 renv_1.1.8          promises_1.3.3     
 [7] tidyselect_1.2.1    Rcpp_1.1.0          stringr_1.6.0      
[10] git2r_0.36.2        callr_3.7.6         later_1.4.4        
[13] jquerylib_0.1.4     yaml_2.3.10         fastmap_1.2.0      
[16] here_1.0.1          R6_2.6.1            generics_0.1.4     
[19] knitr_1.50          tibble_3.3.0        rprojroot_2.1.0    
[22] bslib_0.9.0         pillar_1.11.1       rlang_1.1.6        
[25] cachem_1.1.0        stringi_1.8.7       httpuv_1.6.16      
[28] xfun_0.55           getPass_0.2-4       fs_1.6.6           
[31] sass_0.4.10         cli_3.6.5           withr_3.0.2        
[34] magrittr_2.0.4      ps_1.9.1            digest_0.6.37      
[37] processx_3.8.6      rstudioapi_0.17.1   lifecycle_1.0.4    
[40] vctrs_0.6.5         data.table_1.17.8   evaluate_1.0.5     
[43] glue_1.8.0          whisker_0.4.1       rmarkdown_2.30     
[46] httr_1.4.7          tools_4.3.1         pkgconfig_2.0.3    
[49] htmltools_0.5.8.1