Last updated: 2026-01-03
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1a9afee. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: data/.DS_Store
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: data/icd/~$lancet_conditions_icd10.xlsx
Ignored: data/icd/~$phecode_to_icd10_manual_mapping.xlsx
Ignored: figures/
Ignored: human_dictionary/
Ignored: igsr_populations.tsv
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/trait_ontology/
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: r-spacyr/
Ignored: renv/
Ignored: venv/
Ignored: visualization.Rdata
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/get_full_text.Rmd
Modified: analysis/missing_cohort_info.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/text_for_cohort_labels.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown
(analysis/other_disease_filtering.Rmd) and HTML
(docs/other_disease_filtering.html) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote),
click on the hyperlinks in the table below to view the files as they
were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| html | 645bae7 | IJbeasley | 2025-12-29 | Build site. |
| Rmd | dba8329 | IJbeasley | 2025-12-29 | Archiving old GWAS trait conversion |
| html | da60ca5 | IJbeasley | 2025-10-08 | Build site. |
| Rmd | e254694 | IJbeasley | 2025-10-08 | No longer removing many infectious diseases |
| html | 6929eb4 | IJbeasley | 2025-10-03 | Build site. |
| html | fbca90d | IJbeasley | 2025-09-29 | Build site. |
| Rmd | 468cd00 | IJbeasley | 2025-09-29 | workflowr::wflow_publish("analysis/other_disease_filtering.Rmd") |
| html | 10f63d9 | IJbeasley | 2025-09-28 | Build site. |
| Rmd | 316d1dd | IJbeasley | 2025-09-28 | Using ICD / PheCode mapping |
| html | 57cbb9c | IJbeasley | 2025-09-26 | Build site. |
| Rmd | a1895f9 | IJbeasley | 2025-09-26 | New approach to grouping disease traits |
| html | 9fe901a | IJbeasley | 2025-09-24 | Build site. |
| Rmd | 662707b | IJbeasley | 2025-09-24 | More fixing diseases |
library(data.table)
library(dplyr)
library(stringr)
gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_disease_trait_filtered.csv"))
source(here::here("code/get_term_descendants.R"))
aesthetic facial traits (36035146)gwas_study_info |>
filter(PUBMED_ID == "36035146") |>
select(STUDY, PUBMED_ID, `DISEASE/TRAIT`)
# make collected_all_disease_terms blank for all studies with pubmed id 36035146
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(PUBMED_ID == "36035146",
"",
collected_all_disease_terms
)
)
gwas_study_info =
gwas_study_info |>
mutate(DISEASE_STUDY =
ifelse(PUBMED_ID == "36035146",
FALSE,
DISEASE_STUDY
)
)
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_replace_all(collected_all_disease_terms,
vec_to_grep_pattern("carbon monoxide poisoning"),
"poisoning"
))
url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0003931/descendants"
bone_fracture_terms <- get_descendants(url)
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_replace_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(bone_fracture_terms),
"bone fracture"
))
toxicity_terms = c("cardiotoxicity",
"dermatological toxicity",
"immune system toxicity",
"neurotoxicity"
)
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_replace_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(toxicity_terms),
"toxicity"
))
gwas_study_info |>
dplyr::filter(PUBMED_ID == "38509478") |>
select(`DISEASE/TRAIT`, MAPPED_TRAIT, collected_all_disease_terms)
# replace nausea and vomiting, with vomiting of pregnancy
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(PUBMED_ID == "38509478",
stringr::str_replace_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("nausea and vomiting")),
"vomiting of pregnancy"),
collected_all_disease_terms
)
)
# where MAPPED_TRAIT is "nausea and vomiting of pregnancy" add vomiting of pregnancy
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(MAPPED_TRAIT == "nausea and vomiting of pregnancy severity measurement" &
PUBMED_ID == "38509478",
paste0(collected_all_disease_terms, ", vomiting of pregnancy"),
collected_all_disease_terms)
)
# if DISEASE/TRAIT contains Asthma in any disease, remove disease from collected_all_disease_terms
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Asthma in any disease", `DISEASE/TRAIT`, ignore.case = TRUE),
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("disease"))
),
collected_all_disease_terms)
)
# if DISEASE/TRAIT contains ICD10 R69, remove disease from collected_all_disease_terms
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("ICD10 R69", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# if DISEASE/TRAIT contains PheCode 1019, remove disease from collected_all_disease_terms
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("PheCode 1019", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Number of non-cancer illnesses", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Unknown and unspecified causes of morbidity
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Unknown and unspecified causes of morbidity", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Non cancer illness - year age first occurred
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Non cancer illness - year age first occurred", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# ICD10 Z87.8: Personal history of other specified conditions
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("ICD10 Z87.8: Personal history of other specified conditions", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Number of non-cancer illnesses
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Number of non-cancer illnesses", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Disease burden
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Disease burden", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Diagnosed with life threatening illness
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Diagnosed with life threatening illness", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Interpolated Age of participant when non cancer illness
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Interpolated Age of participant when non cancer illness", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# illness, injury, bereavement, stress in last 2 years
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Illness, injury, bereavement, stress in last 2 years", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# ICD10 Z87.89: Personal history of other specified conditions
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("ICD10 Z87.89: Personal history of other specified conditions", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# ICD10 Z86: Personal history of certain other diseases
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("ICD10 Z86: Personal history of certain other diseases", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# ICD10 Z87.828.: Personal history of other (healed) physical injury and trauma
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Personal history of other \\(healed\\) physical injury and trauma", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# Other serious medical condition/disability diagnosed by doctor
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("Other serious medical condition/disability diagnosed by doctor", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
# if DISEASE/TRAIT contains ICD10 Z86.6, remove disease from collected_all_disease_terms
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl("ICD10 Z86.6", `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_remove_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
),
collected_all_disease_terms)
)
diseases_to_fix = c("calculus of lower urinary tract" = "lower urinary tract calculus",
"urinary calculus" = "urinary calculus",
"miscarriage; stillbirth" = "miscarriage; stillbirth",
"chronic ischemic heart disease" = "chronic ischemic heart disease",
"pericarditis" = "pericarditis",
"retention of urine" = "urinary retention",
"back pain" = "back pain",
"prostatitis" = "prostatitis",
"dysuria" = "dysuria",
"renal colic" = "renal colic",
"cervicitis and endocervicitis" = "cervicitis, endocervicitis",
"edema" = "edema",
"hypertensive chronic kidney disease" = "hypertensive chronic kidney disease",
"substance abuse" = "substance abuse",
"pyogenic arthritis" = "pyogenic arthritis",
"hematuria" = "hematuria",
"pilonidal cyst" = "pilonidal cyst",
"atopic/contact dermatitis" = "atopic dermatitis, contact dermatitis",
"arterial embolism and thrombosis" = "arterial embolism, arterial thrombosis",
"functional disorders of bladder" = "urinary disorder",
"spondylosis" = "spondylosis",
"abnormal involuntary movements" = "abnormal involuntary movements",
"degenerative disease of the spinal cord" = "degenerative disease of the spinal cord",
"voice disturbance" = "voice disturbance",
"diseases of nail" = "nail disorder",
"visual disturbances" = "visual disturbances",
"late pregnancy and failed induction" = "late pregnancy and failed induction",
"other diseases of respiratory system" = "respiratory system disease",
"malposition and malpresentation of fetus or obstruction" = "obstructed labor",
"Hemorrhage in early pregnancy" = "early pregnancy hemorrhage",
"Chronic pharyngitis and nasopharyngitis" = "chronic pharyngitis, nasopharyngitis",
"Chronic tonsillitis and adenoiditis" = "chronic tonsillitis, adenoiditis",
"fetal abnormality affecting management of mother" = "fetal abnormality",
"Other disorders of synovium and tendon" = "synovium and tendon disorder",
"Noninfectious disorders of lymphatic channels" = "lymphatic system disorder",
"Noninflammatory disorders of cervix" = "cervical disorder",
"Noninflammatory disorders of ovary, fallopian tube, and broad ligament" = "noninflammatory disorder of ovary fallopian tube and broad ligament",
"Other inflammatory spondylopathies" = "inflammatory spondylopathy",
"Inflammatory disease of breast" = "inflammatory breast disease",
"Lump or mass in breast" = "lump or mass in breast",
"Corns and callosities" = "keratosis",
"Other acquired musculoskeletal deformity" = "acquired musculoskeletal deformity",
"Other derangement of joint" = "joint derangement",
"Other and unspecified disc disorder" = "disc disorder",
"Internal derangement of knee" = "joint derangement",
"Diseases of the musculoskeletal system and connective tissue" = "musculoskeletal diseases",
"diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism" = "blood disorder",
"endocrine, nutritional and metabolic diseases" = "endocrine and metabolic disease",
"congenital malformations" = "congenital malformations",
"Localized superficial swelling" = "localized superficial swelling or mass"
)
# if collected_all_disease_terms == disease,
# but grep one of diseases_to_fix in DISEASE/TRAIT, replace disease with specific disease
# in collected_all_disease_terms
for(specifc_disease in 1:length(diseases_to_fix)){
grep_pattern = names(diseases_to_fix[specifc_disease])
replacement = diseases_to_fix[specifc_disease]
gwas_study_info =
gwas_study_info |>
mutate(collected_all_disease_terms =
ifelse(grepl(grep_pattern, `DISEASE/TRAIT`, ignore.case = TRUE) &
collected_all_disease_terms == "disease",
stringr::str_replace_all(
collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("disease")),
replacement = replacement
),
collected_all_disease_terms)
)
}
url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mesh/terms/http%253A%252F%252Fid.nlm.nih.gov%252Fmesh%252FD005530/descendants"
foot_abnormality_terms <- get_descendants(url)
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_replace_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(foot_abnormality_terms),
"foot deformity"
))
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_remove_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("widows peak"))
))
gwas_study_info = gwas_study_info |>
mutate(collected_all_disease_terms =
stringr::str_replace_all(collected_all_disease_terms,
pattern = vec_to_grep_pattern(c("genu varum",
"genu valgum")),
"bowing of the legs"
))
n_studies_trait = gwas_study_info |>
dplyr::filter(DISEASE_STUDY == T) |>
dplyr::select(collected_all_disease_terms, PUBMED_ID) |>
dplyr::distinct() |>
dplyr::group_by(collected_all_disease_terms) |>
dplyr::summarise(n_studies = dplyr::n()) |>
dplyr::arrange(desc(n_studies))
head(n_studies_trait)
dim(n_studies_trait)
diseases <- stringr::str_split(pattern = ", ",
gwas_study_info$collected_all_disease_terms[gwas_study_info$collected_all_disease_terms != ""]) |>
unlist() |>
stringr::str_trim()
length(unique(diseases))
fwrite(gwas_study_info,
here::here("output/gwas_cat/gwas_study_info_disease_trait_filtered_v2.csv"))
sessionInfo()