Last updated: 2026-01-03

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1a9afee. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    data/.DS_Store
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    data/icd/~$lancet_conditions_icd10.xlsx
    Ignored:    data/icd/~$phecode_to_icd10_manual_mapping.xlsx
    Ignored:    figures/
    Ignored:    human_dictionary/
    Ignored:    igsr_populations.tsv
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/
    Ignored:    visualization.Rdata

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/get_full_text.Rmd
    Modified:   analysis/missing_cohort_info.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/other_disease_filtering.Rmd) and HTML (docs/other_disease_filtering.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 645bae7 IJbeasley 2025-12-29 Build site.
Rmd dba8329 IJbeasley 2025-12-29 Archiving old GWAS trait conversion
html da60ca5 IJbeasley 2025-10-08 Build site.
Rmd e254694 IJbeasley 2025-10-08 No longer removing many infectious diseases
html 6929eb4 IJbeasley 2025-10-03 Build site.
html fbca90d IJbeasley 2025-09-29 Build site.
Rmd 468cd00 IJbeasley 2025-09-29 workflowr::wflow_publish("analysis/other_disease_filtering.Rmd")
html 10f63d9 IJbeasley 2025-09-28 Build site.
Rmd 316d1dd IJbeasley 2025-09-28 Using ICD / PheCode mapping
html 57cbb9c IJbeasley 2025-09-26 Build site.
Rmd a1895f9 IJbeasley 2025-09-26 New approach to grouping disease traits
html 9fe901a IJbeasley 2025-09-24 Build site.
Rmd 662707b IJbeasley 2025-09-24 More fixing diseases

Set up

library(data.table)
library(dplyr)
library(stringr)

gwas_study_info <- fread(here::here("output/gwas_cat/gwas_study_info_disease_trait_filtered.csv"))

Ontology help - for getting disease subtypes

source(here::here("code/get_term_descendants.R"))

aesthetic facial traits (36035146)

gwas_study_info |> 
  filter(PUBMED_ID == "36035146") |> 
  select(STUDY, PUBMED_ID, `DISEASE/TRAIT`)


# make collected_all_disease_terms blank for all studies with pubmed id 36035146
gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(PUBMED_ID == "36035146",
         "",
         collected_all_disease_terms
         )
  )

gwas_study_info = 
  gwas_study_info |>
  mutate(DISEASE_STUDY = 
           ifelse(PUBMED_ID == "36035146",
                  FALSE,
                  DISEASE_STUDY
           )
  )

Injuries

Poisoning

Combining Poisoning

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          vec_to_grep_pattern("carbon monoxide poisoning"),
                          "poisoning"
         ))

Bone fracture

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0003931/descendants"

bone_fracture_terms <- get_descendants(url)

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(bone_fracture_terms),
                          "bone fracture"
         ))

Toxicity

toxicity_terms = c("cardiotoxicity",
                   "dermatological toxicity",
                   "immune system toxicity",  
                   "neurotoxicity"
                   )

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(toxicity_terms),
                          "toxicity"
         ))

Naming Fixing:

Nausea, vomiting

Pregancy nausea

gwas_study_info |> 
  dplyr::filter(PUBMED_ID == "38509478") |> 
  select(`DISEASE/TRAIT`, MAPPED_TRAIT, collected_all_disease_terms)


# replace nausea and vomiting, with vomiting of pregnancy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(PUBMED_ID == "38509478",
         stringr::str_replace_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(c("nausea and vomiting")),
                          "vomiting of pregnancy"),
         collected_all_disease_terms
         )
  )

# where MAPPED_TRAIT is "nausea and vomiting of pregnancy" add vomiting of pregnancy

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(MAPPED_TRAIT == "nausea and vomiting of pregnancy severity measurement" & 
                 PUBMED_ID == "38509478",
         paste0(collected_all_disease_terms, ", vomiting of pregnancy"),
         collected_all_disease_terms)
  )

Unspecified disease

Asthma, and other disease

# if DISEASE/TRAIT contains Asthma in any disease, remove disease from collected_all_disease_terms

gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Asthma in any disease", `DISEASE/TRAIT`, ignore.case = TRUE),
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("disease"))
                          ),
         collected_all_disease_terms)
  )

ICD10 R69: Illness, unspecified

# if DISEASE/TRAIT contains ICD10 R69, remove disease from collected_all_disease_terms

gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("ICD10 R69", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

PheCode 1019 - unspecified illness

# if DISEASE/TRAIT contains PheCode 1019, remove disease from collected_all_disease_terms
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("PheCode 1019", `DISEASE/TRAIT`, ignore.case = TRUE) &
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

Number of non-cancer illnesses

gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Number of non-cancer illnesses", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

Other unspecified disease terms

# Unknown and unspecified causes of morbidity
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Unknown and unspecified causes of morbidity", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )


# Non cancer illness - year age first occurred
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Non cancer illness - year age first occurred", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )


# ICD10 Z87.8: Personal history of other specified conditions
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("ICD10 Z87.8: Personal history of other specified conditions", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# Number of non-cancer illnesses
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Number of non-cancer illnesses", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

#  Disease burden
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Disease burden", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# Diagnosed with life threatening illness 
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Diagnosed with life threatening illness", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# Interpolated Age of participant when non cancer illness 
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Interpolated Age of participant when non cancer illness", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# illness, injury, bereavement, stress in last 2 years
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Illness, injury, bereavement, stress in last 2 years", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# ICD10 Z87.89: Personal history of other specified conditions
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("ICD10 Z87.89: Personal history of other specified conditions", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# ICD10 Z86: Personal history of certain other diseases
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("ICD10 Z86: Personal history of certain other diseases", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )


# ICD10 Z87.828.: Personal history of other (healed) physical injury and trauma
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Personal history of other \\(healed\\) physical injury and trauma", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

# Other serious medical condition/disability diagnosed by doctor
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("Other serious medical condition/disability diagnosed by doctor", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

? ICD10 Z86.6: Personal history of diseases of the nervous system and sense organs (? doesn’t fit nicely in gbd so remove)

# if DISEASE/TRAIT contains ICD10 Z86.6, remove disease from collected_all_disease_terms
gwas_study_info  = 
 gwas_study_info |>
  mutate(collected_all_disease_terms  = 
          ifelse(grepl("ICD10 Z86.6", `DISEASE/TRAIT`, ignore.case = TRUE) & 
                   collected_all_disease_terms == "disease",
         stringr::str_remove_all(
           collected_all_disease_terms,
            pattern = vec_to_grep_pattern(c("illness", "unspecified", "disease"))
                          ),
         collected_all_disease_terms)
  )

Fix diseases that should be specified:

diseases_to_fix = c("calculus of lower urinary tract" = "lower urinary tract calculus",
                    "urinary calculus" = "urinary calculus",
                    "miscarriage; stillbirth" = "miscarriage; stillbirth",
                    "chronic ischemic heart disease" = "chronic ischemic heart disease",
                    "pericarditis" = "pericarditis",
                    "retention of urine" = "urinary retention",
                    "back pain" = "back pain",
                    "prostatitis" = "prostatitis",
                    "dysuria" = "dysuria",
                    "renal colic" = "renal colic",
                    "cervicitis and endocervicitis" = "cervicitis, endocervicitis",
                    "edema" = "edema",
                    "hypertensive chronic kidney disease" = "hypertensive chronic kidney disease",
                    "substance abuse" = "substance abuse",
                    "pyogenic arthritis" = "pyogenic arthritis",
                    "hematuria" = "hematuria",
                    "pilonidal cyst" = "pilonidal cyst",
                    "atopic/contact dermatitis" = "atopic dermatitis, contact dermatitis",
                    "arterial embolism and thrombosis" = "arterial embolism, arterial thrombosis",
                    "functional disorders of bladder" = "urinary disorder",
                    "spondylosis" = "spondylosis",
                    "abnormal involuntary movements" = "abnormal involuntary movements",
                    "degenerative disease of the spinal cord" = "degenerative disease of the spinal cord",
                    "voice disturbance" = "voice disturbance",
                    "diseases of nail" = "nail disorder",
                    "visual disturbances" = "visual disturbances",
                    "late pregnancy and failed induction" = "late pregnancy and failed induction",
                    "other diseases of respiratory system" = "respiratory system disease",
                    "malposition and malpresentation of fetus or obstruction" = "obstructed labor",
                    "Hemorrhage in early pregnancy" = "early pregnancy hemorrhage",
                    "Chronic pharyngitis and nasopharyngitis" = "chronic pharyngitis, nasopharyngitis",
                    "Chronic tonsillitis and adenoiditis" = "chronic tonsillitis, adenoiditis",
                    "fetal abnormality affecting management of mother" = "fetal abnormality",
                    "Other disorders of synovium and tendon" = "synovium and tendon disorder",
                    "Noninfectious disorders of lymphatic channels" = "lymphatic system disorder",
                    "Noninflammatory disorders of cervix" = "cervical disorder",
                    "Noninflammatory disorders of ovary, fallopian tube, and broad ligament" = "noninflammatory disorder of ovary fallopian tube and broad ligament",
                    "Other inflammatory spondylopathies" = "inflammatory spondylopathy",
                    "Inflammatory disease of breast" = "inflammatory breast disease",
                    "Lump or mass in breast" = "lump or mass in breast",
                    "Corns and callosities" = "keratosis",
                    "Other acquired musculoskeletal deformity" = "acquired musculoskeletal deformity",
                    "Other derangement of joint" = "joint derangement",
                    "Other and unspecified disc disorder" = "disc disorder",
                    "Internal derangement of knee" = "joint derangement",
                    "Diseases of the musculoskeletal system and connective tissue" = "musculoskeletal diseases",
                    "diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism" = "blood disorder",
                    "endocrine, nutritional and metabolic diseases" = "endocrine and metabolic disease",
                    "congenital malformations" = "congenital malformations",
                    "Localized superficial swelling" = "localized superficial swelling or mass"
                    )

# if collected_all_disease_terms == disease,
# but grep one of diseases_to_fix in DISEASE/TRAIT, replace disease with specific disease
# in collected_all_disease_terms


for(specifc_disease in 1:length(diseases_to_fix)){
  
  grep_pattern = names(diseases_to_fix[specifc_disease])
  replacement = diseases_to_fix[specifc_disease]
  
  gwas_study_info  = 
   gwas_study_info |>
    mutate(collected_all_disease_terms  = 
            ifelse(grepl(grep_pattern, `DISEASE/TRAIT`, ignore.case = TRUE) &
                   collected_all_disease_terms == "disease",
           stringr::str_replace_all(
             collected_all_disease_terms,
              pattern = vec_to_grep_pattern(c("disease")),
              replacement = replacement
                            ),
           collected_all_disease_terms)
    )
  
}

Foot deformity

url <- "http://www.ebi.ac.uk/ols4/api/ontologies/mesh/terms/http%253A%252F%252Fid.nlm.nih.gov%252Fmesh%252FD005530/descendants"

foot_abnormality_terms <- get_descendants(url)

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(foot_abnormality_terms),
                          "foot deformity"
         ))

Other phenotypes to remove

Widows peak

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_remove_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(c("widows peak"))
         ))

Other phenotypes to group

Bowing legs

gwas_study_info = gwas_study_info |>
  mutate(collected_all_disease_terms  = 
         stringr::str_replace_all(collected_all_disease_terms,
                          pattern = vec_to_grep_pattern(c("genu varum", 
                                                          "genu valgum")),
                          "bowing of the legs"
         ))

Final summary - number of unique study terms

n_studies_trait = gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(collected_all_disease_terms, PUBMED_ID) |>
  dplyr::distinct() |>
  dplyr::group_by(collected_all_disease_terms) |>
  dplyr::summarise(n_studies = dplyr::n()) |>
  dplyr::arrange(desc(n_studies))

head(n_studies_trait)

dim(n_studies_trait)

When separate studies with multiple terms

diseases <- stringr::str_split(pattern = ", ", 
 gwas_study_info$collected_all_disease_terms[gwas_study_info$collected_all_disease_terms != ""])  |> 
            unlist() |>
            stringr::str_trim()

length(unique(diseases))

Save output

fwrite(gwas_study_info,
here::here("output/gwas_cat/gwas_study_info_disease_trait_filtered_v2.csv"))

sessionInfo()