Last updated: 2026-03-31

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220216)

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: bddceeb

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version bddceeb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    BC2GM/
    Ignored:    BioC.dtd
    Ignored:    FormatConverter.jar
    Ignored:    FormatConverter.zip
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    doccano_venv/
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/supplement/
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacy_venv_requirements.txt
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  code/full_text_conversion/html_to_xml.R
    Untracked:  code/text_mining_models/tokenise_data.py

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_dbgap_ids.Rmd) and HTML (docs/get_dbgap_ids.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	bddceeb	IJbeasley	2026-03-31	Add ega ids exploration to genome archive information
html	694387b	IJbeasley	2026-03-30	Build site.
Rmd	4e37aff	IJbeasley	2026-03-30	Update getting dbgap ids
html	079a806	IJbeasley	2025-10-27	Build site.
Rmd	0e28f49	IJbeasley	2025-10-27	Compare dbgap sources
html	dcc17fc	IJbeasley	2025-10-27	Build site.
Rmd	a418670	IJbeasley	2025-10-27	Testing overlap with annotation + text mining dbgap
html	3feefe9	IJbeasley	2025-10-20	Build site.
Rmd	bdb7fae	IJbeasley	2025-10-20	Extracting all dbgap id info
html	47e3b12	IJbeasley	2025-10-20	Build site.
Rmd	5909160	IJbeasley	2025-10-20	Extracting dbgap ids

library(dplyr)
library(data.table)
library(rentrez)
library(purrr)
library(stringr)

Get from full text

Functions to extract dbGaP (and other archive) ids from full text

# patterns to extract: 
# get dbgap ids
# get EGA ids
# get JGAS
# get PRJEB / PRJNA ids
patterns <- list(
  dbgap = "\\bphs\\d+(?:\\.v\\d+)?(?:\\.p\\d+)?",
  ega   = "\\bEGA[CDFS]\\d+",
  jgas  = "\\bJGAS\\d+|JGAD\\d+",
  prj   = "\\bPRJEB\\d+|PRJNA\\d+"
)

grep_pattern <- paste(patterns, collapse = "|")

library(tokenizers)

# get sentences with dbgap / ega ids etc. in them
get_grep_sentences <- function(file_path,
                               grep_pattern) {
  
  #browser()
  
  
  txt_in <- suppressWarnings(xml2::read_xml(file_path))
  
  # txt_in <- tryCatch(
  #   xml2::read_xml(file_path),
  #   error = function(e) {
  #     print(paste0("error with ", file_path));
  #     return(NULL)
  #   }
  # )
  
  txt_in <- xml2::read_xml(file_path)
  
  get_patterns <- c(".//*[local-name()='body']//*[local-name()='p']",
                    ".//*[local-name()='abstract']//*[local-name()='p']",
                    ".//*[local-name()='back']//*[local-name()='p']",
                    ".//*[local-name()='fig']//*[local-name()='caption']",
                    ".//*[local-name()='table-wrap']//*[local-name()='p']",
                    ".//*[local-name()='sec' and @sec-type='data-availability']//*[local-name()='p']",
                    ".//*[local-name()='supplementary-material']//*[local-name()='p']",
                    ".//passage/text")
  
  canidate_text <- txt_in |>
                   xml2::xml_find_all(paste(get_patterns, collapse = " | ")) |>
                   # xml2::xml_find_all("//body//p | 
                   #                     //abstract//p |
                   #                    //back//p  |
                   #                    //fig//caption |
                   #                    //table-wrap//p |
                   #                    //sec[@sec-type='data-availability']//p
                   #                    //supplementary-material") |>
                   #xml2::xml_find_all("//body|//back") |>
                   xml2::xml_text()
  
  
  if(length(canidate_text) == 0) {
    print(paste0("No candidate text found in ", file_path))
    return(character(0))
  }
  
  # txt_in <- readLines(file_path,
  #                     warn = FALSE,
  #                     encoding = "UTF-8")
  # tokenize into sentences
  
  
  sentences <- unlist(
  tokenize_sentences(paste(canidate_text, collapse = " "))
)

grep_sentences <- sentences[grepl(grep_pattern, sentences)]

  # sentences <- tokenize_sentences(paste(canidate_text, collapse = "\n"))
  # 
  # grep_sentences <- sentences[grepl(grep_pattern, sentences)]

  # grep_sentences <- tokenize_sentences(paste(candidate_lines, 
  #                                            collapse = "\n"))
  # 
  # sentences <- tokenize_sentences(paste(txt_in, collapse = " "))
  # 
  # sentences <- unlist(tokenize_sentences(txt_in))
  # 
  # dbgap_sentences = sentences[grepl(grep_pattern, sentences)]

  return(grep_sentences)
}

Text file paths to load

full_text_files <-
  list.files(here::here("output/fulltexts"), 
             recursive = T, 
             pattern = "\\.xml$") 

# some duplicates come from converted xml files 
# i.e. from propriety wiley and elsevier xml
# let's keep only the converted files:
full_text_files <- 
  grep("elsevier/elsevier_xml|wiley/wiley_xml", 
       full_text_files, 
       value = TRUE,
       invert = TRUE)


# get pmcids of these files
all_ids <-
  full_text_files |>
  gsub(pattern = ".*/", replacement = "") |>
  gsub(pattern = "\\.xml$", replacement = "") |>
  gsub(pattern = "_bioc", replacement = "") |>
  gsub(pattern = "\\.pdf.tei", replacement = "")

# some duplicates come from ncbi and europe pmc
# let's keep the europe pmc files
duplicate_ids <-
data.frame(ids = all_ids) |>
  group_by(ids) |>
  summarise(count = n()) |>
  filter(count > 1) |>
  pull(ids)

grep(full_text_files, 
     pattern = duplicate_ids[1],#paste(duplicate_ids, collapse = "|"),
     value = T)

[1] "europe_pmc/PMC10067501.xml" "ncbi_cloud/PMC10067501.xml"

duplicates_to_remove <-
grep(full_text_files, 
     pattern = paste(duplicate_ids, collapse = "|"),
     value = T) |>
  grep(pattern = "ncbi", value = T)

full_text_files <- full_text_files[!c(full_text_files %in% duplicates_to_remove)]

# convert to file paths 
full_text_files <- here::here(paste0("output/fulltexts/", 
                                     full_text_files)
                              )

# get pmids and pmcids of relevant files to load
relevant_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")) 

relevant_ids <- c(relevant_ids$PMID, relevant_ids$pmcids)

all_text_ids <-   full_text_files |>
  gsub(pattern = ".*/", replacement = "") |>
  gsub(pattern = "\\.xml$", replacement = "") |>
  gsub(pattern = "_bioc", replacement = "") |>
  gsub(pattern = "\\.pdf.tei", replacement = "")

text_files_to_get <-
data.frame(full_text_files = full_text_files,
           all_text_ids = all_text_ids) |>
  filter(all_text_ids %in% relevant_ids)

full_text_files <- text_files_to_get$full_text_files

print("Number of full text files to load:")

[1] "Number of full text files to load:"

print(length(full_text_files))

[1] 790

Extract dbGaP accessions from full text

pmcid_dbgap_sentences <- 
purrr::map(full_text_files,
           ~get_grep_sentences(.x, 
                               grep_pattern = grep_pattern
                               )
           ) 

all_text_ids <-   full_text_files |>
  gsub(pattern = ".*/", replacement = "") |>
  gsub(pattern = "\\.xml$", replacement = "")  |>
  gsub(pattern = "_bioc", replacement = "") |>
  gsub(pattern = "\\.pdf.tei", replacement = "")

names(pmcid_dbgap_sentences) <- all_text_ids

keep_pmcid_dbgap_sentences <-
  purrr::discard(
pmcid_dbgap_sentences,
~rlang::is_empty(.x)
) 


# Convert to data frame
sentences_df <- 
  purrr::imap(keep_pmcid_dbgap_sentences,
              ~data.frame(pmcid = .y,
                          sentence = unique(.x),
                        stringsAsFactors = FALSE)
              )|>
  dplyr::bind_rows()

# Extract all dbGaP IDs from the sentences
sentences_df <- sentences_df |>
  mutate(
    dbgap_id = str_extract_all(sentence, patterns[["dbgap"]]),
                               #"phs\\d+(\\.v\\d+)?(\\.p\\d+)?"),
    ega_id = str_extract_all(sentence, patterns[["ega"]]),
                             #"EGA[CDFS]\\d+"),
    jgas_id = str_extract_all(sentence, patterns[["jgas"]]),
                              #"JGAS\\d+|JGAD"),
    prj_id = str_extract_all(sentence, patterns[["prj"]]),
                             #"PRJEB\\d+|PRJNA\\d+")
  )
             
sentences_df = 
  sentences_df |>
  distinct()

print("Number of unique PMCIDs with dbgap / ega ids found in full text:")

[1] "Number of unique PMCIDs with dbgap / ega ids found in full text:"

sentences_df$pmcid |> unique() |> length()

[1] 122

# number of pubmed ids with dbgap ids found
print("Number of unique PMCIDs with dbgap ids found in full text:")

[1] "Number of unique PMCIDs with dbgap ids found in full text:"

sentences_df %>%
  rowwise() %>%
  filter(!rlang::is_empty(dbgap_id)) %>%
  filter(length(dbgap_id) > 0) %>%
  ungroup() %>%
  pull(pmcid) %>%
  unique() %>%
  length()

[1] 103

print("Number of unique PMCIDs with EGA ids found in full text:")

[1] "Number of unique PMCIDs with EGA ids found in full text:"

sentences_df %>%
  rowwise() %>%
  filter(!rlang::is_empty(ega_id)) %>%
  filter(length(ega_id) > 0) %>%
  pull(pmcid) %>%
  unique() %>%
  length()

[1] 17

convert_pmid_df  <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")) |>
  rename(PMCID = pmcids)

# add pmid column
sentences_df =
  left_join(sentences_df,
            convert_pmid_df |> 
              dplyr::select(pmcid = PMCID,
                            pmid = PMID),
            by = "pmcid"
            ) |>
  mutate(pmid = ifelse(is.na(pmid), 
                      pmcid,
                      pmid
                      )
  ) |> 
  mutate(pmcid = ifelse(grepl("PMC", pmid),
                        pmid,
                        pmcid)
         ) |>
  mutate(pmcid = ifelse(grepl("PMC", pmcid), 
                        pmcid, 
                        ""
                        )
  )


sentences_df = 
  left_join(sentences_df |> 
              select(-pmcid) |> 
              mutate(pmid = as.integer(pmid)),
            convert_pmid_df |> 
              dplyr::select(pmcid = PMCID,
                            pmid = PMID),
            by = "pmid"
            ) 

# typo in text for PMC8486151
# replace  "phs000" with 
# phs000.56.v1.p1

sentences_df |> filter(pmcid == "PMC8486151") |> pull(dbgap_id)

[[1]]
[1] "phs000090.v1.p1" "phs000810.v1.p1" "phs000293.v1"    "phs000"         
[5] "phs000200.v1"

fix_dbgap_ids <- 
  c(sentences_df |> filter(pmcid == "PMC8486151") |> pull(dbgap_id),
    "phs00056.v1.p1"
    ) |>
  unlist()

fix_dbgap_ids <- fix_dbgap_ids[fix_dbgap_ids != "phs000"]

sentences_df <-
  sentences_df |>
  mutate(dbgap_id = ifelse(pmcid == "PMC8486151",
                          list(fix_dbgap_ids),
                          dbgap_id
                          )
         )
         

data.table::fwrite(sentences_df,
                   here::here("output/gwas_cat/gwas_study_dbgap_ega_sentences.csv")
                   )

Get from Europe PMC list

dbgap_info <-
data.table::fread(here::here("data/europe_pmc/dbgap.csv")) 

pmids <- convert_pmid_df$PMID 

pmcids <- convert_pmid_df$PMCID

relevant_pmid_dbgaps = dbgap_info |>
  filter(EXTID %in% pmids) |>
  distinct() 

print("Number of unique PMIDs with dbgap ids found in Europe PMC:")

[1] "Number of unique PMIDs with dbgap ids found in Europe PMC:"

relevant_pmid_dbgaps |>
  pull(EXTID) |>
  unique() |>
  length()

[1] 59

relevant_pmcids_dbgaps = dbgap_info |>
  filter(PMCID %in% pmcids) |>
  filter(PMCID != "") |>
  distinct()

print("Number of unique PMCIDs with dbgap ids found in Europe PMC:")

[1] "Number of unique PMCIDs with dbgap ids found in Europe PMC:"

relevant_pmcids_dbgaps |>
  pull(PMCID) |>
  unique() |>
  length()

[1] 59

# overlap between pmid and pmcid dbgap annotations in europe pmc
print("Number of unique PMIDs with dbgap ids found in Europe PMC that also have dbgap ids found in full text:")

[1] "Number of unique PMIDs with dbgap ids found in Europe PMC that also have dbgap ids found in full text:"

unique(sentences_df$pmid) %in% unique(relevant_pmid_dbgaps$EXTID) |>
  sum()

[1] 57

# overlap between pmid and pmcid dbgap annotations in europe pmc
# 49 / 57 = 86% of pmids with dbgap ids in europe pmc annotations are also found in full text

print("Number of unique PMIDs with dbgap ids found in full text, but not in Europe PMC:")

[1] "Number of unique PMIDs with dbgap ids found in full text, but not in Europe PMC:"

setdiff(unique(sentences_df$pmid), 
        unique(relevant_pmid_dbgaps$EXTID)) |> length()

[1] 65

# 101 / 158 = 64% of pmids with dbgap ids found in full text are not found in europe pmc annotations

print("Number of unique PMIDs with dbgap ids found in Europe PMC that do not have dbgap ids found in full text:")

[1] "Number of unique PMIDs with dbgap ids found in Europe PMC that do not have dbgap ids found in full text:"

setdiff(unique(relevant_pmid_dbgaps$EXTID), 
        unique(sentences_df$pmid)) |> length()

[1] 2

full_text_dbgaps = 
sentences_df  |>
  select(dbgap_id, pmid, pmcid) |>
  tidyr::unnest(dbgap_id) |>
  mutate(dbgap_id = str_trim(dbgap_id)) |>
  mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
  # mutate(dbgap_id = str_remove(dbgap_id, "\\.v\\d+\\.p\\d+.*$")) |> 
  group_by(pmid, pmcid) |>
  summarise(dbgap_full_text = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE)) |>
#  tidyr::nest(dbgap_full_text = dbgap_id) |>
  mutate(pmid = as.character(pmid))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

europe_pmc_dbgaps =
relevant_pmid_dbgaps |>
  select(dbgap_id = dbgap,
         pmcid = PMCID,
         pmid = EXTID) |>
    group_by(pmid, pmcid) |>
  summarise(dbgap_epmc = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

 # tidyr::nest(dbgap_epmc = dbgap_id) 

joined <-
full_join(full_text_dbgaps,
          europe_pmc_dbgaps,
          by = c("pmid", "pmcid")
          )

# includes more dbgap ids from full text than europe pmc annotations
joined |> 
  filter(!is.na(dbgap_full_text) & !is.na(dbgap_epmc)) |> 
  filter(dbgap_full_text != dbgap_epmc)

# A tibble: 10 × 4
# Groups:   pmid [10]
   pmid     pmcid       dbgap_full_text                               dbgap_epmc
   <chr>    <chr>       <chr>                                         <chr>     
 1 26301688 PMC4863040  phs000019, phs000091, phs000092, phs000125, … phs000019…
 2 27790247 PMC5061751  phs000007, phs000090, phs000209, phs000280, … phs000007…
 3 28044437 PMC5476850  phs000007, phs000342                          phs000007 
 4 28604730 PMC5510465  phs000876, phs001273                          phs000876 
 5 28838971 PMC5652606  phs000424, phs001375, phs001388, phs001393, … phs000424 
 6 30510157 PMC6277418  phs000424, phs000930                          phs000424 
 7 34727735 PMC8885789  phs000964, phs000974, phs001211, phs001237, … phs000964…
 8 35769078 PMC9234217  phs000090, phs000342                          phs000090 
 9 36929942 PMC10248849 phs000178, phs000187, phs000196, phs000206, … phs000178…
10 40210677 PMC11985957 phs001672, phs003894                          phs001672

joined |>
  filter(!is.na(dbgap_full_text) & is.na(dbgap_epmc))

# A tibble: 46 × 4
# Groups:   pmid [46]
   pmid     pmcid        dbgap_full_text                              dbgap_epmc
   <chr>    <chr>        <chr>                                        <chr>     
 1 17903298 "PMC1995610" phs000007                                    <NA>      
 2 17903303 "PMC1995605" phs000007                                    <NA>      
 3 19458352 "PMC2857316" phs000183                                    <NA>      
 4 19875614 "PMC2809960" phs000086                                    <NA>      
 5 25102180 "PMC4125087" phs000200                                    <NA>      
 6 26634245 "PMC4668640" phs000179                                    <NA>      
 7 26959717 ""           phs000346                                    <NA>      
 8 28043905 "PMC5367948" phs000346                                    <NA>      
 9 28346466 "PMC5367689" phs000090, phs000209, phs000280, phs000284,… <NA>      
10 29404672 "PMC5876265" phs000086                                    <NA>      
# ℹ 36 more rows

# seems good extraction: 
# pubmed id: 39024449
# pubmed id: 26959717
# pubmed id: 31537701
# pubmed id: 37167549
# pubmed id: 30207284
# pubmed id: 33135175
# pubmed id: 37069358

joined |>
  filter(is.na(dbgap_full_text) & !is.na(dbgap_epmc))

# A tibble: 2 × 4
# Groups:   pmid [2]
  pmid     pmcid      dbgap_full_text dbgap_epmc
  <chr>    <chr>      <chr>           <chr>     
1 22384361 PMC3276159 <NA>            phs000292 
2 31697830 PMC7236628 <NA>            phs000424

Get from Entrez

Extract internal dbGAP ids

Sys.getenv("ENTREZ_KEY")

get_internal_dbgap_ids = function(pubmed_id) {
  
  Sys.sleep(0.1)

dbgap_links <- entrez_link(
            dbfrom = "pubmed",
            db = "gap",
            id = pubmed_id
)

dbgap_ids <- unlist(dbgap_links$links$pubmed_gap)

if(is.null(dbgap_ids)){
  dbgap_ids = NA
}

return(data.frame(PUBMED_ID = pubmed_id,
                  DBGAP_ID = str_flatten(dbgap_ids,
                                         collapse = ",",
                                         na.rm = TRUE)
                  )
)

}

pubmed_dbgap_mapping <-
purrr::map(pmids,
           get_internal_dbgap_ids,
           .progress = TRUE
           ) |>
  dplyr::bind_rows()

Save dbGaP id mapping

data.table::fwrite(pubmed_dbgap_mapping,
                   here::here("output/gwas_cohorts/gwas_study_dbgap_ids.csv")
                   )

Percentage of studies with dbGAP ids

# load previously saved mapping
pubmed_dbgap_mapping <- data.table::fread(
  here::here("output/gwas_cohorts/gwas_study_dbgap_ids.csv")
)

n_with_dbgap = pubmed_dbgap_mapping |>
  dplyr::filter(DBGAP_ID != "") |>
  nrow()

print("Number of unique PMIDs with dbgap ids found via Entrez:")

[1] "Number of unique PMIDs with dbgap ids found via Entrez:"

n_with_dbgap

[1] 64

percent_with_dbgap = n_with_dbgap / nrow(pubmed_dbgap_mapping) * 100

print("Percentage of PMIDs with dbgap ids found via Entrez:")

[1] "Percentage of PMIDs with dbgap ids found via Entrez:"

percent_with_dbgap

[1] 7.729469

Extract dbGAP study accessions

entrez_dbgap_ids = pubmed_dbgap_mapping$DBGAP_ID

entrez_dbgap_ids = entrez_dbgap_ids[entrez_dbgap_ids != ""]

entrez_dbgap_ids = entrez_dbgap_ids |>
  strsplit(",") |>
  unlist() |>
  unique()

print("Number of unique internal dbgap ids to retrieve accessions for:")
print(length(entrez_dbgap_ids))


get_dbgap_accession = function(internal_dbgap_id) {
  
  
  summary = entrez_summary(db = "gap", 
                           id = internal_dbgap_id)
  
  
  accession = summary$d_study_results$d_study_id
  
  return(data.frame(
    INTERNAL_DBGAP_ID = internal_dbgap_id,
    DBGAP_ACCESSION = str_flatten(accession, 
                                  collapse = ",", 
                                  na.rm = TRUE)
  )
  )

} 

dbgap_accession_mapping <-
  purrr::map(entrez_dbgap_ids,
             get_dbgap_accession,
             .progress = TRUE
             ) |>
  dplyr::bind_rows()

Save dbGaP accession mapping

  data.table::fwrite(
    dbgap_accession_mapping,
     here::here("output/gwas_cohorts/gwas_study_dbgap_accessions.csv")
                     )

Percentage of internal dbGaP ids for which we could not retrieve accessions

dbgap_accession_mapping <- data.table::fread(
  here::here("output/gwas_cohorts/gwas_study_dbgap_accessions.csv")
)

# a lot of internal dbgap ids are not able be retrieve study accessions
# sad
# ? perhaps because they are old dbgap versions? 
print("Number of internal dbgap ids for which we could not retrieve accessions:")

[1] "Number of internal dbgap ids for which we could not retrieve accessions:"

dbgap_accession_mapping |> 
  filter(DBGAP_ACCESSION == "") |>
  nrow()

[1] 28

pubmed_dbgap_mapping =
  pubmed_dbgap_mapping |>
  tidyr::separate_longer_delim(cols = "DBGAP_ID", 
                               delim = ",")

dbgap_accession_mapping = 
  dbgap_accession_mapping |>
  tidyr::separate_longer_delim(cols = "INTERNAL_DBGAP_ID", 
                               delim = ",")

dbgap_accession_mapping = 
left_join(pubmed_dbgap_mapping, 
     dbgap_accession_mapping,
     by = c("DBGAP_ID" = "INTERNAL_DBGAP_ID")
     ) 

print("Number of unique PMIDs with dbgap accessions found via Entrez:")

[1] "Number of unique PMIDs with dbgap accessions found via Entrez:"

dbgap_accession_mapping |>
  filter(DBGAP_ACCESSION != "" & DBGAP_ID != "") |>
  pull(PUBMED_ID) |>
  unique() |>
  length()

[1] 34

Save combined dbGaP mapping

combined_dbgap_pubmed_mapping = 
  dbgap_accession_mapping |>
  filter(DBGAP_ACCESSION != "" & DBGAP_ID != "") |>
  select(PUBMED_ID, DBGAP_ACCESSION) |>
  distinct()

data.table::fwrite(
  combined_dbgap_pubmed_mapping,
  here::here("output/gwas_cohorts/gwas_study_combined_dbgap_pubmed_mapping.csv")
)

Compare dbgap retrieved via extracting full text vs via entrez

combined_dbgap_pubmed_mapping <- data.table::fread(
  here::here("output/gwas_cohorts/gwas_study_combined_dbgap_pubmed_mapping.csv")
)

entrez_dbgaps <- 
  combined_dbgap_pubmed_mapping |>
  rename(pmid = PUBMED_ID,
         dbgap_id = DBGAP_ACCESSION
         ) |>
  mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
  filter(dbgap_id != "") |>
  group_by(pmid) |>
  summarise(dbgap_annotation = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE)) |>
  mutate(pmid = as.character(pmid))


joined_dbgaps =
full_join(entrez_dbgaps,
          full_text_dbgaps,
          by = "pmid")

print("Number of PMIDs with dbgap annotations from Entrez that do not match dbgap ids found in full text:")

[1] "Number of PMIDs with dbgap annotations from Entrez that do not match dbgap ids found in full text:"

joined_dbgaps |>
  filter(!is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
  filter(dbgap_annotation != dbgap_full_text)

# A tibble: 5 × 4
  pmid     dbgap_annotation                pmcid      dbgap_full_text           
  <chr>    <chr>                           <chr>      <chr>                     
1 18978792 phs000180                       PMC2635556 phs000018                 
2 19430480 phs000180, phs000910, phs000911 PMC2889014 phs000018                 
3 22101970 phs000888, phs000984            PMC3277617 phs000170, phs000188, phs…
4 25102180 phs000888, phs000925            PMC4125087 phs000200                 
5 25751624 phs000911                       PMC4380767 phs000180

print("Number of PMIDs with dbgap annotations from Entrez that match dbgap ids found in full text:")

[1] "Number of PMIDs with dbgap annotations from Entrez that match dbgap ids found in full text:"

joined_dbgaps |>
  filter(!is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
  filter(dbgap_annotation == dbgap_full_text) |>
  nrow()

[1] 0

print("Number of PMIDs with dbgap annotations from Entrez but no dbgap ids found in full text:")

[1] "Number of PMIDs with dbgap annotations from Entrez but no dbgap ids found in full text:"

joined_dbgaps |>
  filter(!is.na(dbgap_annotation) & is.na(dbgap_full_text)) |>
  nrow()

[1] 29

print("Number of PMIDs with dbgap found in full text but not in Entrez annotations:")

[1] "Number of PMIDs with dbgap found in full text but not in Entrez annotations:"

joined_dbgaps |>
  filter(is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
  nrow()

[1] 98

Combine and save

# europe_pmc_dbgaps
# entrez_dbgaps 
# full_text_dbgaps

full_text_dbgaps = 
sentences_df  |>
  select(dbgap_id, pmid, pmcid) |>
  tidyr::unnest(dbgap_id) |>
  mutate(dbgap_id = str_trim(dbgap_id)) |>
  # mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
  # mutate(dbgap_id = str_remove(dbgap_id, "\\.v\\d+\\.p\\d+.*$")) |> 
  group_by(pmid, pmcid) |>
  summarise(dbgap_full_text = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE)) |>
#  tidyr::nest(dbgap_full_text = dbgap_id) |>
  mutate(pmid = as.character(pmid))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

europe_pmc_dbgaps =
relevant_pmid_dbgaps |>
  select(dbgap_id = dbgap,
         pmcid = PMCID,
         pmid = EXTID) |>
    group_by(pmid, pmcid) |>
  summarise(dbgap_epmc = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

entrez_dbgaps <- 
  combined_dbgap_pubmed_mapping |>
  rename(pmid = PUBMED_ID,
         dbgap_id = DBGAP_ACCESSION
         ) |>
  # mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
  filter(dbgap_id != "") |>
  group_by(pmid) |>
  summarise(dbgap_annotation = str_flatten(unique(sort(dbgap_id)), 
                          collapse = ", ", 
                          na.rm = TRUE)) |>
  mutate(pmid = as.character(pmid))

combined_annotations <-
full_join(full_text_dbgaps,
          europe_pmc_dbgaps,
          by = c("pmid", "pmcid")
          ) |>
  full_join(entrez_dbgaps,
            by = c("pmid")
            ) |>
  tidyr::separate_longer_delim(cols = c(dbgap_full_text),
                               delim = ", ")|>
  tidyr::separate_longer_delim(cols = c(dbgap_epmc),
                               delim = ", ")|>
  tidyr::separate_longer_delim(cols = c(dbgap_annotation),
                               delim = ", ")|>
  tidyr::pivot_longer(cols = c(dbgap_full_text, dbgap_epmc, dbgap_annotation),
               names_to = "source",
               values_to = "dbgap_id") |>
  filter(!is.na(dbgap_id) & dbgap_id != "") |>
  mutate(source = stringr::str_remove_all("dbgap_", string = source)) |>
  mutate(source = ifelse(source == "annotation", "entrez", source)) |>
  distinct()


combined_annotations$pmid |> unique() |> length()

[1] 134

n_pubmeds_per_dbgap =
combined_annotations |>
  mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")) |>
  select(pmid, broad_dbgap) |>
  distinct() |>
  group_by(broad_dbgap) |>
  summarise(n_pubmeds = n()) |>
  arrange(desc(n_pubmeds))

n_pubmeds_per_dbgap |>
  head(
  )

# A tibble: 6 × 2
  broad_dbgap n_pubmeds
  <chr>           <int>
1 phs000424          11
2 phs001273           8
3 phs000007           7
4 phs000180           7
5 phs001672           7
6 phs000179           6

n_pubmeds_per_dbgap$n_pubmeds |> summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   2.039   2.000  11.000

Are there any dbgap ids not in our cohort description mapping?

cohort_dbgap_mapping <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
                                 sheet = 1)

New names:
• `` -> `...15`

annotated_dbgap_ids =
  combined_annotations |>
  pull(dbgap_id) |>
  str_trim() |>
  str_extract(pattern = "^phs\\d+") |>
  unique()

cohort_desc_dbgap_ids = 
  cohort_dbgap_mapping |>
  pull(dbGaP) |>
  strsplit(",") |>
  unlist() |>
  str_trim() |>
  str_extract(pattern = "^phs\\d+") |> 
  unique()

cohort_desc_dbgap_ids_p2 = 
    cohort_dbgap_mapping |>
  pull(old_dbGaP) |>
  strsplit(",") |>
  unlist() |>
  str_trim() |>
  str_extract(pattern = "^phs\\d+") |> 
  unique()

cohort_desc_dbgap_ids <- unique(c(cohort_desc_dbgap_ids, cohort_desc_dbgap_ids_p2))

not_found_dbgap_ids <- annotated_dbgap_ids[!c(annotated_dbgap_ids %in% cohort_desc_dbgap_ids)]

not_found_dbgap_ids  <- sort(not_found_dbgap_ids)

not_found_dbgap_ids |> length()

[1] 0

not_found_dbgap_ids

character(0)

get_dbgap_cohort_names <- 
cohort_dbgap_mapping |>
  tidyr::separate_longer_delim(cols = c(dbGaP),
                               delim = ",")  |>
  tidyr::separate_longer_delim(cols = c(old_dbGaP),
                               delim = ",")  |>
  tidyr::pivot_longer(cols = c(dbGaP, old_dbGaP),
               names_to = "source",
               values_to = "dbgap_id") |>
  select(cohort, dbgap_id) |>
  mutate(dbgap_id = str_trim(dbgap_id)) |>
  mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")) |>
  filter(!is.na(dbgap_id)) |>
  select(-dbgap_id) |>
  distinct()

get_dbgap_cohort_names |>
  group_by(broad_dbgap) |>
  summarise(n = n(),
            cohorts = stringr::str_flatten(cohort, collapse = ", ", na.rm = T)) |>
  filter(n > 1)

# A tibble: 30 × 3
   broad_dbgap     n cohorts                          
   <chr>       <int> <chr>                            
 1 phs000086       4 DCCT, DCCT/EDIC, EDIC, TEDDY     
 2 phs000091       3 HPFS, NHS, NHS/HPFS              
 3 phs000171       2 GeneMSA, PMRP                    
 4 phs000181       2 OZ-ALC, OZALC                    
 5 phs000215       2 BLSA, InCHIANTI                  
 6 phs000221       3 FamHS, FamHS-Visit1, FamHS-Visit2
 7 phs000237       2 NUGENE, NUgene                   
 8 phs000280       2 ARIC-CARe, CARe                  
 9 phs000303       2 KORA, KORA-gen                   
10 phs000404       2 COGEND, UW-TTURC                 
# ℹ 20 more rows

get_dbgap_cohort_names <- 
  get_dbgap_cohort_names |>
  group_by(broad_dbgap) |>
  summarise(cohort = stringr::str_flatten(cohort, collapse = "|", na.rm = T))

dbgap_cohort_names =
left_join(
combined_annotations |> 
  mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")),
get_dbgap_cohort_names,
by = c("broad_dbgap")
)

```{save-dbgap-cohort-names}

data.table::fwrite(dbgap_cohort_names, here::here(“output/gwas_cohorts/gwas_study_dbgap_cohort_names.csv”) )




# EGA ids


``` r
ega_info <-
data.table::fread(here::here("data/europe_pmc/ega.csv")) 

full_text_ega_ids =
sentences_df  |>
  select(ega_id, pmid, pmcid) |>
  tidyr::unnest(ega_id) |>
  mutate(ega_id = str_trim(ega_id)) |>
  group_by(pmid, pmcid) |>
  summarise(ega_full_text = str_flatten(unique(sort(ega_id)), 
                          collapse = ", ", 
                          na.rm = TRUE)) |>
  mutate(pmid = as.character(pmid))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

epmc_ega_ids =
ega_info |>
  filter(EXTID %in% pmids) |>
  select(ega_id = ega, 
         pmid = EXTID,
         pmcid = PMCID) |>
  group_by(pmid, pmcid) |>
  summarise(ega_epmc = str_flatten(unique(sort(ega_id)), 
                          collapse = ", ", 
                          na.rm = TRUE))

`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.

joined_ega =
full_join(full_text_ega_ids,
          epmc_ega_ids,
          by = c("pmid", "pmcid")
          )

joined_ega |>
  filter(!is.na(ega_full_text) & !is.na(ega_epmc)) |>
  filter(ega_full_text != ega_epmc)

# A tibble: 1 × 4
# Groups:   pmid [1]
  pmid     pmcid       ega_full_text                                    ega_epmc
  <chr>    <chr>       <chr>                                            <chr>   
1 39939757 PMC11903302 EGAD00010002057, EGAD50000000933, EGAD500000009… EGAD000…

# european pmcs are a direct subset of full text extracted ega ids, which is good

get_ega_cohort_names <-
cohort_dbgap_mapping |>
  tidyr::separate_longer_delim(cols = c(EGA),
                               delim = ",")  |>
  tidyr::pivot_longer(cols = c(EGA),
               names_to = "source",
               values_to = "ega_id") |>
  select(cohort, ega_id) |>
  mutate(ega_id = str_trim(ega_id)) |>
  filter(!is.na(ega_id)) |>
  group_by(ega_id) |>
  summarise(cohort = stringr::str_flatten(cohort, collapse = "|", na.rm = T)) |>
  distinct()

joined_ega <-
  joined_ega |>
  tidyr::pivot_longer(cols = c(ega_full_text, ega_epmc),
               names_to = "source",
               values_to = "ega_id") |>
  tidyr::separate_longer_delim(cols = c(ega_id),
                               delim = ", ") |>
  filter(!is.na(ega_id) & ega_id != "")

ega_cohort_names =
left_join(joined_ega,
get_ega_cohort_names, 
by = "ega_id"
)

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] tokenizers_0.3.0  stringr_1.6.0     purrr_1.1.0       rentrez_1.2.4    
[5] data.table_1.17.8 dplyr_1.1.4       workflowr_1.7.2  

loaded via a namespace (and not attached):
 [1] sass_0.4.10         utf8_1.2.6          generics_0.1.4     
 [4] tidyr_1.3.1         renv_1.1.8          xml2_1.4.0         
 [7] stringi_1.8.7       digest_0.6.37       magrittr_2.0.4     
[10] evaluate_1.0.5      fastmap_1.2.0       cellranger_1.1.0   
[13] rprojroot_2.1.0     jsonlite_2.0.0      processx_3.8.6     
[16] whisker_0.4.1       ps_1.9.1            promises_1.3.3     
[19] BiocManager_1.30.26 httr_1.4.7          XML_3.99-0.19      
[22] jquerylib_0.1.4     cli_3.6.5           rlang_1.1.6        
[25] withr_3.0.2         cachem_1.1.0        yaml_2.3.10        
[28] tools_4.3.1         httpuv_1.6.16       here_1.0.1         
[31] vctrs_0.6.5         R6_2.6.1            lifecycle_1.0.4    
[34] git2r_0.36.2        fs_1.6.6            pkgconfig_2.0.3    
[37] callr_3.7.6         pillar_1.11.1       bslib_0.9.0        
[40] later_1.4.4         glue_1.8.0          Rcpp_1.1.0         
[43] xfun_0.55           tibble_3.3.0        tidyselect_1.2.1   
[46] rstudioapi_0.17.1   knitr_1.50          htmltools_0.5.8.1  
[49] SnowballC_0.7.1     rmarkdown_2.30      compiler_4.3.1     
[52] getPass_0.2-4       readxl_1.4.5

Get dbGAP ids for studies

Isobel Beasley

2025-10-20