Last updated: 2026-03-31
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bddceeb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: BC2GM/
Ignored: BioC.dtd
Ignored: FormatConverter.jar
Ignored: FormatConverter.zip
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: code/.DS_Store
Ignored: code/full_text_conversion/.DS_Store
Ignored: data/.DS_Store
Ignored: data/RCDCFundingSummary_01042026.xlsx
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/epmc/
Ignored: data/europe_pmc/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/gbd_2019_california_percent_deaths.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: doccano_venv/
Ignored: figures/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/pubmedbert_entity_predictions.csv
Ignored: output/pubmedbert_entity_predictions.jsonl
Ignored: output/pubmedbert_predictions.csv
Ignored: output/pubmedbert_predictions.jsonl
Ignored: output/supplement/
Ignored: output/text_mining_predictions/
Ignored: output/trait_ontology/
Ignored: population_description_terms.txt
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: renv/
Ignored: spacy_venv_requirements.txt
Ignored: spacyr_venv/
Untracked files:
Untracked: code/full_text_conversion/html_to_xml.R
Untracked: code/text_mining_models/tokenise_data.py
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/gwas_to_gbd.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/specific_aims_stats.Rmd
Modified: analysis/text_for_cohort_labels.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/get_dbgap_ids.Rmd) and
HTML (docs/get_dbgap_ids.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | bddceeb | IJbeasley | 2026-03-31 | Add ega ids exploration to genome archive information |
| html | 694387b | IJbeasley | 2026-03-30 | Build site. |
| Rmd | 4e37aff | IJbeasley | 2026-03-30 | Update getting dbgap ids |
| html | 079a806 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 0e28f49 | IJbeasley | 2025-10-27 | Compare dbgap sources |
| html | dcc17fc | IJbeasley | 2025-10-27 | Build site. |
| Rmd | a418670 | IJbeasley | 2025-10-27 | Testing overlap with annotation + text mining dbgap |
| html | 3feefe9 | IJbeasley | 2025-10-20 | Build site. |
| Rmd | bdb7fae | IJbeasley | 2025-10-20 | Extracting all dbgap id info |
| html | 47e3b12 | IJbeasley | 2025-10-20 | Build site. |
| Rmd | 5909160 | IJbeasley | 2025-10-20 | Extracting dbgap ids |
library(dplyr)
library(data.table)
library(rentrez)
library(purrr)
library(stringr)
# patterns to extract:
# get dbgap ids
# get EGA ids
# get JGAS
# get PRJEB / PRJNA ids
patterns <- list(
dbgap = "\\bphs\\d+(?:\\.v\\d+)?(?:\\.p\\d+)?",
ega = "\\bEGA[CDFS]\\d+",
jgas = "\\bJGAS\\d+|JGAD\\d+",
prj = "\\bPRJEB\\d+|PRJNA\\d+"
)
grep_pattern <- paste(patterns, collapse = "|")
library(tokenizers)
# get sentences with dbgap / ega ids etc. in them
get_grep_sentences <- function(file_path,
grep_pattern) {
#browser()
txt_in <- suppressWarnings(xml2::read_xml(file_path))
# txt_in <- tryCatch(
# xml2::read_xml(file_path),
# error = function(e) {
# print(paste0("error with ", file_path));
# return(NULL)
# }
# )
txt_in <- xml2::read_xml(file_path)
get_patterns <- c(".//*[local-name()='body']//*[local-name()='p']",
".//*[local-name()='abstract']//*[local-name()='p']",
".//*[local-name()='back']//*[local-name()='p']",
".//*[local-name()='fig']//*[local-name()='caption']",
".//*[local-name()='table-wrap']//*[local-name()='p']",
".//*[local-name()='sec' and @sec-type='data-availability']//*[local-name()='p']",
".//*[local-name()='supplementary-material']//*[local-name()='p']",
".//passage/text")
canidate_text <- txt_in |>
xml2::xml_find_all(paste(get_patterns, collapse = " | ")) |>
# xml2::xml_find_all("//body//p |
# //abstract//p |
# //back//p |
# //fig//caption |
# //table-wrap//p |
# //sec[@sec-type='data-availability']//p
# //supplementary-material") |>
#xml2::xml_find_all("//body|//back") |>
xml2::xml_text()
if(length(canidate_text) == 0) {
print(paste0("No candidate text found in ", file_path))
return(character(0))
}
# txt_in <- readLines(file_path,
# warn = FALSE,
# encoding = "UTF-8")
# tokenize into sentences
sentences <- unlist(
tokenize_sentences(paste(canidate_text, collapse = " "))
)
grep_sentences <- sentences[grepl(grep_pattern, sentences)]
# sentences <- tokenize_sentences(paste(canidate_text, collapse = "\n"))
#
# grep_sentences <- sentences[grepl(grep_pattern, sentences)]
# grep_sentences <- tokenize_sentences(paste(candidate_lines,
# collapse = "\n"))
#
# sentences <- tokenize_sentences(paste(txt_in, collapse = " "))
#
# sentences <- unlist(tokenize_sentences(txt_in))
#
# dbgap_sentences = sentences[grepl(grep_pattern, sentences)]
return(grep_sentences)
}
full_text_files <-
list.files(here::here("output/fulltexts"),
recursive = T,
pattern = "\\.xml$")
# some duplicates come from converted xml files
# i.e. from propriety wiley and elsevier xml
# let's keep only the converted files:
full_text_files <-
grep("elsevier/elsevier_xml|wiley/wiley_xml",
full_text_files,
value = TRUE,
invert = TRUE)
# get pmcids of these files
all_ids <-
full_text_files |>
gsub(pattern = ".*/", replacement = "") |>
gsub(pattern = "\\.xml$", replacement = "") |>
gsub(pattern = "_bioc", replacement = "") |>
gsub(pattern = "\\.pdf.tei", replacement = "")
# some duplicates come from ncbi and europe pmc
# let's keep the europe pmc files
duplicate_ids <-
data.frame(ids = all_ids) |>
group_by(ids) |>
summarise(count = n()) |>
filter(count > 1) |>
pull(ids)
grep(full_text_files,
pattern = duplicate_ids[1],#paste(duplicate_ids, collapse = "|"),
value = T)
[1] "europe_pmc/PMC10067501.xml" "ncbi_cloud/PMC10067501.xml"
duplicates_to_remove <-
grep(full_text_files,
pattern = paste(duplicate_ids, collapse = "|"),
value = T) |>
grep(pattern = "ncbi", value = T)
full_text_files <- full_text_files[!c(full_text_files %in% duplicates_to_remove)]
# convert to file paths
full_text_files <- here::here(paste0("output/fulltexts/",
full_text_files)
)
# get pmids and pmcids of relevant files to load
relevant_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))
relevant_ids <- c(relevant_ids$PMID, relevant_ids$pmcids)
all_text_ids <- full_text_files |>
gsub(pattern = ".*/", replacement = "") |>
gsub(pattern = "\\.xml$", replacement = "") |>
gsub(pattern = "_bioc", replacement = "") |>
gsub(pattern = "\\.pdf.tei", replacement = "")
text_files_to_get <-
data.frame(full_text_files = full_text_files,
all_text_ids = all_text_ids) |>
filter(all_text_ids %in% relevant_ids)
full_text_files <- text_files_to_get$full_text_files
print("Number of full text files to load:")
[1] "Number of full text files to load:"
print(length(full_text_files))
[1] 790
pmcid_dbgap_sentences <-
purrr::map(full_text_files,
~get_grep_sentences(.x,
grep_pattern = grep_pattern
)
)
all_text_ids <- full_text_files |>
gsub(pattern = ".*/", replacement = "") |>
gsub(pattern = "\\.xml$", replacement = "") |>
gsub(pattern = "_bioc", replacement = "") |>
gsub(pattern = "\\.pdf.tei", replacement = "")
names(pmcid_dbgap_sentences) <- all_text_ids
keep_pmcid_dbgap_sentences <-
purrr::discard(
pmcid_dbgap_sentences,
~rlang::is_empty(.x)
)
# Convert to data frame
sentences_df <-
purrr::imap(keep_pmcid_dbgap_sentences,
~data.frame(pmcid = .y,
sentence = unique(.x),
stringsAsFactors = FALSE)
)|>
dplyr::bind_rows()
# Extract all dbGaP IDs from the sentences
sentences_df <- sentences_df |>
mutate(
dbgap_id = str_extract_all(sentence, patterns[["dbgap"]]),
#"phs\\d+(\\.v\\d+)?(\\.p\\d+)?"),
ega_id = str_extract_all(sentence, patterns[["ega"]]),
#"EGA[CDFS]\\d+"),
jgas_id = str_extract_all(sentence, patterns[["jgas"]]),
#"JGAS\\d+|JGAD"),
prj_id = str_extract_all(sentence, patterns[["prj"]]),
#"PRJEB\\d+|PRJNA\\d+")
)
sentences_df =
sentences_df |>
distinct()
print("Number of unique PMCIDs with dbgap / ega ids found in full text:")
[1] "Number of unique PMCIDs with dbgap / ega ids found in full text:"
sentences_df$pmcid |> unique() |> length()
[1] 122
# number of pubmed ids with dbgap ids found
print("Number of unique PMCIDs with dbgap ids found in full text:")
[1] "Number of unique PMCIDs with dbgap ids found in full text:"
sentences_df %>%
rowwise() %>%
filter(!rlang::is_empty(dbgap_id)) %>%
filter(length(dbgap_id) > 0) %>%
ungroup() %>%
pull(pmcid) %>%
unique() %>%
length()
[1] 103
print("Number of unique PMCIDs with EGA ids found in full text:")
[1] "Number of unique PMCIDs with EGA ids found in full text:"
sentences_df %>%
rowwise() %>%
filter(!rlang::is_empty(ega_id)) %>%
filter(length(ega_id) > 0) %>%
pull(pmcid) %>%
unique() %>%
length()
[1] 17
convert_pmid_df <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")) |>
rename(PMCID = pmcids)
# add pmid column
sentences_df =
left_join(sentences_df,
convert_pmid_df |>
dplyr::select(pmcid = PMCID,
pmid = PMID),
by = "pmcid"
) |>
mutate(pmid = ifelse(is.na(pmid),
pmcid,
pmid
)
) |>
mutate(pmcid = ifelse(grepl("PMC", pmid),
pmid,
pmcid)
) |>
mutate(pmcid = ifelse(grepl("PMC", pmcid),
pmcid,
""
)
)
sentences_df =
left_join(sentences_df |>
select(-pmcid) |>
mutate(pmid = as.integer(pmid)),
convert_pmid_df |>
dplyr::select(pmcid = PMCID,
pmid = PMID),
by = "pmid"
)
# typo in text for PMC8486151
# replace "phs000" with
# phs000.56.v1.p1
sentences_df |> filter(pmcid == "PMC8486151") |> pull(dbgap_id)
[[1]]
[1] "phs000090.v1.p1" "phs000810.v1.p1" "phs000293.v1" "phs000"
[5] "phs000200.v1"
fix_dbgap_ids <-
c(sentences_df |> filter(pmcid == "PMC8486151") |> pull(dbgap_id),
"phs00056.v1.p1"
) |>
unlist()
fix_dbgap_ids <- fix_dbgap_ids[fix_dbgap_ids != "phs000"]
sentences_df <-
sentences_df |>
mutate(dbgap_id = ifelse(pmcid == "PMC8486151",
list(fix_dbgap_ids),
dbgap_id
)
)
data.table::fwrite(sentences_df,
here::here("output/gwas_cat/gwas_study_dbgap_ega_sentences.csv")
)
dbgap_info <-
data.table::fread(here::here("data/europe_pmc/dbgap.csv"))
pmids <- convert_pmid_df$PMID
pmcids <- convert_pmid_df$PMCID
relevant_pmid_dbgaps = dbgap_info |>
filter(EXTID %in% pmids) |>
distinct()
print("Number of unique PMIDs with dbgap ids found in Europe PMC:")
[1] "Number of unique PMIDs with dbgap ids found in Europe PMC:"
relevant_pmid_dbgaps |>
pull(EXTID) |>
unique() |>
length()
[1] 59
relevant_pmcids_dbgaps = dbgap_info |>
filter(PMCID %in% pmcids) |>
filter(PMCID != "") |>
distinct()
print("Number of unique PMCIDs with dbgap ids found in Europe PMC:")
[1] "Number of unique PMCIDs with dbgap ids found in Europe PMC:"
relevant_pmcids_dbgaps |>
pull(PMCID) |>
unique() |>
length()
[1] 59
# overlap between pmid and pmcid dbgap annotations in europe pmc
print("Number of unique PMIDs with dbgap ids found in Europe PMC that also have dbgap ids found in full text:")
[1] "Number of unique PMIDs with dbgap ids found in Europe PMC that also have dbgap ids found in full text:"
unique(sentences_df$pmid) %in% unique(relevant_pmid_dbgaps$EXTID) |>
sum()
[1] 57
# overlap between pmid and pmcid dbgap annotations in europe pmc
# 49 / 57 = 86% of pmids with dbgap ids in europe pmc annotations are also found in full text
print("Number of unique PMIDs with dbgap ids found in full text, but not in Europe PMC:")
[1] "Number of unique PMIDs with dbgap ids found in full text, but not in Europe PMC:"
setdiff(unique(sentences_df$pmid),
unique(relevant_pmid_dbgaps$EXTID)) |> length()
[1] 65
# 101 / 158 = 64% of pmids with dbgap ids found in full text are not found in europe pmc annotations
print("Number of unique PMIDs with dbgap ids found in Europe PMC that do not have dbgap ids found in full text:")
[1] "Number of unique PMIDs with dbgap ids found in Europe PMC that do not have dbgap ids found in full text:"
setdiff(unique(relevant_pmid_dbgaps$EXTID),
unique(sentences_df$pmid)) |> length()
[1] 2
full_text_dbgaps =
sentences_df |>
select(dbgap_id, pmid, pmcid) |>
tidyr::unnest(dbgap_id) |>
mutate(dbgap_id = str_trim(dbgap_id)) |>
mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
# mutate(dbgap_id = str_remove(dbgap_id, "\\.v\\d+\\.p\\d+.*$")) |>
group_by(pmid, pmcid) |>
summarise(dbgap_full_text = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE)) |>
# tidyr::nest(dbgap_full_text = dbgap_id) |>
mutate(pmid = as.character(pmid))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
europe_pmc_dbgaps =
relevant_pmid_dbgaps |>
select(dbgap_id = dbgap,
pmcid = PMCID,
pmid = EXTID) |>
group_by(pmid, pmcid) |>
summarise(dbgap_epmc = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
# tidyr::nest(dbgap_epmc = dbgap_id)
joined <-
full_join(full_text_dbgaps,
europe_pmc_dbgaps,
by = c("pmid", "pmcid")
)
# includes more dbgap ids from full text than europe pmc annotations
joined |>
filter(!is.na(dbgap_full_text) & !is.na(dbgap_epmc)) |>
filter(dbgap_full_text != dbgap_epmc)
# A tibble: 10 × 4
# Groups: pmid [10]
pmid pmcid dbgap_full_text dbgap_epmc
<chr> <chr> <chr> <chr>
1 26301688 PMC4863040 phs000019, phs000091, phs000092, phs000125, … phs000019…
2 27790247 PMC5061751 phs000007, phs000090, phs000209, phs000280, … phs000007…
3 28044437 PMC5476850 phs000007, phs000342 phs000007
4 28604730 PMC5510465 phs000876, phs001273 phs000876
5 28838971 PMC5652606 phs000424, phs001375, phs001388, phs001393, … phs000424
6 30510157 PMC6277418 phs000424, phs000930 phs000424
7 34727735 PMC8885789 phs000964, phs000974, phs001211, phs001237, … phs000964…
8 35769078 PMC9234217 phs000090, phs000342 phs000090
9 36929942 PMC10248849 phs000178, phs000187, phs000196, phs000206, … phs000178…
10 40210677 PMC11985957 phs001672, phs003894 phs001672
joined |>
filter(!is.na(dbgap_full_text) & is.na(dbgap_epmc))
# A tibble: 46 × 4
# Groups: pmid [46]
pmid pmcid dbgap_full_text dbgap_epmc
<chr> <chr> <chr> <chr>
1 17903298 "PMC1995610" phs000007 <NA>
2 17903303 "PMC1995605" phs000007 <NA>
3 19458352 "PMC2857316" phs000183 <NA>
4 19875614 "PMC2809960" phs000086 <NA>
5 25102180 "PMC4125087" phs000200 <NA>
6 26634245 "PMC4668640" phs000179 <NA>
7 26959717 "" phs000346 <NA>
8 28043905 "PMC5367948" phs000346 <NA>
9 28346466 "PMC5367689" phs000090, phs000209, phs000280, phs000284,… <NA>
10 29404672 "PMC5876265" phs000086 <NA>
# ℹ 36 more rows
# seems good extraction:
# pubmed id: 39024449
# pubmed id: 26959717
# pubmed id: 31537701
# pubmed id: 37167549
# pubmed id: 30207284
# pubmed id: 33135175
# pubmed id: 37069358
joined |>
filter(is.na(dbgap_full_text) & !is.na(dbgap_epmc))
# A tibble: 2 × 4
# Groups: pmid [2]
pmid pmcid dbgap_full_text dbgap_epmc
<chr> <chr> <chr> <chr>
1 22384361 PMC3276159 <NA> phs000292
2 31697830 PMC7236628 <NA> phs000424
Sys.getenv("ENTREZ_KEY")
get_internal_dbgap_ids = function(pubmed_id) {
Sys.sleep(0.1)
dbgap_links <- entrez_link(
dbfrom = "pubmed",
db = "gap",
id = pubmed_id
)
dbgap_ids <- unlist(dbgap_links$links$pubmed_gap)
if(is.null(dbgap_ids)){
dbgap_ids = NA
}
return(data.frame(PUBMED_ID = pubmed_id,
DBGAP_ID = str_flatten(dbgap_ids,
collapse = ",",
na.rm = TRUE)
)
)
}
pubmed_dbgap_mapping <-
purrr::map(pmids,
get_internal_dbgap_ids,
.progress = TRUE
) |>
dplyr::bind_rows()
data.table::fwrite(pubmed_dbgap_mapping,
here::here("output/gwas_cohorts/gwas_study_dbgap_ids.csv")
)
# load previously saved mapping
pubmed_dbgap_mapping <- data.table::fread(
here::here("output/gwas_cohorts/gwas_study_dbgap_ids.csv")
)
n_with_dbgap = pubmed_dbgap_mapping |>
dplyr::filter(DBGAP_ID != "") |>
nrow()
print("Number of unique PMIDs with dbgap ids found via Entrez:")
[1] "Number of unique PMIDs with dbgap ids found via Entrez:"
n_with_dbgap
[1] 64
percent_with_dbgap = n_with_dbgap / nrow(pubmed_dbgap_mapping) * 100
print("Percentage of PMIDs with dbgap ids found via Entrez:")
[1] "Percentage of PMIDs with dbgap ids found via Entrez:"
percent_with_dbgap
[1] 7.729469
entrez_dbgap_ids = pubmed_dbgap_mapping$DBGAP_ID
entrez_dbgap_ids = entrez_dbgap_ids[entrez_dbgap_ids != ""]
entrez_dbgap_ids = entrez_dbgap_ids |>
strsplit(",") |>
unlist() |>
unique()
print("Number of unique internal dbgap ids to retrieve accessions for:")
print(length(entrez_dbgap_ids))
get_dbgap_accession = function(internal_dbgap_id) {
summary = entrez_summary(db = "gap",
id = internal_dbgap_id)
accession = summary$d_study_results$d_study_id
return(data.frame(
INTERNAL_DBGAP_ID = internal_dbgap_id,
DBGAP_ACCESSION = str_flatten(accession,
collapse = ",",
na.rm = TRUE)
)
)
}
dbgap_accession_mapping <-
purrr::map(entrez_dbgap_ids,
get_dbgap_accession,
.progress = TRUE
) |>
dplyr::bind_rows()
data.table::fwrite(
dbgap_accession_mapping,
here::here("output/gwas_cohorts/gwas_study_dbgap_accessions.csv")
)
dbgap_accession_mapping <- data.table::fread(
here::here("output/gwas_cohorts/gwas_study_dbgap_accessions.csv")
)
# a lot of internal dbgap ids are not able be retrieve study accessions
# sad
# ? perhaps because they are old dbgap versions?
print("Number of internal dbgap ids for which we could not retrieve accessions:")
[1] "Number of internal dbgap ids for which we could not retrieve accessions:"
dbgap_accession_mapping |>
filter(DBGAP_ACCESSION == "") |>
nrow()
[1] 28
pubmed_dbgap_mapping =
pubmed_dbgap_mapping |>
tidyr::separate_longer_delim(cols = "DBGAP_ID",
delim = ",")
dbgap_accession_mapping =
dbgap_accession_mapping |>
tidyr::separate_longer_delim(cols = "INTERNAL_DBGAP_ID",
delim = ",")
dbgap_accession_mapping =
left_join(pubmed_dbgap_mapping,
dbgap_accession_mapping,
by = c("DBGAP_ID" = "INTERNAL_DBGAP_ID")
)
print("Number of unique PMIDs with dbgap accessions found via Entrez:")
[1] "Number of unique PMIDs with dbgap accessions found via Entrez:"
dbgap_accession_mapping |>
filter(DBGAP_ACCESSION != "" & DBGAP_ID != "") |>
pull(PUBMED_ID) |>
unique() |>
length()
[1] 34
combined_dbgap_pubmed_mapping =
dbgap_accession_mapping |>
filter(DBGAP_ACCESSION != "" & DBGAP_ID != "") |>
select(PUBMED_ID, DBGAP_ACCESSION) |>
distinct()
data.table::fwrite(
combined_dbgap_pubmed_mapping,
here::here("output/gwas_cohorts/gwas_study_combined_dbgap_pubmed_mapping.csv")
)
combined_dbgap_pubmed_mapping <- data.table::fread(
here::here("output/gwas_cohorts/gwas_study_combined_dbgap_pubmed_mapping.csv")
)
entrez_dbgaps <-
combined_dbgap_pubmed_mapping |>
rename(pmid = PUBMED_ID,
dbgap_id = DBGAP_ACCESSION
) |>
mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
filter(dbgap_id != "") |>
group_by(pmid) |>
summarise(dbgap_annotation = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE)) |>
mutate(pmid = as.character(pmid))
joined_dbgaps =
full_join(entrez_dbgaps,
full_text_dbgaps,
by = "pmid")
print("Number of PMIDs with dbgap annotations from Entrez that do not match dbgap ids found in full text:")
[1] "Number of PMIDs with dbgap annotations from Entrez that do not match dbgap ids found in full text:"
joined_dbgaps |>
filter(!is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
filter(dbgap_annotation != dbgap_full_text)
# A tibble: 5 × 4
pmid dbgap_annotation pmcid dbgap_full_text
<chr> <chr> <chr> <chr>
1 18978792 phs000180 PMC2635556 phs000018
2 19430480 phs000180, phs000910, phs000911 PMC2889014 phs000018
3 22101970 phs000888, phs000984 PMC3277617 phs000170, phs000188, phs…
4 25102180 phs000888, phs000925 PMC4125087 phs000200
5 25751624 phs000911 PMC4380767 phs000180
print("Number of PMIDs with dbgap annotations from Entrez that match dbgap ids found in full text:")
[1] "Number of PMIDs with dbgap annotations from Entrez that match dbgap ids found in full text:"
joined_dbgaps |>
filter(!is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
filter(dbgap_annotation == dbgap_full_text) |>
nrow()
[1] 0
print("Number of PMIDs with dbgap annotations from Entrez but no dbgap ids found in full text:")
[1] "Number of PMIDs with dbgap annotations from Entrez but no dbgap ids found in full text:"
joined_dbgaps |>
filter(!is.na(dbgap_annotation) & is.na(dbgap_full_text)) |>
nrow()
[1] 29
print("Number of PMIDs with dbgap found in full text but not in Entrez annotations:")
[1] "Number of PMIDs with dbgap found in full text but not in Entrez annotations:"
joined_dbgaps |>
filter(is.na(dbgap_annotation) & !is.na(dbgap_full_text)) |>
nrow()
[1] 98
# europe_pmc_dbgaps
# entrez_dbgaps
# full_text_dbgaps
full_text_dbgaps =
sentences_df |>
select(dbgap_id, pmid, pmcid) |>
tidyr::unnest(dbgap_id) |>
mutate(dbgap_id = str_trim(dbgap_id)) |>
# mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
# mutate(dbgap_id = str_remove(dbgap_id, "\\.v\\d+\\.p\\d+.*$")) |>
group_by(pmid, pmcid) |>
summarise(dbgap_full_text = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE)) |>
# tidyr::nest(dbgap_full_text = dbgap_id) |>
mutate(pmid = as.character(pmid))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
europe_pmc_dbgaps =
relevant_pmid_dbgaps |>
select(dbgap_id = dbgap,
pmcid = PMCID,
pmid = EXTID) |>
group_by(pmid, pmcid) |>
summarise(dbgap_epmc = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
entrez_dbgaps <-
combined_dbgap_pubmed_mapping |>
rename(pmid = PUBMED_ID,
dbgap_id = DBGAP_ACCESSION
) |>
# mutate(dbgap_id = str_extract(dbgap_id, "^phs\\d+")) |>
filter(dbgap_id != "") |>
group_by(pmid) |>
summarise(dbgap_annotation = str_flatten(unique(sort(dbgap_id)),
collapse = ", ",
na.rm = TRUE)) |>
mutate(pmid = as.character(pmid))
combined_annotations <-
full_join(full_text_dbgaps,
europe_pmc_dbgaps,
by = c("pmid", "pmcid")
) |>
full_join(entrez_dbgaps,
by = c("pmid")
) |>
tidyr::separate_longer_delim(cols = c(dbgap_full_text),
delim = ", ")|>
tidyr::separate_longer_delim(cols = c(dbgap_epmc),
delim = ", ")|>
tidyr::separate_longer_delim(cols = c(dbgap_annotation),
delim = ", ")|>
tidyr::pivot_longer(cols = c(dbgap_full_text, dbgap_epmc, dbgap_annotation),
names_to = "source",
values_to = "dbgap_id") |>
filter(!is.na(dbgap_id) & dbgap_id != "") |>
mutate(source = stringr::str_remove_all("dbgap_", string = source)) |>
mutate(source = ifelse(source == "annotation", "entrez", source)) |>
distinct()
combined_annotations$pmid |> unique() |> length()
[1] 134
n_pubmeds_per_dbgap =
combined_annotations |>
mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")) |>
select(pmid, broad_dbgap) |>
distinct() |>
group_by(broad_dbgap) |>
summarise(n_pubmeds = n()) |>
arrange(desc(n_pubmeds))
n_pubmeds_per_dbgap |>
head(
)
# A tibble: 6 × 2
broad_dbgap n_pubmeds
<chr> <int>
1 phs000424 11
2 phs001273 8
3 phs000007 7
4 phs000180 7
5 phs001672 7
6 phs000179 6
n_pubmeds_per_dbgap$n_pubmeds |> summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 2.039 2.000 11.000
cohort_dbgap_mapping <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
sheet = 1)
New names:
• `` -> `...15`
annotated_dbgap_ids =
combined_annotations |>
pull(dbgap_id) |>
str_trim() |>
str_extract(pattern = "^phs\\d+") |>
unique()
cohort_desc_dbgap_ids =
cohort_dbgap_mapping |>
pull(dbGaP) |>
strsplit(",") |>
unlist() |>
str_trim() |>
str_extract(pattern = "^phs\\d+") |>
unique()
cohort_desc_dbgap_ids_p2 =
cohort_dbgap_mapping |>
pull(old_dbGaP) |>
strsplit(",") |>
unlist() |>
str_trim() |>
str_extract(pattern = "^phs\\d+") |>
unique()
cohort_desc_dbgap_ids <- unique(c(cohort_desc_dbgap_ids, cohort_desc_dbgap_ids_p2))
not_found_dbgap_ids <- annotated_dbgap_ids[!c(annotated_dbgap_ids %in% cohort_desc_dbgap_ids)]
not_found_dbgap_ids <- sort(not_found_dbgap_ids)
not_found_dbgap_ids |> length()
[1] 0
not_found_dbgap_ids
character(0)
get_dbgap_cohort_names <-
cohort_dbgap_mapping |>
tidyr::separate_longer_delim(cols = c(dbGaP),
delim = ",") |>
tidyr::separate_longer_delim(cols = c(old_dbGaP),
delim = ",") |>
tidyr::pivot_longer(cols = c(dbGaP, old_dbGaP),
names_to = "source",
values_to = "dbgap_id") |>
select(cohort, dbgap_id) |>
mutate(dbgap_id = str_trim(dbgap_id)) |>
mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")) |>
filter(!is.na(dbgap_id)) |>
select(-dbgap_id) |>
distinct()
get_dbgap_cohort_names |>
group_by(broad_dbgap) |>
summarise(n = n(),
cohorts = stringr::str_flatten(cohort, collapse = ", ", na.rm = T)) |>
filter(n > 1)
# A tibble: 30 × 3
broad_dbgap n cohorts
<chr> <int> <chr>
1 phs000086 4 DCCT, DCCT/EDIC, EDIC, TEDDY
2 phs000091 3 HPFS, NHS, NHS/HPFS
3 phs000171 2 GeneMSA, PMRP
4 phs000181 2 OZ-ALC, OZALC
5 phs000215 2 BLSA, InCHIANTI
6 phs000221 3 FamHS, FamHS-Visit1, FamHS-Visit2
7 phs000237 2 NUGENE, NUgene
8 phs000280 2 ARIC-CARe, CARe
9 phs000303 2 KORA, KORA-gen
10 phs000404 2 COGEND, UW-TTURC
# ℹ 20 more rows
get_dbgap_cohort_names <-
get_dbgap_cohort_names |>
group_by(broad_dbgap) |>
summarise(cohort = stringr::str_flatten(cohort, collapse = "|", na.rm = T))
dbgap_cohort_names =
left_join(
combined_annotations |>
mutate(broad_dbgap = str_extract(dbgap_id, "^phs\\d+")),
get_dbgap_cohort_names,
by = c("broad_dbgap")
)
```{save-dbgap-cohort-names}
data.table::fwrite(dbgap_cohort_names, here::here(“output/gwas_cohorts/gwas_study_dbgap_cohort_names.csv”) )
# EGA ids
``` r
ega_info <-
data.table::fread(here::here("data/europe_pmc/ega.csv"))
full_text_ega_ids =
sentences_df |>
select(ega_id, pmid, pmcid) |>
tidyr::unnest(ega_id) |>
mutate(ega_id = str_trim(ega_id)) |>
group_by(pmid, pmcid) |>
summarise(ega_full_text = str_flatten(unique(sort(ega_id)),
collapse = ", ",
na.rm = TRUE)) |>
mutate(pmid = as.character(pmid))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
epmc_ega_ids =
ega_info |>
filter(EXTID %in% pmids) |>
select(ega_id = ega,
pmid = EXTID,
pmcid = PMCID) |>
group_by(pmid, pmcid) |>
summarise(ega_epmc = str_flatten(unique(sort(ega_id)),
collapse = ", ",
na.rm = TRUE))
`summarise()` has grouped output by 'pmid'. You can override using the
`.groups` argument.
joined_ega =
full_join(full_text_ega_ids,
epmc_ega_ids,
by = c("pmid", "pmcid")
)
joined_ega |>
filter(!is.na(ega_full_text) & !is.na(ega_epmc)) |>
filter(ega_full_text != ega_epmc)
# A tibble: 1 × 4
# Groups: pmid [1]
pmid pmcid ega_full_text ega_epmc
<chr> <chr> <chr> <chr>
1 39939757 PMC11903302 EGAD00010002057, EGAD50000000933, EGAD500000009… EGAD000…
# european pmcs are a direct subset of full text extracted ega ids, which is good
get_ega_cohort_names <-
cohort_dbgap_mapping |>
tidyr::separate_longer_delim(cols = c(EGA),
delim = ",") |>
tidyr::pivot_longer(cols = c(EGA),
names_to = "source",
values_to = "ega_id") |>
select(cohort, ega_id) |>
mutate(ega_id = str_trim(ega_id)) |>
filter(!is.na(ega_id)) |>
group_by(ega_id) |>
summarise(cohort = stringr::str_flatten(cohort, collapse = "|", na.rm = T)) |>
distinct()
joined_ega <-
joined_ega |>
tidyr::pivot_longer(cols = c(ega_full_text, ega_epmc),
names_to = "source",
values_to = "ega_id") |>
tidyr::separate_longer_delim(cols = c(ega_id),
delim = ", ") |>
filter(!is.na(ega_id) & ega_id != "")
ega_cohort_names =
left_join(joined_ega,
get_ega_cohort_names,
by = "ega_id"
)
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] tokenizers_0.3.0 stringr_1.6.0 purrr_1.1.0 rentrez_1.2.4
[5] data.table_1.17.8 dplyr_1.1.4 workflowr_1.7.2
loaded via a namespace (and not attached):
[1] sass_0.4.10 utf8_1.2.6 generics_0.1.4
[4] tidyr_1.3.1 renv_1.1.8 xml2_1.4.0
[7] stringi_1.8.7 digest_0.6.37 magrittr_2.0.4
[10] evaluate_1.0.5 fastmap_1.2.0 cellranger_1.1.0
[13] rprojroot_2.1.0 jsonlite_2.0.0 processx_3.8.6
[16] whisker_0.4.1 ps_1.9.1 promises_1.3.3
[19] BiocManager_1.30.26 httr_1.4.7 XML_3.99-0.19
[22] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6
[25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
[28] tools_4.3.1 httpuv_1.6.16 here_1.0.1
[31] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
[34] git2r_0.36.2 fs_1.6.6 pkgconfig_2.0.3
[37] callr_3.7.6 pillar_1.11.1 bslib_0.9.0
[40] later_1.4.4 glue_1.8.0 Rcpp_1.1.0
[43] xfun_0.55 tibble_3.3.0 tidyselect_1.2.1
[46] rstudioapi_0.17.1 knitr_1.50 htmltools_0.5.8.1
[49] SnowballC_0.7.1 rmarkdown_2.30 compiler_4.3.1
[52] getPass_0.2-4 readxl_1.4.5