Last updated: 2026-03-30
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 7a17bd6. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: BC2GM/
Ignored: BioC.dtd
Ignored: FormatConverter.jar
Ignored: FormatConverter.zip
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: code/.DS_Store
Ignored: code/full_text_conversion/.DS_Store
Ignored: data/.DS_Store
Ignored: data/RCDCFundingSummary_01042026.xlsx
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/epmc/
Ignored: data/europe_pmc/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/gbd_2019_california_percent_deaths.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: doccano_venv/
Ignored: figures/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/pubmedbert_entity_predictions.csv
Ignored: output/pubmedbert_entity_predictions.jsonl
Ignored: output/pubmedbert_predictions.csv
Ignored: output/pubmedbert_predictions.jsonl
Ignored: output/supplement/
Ignored: output/text_mining_predictions/
Ignored: output/trait_ontology/
Ignored: population_description_terms.txt
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: renv/
Ignored: spacy_venv_requirements.txt
Ignored: spacyr_venv/
Untracked files:
Untracked: code/full_text_conversion/html_to_xml.R
Untracked: code/text_mining_models/tokenise_data.py
Unstaged changes:
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/gwas_to_gbd.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/specific_aims_stats.Rmd
Modified: analysis/text_for_cohort_labels.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/get_full_text.Rmd) and
HTML (docs/get_full_text.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 7a17bd6 | IJbeasley | 2026-03-30 | Fix springer nature download bug |
| html | a1dd9e2 | IJbeasley | 2026-03-30 | Build site. |
| Rmd | ecea90d | IJbeasley | 2026-03-30 | More comphrensive full text download |
| html | a80d8cb | IJbeasley | 2026-03-25 | Build site. |
| Rmd | 2585a14 | IJbeasley | 2026-03-25 | Update full text downloading |
| html | 5e29024 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | e7de25d | IJbeasley | 2026-02-04 | Adding totals + to manually review |
| html | 5c9d397 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | 456acb1 | IJbeasley | 2026-02-04 | Fixing percentage downloaded stats |
| html | 55f6763 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | c0dc676 | IJbeasley | 2026-02-04 | Getting full text from publisher APIs |
| html | 1898c02 | IJbeasley | 2026-02-04 | Build site. |
| Rmd | d214580 | IJbeasley | 2026-02-04 | Getting full text from publisher APIs |
| html | 6ba1e1f | IJbeasley | 2026-01-12 | Build site. |
| Rmd | b43e9a9 | IJbeasley | 2026-01-12 | Update getting full text |
| html | ac0d1a7 | IJbeasley | 2025-10-27 | Build site. |
| html | 8642872 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | da4d730 | IJbeasley | 2025-10-27 | Now run on all texts |
| html | fb5cfd9 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 8ed4c37 | IJbeasley | 2025-10-27 | Now run on all texts |
| html | 8610283 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 7d504e3 | IJbeasley | 2025-10-27 | More fixing of download full text |
| html | 16f4c19 | IJbeasley | 2025-10-27 | Build site. |
| Rmd | 3df4096 | IJbeasley | 2025-10-27 | Update + improve full text downloading - test run |
| html | 1439951 | IJbeasley | 2025-10-24 | Build site. |
| Rmd | 481aebe | IJbeasley | 2025-10-24 | Update code for getting full texts |
library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
## Step 1:
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))
gwas_study_info = gwas_study_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
dplyr::filter(!cause %in% c("HIV/AIDS",
"Tuberculosis",
"Malaria",
"Lower respiratory infections",
"Diarrhoeal diseases",
"Neonatal disorders",
"Tetanus",
"Diphtheria",
"Pertussis" ,
"Measles",
"Maternal disorders")) |>
dplyr::filter(cause != "")
# gwas_study_info <- gwas_study_info |>
# dplyr::filter(DISEASE_STUDY == TRUE)
print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
all_pmids <- unique(gwas_study_info$PUBMED_ID)
length(all_pmids)
[1] 828
# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
convert_pmid_df <- convert_pmid_df |>
dplyr::rename(pmcids = PMCID
) |>
dplyr::mutate(pmcids = ifelse(is.na(pmcids),
"",
pmcids
)
)
convert_pmid_df <-
convert_pmid_df |>
dplyr::filter(!is.na(PMID))
converted_ids =
convert_pmid_df |>
filter(PMID %in% all_pmids)
data.table::fwrite(converted_ids,
here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")
)
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))
print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
PMID pmcids DOI
<int> <char> <char>
1: 17223258 https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876 https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697 https://doi.org/10.1038/ng2043
5: 17463246 https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617 https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 828 3
length(all_pmids)
[1] 828
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
filter(pmcids == "") |>
pull(PMID)
print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 175
pmcids <-
converted_ids$pmcids |>
unique()
pmcids <- pmcids[pmcids != ""]
print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 653
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(all_pmids), digits = 2)
[1] 78.86
Requires PMCIDS to download full text xmls from Europe PMC Restful API. Thus, can only be applied to papers with PMCIDs.
# Function to download full text xml from Europe PMC Restful API
download_pmc_text <- function(pmcid,
out_dir = here::here("output/fulltexts/europe_pmc/")
) {
# check if file already exists
if(file.exists(paste0(out_dir, pmcid, ".xml"))){
return(TRUE)
}
url_xml <- paste0("https://www.ebi.ac.uk/",
"europepmc/webservices/rest/",
pmcid,
"/fullTextXML"
)
resp <- GET(url_xml)
# ---- Fallback URL ----
if(status_code(resp) != 200){
url_xml <- paste0("https://europepmc.org/",
"oai.cgi?verb=GetRecord",
"&metadataPrefix=pmc",
"&identifier=oai:europepmc.org:",
pmcid)
resp <- GET(url_xml)
}
# ---- Fail if still bad ----
if(status_code(resp) != 200){
return(NULL)
}
# ---- Parse XML ----
xml_content <- read_xml(
content(resp,
as = "text",
encoding = "UTF-8")
)
article_node = xml_find_first(xml_content,
"//*[local-name() = 'article']"
)
if (is.na(article_node)) {
message("No <article> node found for ", pmcid)
return(NULL)
}
# --- Save ---
write_xml(article_node,
paste0(out_dir, pmcid, ".xml")
)
}
for(article in pmcids[pmcids != ""]){
download_pmc_text(article)
}
euro_pmcids <-list.files(here::here("output/fulltexts/europe_pmc/"),
pattern = "\\.xml$")
euro_pmcids <- gsub("\\.xml$",
"",
euro_pmcids
)
euro_pmcids <- pmcids[pmcids %in% euro_pmcids]
n_euro_pmc <- length(euro_pmcids)
print("Number of downloaded full text files from European PMC:")
[1] "Number of downloaded full text files from European PMC:"
print(n_euro_pmc)
[1] 435
print("Percentage of pmids with full text from European PMC:")
[1] "Percentage of pmids with full text from European PMC:"
round(100 * n_euro_pmc / length(all_pmids), digits = 2)
[1] 52.54
For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service.
# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text - non-commercial use
# oa_noncomm
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/ --no-sign-request
europeanpmc_full_texts <-
list.files(here::here("output/fulltexts/europe_pmc"),
pattern = "\\.xml"
)
# get pmcids of these files
europeanpmc_full_texts <-
gsub("\\.xml$",
"",
europeanpmc_full_texts
)
left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]
print("Number of remaining pmcids without full text:")
[1] "Number of remaining pmcids without full text:"
length(left_over_pmcids)
[1] 218
print("+ Number of pmids without pmcid mapping:")
[1] "+ Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 175
author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))
oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))
oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))
author_manu_to_get <-
author_manu |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
print("Number of papers to download in Author Manuscripts section:")
[1] "Number of papers to download in Author Manuscripts section:"
nrow(author_manu_to_get)
[1] 136
oa_noncomm_to_get =
oa_noncomm |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
# remove any overlaps between sections
oa_noncomm_to_get <-
oa_noncomm_to_get |>
dplyr::filter(!c(PMID %in% author_manu_to_get$PMID))
print("Number of additional papers to download in the Non-commericial Open Access PMC section:")
[1] "Number of additional papers to download in the Non-commericial Open Access PMC section:"
nrow(oa_noncomm_to_get)
[1] 0
oa_comm_to_get =
oa_comm |>
dplyr::filter(AccessionID %in% left_over_pmcids |
PMID %in% not_converted_pmids)
oa_comm_to_get <-
oa_comm_to_get |>
dplyr::filter(!c(PMID %in% author_manu_to_get$PMID)) |>
dplyr::filter(!c(PMID %in% oa_noncomm_to_get$PMID))
# remove any overlaps between sections
print("Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:")
[1] "Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:"
nrow(oa_comm_to_get)
[1] 3
file_paths =
c(oa_noncomm_to_get$Key,
oa_comm_to_get$Key,
author_manu_to_get$Key)
file_paths <- str_replace_all(file_paths,
pattern = "txt",
replacement = "xml")
writeLines(
file_paths,
here::here("output/fulltexts/aws_locations/selected_paths.txt")
)
system(
paste(
"xargs -I {} aws s3 cp",
"s3://pmc-oa-opendata/{}",
shQuote(here::here("output/fulltexts/ncbi_cloud/")),
"--no-sign-request",
"<",
shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
)
)
# not_available = left_over_pmcids[!c(left_over_pmcids %in%
# c(oa_noncomm_to_get$AccessionID,
# oa_comm_to_get$AccessionID,
# author_manu_to_get$AccessionID)
# )]
# get list of pmcids already retrieved
ncbi_pmcids_retrieved <-
list.files(c(#here::here("output/fulltexts/europe_pmc"),
here::here("output/fulltexts/ncbi_cloud/")
),
pattern = "\\.xml$"
)
ncbi_pmcids_retrieved <-
gsub("\\.xml$",
"",
ncbi_pmcids_retrieved
)
pmids_retrieved <-
converted_ids |>
filter(pmcids %in% c(ncbi_pmcids_retrieved, euro_pmcids)) |>
pull(PMID)
not_available <- all_pmids[!c(all_pmids %in% pmids_retrieved)]
print("Percentage of pmids with full text from NCBI Cloud Service:")
[1] "Percentage of pmids with full text from NCBI Cloud Service:"
100 * (length(all_pmids) - n_euro_pmc - length(not_available)) / length(all_pmids)
[1] 16.42512
print("Percentage of pmids without full text from either European PMC or NCBI Cloud Service:")
[1] "Percentage of pmids without full text from either European PMC or NCBI Cloud Service:"
100 * length(not_available) / length(all_pmids)
[1] 31.03865
doi_information <-
converted_ids |>
filter(PMID %in% not_available)
library(rcrossref)
library(httr)
# Get download links from Crossref
get_crossref_links <- function(doi) {
# Query Crossref for the article
works <- cr_works(dois = doi)
# keep links for xml or text-mining
links <- works$data$link[[1]]
if(is.null(links)){
link_data <- data.frame(doi = doi,
URL = NA,
content.type = NA,
content.version = NA,
intended.application = NA)
return(link_data)
}
links <-
links |>
filter(intended.application == "text-mining" | content.type == "application/xml"
)
if(nrow(links) == 0){
link_data <- data.frame(doi = doi,
URL = NA,
content.type = NA,
content.version = NA,
intended.application = NA)
} else{
link_data <-
data.frame(doi = doi,
links)
}
return(link_data)
}
# get all crossref links
get_crossref_links <- function(doi) {
# Query Crossref for the article
works <- cr_works(dois = doi)
# keep links for xml or text-mining
links <- works$data$link
if(is.null(links)){
link_data <- data.frame(doi = doi,
URL = NA,
content.type = NA,
content.version = NA,
intended.application = NA)
return(link_data)
}
return(links)
}
# elsevier dois:
elsevier_doi_patterns <- "10.1016|10.1053|10.1086|10.1194|10.1593|10.1097/jto.|10.1182|10.1038/sj.ki"
elsevier_dois <- grep(elsevier_doi_patterns,
doi_information$DOI,
value = TRUE
)
print("Number of papers potentially can get from Elsevier:")
[1] "Number of papers potentially can get from Elsevier:"
length(elsevier_dois)
[1] 42
elsevier_api_key <- Sys.getenv("ELSEVIER_API_KEY")
elsevier_doi_info <- str_remove_all(pattern = "https://doi.org/",
string = elsevier_dois)
# get pmids for elsevier dois
pmids_elsevier <- doi_information |>
filter(DOI %in% elsevier_dois) |>
mutate(DOI = str_remove_all(DOI,
pattern = "https://doi.org/"
)
) |>
rename_with(~tolower(.x))
# get elsevier full text links from crossref
elsevier_link_df <- purrr::map(elsevier_doi_info,
~get_crossref_links(.x)
) |>
bind_rows()
print("Number of Elsevier links retrieved from Crossref:")
nrow(elsevier_link_df)
print("Number of xml links retrieved from Elsevier links:")
elsevier_link_df |>
filter(content.type == "text/xml") |>
nrow()
elsevier_links <- elsevier_link_df |>
filter(!is.na(URL))
elsevier_links <- elsevier_links |>
left_join(pmids_elsevier,
by = c("doi")
)
# get only xml links
elsevier_links <-
elsevier_links |>
filter(content.type == "text/xml")
download_elsevier_text <- function(url,
api_key,
pmid,
out_dir = here::here("output/fulltexts/elsevier/elsevier_xml/")) {
# if(file.exists(paste0(out_dir, pmid, ".xml"))|file.exists(paste0(out_dir, pmid, ".txt"))
# ){
# return(TRUE)
# }
response <- GET(url,
add_headers("X-ELS-APIKey" = api_key)
)
# if (status_code(response) != 200) {
# message("Failed to fetch text for ", pmid)
# return(FALSE)
#
# }
ct <- headers(response)[["content-type"]]
#print(ct)
if(grepl("text/plain", ct)){
message("Received plain text for ", pmid,
" - skipping for now."
)
return(TRUE)
# text_content <- content(response, type = "text/plain")
#
# writeLines(text_content,
# paste0(out_dir, pmid, ".txt"),
# useBytes = TRUE)
} else {
xml_content <- content(response,
encoding = "UTF-8",
type = "text/xml")
article_node <- xml2::xml_find_first(
xml_content,
".//*[local-name()='originalText']"
)
xml2::write_xml(article_node,
file = paste0(out_dir, pmid, ".xml")
)
}
# writeLines(text_content,
# paste0(out_dir, pmid, ".txt"),
# useBytes = TRUE)
}
purrr::walk2(elsevier_links$URL,
elsevier_links$pmid,
~download_elsevier_text(url = .x,
api_key = elsevier_api_key,
pmid = .y)
)
Convert Elsevier xmls to JATS xml files
mkdir -p output/fulltexts/elsevier/xml
for file in output/fulltexts/elsevier/elsevier_xml/*.xml; do
filename=$(basename "$file")
Rscript code/full_text_conversion/elsevier_to_jats_v5.R "$file" "output/fulltexts/elsevier/xml/${filename%.xml}.xml"
done
print("Number of downloaded full text files (xml) from Elsevier:")
[1] "Number of downloaded full text files (xml) from Elsevier:"
list.files(here::here("output/fulltexts/elsevier/elsevier_xml/"),
pattern = "\\.xml$"
) |>
length()
[1] 42
Policies:
sage_doi_patterns <- "10.1177|10.1089"
sage_links <-
grep(sage_doi_patterns,
doi_information$DOI,
value = TRUE)
sage_links <- str_remove_all(pattern = "https://doi.org/",
string = sage_links)
sage_link_df <- purrr::map(sage_links,
~get_crossref_links(.x)) |>
bind_rows()
# then had to download manually using provided xml links
# to use institutional login details
# http://www.liebertpub.com/doi/full-xml/10.1089/omi.2017.0019
# https://journals.sagepub.com/doi/full-xml/10.1177/00220345211051967
# https://journals.sagepub.com/doi/full-xml/10.1177/0271678X211066299
# these are in JATS .xml format
# saved to output/fulltexts/sage
length(sage_links)
print("Number of downloaded full text files (xml) from Sage:")
[1] "Number of downloaded full text files (xml) from Sage:"
length(list.files(here::here("output/fulltexts/sage"),
pattern = "\\.xml$"
)
)
[1] 3
springer_nature_links <-
grep("nature|10.1038/ng|10.1007/s0|10.1007|10.1038/ejhg|10.1038/tpj|10.1038/jhg|10\\.1038/|10\\.1007/|10.1245",
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Springer Nature:")
length(springer_nature_links)
springer_to_get <- doi_information %>%
filter(DOI %in% paste0("https://doi.org/",
springer_nature_links))
springer_nature_links <- springer_to_get$DOI |>
str_remove_all(pattern = "https://doi.org/")
springer_nature_pmids <- springer_to_get$PMID
check_springer_oa <- function(doi,
api_key,
pmids,
out_dir = here::here("output/fulltexts/springer_nature/")) {
# if(file.exists(paste0(out_dir, pmids, ".xml"))){
# return(data.frame(doi = doi,
# openaccess = TRUE)
# )
# }
query <- glue::glue('(doi:"{doi}")')
url <- modify_url(
"https://api.springernature.com/openaccess/jats",
query = list(
api_key = Sys.getenv("NATURE_SPRINGER_OA_API_KEY"),
callback = "",
s = 1,
p = 1,
q = query
)
)
# url<- paste0('https://api.springernature.com/openaccess/',
# 'jats?',
# 'api_key=', api_key,
# '&callback=&s=1&p=1',
# '&q=(doi:', '"', doi, '"', ")"
# )
# url <- 'https://api.springernature.com/openaccess/jats?api_key=237213d679b2c31096e6e777bd122f5c&q=(doi="10.1038/nature05616")'
print(url)
response <- GET(url)
# if the request fails, return data.frame with doi and oa = F
if (status_code(response) != 200) {
return(data.frame(doi = doi,
openaccess = FALSE)
)
} else {
xml_content <- content(response)
article_node <- xml2::xml_find_all(xml_content, ".//records")
if (xml2::xml_text(article_node) == "") {
return(data.frame(doi = doi,
openaccess = FALSE)
)
}
}
xml2::write_xml(article_node,
paste0(out_dir, pmids, ".xml")
)
return(data.frame(doi = doi,
openaccess = TRUE)
)
}
check_springer_oa(doi = springer_nature_links[1],
api_key = Sys.getenv("NATURE_SPRINGER_OA_API_KEY"),
pmids = springer_nature_pmids[1]
)
oa_status <-
purrr::map2(springer_nature_links,
springer_nature_pmids,
~check_springer_oa(doi = .x,
api_key = Sys.getenv("NATURE_SPRINGER_OA_API_KEY"),
pmids = .y)
)
oa_status_df <- oa_status |> bind_rows()
oa_status_df |> group_by(openaccess) |>
summarise(n = n())
oa_status_df |>
filter(openaccess == FALSE)
oa_status_df |>
filter(openaccess == TRUE) |>
nrow()
auto-corpus -b NATURE_GENETICS -t "output/fulltexts/springer_nature" -f "output/fulltexts/springer_nature/html" -o XML
print("Number of downloaded full text files (xml) from Springer Nature:")
[1] "Number of downloaded full text files (xml) from Springer Nature:"
length(list.files(here::here("output/fulltexts/springer_nature"),
pattern = "\\.xml$"
)
)
[1] 75
print("Number of downloaded html files from Springer Nature:")
[1] "Number of downloaded html files from Springer Nature:"
length(list.files(here::here("output/fulltexts/springer_nature"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 72
Wiley Text & Data-mining Policy: https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining
These xmls are in Wiley’s proprietary XML format, not JATS.
wiley_dois <- grep("10\\.1002/|10\\.1111/",
doi_information$DOI,
value = TRUE)
wiley_dois <- str_remove_all(wiley_dois, "https://doi.org/")
print("Number of papers potentially can get from Wiley:")
[1] "Number of papers potentially can get from Wiley:"
length(wiley_dois)
[1] 25
pmids_wiley_dois <- doi_information %>%
filter(DOI %in% paste0("https://doi.org/",
wiley_dois)
) %>%
pull(PMID)
download_wiley_pdf<- function(doi,
api_key,
pmids,
output_dir = here::here("output/fulltexts/wiley/pdf/")){
# check files doesn't already exist
if(file.exists(paste0(output_dir, pmids, ".pdf"))){
return(NULL)
}
print(pmids)
curl_command <- paste0('curl -L -H "Wiley-TDM-Client-Token:',
api_key,
'" https://api.wiley.com/onlinelibrary/tdm/v1/articles/',
doi,
' -o ', output_dir, pmids, '.pdf'
)
print(curl_command)
system(curl_command)
}
purrr::walk2(wiley_dois,
pmids_wiley_dois,
~ download_wiley_pdf(.x, Sys.getenv("WILEY_API_KEY"), .y)
)
# remove zero byte files - ? I think these are failed downloads as not open access
system("find output/fulltexts/wiley/pdf -type f -size 0 -delete")
# xmls downloaded manually using https://onlinelibrary.wiley.com/doi/full-xml/[DOI]
# downloaded to fulltexts/wiley/wiley_xml
As xml files are in Wiley format, convert JATS XML (1.1) format to be consistent with PubMed etc.
mkdir -p output/fulltexts/wiley/xml
for file in output/fulltexts/wiley/wiley_xml/*.xml; do
filename=$(basename "$file")
Rscript code/full_text_conversion/wiley_to_jats.R "$file" "output/fulltexts/wiley/xml/${filename%.xml}.xml"
done
# how many wiley full text xml downloaded
print("Number of downloaded full text files (xml) from Wiley:")
[1] "Number of downloaded full text files (xml) from Wiley:"
length(list.files(here::here("output/fulltexts/wiley/xml/"),
recursive = TRUE,
pattern = "\\.xml$"))
[1] 24
# how many wiley pdfs downloaded
print("Number of downloaded full text files (pdf) from Wiley:")
[1] "Number of downloaded full text files (pdf) from Wiley:"
length(list.files(here::here("output/fulltexts/wiley/"),
recursive = TRUE,
pattern = "\\.pdf$"))
[1] 0
Published by e-Century, under policy: https://e-century.us/web/journal_author_info.php?journal=ajcr
“All PDF, XML and html files for all articles published in this journal are the property of the publisher, e-Century Publishing Corporation (www.e-Century.org). Authors and readers are granted the right to freely use these files for all academic purposes.”
# pubmed ids: 34522458
# download pdf
print("Number of downloaded full text files (pdf) from American Journal of Cancer Research:")
[1] "Number of downloaded full text files (pdf) from American Journal of Cancer Research:"
length(list.files(here::here("output/fulltexts/ajcr"),
pattern = "\\.pdf$"
)
)
[1] 1
# converted to xml using Grobid, saved to output/fulltexts/ajcr
print("Number of converted full text files (xml) from American Journal of Cancer Research:")
[1] "Number of converted full text files (xml) from American Journal of Cancer Research:"
length(list.files(here::here("output/fulltexts/ajcr"),
pattern = "\\.xml$"
)
)
[1] 1
asm_doi_patterns <- "10.1128|10.1093/infdis|10.1093/cid"
asm_links <-
grep(asm_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ASM:")
length(asm_links)
# download pdf content from webpage
# use crossref links:
asm_link_df <-
get_crossref_links(asm_links) |>
bind_rows()
print("ASM links to retrieve")
asm_link_df
print("Number of downloaded full text files (pdf) from ASM:")
[1] "Number of downloaded full text files (pdf) from ASM:"
length(list.files(here::here("output/fulltexts/asm"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 1
# converted to xml using Grobid, saved to output/fulltexts/asm
print("Number of converted full text files (xml) from ASM:")
[1] "Number of converted full text files (xml) from ASM:"
length(list.files(here::here("output/fulltexts/asm"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 1
TDM policy: https://bmjgroup.com/text-and-data-mining-tdm-policy/
bmj_doi_patterns <- "10.1136/gutjnl|10.1136/jmedgenet"
bmj_links <-
grep(bmj_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from BMJ:")
[1] "Number of papers potentially can get from BMJ:"
length(bmj_links)
[1] 3
# download html content from webpage
# save to output/fulltexts/bmj
auto-corpus -b BMJ -t "output/fulltexts/bmj" -f "output/fulltexts/bmj/html" -o XML
print("Number of downloaded full text files (html) from BMJ:")
[1] "Number of downloaded full text files (html) from BMJ:"
length(list.files(here::here("output/fulltexts/bmj"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 3
print("Number of convert full text files (xml) from BMJ:")
[1] "Number of convert full text files (xml) from BMJ:"
length(list.files(here::here("output/fulltexts/bmj"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 4
Policies: https://www.cambridge.org/core/services/open-research/text-and-data-mining
cambridge_doi_patterns <- "10.1017"
cambridge_links <-
grep(cambridge_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Cambridge:")
[1] "Number of papers potentially can get from Cambridge:"
length(cambridge_links)
[1] 1
# obtain html content from webpage
auto-corpus -b CAMBRIDGE_CORE -t "output/fulltexts/cambridge" -f "output/fulltexts/cambridge/html" -o XML
print("Number of downloaded full text files (html) from Cambridge:")
[1] "Number of downloaded full text files (html) from Cambridge:"
length(list.files(here::here("output/fulltexts/cambridge"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 1
print("Number of converted full text files (xml) from Cambridge:")
[1] "Number of converted full text files (xml) from Cambridge:"
length(list.files(here::here("output/fulltexts/cambridge"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 1
ers_doi_patterns <- "10.1183"
ers_links <-
grep(ers_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ERS:")
[1] "Number of papers potentially can get from ERS:"
length(ers_links)
[1] 1
ers_links
[1] "https://doi.org/10.1183/13993003.00521-2019"
# this article is dstributed under the terms of the Creative Commons Attribution Non-Commercial Licence 4.0, which permits text-mining and is available at https://erj.ersjournals.com/content/erj/53/2/1801258.full.pdf
print("Number of downloaded full text files (pdf) from ERS:")
[1] "Number of downloaded full text files (pdf) from ERS:"
length(list.files(here::here("output/fulltexts/ers"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 1
# converted to xml using Grobid, saved to output/fulltexts/ers
print("Number of converted full text files (xml) from ERS:")
[1] "Number of converted full text files (xml) from ERS:"
length(list.files(here::here("output/fulltexts/ers"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 1
# 39024449
# https://doi.org/10.1126/science.adj1182 https://escholarship.org/content/qt53c9n629/qt53c9n629.pdf
print("Number of papers potentially can get from eScholarship:")
[1] "Number of papers potentially can get from eScholarship:"
list.files(here::here("output/fulltexts/escholarship"),
pattern = "\\.pdf$"
) |>
length()
[1] 1
print("Number of downloaded full text files (pdf) from eScholarship:")
[1] "Number of downloaded full text files (pdf) from eScholarship:"
length(list.files(here::here("output/fulltexts/escholarship"),
pattern = "\\.pdf$"
)
)
[1] 1
# converted to xml using Grobid, saved to output/fulltexts/escholarship
print("Number of converted full text files (xml) from eScholarship:")
[1] "Number of converted full text files (xml) from eScholarship:"
length(list.files(here::here("output/fulltexts/escholarship"),
pattern = "\\.xml$"
)
)
[1] 1
jama_doi_patterns <- "10.1001/jama"
jama_links <-
grep(jama_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from JAMA:")
[1] "Number of papers potentially can get from JAMA:"
length(jama_links)
[1] 2
auto-corpus -b JAMA -t "output/fulltexts/jama" -f "output/fulltexts/jama/html" -o XML
print("Number of downloaded full text files (html) from JAMA:")
[1] "Number of downloaded full text files (html) from JAMA:"
length(list.files(here::here("output/fulltexts/jama"),
recursive = TRUE,
pattern = "\\.html$"
)
)
[1] 2
print("Number of converted full text files (xml) from JAMA:")
[1] "Number of converted full text files (xml) from JAMA:"
length(list.files(here::here("output/fulltexts/jama"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 2
Uses a CC-BY-NC-ND 4.0 license, so can download and use for non-commercial purposes.
invivo_doi_patterns <- "10.21873/invivo"
invivo_links <-
grep(invivo_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from In vivo:")
[1] "Number of papers potentially can get from In vivo:"
length(invivo_links)
[1] 1
print("Number of downloaded full text files (pdf) from In vivo:")
[1] "Number of downloaded full text files (pdf) from In vivo:"
length(list.files(here::here("output/fulltexts/in_vivo"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 1
# converted to xml using Grobid, saved to output/fulltexts/in_vivo
print("Number of converted full text files (xml) from In vivo:")
[1] "Number of converted full text files (xml) from In vivo:"
length(list.files(here::here("output/fulltexts/in_vivo"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 1
karger_doi_patterns <- "10.1159"
karger_links <-
grep(karger_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Karger:")
[1] "Number of papers potentially can get from Karger:"
length(karger_links)
[1] 2
# download pdf content from webpage
# save to output/fulltexts/karger
# get links from crossref
karger_link_df <-
get_crossref_links(karger_links) |>
bind_rows()
print("Karger links to retrieve")
[1] "Karger links to retrieve"
karger_link_df
# A tibble: 5 × 4
URL content.type content.version intended.application
<chr> <chr> <chr> <chr>
1 https://www.karger.com/Arti… unspecified vor text-mining
2 https://www.karger.com/Arti… application… vor text-mining
3 https://www.karger.com/Arti… unspecified vor similarity-checking
4 https://www.karger.com/Arti… application… vor text-mining
5 https://www.karger.com/Arti… unspecified vor similarity-checking
print("Number of downloaded full text files (pdf) from Karger:")
[1] "Number of downloaded full text files (pdf) from Karger:"
length(list.files(here::here("output/fulltexts/karger"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 2
# converted to xml using Grobid, saved to output/fulltexts/karger
print("Number of converted full text files (xml) from Karger:")
[1] "Number of converted full text files (xml) from Karger:"
length(list.files(here::here("output/fulltexts/karger"),
recursive = TRUE,
pattern = "\\.xml$"
)
)
[1] 2
medica_doi_patterns <- "10.1159|10.5603"
medica_links <-
grep(medica_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Medica:")
[1] "Number of papers potentially can get from Medica:"
length(medica_links)
[1] 3
# download pdf content from webpage
# save to output/fulltexts/medica
# uses a CC BY license, so can download and use for non-commercial purposes
print("Number of downloaded full text files (pdf) from Medica:")
[1] "Number of downloaded full text files (pdf) from Medica:"
length(list.files(here::here("output/fulltexts/medica"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 1
print("Number of converted full text files (xml) from Medica:")
[1] "Number of converted full text files (xml) from Medica:"
length(list.files(here::here("output/fulltexts/medica"),
recursive = TRUE,
pattern = "\\.xml$")
)
[1] 1
From UCSF Library, license allows text-mining (for Ovid Lippincott Williams & Wilkins Total Access Collection):
“Data Mining: Yes, text mining / data mining activities for legitimate academic research and education purposes.”
ovid_doi_patterns <- "10.1097/fpc|10.1212|10.1681|10.1161/circgen|10.1161/strokeaha"
# 24001895
# https://doi.org/10.1161/circgen.119.002670
# 29748315
# https://doi.org/10.1161/circgen.117.001992
# https://doi.org/10.1227/neu.0000000000002082 ... should work but error atm
ovid_links <-
grep(ovid_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Ovid Lippincott Williams & Wilkins Total Access Collection:")
[1] "Number of papers potentially can get from Ovid Lippincott Williams & Wilkins Total Access Collection:"
length(ovid_links)
[1] 14
print("Number of downloaded full text files (pdf) from Ovid Lippincott Williams & Wilkins Total Access Collection:")
[1] "Number of downloaded full text files (pdf) from Ovid Lippincott Williams & Wilkins Total Access Collection:"
length(list.files(here::here("output/fulltexts/ovid"),
recursive = TRUE,
pattern = "\\.pdf$"
)
)
[1] 13
# converted to xml using Grobid, saved to output/fulltexts/ovid
print("Number of converted full text files (xml) from Ovid Lippincott Williams & Wilkins Total Access Collection:")
[1] "Number of converted full text files (xml) from Ovid Lippincott Williams & Wilkins Total Access Collection:"
length(list.files(here::here("output/fulltexts/ovid"),
recursive = TRUE,
pattern = "\\.xml$"
))
[1] 12
Oxford Academic TDM policy: https://academic.oup.com/pages/purchasing/rights-and-permissions/text-and-data-mining
*should reach out to confirm UCSF rights / possibly get xml formats
ats_doi_patterns <- "10.1164|10.1165"
# ? ATS: doi: 10.1164, 10.1165 (moving to Oxford Academic in March 2026)
# go doi pages, and download html manually
oxford_dois <- grep("10.1093|10.1136/amiajnl|10.1210|10.1513|10.1164|10.1165",
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Oxford Academic:")
[1] "Number of papers potentially can get from Oxford Academic:"
length(oxford_dois)
[1] 49
# then had to download manually using institutional login
# then saved to output/fulltexts/oxford_academic/html
oxford_htmls <- list.files(here::here("output/fulltexts/oxford_academic/html/"),
pattern = "\\.html$"
)
print("Number of downloaded full text files (html) from Oxford Academic:")
[1] "Number of downloaded full text files (html) from Oxford Academic:"
length(oxford_htmls)
[1] 49
auto-corpus -b OXFORD_ACADEMIC -t "output/fulltexts/oxford_academic" -f "output/fulltexts/oxford_academic/html" -o XML
print("Number of downloaded full text files (html) from Oxford Academic:")
[1] "Number of downloaded full text files (html) from Oxford Academic:"
length(list.files(here::here("output/fulltexts/oxford_academic/html"),
pattern = "\\.html$"
)
)
[1] 49
print("Number of converted (xml) from Oxford Academic:")
[1] "Number of converted (xml) from Oxford Academic:"
length(list.files(here::here("output/fulltexts/oxford_academic"),
pattern = "\\.xml$"
)
)
[1] 49
TDM policy / information: https://taylorandfrancis.com/our-policies/textanddatamining/
taylor_francis_dois <- grep("10.1080|10.2217|10.3109",
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Taylor & Francis:")
length(taylor_francis_dois)
# then had to download manually using institutional login
# as html
# saved to output/fulltexts/taylor_and_francis/html
auto-corpus -b TAYLOR_AND_FRANCIS -t "output/fulltexts/taylor_and_francis" -f "output/fulltexts/taylor_and_francis/html" -o XML
print("Number of downloaded full text files (html) from Taylor & Francis:")
[1] "Number of downloaded full text files (html) from Taylor & Francis:"
length(list.files(here::here("output/fulltexts/taylor_and_francis/html"),
pattern = "\\.html$"
)
)
[1] 4
print("Number of downloaded full text files (xml) from Taylor & Francis:")
[1] "Number of downloaded full text files (xml) from Taylor & Francis:"
length(list.files(here::here("output/fulltexts/taylor_and_francis"),
recursive = TRUE,
pattern = "\\.xml$")
)
[1] 4
? UCSF Library - “Other Use Restrictions (Public Note): TDM: Permitted (interpreted)”
aacr_doi_patterns <- "10.1158"
aacr_links <-
grep(aacr_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from AACR:")
[1] "Number of papers potentially can get from AACR:"
length(aacr_links)
[1] 4
10.1158/0008-5472.can-10-1493 10.1158/1078-0432.ccr-10-2394 10.1158/1078-0432.ccr-13-2835 10.1158/1078-0432.ccr-17-2537
# check, how many papers:
aps_doi_patterns <- "10.1152"
aps_links <-
grep(aps_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from APS:")
[1] "Number of papers potentially can get from APS:"
length(aps_links)
[1] 2
aha_doi_patterns <- "10.1161"
aha_links <-
grep(aha_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from AHA:")
[1] "Number of papers potentially can get from AHA:"
length(aha_links)
[1] 5
ash_doi_patterns <- "10.1182"
ash_links <-
grep(ash_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ASH:")
[1] "Number of papers potentially can get from ASH:"
length(ash_links)
[1] 1
asco_doi_patterns <- "10.1200"
asco_links <-
grep(asco_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from ASCO:")
[1] "Number of papers potentially can get from ASCO:"
length(asco_links)
[1] 1
jstage_doi_patterns <- "10.1248"
jstage_links <-
grep(jstage_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from J-STAGE:")
[1] "Number of papers potentially can get from J-STAGE:"
length(jstage_links)
[1] 1
License for Non-Commercial Reuse, Version 1.0 Unless otherwise indicated, the American Diabetes Association (ADA) holds copyright on all content published in ADA journals, unless otherwise noted. Individual readers may use the content as long as the work is properly cited and linked to the version of record, the use is educational and not for profit, and the work is not altered. ADA permission is required to post articles on third-party websites (unless otherwise specified below under “Article sharing and access”) or to include articles in educational materials that are sold to students or used in courses for which tuition or other fees are charged.
Agreeing to the Publisher’s license enables the Licensee to use the article anywhere in the world for non-commercial purposes, provided that the Licensee:
Cites the article using an appropriate bibliographic citation, e.g., authors, article title, journal, volume, page numbers, DOI, and the link to the definitive published version Uses the article for educational and not-for-profit purposes only Maintains the integrity of the work by making no alterations Retains copyright notices and links to these terms and conditions Ensures that, for any content in the article that is identified as belonging to a third party, any re-use complies with the copyright policies of that third party.
diabetes_doi_patterns <- "10.2337"
diabetes_links <-
grep(diabetes_doi_patterns,
doi_information$DOI,
value = TRUE)
print("Number of papers potentially can get from Diabetes:")
[1] "Number of papers potentially can get from Diabetes:"
length(diabetes_links)
[1] 12
full_text_files <-
list.files(here::here("output/fulltexts"),
recursive = T,
pattern = "\\.xml$|\\.html$|\\.pdf$")
# full_text_files <-
# list.files(here::here("output/fulltexts"),
# recursive = T,
# pattern = "\\.xml$")
full_text_files <- basename(full_text_files) |>
stringr::str_remove_all("\\.pdf.tei.xml$|\\.html$|\\.xml$|\\.pdf$") |>
stringr::str_remove_all("_bioc") |>
unique()
# convert pmcids to pmids
converted_fulltext_pmcids <-
converted_ids |>
filter(pmcids %in% full_text_files) |>
pull(PMID) |>
unique()
full_text_files <- c(full_text_files,
converted_fulltext_pmcids)
full_text_pmids <- grep("PMC",
full_text_files,
invert = T,
value = T)
full_text_pmids = unique(full_text_pmids)
print("Number of PMIDs with full texts downloaded:")
[1] "Number of PMIDs with full texts downloaded:"
sum(all_pmids %in% full_text_files)
[1] 792
print("% of total PMIDs with full texts downloaded:")
[1] "% of total PMIDs with full texts downloaded:"
100 * sum(all_pmids %in% full_text_files) / length(all_pmids)
[1] 95.65217
The papers I can’t get full texts for automatically (either through Europe PMC, NCBI Cloud Service, or publisher TDM policies) will need to be manually reviewed to identify study cohorts.
print("Number of PMIDs without full texts downloaded (to manually review):")
[1] "Number of PMIDs without full texts downloaded (to manually review):"
n_manual_review = length(all_pmids) - sum(all_pmids %in% full_text_files)
n_manual_review
[1] 36
print("Assuming 10 minutes per paper to review, total time (hours):")
[1] "Assuming 10 minutes per paper to review, total time (hours):"
n_manual_review * 10 / 60
[1] 6
# NCBI cloud xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/ncbi_cloud \
output/fulltexts/methods_sections
# Europe PMC xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/europe_pmc \
output/fulltexts/methods_sections
# Sage xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/sage \
output/fulltexts/methods_sections
# Springer nature xmls
code/extract_text/batch_process_methods.sh \
output/fulltexts/springer_nature \
output/fulltexts/methods_sections
# wiley converted xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/wiley/xml \
output/fulltexts/methods_sections
# Elsevier converted xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/elsevier/xml \
output/fulltexts/methods_sections
code/extract_text/batch_check_methods.sh \
output/fulltexts/methods_sections
open_alex_wiley_urls <-
open_alex_works |>
filter(doi %in% paste0("https://doi.org/", remaining_doi_info)) |>
filter(is_oa_anywhere == T) |>
filter(grepl("onlinelibrary.wiley.com", oa_url))
open_alex_wiley_dois <-
open_alex_wiley_urls |>
pull(doi)
doi_information |>
filter(DOI %in% open_alex_wiley_dois
) |>
pull(PMID) -> pmids_open_alex_wiley
purrr::walk2(open_alex_wiley_urls$oa_url,
pmids_open_alex_wiley,
~ download_wiley_pdf(doi = .x,
api_key = wiley_api,
pmids = .y)
)
doi_information <-
converted_ids |>
filter(!c(PMID %in% full_text_pmids)) |>
pull(DOI)
doi_information <- doi_information[doi_information != ""]
library(openalexR)
# get open alex works for pmids
open_alex_works <- oa_fetch(
doi = unique(doi_information),
entity = "works",
options = list(select = c("doi",
"open_access"))
)
oa_fetch(
doi = unique(doi_information)[10],
entity = "works",
options = list(select = c("doi",
"licenses"))
)
library(openalexR)
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))
not_convertable_pmids <- converted_ids |>
filter(pmcids == "") |>
pull(PMID)
doi_information <-
convert_pmid_df |>
filter(PMID %in% not_convertable_pmids)
doi_information |>
filter(DOI == "")
doi_information$PMID |> unique() |> length()
length(not_convertable_pmids)
# get open alex works for pmids
open_alex_works <- oa_fetch(
doi = unique(doi_information$DOI),
entity = "works",
options = list(select = c("doi",
"open_access"))
)
# no best open access location:
open_alex_works |>
filter(is.na(oa_url)) |>
nrow()
# pdf link available:
open_alex_works |>
filter(grepl("pdf", oa_url)) |>
nrow()
to_download_pdfs <-
open_alex_works |>
filter(grepl(".pdf", oa_url)) |>
pull(oa_url)
writeLines(
to_download_pdfs,
here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))
cd output/fulltexts/pdfs
while read -r url; do
curl -O "$url"
done < pdf_links_to_download.txt
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rcrossref_1.2.1 data.table_1.17.8 dplyr_1.1.4 here_1.0.1
[5] stringr_1.6.0 xml2_1.4.0 httr_1.4.7 workflowr_1.7.2
loaded via a namespace (and not attached):
[1] utf8_1.2.6 sass_0.4.10 generics_0.1.4
[4] renv_1.1.8 stringi_1.8.7 httpcode_0.3.0
[7] digest_0.6.37 magrittr_2.0.4 evaluate_1.0.5
[10] fastmap_1.2.0 plyr_1.8.9 rprojroot_2.1.0
[13] jsonlite_2.0.0 processx_3.8.6 whisker_0.4.1
[16] crul_1.6.0 urltools_1.7.3.1 ps_1.9.1
[19] promises_1.3.3 BiocManager_1.30.26 jquerylib_0.1.4
[22] cli_3.6.5 shiny_1.11.1 rlang_1.1.6
[25] triebeard_0.4.1 withr_3.0.2 cachem_1.1.0
[28] yaml_2.3.10 tools_4.3.1 httpuv_1.6.16
[31] DT_0.34.0 curl_7.0.0 vctrs_0.6.5
[34] R6_2.6.1 mime_0.13 lifecycle_1.0.4
[37] git2r_0.36.2 fs_1.6.6 htmlwidgets_1.6.4
[40] miniUI_0.1.2 pkgconfig_2.0.3 callr_3.7.6
[43] pillar_1.11.1 bslib_0.9.0 later_1.4.4
[46] glue_1.8.0 Rcpp_1.1.0 xfun_0.55
[49] tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1
[52] knitr_1.50 xtable_1.8-4 htmltools_0.5.8.1
[55] rmarkdown_2.30 compiler_4.3.1 getPass_0.2-4