Last updated: 2026-03-30

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7a17bd6. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    .venv/
    Ignored:    BC2GM/
    Ignored:    BioC.dtd
    Ignored:    FormatConverter.jar
    Ignored:    FormatConverter.zip
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    code/.DS_Store
    Ignored:    code/full_text_conversion/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/RCDCFundingSummary_01042026.xlsx
    Ignored:    data/cdc/
    Ignored:    data/cohort/
    Ignored:    data/epmc/
    Ignored:    data/europe_pmc/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
    Ignored:    data/gbd/gbd_2019_california_percent_deaths.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/2025AA/
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/hp_umls_mapping.csv
    Ignored:    data/icd/lancet_conditions_icd10.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/mondo_umls_mapping.csv
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/phecode_to_icd10_manual_mapping.xlsx
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
    Ignored:    data/icd/umls-2025AA-mrconso.zip
    Ignored:    doccano_venv/
    Ignored:    figures/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/pubmedbert_entity_predictions.csv
    Ignored:    output/pubmedbert_entity_predictions.jsonl
    Ignored:    output/pubmedbert_predictions.csv
    Ignored:    output/pubmedbert_predictions.jsonl
    Ignored:    output/supplement/
    Ignored:    output/text_mining_predictions/
    Ignored:    output/trait_ontology/
    Ignored:    population_description_terms.txt
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    renv/
    Ignored:    spacy_venv_requirements.txt
    Ignored:    spacyr_venv/

Untracked files:
    Untracked:  code/full_text_conversion/html_to_xml.R
    Untracked:  code/text_mining_models/tokenise_data.py

Unstaged changes:
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/replication_ancestry_bias.Rmd
    Modified:   analysis/specific_aims_stats.Rmd
    Modified:   analysis/text_for_cohort_labels.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/get_full_text.Rmd) and HTML (docs/get_full_text.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 7a17bd6 IJbeasley 2026-03-30 Fix springer nature download bug
html a1dd9e2 IJbeasley 2026-03-30 Build site.
Rmd ecea90d IJbeasley 2026-03-30 More comphrensive full text download
html a80d8cb IJbeasley 2026-03-25 Build site.
Rmd 2585a14 IJbeasley 2026-03-25 Update full text downloading
html 5e29024 IJbeasley 2026-02-04 Build site.
Rmd e7de25d IJbeasley 2026-02-04 Adding totals + to manually review
html 5c9d397 IJbeasley 2026-02-04 Build site.
Rmd 456acb1 IJbeasley 2026-02-04 Fixing percentage downloaded stats
html 55f6763 IJbeasley 2026-02-04 Build site.
Rmd c0dc676 IJbeasley 2026-02-04 Getting full text from publisher APIs
html 1898c02 IJbeasley 2026-02-04 Build site.
Rmd d214580 IJbeasley 2026-02-04 Getting full text from publisher APIs
html 6ba1e1f IJbeasley 2026-01-12 Build site.
Rmd b43e9a9 IJbeasley 2026-01-12 Update getting full text
html ac0d1a7 IJbeasley 2025-10-27 Build site.
html 8642872 IJbeasley 2025-10-27 Build site.
Rmd da4d730 IJbeasley 2025-10-27 Now run on all texts
html fb5cfd9 IJbeasley 2025-10-27 Build site.
Rmd 8ed4c37 IJbeasley 2025-10-27 Now run on all texts
html 8610283 IJbeasley 2025-10-27 Build site.
Rmd 7d504e3 IJbeasley 2025-10-27 More fixing of download full text
html 16f4c19 IJbeasley 2025-10-27 Build site.
Rmd 3df4096 IJbeasley 2025-10-27 Update + improve full text downloading - test run
html 1439951 IJbeasley 2025-10-24 Build site.
Rmd 481aebe IJbeasley 2025-10-24 Update code for getting full texts

Required packages

library(httr)
library(xml2)
library(stringr)
library(here)
library(dplyr)
library(data.table)

Get PMCIDs

Get Pubmed ids from GWAS catalog

# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

## Step 1: 
# get only relevant disease studies
# gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
    dplyr::filter(!cause %in% c("HIV/AIDS",
                             "Tuberculosis",
                             "Malaria",
                             "Lower respiratory infections",
                             "Diarrhoeal diseases",
                             "Neonatal disorders",
                             "Tetanus",
                             "Diphtheria",
                             "Pertussis" ,
                             "Measles",
                             "Maternal disorders")) |>
  dplyr::filter(cause != "")

# gwas_study_info <- gwas_study_info |>
#   dplyr::filter(DISEASE_STUDY == TRUE)

print("Number of disease studies to get full texts for:")
[1] "Number of disease studies to get full texts for:"
all_pmids <- unique(gwas_study_info$PUBMED_ID)
length(all_pmids)
[1] 828

Convert Pubmed IDs to PMCIDs

# get PMID to PMCID mapping using Europe PMC file:
convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

convert_pmid_df <- convert_pmid_df |>
  dplyr::rename(pmcids = PMCID
                ) |>
  dplyr::mutate(pmcids = ifelse(is.na(pmcids),
                                "",
                                pmcids
                                )
                )

convert_pmid_df <-
  convert_pmid_df |>
  dplyr::filter(!is.na(PMID))

converted_ids = 
  convert_pmid_df |>
  filter(PMID %in% all_pmids)

data.table::fwrite(converted_ids,
                   here::here("output/fulltexts/pmid_to_pmcid_mapping.csv")
                   )
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))

print("Head of pmid to pmcid mapping data.frame:")
[1] "Head of pmid to pmcid mapping data.frame:"
head(converted_ids)
       PMID     pmcids                                           DOI
      <int>     <char>                                        <char>
1: 17223258             https://doi.org/10.1016/j.canlet.2006.11.029
2: 17293876                      https://doi.org/10.1038/nature05616
3: 17434096 PMC2613843 https://doi.org/10.1016/s1474-4422(07)70081-9
4: 17460697                           https://doi.org/10.1038/ng2043
5: 17463246                  https://doi.org/10.1126/science.1142358
6: 17463248 PMC3214617       https://doi.org/10.1126/science.1142382
print("Dimensions of pmid to pmcid mapping data.frame:")
[1] "Dimensions of pmid to pmcid mapping data.frame:"
dim(converted_ids)
[1] 828   3
length(all_pmids)
[1] 828
print("All pmids are in this data.frame, but some don't have pmcid mapping")
[1] "All pmids are in this data.frame, but some don't have pmcid mapping"
not_converted_pmids <-
converted_ids |>
  filter(pmcids == "")  |>
  pull(PMID)

print("Number of pmids without pmcid mapping:")
[1] "Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 175
pmcids <-
converted_ids$pmcids |>
  unique()

pmcids <- pmcids[pmcids != ""]

print("Number of pmids with pmcid mapping:")
[1] "Number of pmids with pmcid mapping:"
length(pmcids)
[1] 653
print("Percentage of pmids with pmcid:")
[1] "Percentage of pmids with pmcid:"
round(100 * length(pmcids) / length(all_pmids), digits = 2)
[1] 78.86

Download full texts from European PMC

Requires PMCIDS to download full text xmls from Europe PMC Restful API. Thus, can only be applied to papers with PMCIDs.

# Function to download full text xml from Europe PMC Restful API
download_pmc_text <- function(pmcid, 
                              out_dir = here::here("output/fulltexts/europe_pmc/")
                              ) {
  
  # check if file already exists
  if(file.exists(paste0(out_dir, pmcid, ".xml"))){
    return(TRUE)
  }


  url_xml <- paste0("https://www.ebi.ac.uk/",
                    "europepmc/webservices/rest/",
                    pmcid,
                    "/fullTextXML"
                    )
  
  resp <- GET(url_xml)
  
  # ---- Fallback URL ----
  if(status_code(resp) != 200){
    
    url_xml <- paste0("https://europepmc.org/",
                       "oai.cgi?verb=GetRecord",
                       "&metadataPrefix=pmc",
                       "&identifier=oai:europepmc.org:",
                       pmcid)
    
    resp <- GET(url_xml)
  
  }
  
  # ---- Fail if still bad ----
  if(status_code(resp) != 200){
    
  return(NULL)
    
  }
  
  # ---- Parse XML ----
  xml_content <- read_xml(
    content(resp, 
            as = "text", 
            encoding = "UTF-8")
  )
  
  article_node = xml_find_first(xml_content, 
                               "//*[local-name() = 'article']"
                               )
  
   if (is.na(article_node)) {
    message("No <article> node found for ", pmcid)
     
    return(NULL)
   }
  
  # --- Save ---
  write_xml(article_node, 
            paste0(out_dir, pmcid, ".xml")
            )
  
} 


for(article in pmcids[pmcids != ""]){

download_pmc_text(article)

}
euro_pmcids <-list.files(here::here("output/fulltexts/europe_pmc/"),
                  pattern = "\\.xml$")

euro_pmcids <- gsub("\\.xml$", 
       "", 
       euro_pmcids
       )

euro_pmcids <- pmcids[pmcids %in% euro_pmcids]

n_euro_pmc <- length(euro_pmcids)

print("Number of downloaded full text files from European PMC:")
[1] "Number of downloaded full text files from European PMC:"
print(n_euro_pmc)
[1] 435
print("Percentage of pmids with full text from European PMC:")
[1] "Percentage of pmids with full text from European PMC:"
round(100 * n_euro_pmc / length(all_pmids), digits = 2)
[1] 52.54

Download full texts from NCBI Cloud Service

For remaining PMCIDs / PMIDs without full text, try downloading using NCBI Cloud Service.


# get list of pmcids with full text - author_manuscript available
# available in XML and plain text for text mining purposes.
aws s3 cp s3://pmc-oa-opendata/author_manuscript/txt/metadata/txt/author_manuscript.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text - non-commercial use
# oa_noncomm 
aws s3 cp s3://pmc-oa-opendata/oa_noncomm/txt/metadata/txt/oa_noncomm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_comm
aws s3 cp s3://pmc-oa-opendata/oa_comm/txt/metadata/txt/oa_comm.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

# get list of pmcids with full text, commercial list
# oa_other
aws s3 cp s3://pmc-oa-opendata/oa_other/txt/metadata/txt/oa_other.filelist.txt output/fulltexts/aws_locations/  --no-sign-request 

Identify full texts already downloaded through European PMC

europeanpmc_full_texts <- 
list.files(here::here("output/fulltexts/europe_pmc"),
                  pattern = "\\.xml"
           )

# get pmcids of these files
europeanpmc_full_texts <-
  gsub("\\.xml$", 
       "", 
       europeanpmc_full_texts
       ) 

Get NCBI download paths for remaining full texts (where available)

left_over_pmcids = pmcids[!pmcids %in% europeanpmc_full_texts]

print("Number of remaining pmcids without full text:")
[1] "Number of remaining pmcids without full text:"
length(left_over_pmcids)
[1] 218
print("+ Number of pmids without pmcid mapping:")
[1] "+ Number of pmids without pmcid mapping:"
length(not_converted_pmids)
[1] 175
author_manu = data.table::fread(here::here("output/fulltexts/aws_locations/author_manuscript.filelist.txt"))

oa_noncomm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_noncomm.filelist.txt"))

oa_comm = data.table::fread(here::here("output/fulltexts/aws_locations/oa_comm.filelist.txt"))

author_manu_to_get <-
author_manu |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

print("Number of papers to download in Author Manuscripts section:")
[1] "Number of papers to download in Author Manuscripts section:"
nrow(author_manu_to_get)
[1] 136
oa_noncomm_to_get = 
oa_noncomm |>
  dplyr::filter(AccessionID %in% left_over_pmcids | 
                PMID %in% not_converted_pmids)

# remove any overlaps between sections
oa_noncomm_to_get <-
  oa_noncomm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID))

print("Number of additional papers to download in the Non-commericial Open Access PMC section:")
[1] "Number of additional papers to download in the Non-commericial Open Access PMC section:"
nrow(oa_noncomm_to_get)
[1] 0
oa_comm_to_get = 
oa_comm |>
  dplyr::filter(AccessionID %in% left_over_pmcids |
                PMID %in% not_converted_pmids)

oa_comm_to_get <-
  oa_comm_to_get |>
  dplyr::filter(!c(PMID %in% author_manu_to_get$PMID)) |>
  dplyr::filter(!c(PMID %in% oa_noncomm_to_get$PMID))

# remove any overlaps between sections
print("Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:")
[1] "Number of additional papers to download in the Commercial Open Access PMC section after removing overlaps with Author Manuscripts:"
nrow(oa_comm_to_get)
[1] 3
file_paths = 
c(oa_noncomm_to_get$Key,
  oa_comm_to_get$Key,
  author_manu_to_get$Key)

file_paths <- str_replace_all(file_paths,
                              pattern = "txt",
                              replacement = "xml")

writeLines(
  file_paths,
  here::here("output/fulltexts/aws_locations/selected_paths.txt")
)

Download remaining full texts from NCBI Cloud Service

system(
  paste(
    "xargs -I {} aws s3 cp",
    "s3://pmc-oa-opendata/{}",
    shQuote(here::here("output/fulltexts/ncbi_cloud/")),
    "--no-sign-request",
    "<",
    shQuote(here::here("output/fulltexts/aws_locations/selected_paths.txt"))
  )
)
# not_available = left_over_pmcids[!c(left_over_pmcids %in% 
#                                           c(oa_noncomm_to_get$AccessionID, 
#                                             oa_comm_to_get$AccessionID,
#                                             author_manu_to_get$AccessionID)
#                                           )]

# get list of pmcids already retrieved
ncbi_pmcids_retrieved <- 
list.files(c(#here::here("output/fulltexts/europe_pmc"),
             here::here("output/fulltexts/ncbi_cloud/")
             ),
             pattern = "\\.xml$"
)

ncbi_pmcids_retrieved <-
  gsub("\\.xml$", 
       "", 
       ncbi_pmcids_retrieved
       )

pmids_retrieved <-
converted_ids  |>
  filter(pmcids %in% c(ncbi_pmcids_retrieved, euro_pmcids)) |>
  pull(PMID)

not_available <- all_pmids[!c(all_pmids %in% pmids_retrieved)] 

print("Percentage of pmids with full text from NCBI Cloud Service:")
[1] "Percentage of pmids with full text from NCBI Cloud Service:"
100 * (length(all_pmids) - n_euro_pmc - length(not_available)) / length(all_pmids)
[1] 16.42512
print("Percentage of pmids without full text from either European PMC or NCBI Cloud Service:")
[1] "Percentage of pmids without full text from either European PMC or NCBI Cloud Service:"
100 * length(not_available) / length(all_pmids)
[1] 31.03865

Download from publisher (uses dois)

Get dois for remaining articles to get full full text

doi_information <-
converted_ids |>
  filter(PMID %in% not_available)

To get/check text-mining license info for:

American Association for Cancer Research, doi: 10.1158

? UCSF Library - “Other Use Restrictions (Public Note): TDM: Permitted (interpreted)”

aacr_doi_patterns <- "10.1158"

aacr_links <-
  grep(aacr_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AACR:")
[1] "Number of papers potentially can get from AACR:"
length(aacr_links)
[1] 4

10.1158/0008-5472.can-10-1493 10.1158/1078-0432.ccr-10-2394 10.1158/1078-0432.ccr-13-2835 10.1158/1078-0432.ccr-17-2537

American Physiological Society, doi: 10.1152

# check, how many papers:
aps_doi_patterns <- "10.1152"

aps_links <-
  grep(aps_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from APS:")
[1] "Number of papers potentially can get from APS:"
length(aps_links)
[1] 2

AHA, doi: 10.1161

aha_doi_patterns <- "10.1161"

aha_links <-
  grep(aha_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from AHA:")
[1] "Number of papers potentially can get from AHA:"
length(aha_links)
[1] 5

ASH, doi: 10.1182

ash_doi_patterns <- "10.1182"

ash_links <-
  grep(ash_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASH:")
[1] "Number of papers potentially can get from ASH:"
length(ash_links)
[1] 1

ASCO, doi: 10.1200

asco_doi_patterns <- "10.1200"

asco_links <-
  grep(asco_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from ASCO:")
[1] "Number of papers potentially can get from ASCO:"
length(asco_links)
[1] 1

J-STAGE: doi: 10.1248

jstage_doi_patterns <- "10.1248"

jstage_links <-
  grep(jstage_doi_patterns,
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from J-STAGE:")
[1] "Number of papers potentially can get from J-STAGE:"
length(jstage_links)
[1] 1

(ADA) Diabetes, doi: 10.2337

License for Non-Commercial Reuse, Version 1.0 Unless otherwise indicated, the American Diabetes Association (ADA) holds copyright on all content published in ADA journals, unless otherwise noted. Individual readers may use the content as long as the work is properly cited and linked to the version of record, the use is educational and not for profit, and the work is not altered. ADA permission is required to post articles on third-party websites (unless otherwise specified below under “Article sharing and access”) or to include articles in educational materials that are sold to students or used in courses for which tuition or other fees are charged.

Agreeing to the Publisher’s license enables the Licensee to use the article anywhere in the world for non-commercial purposes, provided that the Licensee:

Cites the article using an appropriate bibliographic citation, e.g., authors, article title, journal, volume, page numbers, DOI, and the link to the definitive published version Uses the article for educational and not-for-profit purposes only Maintains the integrity of the work by making no alterations Retains copyright notices and links to these terms and conditions Ensures that, for any content in the article that is identified as belonging to a third party, any re-use complies with the copyright policies of that third party.

diabetes_doi_patterns <- "10.2337"

diabetes_links <-
  grep(diabetes_doi_patterns, 
       doi_information$DOI,
       value = TRUE)

print("Number of papers potentially can get from Diabetes:")
[1] "Number of papers potentially can get from Diabetes:"
length(diabetes_links)
[1] 12

Internal non-commerical research purposes

PNAS:

AAAS / Science:

  • https://www.science.org/content/page/institutional-license-agreement “Non-commercial Licensees’ Authorized Users may use the Licensed Materials (excluding any Complimentary Resources) for text and data mining, for purely internal non-commercial research purposes, for as long as the Licensee maintains a subscription to the Licensed Materials, subject to the terms and conditions in ANNEX A”

Total number of downloaded full texts

full_text_files <-
  list.files(here::here("output/fulltexts"), 
             recursive = T, 
             pattern = "\\.xml$|\\.html$|\\.pdf$") 

# full_text_files <-
#   list.files(here::here("output/fulltexts"),
#              recursive = T,
#              pattern = "\\.xml$")

full_text_files <- basename(full_text_files) |> 
  stringr::str_remove_all("\\.pdf.tei.xml$|\\.html$|\\.xml$|\\.pdf$") |>
  stringr::str_remove_all("_bioc") |>
  unique()

# convert pmcids to pmids 
converted_fulltext_pmcids <-
  converted_ids |>
  filter(pmcids %in% full_text_files) |>
  pull(PMID) |>
  unique() 

full_text_files  <- c(full_text_files, 
                      converted_fulltext_pmcids)

full_text_pmids <- grep("PMC", 
                        full_text_files, 
                        invert = T, 
                        value = T)

full_text_pmids = unique(full_text_pmids)

print("Number of PMIDs with full texts downloaded:")
[1] "Number of PMIDs with full texts downloaded:"
sum(all_pmids %in% full_text_files)
[1] 792
print("% of total PMIDs with full texts downloaded:")
[1] "% of total PMIDs with full texts downloaded:"
100 * sum(all_pmids %in% full_text_files) / length(all_pmids)
[1] 95.65217

Number of papers to manually review

The papers I can’t get full texts for automatically (either through Europe PMC, NCBI Cloud Service, or publisher TDM policies) will need to be manually reviewed to identify study cohorts.

print("Number of PMIDs without full texts downloaded (to manually review):")
[1] "Number of PMIDs without full texts downloaded (to manually review):"
n_manual_review = length(all_pmids) - sum(all_pmids %in% full_text_files)
n_manual_review
[1] 36
print("Assuming 10 minutes per paper to review, total time (hours):")
[1] "Assuming 10 minutes per paper to review, total time (hours):"
n_manual_review * 10 / 60
[1] 6

Extracting methods sections


# NCBI cloud xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/ncbi_cloud \
output/fulltexts/methods_sections

# Europe PMC xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/europe_pmc \
output/fulltexts/methods_sections

# Sage xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/sage \
output/fulltexts/methods_sections

# Springer nature xmls
code/extract_text/batch_process_methods.sh \
output/fulltexts/springer_nature \
output/fulltexts/methods_sections

# wiley converted xmls:
code/extract_text/batch_process_methods.sh \
output/fulltexts/wiley/xml \
output/fulltexts/methods_sections

# Elsevier converted xmls: 
code/extract_text/batch_process_methods.sh \
output/fulltexts/elsevier/xml \
output/fulltexts/methods_sections

code/extract_text/batch_check_methods.sh \
output/fulltexts/methods_sections 

Testing:

Download PDFs using Open Access information from Open Alex

doi_information <-
converted_ids |>
  filter(!c(PMID %in% full_text_pmids)) |>
  pull(DOI)

doi_information <- doi_information[doi_information != ""]

library(openalexR)

# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

    
oa_fetch(
  doi = unique(doi_information)[10],
  entity = "works",
  options = list(select = c("doi",
                            "licenses"))
)
library(openalexR)

convert_pmid_df <- fread(here::here("data/europe_pmc/PMID_PMCID_DOI.csv"))

not_convertable_pmids <- converted_ids |> 
                         filter(pmcids == "") |> 
                         pull(PMID)

doi_information <-
convert_pmid_df |>
  filter(PMID %in% not_convertable_pmids)

doi_information |>
  filter(DOI == "")

doi_information$PMID |> unique() |> length()

length(not_convertable_pmids)

# get open alex works for pmids
open_alex_works <- oa_fetch(
  doi = unique(doi_information$DOI),
  entity = "works",
  options = list(select = c("doi", 
                            "open_access"))
)

# no best open access location: 
open_alex_works |> 
  filter(is.na(oa_url)) |>
  nrow()

# pdf link available:
open_alex_works |> 
  filter(grepl("pdf", oa_url)) |>
  nrow()

to_download_pdfs <-
open_alex_works |> 
  filter(grepl(".pdf", oa_url)) |>
  pull(oa_url)

  writeLines(
    to_download_pdfs,
    here::here("output/fulltexts/pdfs/pdf_links_to_download.txt"))

cd output/fulltexts/pdfs

while read -r url; do
  curl -O "$url"
done < pdf_links_to_download.txt

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] rcrossref_1.2.1   data.table_1.17.8 dplyr_1.1.4       here_1.0.1       
[5] stringr_1.6.0     xml2_1.4.0        httr_1.4.7        workflowr_1.7.2  

loaded via a namespace (and not attached):
 [1] utf8_1.2.6          sass_0.4.10         generics_0.1.4     
 [4] renv_1.1.8          stringi_1.8.7       httpcode_0.3.0     
 [7] digest_0.6.37       magrittr_2.0.4      evaluate_1.0.5     
[10] fastmap_1.2.0       plyr_1.8.9          rprojroot_2.1.0    
[13] jsonlite_2.0.0      processx_3.8.6      whisker_0.4.1      
[16] crul_1.6.0          urltools_1.7.3.1    ps_1.9.1           
[19] promises_1.3.3      BiocManager_1.30.26 jquerylib_0.1.4    
[22] cli_3.6.5           shiny_1.11.1        rlang_1.1.6        
[25] triebeard_0.4.1     withr_3.0.2         cachem_1.1.0       
[28] yaml_2.3.10         tools_4.3.1         httpuv_1.6.16      
[31] DT_0.34.0           curl_7.0.0          vctrs_0.6.5        
[34] R6_2.6.1            mime_0.13           lifecycle_1.0.4    
[37] git2r_0.36.2        fs_1.6.6            htmlwidgets_1.6.4  
[40] miniUI_0.1.2        pkgconfig_2.0.3     callr_3.7.6        
[43] pillar_1.11.1       bslib_0.9.0         later_1.4.4        
[46] glue_1.8.0          Rcpp_1.1.0          xfun_0.55          
[49] tibble_3.3.0        tidyselect_1.2.1    rstudioapi_0.17.1  
[52] knitr_1.50          xtable_1.8-4        htmltools_0.5.8.1  
[55] rmarkdown_2.30      compiler_4.3.1      getPass_0.2-4