Last updated: 2026-02-23
Checks: 7 0
Knit directory:
genomics_ancest_disease_dispar/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220216) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version de0b128. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: .venv/
Ignored: Asthma_Bothsex_inv_var_meta_GBMI_052021_nbbkgt1.txt.gz
Ignored: Aus_School_Profile.xlsx
Ignored: BC2GM/
Ignored: SeniorSecondaryCompletionandAchievementInformation_2025.xlsx
Ignored: analysis/.DS_Store
Ignored: ancestry_dispar_env/
Ignored: bc2GMtest_1.0.tar.gz
Ignored: code/.DS_Store
Ignored: code/full_text_conversion/.DS_Store
Ignored: cohort_sentences.json
Ignored: data/.DS_Store
Ignored: data/RCDCFundingSummary_01042026.xlsx
Ignored: data/cdc/
Ignored: data/cohort/
Ignored: data/epmc/
Ignored: data/europe_pmc/
Ignored: data/gbd/.DS_Store
Ignored: data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
Ignored: data/gbd/IHME-GBD_2023_DATA-73cc01fd-1.csv
Ignored: data/gbd/gbd_2019_california_percent_deaths.csv
Ignored: data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
Ignored: data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
Ignored: data/gwas_catalog/
Ignored: data/icd/.DS_Store
Ignored: data/icd/2025AA/
Ignored: data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
Ignored: data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
Ignored: data/icd/UK_Biobank_master_file.tsv
Ignored: data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
Ignored: data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
Ignored: data/icd/hp_umls_mapping.csv
Ignored: data/icd/lancet_conditions_icd10.xlsx
Ignored: data/icd/manual_disease_icd10_mappings.xlsx
Ignored: data/icd/mondo_umls_mapping.csv
Ignored: data/icd/phecode_international_version_unrolled.csv
Ignored: data/icd/phecode_to_icd10_manual_mapping.xlsx
Ignored: data/icd/semiautomatic_ICD-pheno.txt
Ignored: data/icd/semiautomatic_ICD-pheno_UKB_subset.txt
Ignored: data/icd/umls-2025AA-mrconso.zip
Ignored: figures/
Ignored: output/.DS_Store
Ignored: output/abstracts/
Ignored: output/doccano/
Ignored: output/fulltexts/
Ignored: output/gwas_cat/
Ignored: output/gwas_cohorts/
Ignored: output/icd_map/
Ignored: output/pubmedbert_entity_predictions.csv
Ignored: output/pubmedbert_entity_predictions.jsonl
Ignored: output/pubmedbert_predictions.csv
Ignored: output/pubmedbert_predictions.jsonl
Ignored: output/text_mining_predictions/
Ignored: output/trait_ontology/
Ignored: population_description_terms.txt
Ignored: pubmedbert-cohort-ner-model/
Ignored: pubmedbert-cohort-ner/
Ignored: renv/
Ignored: spacyr_venv/
Untracked files:
Untracked: code/extract_text/sentence_embeddings.py
Untracked: schools.R
Untracked: testing.R
Unstaged changes:
Modified: .gitignore
Modified: analysis/disease_inves_by_ancest.Rmd
Modified: analysis/get_dbgap_ids.Rmd
Modified: analysis/get_full_text.Rmd
Modified: analysis/index.Rmd
Modified: analysis/map_trait_to_icd10.Rmd
Modified: analysis/missing_cohort_info.Rmd
Modified: analysis/replication_ancestry_bias.Rmd
Modified: analysis/specific_aims_stats.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown
(analysis/text_for_cohort_labels.Rmd) and HTML
(docs/text_for_cohort_labels.html) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote),
click on the hyperlinks in the table below to view the files as they
were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | de0b128 | IJbeasley | 2026-02-23 | Save cohort sentences |
| html | 1c2ecd8 | IJbeasley | 2026-02-23 | Build site. |
| Rmd | 0ea67c6 | IJbeasley | 2026-02-23 | … actually (plz) fixing adding meta-data to doccano |
| html | a240263 | IJbeasley | 2026-02-23 | Build site. |
| Rmd | b4b420f | IJbeasley | 2026-02-23 | Fixing adding meta-data to doccanno |
| html | ceacd1e | IJbeasley | 2026-02-23 | Build site. |
| Rmd | 1fc02e6 | IJbeasley | 2026-02-23 | Update text-extraction |
| html | db3e0bd | IJbeasley | 2025-12-28 | Build site. |
| Rmd | 79191ff | IJbeasley | 2025-12-28 | Update text cohort extracting to print out information on numbers of abstracts |
| html | cb5cf9e | IJbeasley | 2025-12-28 | Build site. |
| Rmd | 697dbb1 | IJbeasley | 2025-12-28 | Update text cohort extracting |
| html | 2593d6a | IJbeasley | 2025-12-28 | Build site. |
| Rmd | 410a36a | IJbeasley | 2025-12-28 | Include GWAS Catalog cohorts in grep search for cohort sentences |
| html | 238486e | IJbeasley | 2025-10-24 | Build site. |
| Rmd | 0d8b872 | IJbeasley | 2025-10-24 | Cleaning up abstract collecting code again |
| html | 2afc108 | IJbeasley | 2025-10-24 | Build site. |
| Rmd | 748dac2 | IJbeasley | 2025-10-24 | Cleaning up abstract collecting |
library(stringr)
library(readxl)
library(dplyr)
library(stringi)
library(httr)
library(rentrez)
library(xml2)
library(jsonlite)
library(tokenizers)
## Step 1:
# get only relevant disease studies
gwas_study_info <- data.table::fread(here::here("output/icd_map/gwas_study_gbd_causes.csv"))
gwas_study_info = gwas_study_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
# filter out infectious diseases
gwas_study_info <- gwas_study_info |>
dplyr::filter(!cause %in% c("HIV/AIDS",
"Tuberculosis",
"Malaria",
"Lower respiratory infections",
"Diarrhoeal diseases",
"Neonatal disorders",
"Tetanus",
"Diphtheria",
"Pertussis" ,
"Measles",
"Maternal disorders"))
# group conditions into broader categories to ensure we have enough papers in each category for sampling
lancet_cause_mapping <- readxl::read_xlsx(here::here("data/icd/lancet_conditions_icd10.xlsx"),
sheet = 1) |>
dplyr::rename_with(~ gsub(" ", "_", .x))
set.seed(500)
training_sample = gwas_study_info |>
left_join(lancet_cause_mapping |> select(cause = mapped_gbd_term,
lancet_condition),
by = "cause",
relationship = "many-to-many") |>
select(PUBMED_ID,
lancet_condition) |>
distinct() |>
dplyr::group_by(lancet_condition) |>
dplyr::slice_sample(prop = 0.25) |>
pull(PUBMED_ID) |>
unique()
pmids <- training_sample
gwas_study_info =
gwas_study_info |>
filter(PUBMED_ID %in% training_sample)
print("Number of unique pubmed ids for disease studies:")
[1] "Number of unique pubmed ids for disease studies:"
gwas_study_info$PUBMED_ID |> unique() |> length()
[1] 241
converted_ids <- data.table::fread(here::here("output/fulltexts/pmid_to_pmcid_mapping.csv"))
full_text_files <-
list.files(here::here("output/fulltexts"),
recursive = T,
pattern = "\\.html$|\\.xml$")
full_text_files <- basename(full_text_files) |>
stringr::str_remove_all("\\.html$|\\.xml$") |>
unique()
# convert pmcids to pmids
converted_fulltext_pmcids <-
converted_ids |>
filter(pmcids %in% full_text_files) |>
pull(PMID) |>
unique()
full_text_files <- c(full_text_files,
converted_fulltext_pmcids)
full_text_pmids <- grep("PMC",
full_text_files,
invert = T,
value = T)
print("Number of training sample papers with full text retrieved:")
[1] "Number of training sample papers with full text retrieved:"
sum(training_sample %in% full_text_pmids)
[1] 231
print("Number of unique pubmed ids for disease studies with full text retrieved:")
[1] "Number of unique pubmed ids for disease studies with full text retrieved:"
gwas_study_info |>
filter(PUBMED_ID %in% full_text_pmids) |>
select(PUBMED_ID, cause) |>
distinct() |>
group_by(cause) |>
summarise(n = n())
# A tibble: 21 × 2
cause n
<chr> <int>
1 Cervical cancer 22
2 Chronic kidney disease due to diabetes mellitus type 1 3
3 Chronic kidney disease due to diabetes mellitus type 2 5
4 Chronic obstructive pulmonary disease 17
5 Cirrhosis and other chronic liver diseases 80
6 Diabetes mellitus 87
7 Intracerebral hemorrhage 7
8 Ischemic heart disease 73
9 Ischemic stroke 31
10 Larynx cancer 18
# ℹ 11 more rows
GWAS ancestry data.frame contains: country of recruitment, ancestry label, population descriptors
gwas_ancest_info <- data.table::fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-ancestries-r2025-07-21.tsv"),
sep = "\t",
quote = "")
gwas_ancest_info = gwas_ancest_info |>
dplyr::rename_with(~ gsub(" ", "_", .x))
countries <-
gwas_ancest_info$COUNTRY_OF_RECRUITMENT |>
stringr::str_split("\\,") |>
unlist() |>
str_trim() |>
unique()
countries <- countries[countries != "NR"]
countries <- c(countries,
"United States",
"United Kingdom",
"Korea",
"America",
"USA",
"United States of America",
"Latin America")
population_descriptors <-
c(gwas_ancest_info$BROAD_ANCESTRAL_CATEGORY, gwas_ancest_info$ADDITIONAL_ANCESTRY_DESCRIPTION) |>
str_replace_all("Recruitment in North, Central, and South America, Asia, and Europe",
"North America, Central America, Asia, Europe") |>
#stringr::str_split("\\,|;|heritage:\\|") |>
stringr::str_split("\\,|;|heritage:|") |>
unlist() |>
str_trim()
# processing population descriptors to remove extra information and get more general terms to match cohort descriptions in abstracts (e.g. "European" instead of "European (non-Finnish)")
population_descriptors <- population_descriptors |>
str_remove_all("\\.$") |> # Escape the dot to make it literal
str_remove_all("^and ") |>
str_remove_all(" \\(founder/genetic isolate\\)$|\\(founder/genetic isolate$") |>
#str_remove_all("^[0-9+]% ") |>
str_remove_all("^\\d+% ") |> # \d is shorthand for [0-9]
str_remove_all("^[‡†*]+") |> # Remove dagger symbols at start
str_remove_all(" - Korea Association Resource \\(KARE\\)$") |>
str_remove_all("or unknown ancestry$|\\(middle eastern$") |>
str_remove_all("\\(see Springelkamp 2017\\)") |>
str_remove_all("parents & grandparents born in ") |>
str_remove_all(" cell lines$| cell line$") |>
str_remove_all(" cohort$") |>
str_remove_all(" population$") |>
str_remove_all("^Erasmus Rucphen in |^Erasmus Rucphen") |>
str_remove_all("^including ") |>
str_remove_all("\\(Middle Eastern$") |>
unique()
# removing population descriptors that are not valid to match cohort descriptions in abstracts (e.g. "cases", "controls", "Bipolar disorder")
not_valid_pop_descriptors <- c("NR", "N.R", "Other", "other", "", "unspecified",
"East", "Euopre",
"UKB", "UKBB", "DECODE", "Controls from UKBiobank", "UKBB and deCODE",
"NHAPC", "GeneID-I", " Family",
"non-Hispaniic white", "Zimbabweian", "Portugese", "Europen American",
"(see Moffatt et al 2010)", "See Wu J. H",
"et al. 2013. PMID: 23362303", "See Wu et al 23362303", "23362303",
"See Locke (PubMed 25673413) for BMI and Shungin (PubMed 25673412) for WHR")
disease_descriptors <- grep("cases|Bipolar|controls|Schizophrenia|disorder",
population_descriptors,
value = T)
population_descriptors <- population_descriptors[!(population_descriptors %in%
c(not_valid_pop_descriptors,
disease_descriptors
)
)]
# Get all terms in one go (if total < 500)
response <- GET("https://www.ebi.ac.uk/ols4/api/ontologies/hancestro/terms?size=1000")
data <- fromJSON(content(response, "text"), flatten = TRUE)
hancestro_terms <- data$`_embedded`$terms$label
# Remove terms with specific biobanks / datasets in brackets, as these are unlikely to be used in cohort descriptions in abstracts / texts
hancestro_terms <- str_remove_all(hancestro_terms,
" \\(SGDP\\)$| \\(1KGP\\)$| \\(HGDP\\)$| \\(GGVP\\)$")
# Remove obsolete terms
hancestro_terms <- grep("obsolete|obsolescence",
hancestro_terms,
value = TRUE,
invert = TRUE)
# remove terms with specific regions in brackets, as these are unlikely to be used in cohort descriptions in abstracts / texts
hancestro_terms <- str_remove_all(hancestro_terms, pattern = "\\(Carmel\\)$|\\(pre1989\\)$|\\(Bergamo\\)$|\\(Negev\\)$|\\(Central\\)|\\(Caucasus\\)")
# replace underscores with spaces
hancestro_terms <- str_replace_all(hancestro_terms,
pattern = "_",
replacement = " ")
hancestro_terms <- unique(hancestro_terms)
# specific hancestro terms to exclude:
terms_to_exclude <- c("ancestry category",
"ancestry status",
"BFO 0000006",
"Country",
"continent",
"continuant",
"curation status specification",
"denotator type",
"entity",
"ethnicity category",
"ethnicity descriptor",
"geographic location",
"geographic descriptor",
"geography-based population category",
"immaterial entity",
"independent continuant",
"!Kung",
"material entity",
"quality",
"reference population",
"region",
"organization",
"population",
"specifically dependent continuant",
"Thing",
"uncategorised population",
"undefined ancestry population")
hancestro_terms <- hancestro_terms[!c(hancestro_terms %in% terms_to_exclude)]
Country R packages Population descriptors in: https://pmc.ncbi.nlm.nih.gov/articles/PMC8715140/#sec2
# combine terms from different sources, and add some extra terms based on looking at cohort descriptions in abstracts
cohort_context_terms <-
c(countries, hancestro_terms, population_descriptors) |>
unique() |>
sort()
# terms to add:
cohort_context_terms <- c(cohort_context_terms,
"Scandinavians",
"Native Hawaiians")
# terms to remove ... "Qatar", "UK", "Taiwan", "Korean"
cohort_context_terms <- cohort_context_terms[!cohort_context_terms %in% c("Qatar",
"UK",
"Taiwan",
"Korean",
"population")]
# replace brackets with \\( & and \\)
cohort_context_terms <- cohort_context_terms |>
str_replace_all(pattern = "\\(", replacement = "\\\\(") |>
str_replace_all(pattern = "\\)", replacement = "\\\\)")
cohort_context_terms <- c(cohort_context_terms,
"UK\\b(?! Biobank)",
"UK\\b\\(?! Biobank\\)",
"Qatar\\b\\(?! Biobank\\)",
"Taiwan\\b\\(?! Biobank\\)",
"Korean\\b\\(?! Biobank\\)"
)
# sort by length of term (longest first) to match longest names first
cohort_context_terms <- cohort_context_terms[order(-nchar(cohort_context_terms))]
writeLines(cohort_context_terms,
here::here("output/gwas_cat/ancestry_population_terms.txt"))
cohort_context_terms <- readLines(here::here("output/gwas_cat/ancestry_population_terms.txt"))
print("Number of cohort context terms:")
[1] "Number of cohort context terms:"
length(cohort_context_terms)
[1] 924
gwas_study_info_cohort =
data.table::fread(here::here("output/gwas_cohorts/gwas_cohort_name_corrected.csv"))
gwas_study_info_cohort =
gwas_study_info_cohort |>
dplyr::rename_with(~ gsub(" ", "_", .x))
gwas_study_info_cohort =
gwas_study_info_cohort |>
select(PUBMED_ID,
COHORT) |>
distinct()
print("Check adding cohort information has only added columns, not rows:")
[1] "Check adding cohort information has only added columns, not rows:"
print("Before adding:")
[1] "Before adding:"
dim(gwas_study_info)
[1] 8241 13
gwas_study_info =
left_join(gwas_study_info,
gwas_study_info_cohort,
by = "PUBMED_ID"
)
print("After adding:")
[1] "After adding:"
dim(gwas_study_info)
[1] 8241 14
print("Check adding cohort information has not increaed number of unique pubmed ids:")
[1] "Check adding cohort information has not increaed number of unique pubmed ids:"
gwas_study_info |>
pull(PUBMED_ID) |>
unique() |>
length()
[1] 241
print("Number of unique pubmed ids for disease studies with cohort info:")
[1] "Number of unique pubmed ids for disease studies with cohort info:"
gwas_study_info |>
filter(COHORT != "") |>
pull(PUBMED_ID) |>
unique() |>
length()
[1] 52
gwas_ancest_info =
gwas_ancest_info |>
select(PUBMED_ID,
BROAD_ANCESTRAL_CATEGORY,
COUNTRY_OF_RECRUITMENT) |>
distinct() |>
group_by(PUBMED_ID) |>
summarise(
BROAD_ANCESTRAL_CATEGORY = paste(
unique(
unlist(strsplit(BROAD_ANCESTRAL_CATEGORY, split = "\\|"))
),
collapse = "|"
),
COUNTRY_OF_RECRUITMENT = paste(
unique(
unlist(strsplit(COUNTRY_OF_RECRUITMENT, split = "\\|"))
),
collapse = "|"
)
)
print("Check adding ancestry information has only added columns, not rows:")
[1] "Check adding ancestry information has only added columns, not rows:"
print("Before adding:")
[1] "Before adding:"
dim(gwas_study_info)
[1] 8241 14
gwas_study_info =
left_join(gwas_study_info,
gwas_ancest_info,
by = "PUBMED_ID"
)
print("After adding:")
[1] "After adding:"
dim(gwas_study_info)
[1] 8241 16
gwas_study_info <-
gwas_study_info |>
#filter(COHORT != "") |>
select(PUBMED_ID,
COHORT,
YEAR,
BROAD_ANCESTRAL_CATEGORY,
COUNTRY_OF_RECRUITMENT) |>
distinct() |>
group_by(PUBMED_ID,
YEAR,
BROAD_ANCESTRAL_CATEGORY,
COUNTRY_OF_RECRUITMENT) |>
summarise(
COHORT = paste(
unique(
unlist(strsplit(COHORT, split = "\\|"))
),
collapse = "|"
)
)
`summarise()` has grouped output by 'PUBMED_ID', 'YEAR',
'BROAD_ANCESTRAL_CATEGORY'. You can override using the `.groups` argument.
gwas_study_info =
gwas_study_info |>
ungroup() |>
arrange(PUBMED_ID)
pmids = gwas_study_info$PUBMED_ID
cohort = gwas_study_info$COHORT
names(cohort) = pmids
date = gwas_study_info$YEAR
names(date) = pmids
country = gwas_study_info$COUNTRY_OF_RECRUITMENT
names(country) = pmids
ancestry = gwas_study_info$BROAD_ANCESTRAL_CATEGORY
names(ancestry) = pmids
print("Number of papers without cohort information:")
[1] "Number of papers without cohort information:"
gwas_study_info |>
filter(COHORT == "") |>
nrow()
[1] 189
print("Number of papers with cohort information:")
[1] "Number of papers with cohort information:"
gwas_study_info |>
filter(COHORT != "") |>
nrow()
[1] 52
# this xlsx was built from looking at acrynyms / cohort names in the gwas catalog
# and finding the corresponding full names / details of cohorts
cohort_names <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
sheet = 1) |>
mutate(across(everything(),
~stringr::str_replace_all(.x,
pattern = "\u00A0",
replacement = " "))
)
New names:
• `` -> `...15`
cohort_full_names = cohort_names$full_name[!is.na(cohort_names$full_name)]
cohort_full_names <- str_trim(cohort_full_names)
cohort_full_names <- iconv(cohort_full_names, to = "UTF-8")
cohort_full_names <- gsub("[\u00A0\r\n]", " ", cohort_full_names) # replace non-breaking spaces, CR, LF with space
cohort_full_names <- str_squish(cohort_full_names) # trims and removes extra spaces
# sort by length of name (longest first) to match longest names first
cohort_full_names <- cohort_full_names[order(-nchar(cohort_full_names))]
cohort_full_names <- cohort_full_names[cohort_full_names != "Not Reported"]
cohort_full_names <- unique(cohort_full_names)
print("Number of unique cohort full names:")
[1] "Number of unique cohort full names:"
length(cohort_full_names)
[1] 767
cohort_abbr_names = cohort_names$cohort[!is.na(cohort_names$cohort)]
cohort_abbr_names <- str_trim(cohort_abbr_names)
cohort_abbr_names <- iconv(cohort_abbr_names, to = "UTF-8")
cohort_abbr_names <- gsub("[\u00A0\r\n]", " ", cohort_abbr_names) # replace non-breaking spaces, CR, LF with space
cohort_abbr_names <- str_squish(cohort_abbr_names) # trims and removes extra spaces
# remove abbreviations that are too short
# cohort_abbr_names <- cohort_abbr_names[nchar(cohort_abbr_names) >= 4]
# small_abbr_to_keep <- c("C4D",
# "BBJ",
# "UKB",
# "MVP",
# "TWB",
# "QBB",
# "MEC",
# "WHI"
# )
# cohort_abbr_names <- unique(c(cohort_abbr_names,
# small_abbr_to_keep
# ))
cohort_abbr_names <- cohort_abbr_names[!str_detect(pattern = "\\?",
cohort_abbr_names)]
# sort by length of name (longest first) to match longest names first
cohort_abbr_names <- cohort_abbr_names[order(-nchar(cohort_abbr_names))]
# add cohort names from GWAS catalog not yet added to data-dictionary
gwas_cat_cohorts = gwas_study_info_cohort$COHORT
gwas_cat_cohorts = unlist(strsplit(gwas_cat_cohorts, "\\|"))
gwas_cat_cohorts = gwas_cat_cohorts[!(gwas_cat_cohorts %in% c("", "other", "multiple"))]
cohort_abbr_names = unique(c(cohort_abbr_names,
gwas_cat_cohorts))
# remove small abbreviations that are likely to be false positives
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "DN"]
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "CHB"]
cohort_abbr_names <- cohort_abbr_names[cohort_abbr_names != "FG"]
print("Number of unique cohort abbreviation names:")
[1] "Number of unique cohort abbreviation names:"
length(cohort_abbr_names)
[1] 1029
source spacyr_venv/bin/activate
python3 code/extract_text/spacy_obtain_sentences.py \
--input_dir output/abstracts \
--output_dir output/abstracts
abstract_files <- list.files(here::here("output/abstracts/"),
pattern = "*.json",
full.names = FALSE
) |>
sort()
abstract_pmids = str_remove_all(abstract_files, "_sentences.json$")
abstract_pmids = abstract_pmids[abstract_pmids %in% pmids]
abstracts <- sapply(abstract_pmids,
function(file) {
json_data <- fromJSON(here::here(paste0("output/abstracts/", file, "_sentences.json")))
abstract_lines <- json_data[json_data != ""]
# readLines(here::here(paste0("output/abstracts/",file, ".txt")),
# warn = FALSE) |>
# paste(collapse = " ")
}
)
# check lengths of these vectors are the same
print("Check lengths of vectors are the same:")
[1] "Check lengths of vectors are the same:"
print(paste("Length of pmids:", length(pmids)))
[1] "Length of pmids: 241"
print(paste("Length of abstract pmids:", length(abstract_pmids)))
[1] "Length of abstract pmids: 239"
print(paste("Length of abstracts:", length(abstracts)))
[1] "Length of abstracts: 239"
print(paste("Length of cohort:", length(cohort)))
[1] "Length of cohort: 241"
print(paste("Length of date:", length(date)))
[1] "Length of date: 241"
missing_abstracts = pmids[!(pmids %in% abstract_pmids)]
print("Number of missing abstracts:")
[1] "Number of missing abstracts:"
length(missing_abstracts)
[1] 2
#pmids <- pmids[pmids %in% abstract_pmids]
extract_cohort_sentences <- function(abstract_list,
cohort_names,
column_name = "COHORT",
tokenize = FALSE,
ignore_case) {
cohort_names_grep <- paste0("\\b", cohort_names, "\\b") # add word boundaries to match whole words only
results <- lapply(seq_along(abstract_list), function(i) {
#abstract <- text_vector[i]
# Split abstract into sentences
if(tokenize) {
sentences <- tokenizers::tokenize_sentences(abstract_list[[i]])[[1]]
} else {
sentences <- abstract_list[[i]]
}
# For each sentence, find all matching cohort names
lapply(seq_along(sentences), function(s) {
sentence <- sentences[s]
# Identify cohort names present in this sentence
matched_cohorts <- cohort_names[str_detect(sentence,
regex(cohort_names_grep,
ignore_case = ignore_case))]
cohort_df <-
data.frame(
article_id = i,
sentence_id = s,
sentence = sentence,
has_cohort = length(matched_cohorts) > 0,
#!!column_name = if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else "",
#COHORT = if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else "",
stringsAsFactors = FALSE
)
# Add dynamic column
cohort_df[[column_name]] <- if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else ""
return(cohort_df)
}) |> bind_rows()
})
results <- bind_rows(results)
results$pubmed_id <- names(abstract_list)[results$article_id]
return(results) #returns a df of sentences, with labelled columns for whether they contain cohort names, and which cohort names they contain (if any)
# id of sentence, and abstract / article
}
# extract sentences with cohort full names (case-insensitive matching, as full names are less likely to be ambiguous)
cohort_sentences_df_p1 <- extract_cohort_sentences(abstracts,
cohort_full_names,
ignore_case = TRUE
)
# extract sentences with cohort abbreviation names (case-sensitive matching, as abbreviations are more likely to be ambiguous)
cohort_sentences_df_p2 = extract_cohort_sentences(abstracts,
cohort_abbr_names,
ignore_case = FALSE
)
cohort_sentences_df =
bind_rows(cohort_sentences_df_p1,
cohort_sentences_df_p2
)
cohort_sentences_df =
cohort_sentences_df |>
distinct()
separate_cohorts <- function(COHORT) {
if (any(grepl("\\|", COHORT))) {
return(unlist(strsplit(COHORT, "\\|")))
} else {
return(COHORT)
}
}
cohort_sentences_df =
cohort_sentences_df |>
group_by(pubmed_id,
article_id,
sentence_id,
sentence) |>
summarise(
COHORT = str_flatten(unique(separate_cohorts(COHORT)),
collapse = "|",
na.rm = T),
has_cohort = ifelse(any(COHORT != ""), TRUE, FALSE)
) |>
ungroup()
`summarise()` has grouped output by 'pubmed_id', 'article_id', 'sentence_id'.
You can override using the `.groups` argument.
cohort_sentences_df =
cohort_sentences_df |>
mutate(COHORT = str_remove_all(COHORT,
pattern = "\\|$|^\\|"
)
)
json_list <-
cohort_sentences_df |>
filter(has_cohort) |>
pull(sentence) |>
as.list()
# Write to JSON file
writeLines(toJSON(json_list,
auto_unbox = TRUE,
pretty = TRUE),
here::here("output/doccano/abstract_cohort_sentences.json"))
json_list <-
cohort_sentences_df |>
filter(!has_cohort) |>
pull(sentence) |>
as.list()
# Write to JSON file
writeLines(toJSON(json_list,
auto_unbox = TRUE,
pretty = TRUE),
here::here("output/doccano/abstract_non_cohort_sentences.json"))
# remove stop words and common words that are not informative for cohort context
removal_words <- c("the", "and", "of", "in", "to", "with", "a",
"for", "on", "by", "is", "are", "was", "were", "each", "all", "had", "have", "it",
"as", "from", "that", "this", "which", "be", "at", "or", "an", "then", "than", "into",
"if", "not", "only", "both", "same", "after", "across", "between", "out", "up", "any",
"we", "our", "us", "these", "within", "per",
"used", "using", "use",
"0.01", "0.05", "0.5", "1", "one", "2", "two", "3", "three", "iii",
"4", "5", "6", "p", "8", "10", "value", "significant", "30",
"i", "r", "wide"
)
cohort_sentences_words <-
cohort_sentences_df |>
filter(has_cohort) |>
pull(sentence) |>
tokenizers::tokenize_words() |>
unlist()
cohort_sentence_words_df =
data.frame(word = cohort_sentences_words) |>
filter(!(tolower(word) %in% removal_words)) |>
group_by(word) |>
summarise(n_in_cohort = n()) |>
filter(n_in_cohort > 5)
non_cohort_sentences_words <-
cohort_sentences_df |>
filter(!has_cohort) |>
pull(sentence) |>
tokenizers::tokenize_words() |>
unlist()
# sample down non-cohort sentences words
set.seed(500)
non_cohort_sentences_words_sample <- sample(non_cohort_sentences_words,
size = length(cohort_sentences_words),
replace = FALSE)
non_cohort_sentence_words_df =
data.frame(word = non_cohort_sentences_words_sample) |>
filter(!(tolower(word) %in% removal_words)) |>
group_by(word) |>
summarise(n_not_cohort = n())
n_sentences <- cohort_sentences_df |>
filter(COHORT != "") |>
select(article_id, sentence_id) |>
distinct() |>
nrow()
print("Words more common in sentences with cohort names than sentences without cohort names:")
[1] "Words more common in sentences with cohort names than sentences without cohort names:"
left_join(cohort_sentence_words_df,
non_cohort_sentence_words_df,
by = "word"
) |>
mutate(n_not_cohort = ifelse(is.na(n_not_cohort), 0, n_not_cohort)) |>
mutate(delta_n = n_in_cohort - n_not_cohort) |>
mutate(recall = n_in_cohort / n_sentences) |>
arrange(desc(delta_n)) |>
head(20)
# A tibble: 20 × 5
word n_in_cohort n_not_cohort delta_n recall
<chr> <int> <dbl> <dbl> <dbl>
1 biobank 53 0 53 0.310
2 study 62 19 43 0.363
3 uk 41 1 40 0.240
4 data 38 9 29 0.222
5 n 29 4 25 0.170
6 cases 35 16 19 0.205
7 controls 32 15 17 0.187
8 consortium 15 0 15 0.0877
9 cohorts 15 2 13 0.0877
10 project 13 0 13 0.0760
11 african 12 0 12 0.0702
12 based 18 7 11 0.105
13 genomes 11 0 11 0.0643
14 individuals 25 14 11 0.146
15 performed 25 14 11 0.146
16 1000 10 0 10 0.0585
17 19 10 0 10 0.0585
18 ancestry 14 4 10 0.0819
19 analysis 40 31 9 0.234
20 cohort 15 6 9 0.0877
print("Number of abstracts containing probable cohort reference")
[1] "Number of abstracts containing probable cohort reference"
abstracts_cohorts <- cohort_sentences_df |>
filter(COHORT != "") |>
pull(article_id) |>
unique()
n_abstracts_cohorts = abstracts_cohorts |>
length()
print(n_abstracts_cohorts)
[1] 99
print("Percentage of sampled abstracts containing probable cohort reference:")
[1] "Percentage of sampled abstracts containing probable cohort reference:"
100 * n_abstracts_cohorts / length(unique(pmids))
[1] 41.07884
print("Number of sentences containing probable cohort reference")
[1] "Number of sentences containing probable cohort reference"
cohort_sentences_df |>
filter(COHORT != "") |>
select(article_id, sentence_id) |>
distinct() |>
nrow()
[1] 171
print("Number of entities")
[1] "Number of entities"
cohort_sentences_df |>
filter(COHORT != "") |>
pull(COHORT) |>
str_split("\\|") |>
unlist() |>
length()
[1] 279
print("Number of unique cohorts referenced (not unique cohort names, but unique strings in the COHORT column):")
[1] "Number of unique cohorts referenced (not unique cohort names, but unique strings in the COHORT column):"
detected_cohorts <- cohort_sentences_df |>
filter(COHORT != "") |>
pull(COHORT) |>
str_split("\\|") |>
unlist()
n_cohorts <- detected_cohorts |>
unique() |>
length()
print(n_cohorts)
[1] 115
print("Most common cohort names detected (by number of sentences they are mentioned in):")
[1] "Most common cohort names detected (by number of sentences they are mentioned in):"
data.frame(cohort = detected_cohorts) |>
group_by(cohort) |>
summarise(n = n()) |>
arrange(desc(n)) |>
head(10)
# A tibble: 10 × 2
cohort n
<chr> <int>
1 UK Biobank 38
2 1000 Genomes 10
3 Biobank Japan 7
4 HTN 7
5 WTCCC 7
6 Taiwan 6
7 AA-DHS 5
8 COPDGene 5
9 DCCT 5
10 DHS 5
print("Most common cohort names detected (by number of abstracts they are mentioned in):")
[1] "Most common cohort names detected (by number of abstracts they are mentioned in):"
cohort_sentences_df |>
filter(COHORT != "") |>
select(COHORT, article_id) |>
tidyr::separate_rows(COHORT, sep = "\\|") |>
distinct() |>
group_by(COHORT) |>
summarise(n_abstracts = n()) |>
arrange(desc(n_abstracts)) |>
head(5)
# A tibble: 5 × 2
COHORT n_abstracts
<chr> <int>
1 UK Biobank 29
2 1000 Genomes 9
3 Biobank Japan 6
4 FinnGen 4
5 Taiwan 4
Using this to correct for the over-representation of sentences without cohort names, by identifying well-matched sentences
# words that indicate sample or cohort is being discussed:
sample_terms <- c(
# Your original list
"ancestry", "biobank", "cases", "cohort", "controls",
"consortium", "consortia", "descent", "founder",
"enrolled", "enrollment", "ethnicity", "heritage",
"isolate", "isolated", "individuals", "participants",
"patients", "population", "recruitment", "registry",
"sample", "study", "subjects", "volunteer",
# Additional core terms
"recruited", "enroll", "ascertained", "volunteers",
"probands", "affected", "unaffected", "families", "twins",
"cohorts", "populations", "samples", "subgroup", "subset",
"subcohort", "demographic", "ethnic", "racial",
"admixed", "admixture", "ancestral", "origin",
"case-control", "prospective", "genotyped", "sequenced",
"meta-analysis", "pooled", "residing", "nationwide",
# Useful variants
"enrollees", "sampling", "lineage", "self-reported",
"self-identified", "replication", "discovery", "validation"
)
control_cohort_sentences_df <-
cohort_sentences_df |>
filter(COHORT == "" & CONTEXT == "") |>
filter(grepl(paste(sample_terms, collapse = "|"),
sentence,
ignore.case = TRUE)
) #|>
print("Number of sentences without (detected) cohort names or cohort-context but with sample-related context:")
[1] "Number of sentences without (detected) cohort names or cohort-context but with sample-related context:"
nrow(control_cohort_sentences_df)
[1] 706
filtered_cohort_sentences_df =
bind_rows(filtered_cohort_sentences_df,
control_cohort_sentences_df
) |>
distinct()
print("Percentage of sentences with cohort names vs other cohort-related context")
[1] "Percentage of sentences with cohort names vs other cohort-related context"
total_n_sentences <- nrow(filtered_cohort_sentences_df)
filtered_cohort_sentences_df |>
group_by(has_cohort) |>
summarise(n = n(),
percentage = round(100 * n/total_n_sentences, digits = 2)
)
# A tibble: 2 × 3
has_cohort n percentage
<lgl> <int> <dbl>
1 FALSE 920 84.3
2 TRUE 171 15.7
# file path for intermediate json file output
json_file = here::here("output/doccano/abstracts_with_cohort_info.json")
# file path for final jsonl file output
jsonl_file = here::here("output/doccano/abstracts_with_cohort_info.jsonl")
convert_to_doccano_json_sentence_level <- function(cohort_sentences_df) {
# set up json list
doccano_list <- list()
example_id <- 1
for(current_sentence in cohort_sentences_df$sentence) {
# Filter cohort sentences that match this sentence (safe matching)
df <- cohort_sentences_df |>
dplyr::filter(sentence == current_sentence)
for (i in seq_len(nrow(df))) {
matched_cohort <- df$COHORT[i]
pubmed_id_val <- df$pubmed_id[i]
date_val <- df$date[i]
cohort_val <- df$cohort[i]
country_val <- df$country[i]
if(matched_cohort == ""){
doccano_list[[example_id]] <- list(
text = current_sentence,
pubmed_id = pubmed_id_val,
date = date_val,
country = country_val,
gwas_cat_cohort_label = cohort_val,
label = list()
)
} else {
# Find the location of all matches of the cohort name in the sentence
if(grepl("\\|", matched_cohort)) {
# If multiple cohort names, separate
matched_cohort <- unlist(strsplit(matched_cohort, split = "\\|"))
match_locations <- list()
for(current_matched_cohort in matched_cohort){
matches <- str_locate_all(current_sentence,
fixed(current_matched_cohort,
ignore_case = TRUE)
)[[1]]
match_locations <- append(match_locations, list(matches))
}
# Combine all match locations into a single matrix
matches <- do.call(rbind, match_locations)
} else {
matches <- str_locate_all(current_sentence,
fixed(matched_cohort,
ignore_case = TRUE)
)[[1]]
}
# Convert matches to 0-based indexing (for doccano)
matches[, "start"] <- matches[, "start"] - 1
# Turn match locations into entity list
entities <- list()
for(k in seq_len(nrow(matches))) {
entities <- append(entities, list(list(
start_offset = matches[k, "start"],
end_offset = matches[k, "end"],
label = "COHORT"
)))
}
# Create Doccano JSON entry
doccano_list[[example_id]] <- list(
text = current_sentence,
pubmed_id = pubmed_id_val,
date = date_val,
country = country_val,
gwas_cat_cohort_label = cohort_val,
label = entities
)
}
}
example_id <- example_id + 1
}
# Convert named list label format to vector format [start, end, label]
doccano_list <- lapply(doccano_list, function(x) {
x$label <- lapply(x$label, function(l) {
c(l$start_offset, l$end_offset, l$label)
})
x
})
return(doccano_list)
}
# Join metadata onto sentences before passing to function
meta_data_df <- data.frame(
pubmed_id = as.character(pmids),
cohort = cohort,
date = date,
country = country
)
filtered_cohort_sentences_df <- left_join(
filtered_cohort_sentences_df,
meta_data_df,
by = "pubmed_id"
)
# Create JSON as a list
json_list <- convert_to_doccano_json_sentence_level(
cohort_sentences_df = filtered_cohort_sentences_df
)
# Write to JSON file
writeLines(toJSON(json_list,
auto_unbox = TRUE,
pretty = TRUE),
json_file)
json_data <- fromJSON(json_file,
simplifyVector = FALSE)
# Open connection to JSONL file
con <- file(jsonl_file, "w")
# Loop over each element (object) and write as one line
for (i in seq_along(json_data)) {
writeLines(toJSON(json_data[[i]], auto_unbox = TRUE), con)
}
# Close connection
close(con)
cat("JSONL saved to:", jsonl_file, "\n")
JSONL saved to: /Users/ibeasley/code/genomics_ancest_disease_dispar/output/doccano/abstracts_with_cohort_info.jsonl
source spacyr_venv/bin/activate
python3 code/extract_text/spacy_obtain_sentences.py \
--input_dir output/fulltexts/methods_sections \
--output_dir output/fulltexts/methods_sentences
methods_sections <- list.files(here::here("output/fulltexts/methods_sentences/"),
pattern = "*.json",
full.names = TRUE
)
pmcids_with_methods <- methods_sections |>
gsub(pattern = ".*/", replacement = "") |>
gsub(pattern = "_methods_sentences\\.json$", replacement = "")
pubmeds_with_methods <- converted_ids |>
filter(pmcids %in% pmcids_with_methods) |>
pull(PMID)
pubmeds_with_methods <- c(pubmeds_with_methods,
grep("PMC", pmcids_with_methods, value = TRUE, invert = TRUE)
)
print("Number of papers with methods sections extracted:")
[1] "Number of papers with methods sections extracted:"
sum(pubmeds_with_methods %in% pmids)
[1] 212
pmids_methods_to_get <- pubmeds_with_methods[pubmeds_with_methods %in% pmids]
pmcids_methods_to_get <- converted_ids |>
filter(PMID %in% pmids_methods_to_get) |>
pull(pmcids)
ids_methods_to_get <- c(pmids_methods_to_get, pmcids_methods_to_get)
ids_methods_to_get <- ids_methods_to_get[ids_methods_to_get != ""]
files_to_get <-
paste0(ids_methods_to_get,
"_methods_sentences.json") |>
sort()
methods_sections_to_get <- methods_sections[basename(methods_sections) %in% files_to_get]
methods_sections_to_get <- basename(methods_sections_to_get) |>
gsub(pattern = "_methods_sentences\\.json$", replacement = "")
# read in methods sections for available papers
methods_texts <- sapply(methods_sections_to_get,
function(id) {
file <- here::here(paste0("output/fulltexts/methods_sentences/",
id,
"_methods_sentences.json"))
json_data <- fromJSON(file)
abstract_lines <- json_data[json_data != ""]
# readLines(file,
# warn = FALSE,
# encoding = "UTF-8") |>
# paste(collapse = " ")
}
)
# convert pmcids to pubmeds
names(methods_texts) <- sapply(names(methods_texts), function(id) {
if(grepl("^PMC", id)) {
pmcid <- id
pubmed_id <- converted_ids$PMID[converted_ids$pmcids == pmcid]
if(length(pubmed_id) > 0) {
return(as.character(pubmed_id))
} else {
return(NA)
}
} else {
return(id)
}
})
# order methods texts by pmids
methods_texts <- methods_texts[order(names(methods_texts))]
cohort_sentences_words <-
methods_cohort_sentences_df |>
filter(has_cohort) |>
pull(sentence) |>
tokenizers::tokenize_words() |>
unlist()
cohort_sentence_words_df =
data.frame(word = cohort_sentences_words) |>
filter(!(tolower(word) %in% removal_words)) |>
group_by(word) |>
summarise(n_in_cohort = n()) |>
filter(n_in_cohort > 5)
non_cohort_sentences_words <-
methods_cohort_sentences_df |>
filter(!has_cohort) |>
pull(sentence) |>
tokenizers::tokenize_words() |>
unlist()
# sample down non-cohort sentences words
set.seed(500)
non_cohort_sentences_words_sample <- sample(non_cohort_sentences_words,
size = length(cohort_sentences_words),
replace = FALSE)
non_cohort_sentence_words_df =
data.frame(word = non_cohort_sentences_words_sample) |>
filter(!(tolower(word) %in% removal_words)) |>
group_by(word) |>
summarise(n_not_cohort = n())
print("Words more common in sentences with cohort names than sentences without cohort names:")
[1] "Words more common in sentences with cohort names than sentences without cohort names:"
left_join(cohort_sentence_words_df,
non_cohort_sentence_words_df,
by = "word"
) |>
mutate(n_not_cohort = ifelse(is.na(n_not_cohort), 0, n_not_cohort)) |>
mutate(delta_n = n_in_cohort - n_not_cohort) |>
arrange(desc(delta_n)) |>
head(20)
# A tibble: 20 × 4
word n_in_cohort n_not_cohort delta_n
<chr> <int> <dbl> <dbl>
1 study 523 189 334
2 biobank 300 6 294
3 uk 289 12 277
4 data 503 279 224
5 1000 184 7 177
6 genomes 177 3 174
7 project 165 12 153
8 cohort 212 67 145
9 reference 190 45 145
10 ukb 145 0 145
11 consortium 134 7 127
12 participants 194 82 112
13 cases 240 132 108
14 european 153 46 107
15 phase 118 13 105
16 panel 115 26 89
17 samples 216 129 87
18 individuals 193 109 84
19 ukbb 84 0 84
20 studies 149 68 81
control_methods_cohort_sentences_df <-
methods_cohort_sentences_df |>
filter(COHORT == "" & CONTEXT == "") |>
filter(grepl(paste(sample_terms, collapse = "|"),
sentence,
ignore.case = TRUE)
) #|>
print("Number of sentences without (detected) cohort names or cohort-context but with sample-related context:")
[1] "Number of sentences without (detected) cohort names or cohort-context but with sample-related context:"
nrow(control_methods_cohort_sentences_df)
[1] 4358
filtered_methods_cohort_sentences_df =
bind_rows(filtered_methods_cohort_sentences_df,
control_methods_cohort_sentences_df
) |>
distinct()
print("Percentage of sentences with cohort names vs other cohort-related context")
[1] "Percentage of sentences with cohort names vs other cohort-related context"
total_n_sentences <- nrow(filtered_methods_cohort_sentences_df)
filtered_methods_cohort_sentences_df |>
group_by(has_cohort) |>
summarise(n = n(),
percentage = round(100 * n/total_n_sentences, digits = 2)
)
# A tibble: 2 × 3
has_cohort n percentage
<lgl> <int> <dbl>
1 FALSE 5510 74.6
2 TRUE 1878 25.4
# file path for intermediate json file output
json_file = here::here("output/doccano/methods_with_cohort_info.json")
# file path for final jsonl file output
jsonl_file = here::here("output/doccano/methods_with_cohort_info.jsonl")
filtered_methods_cohort_sentences_df <- left_join(
filtered_methods_cohort_sentences_df,
meta_data_df,
by = "pubmed_id"
)
# Create JSON as a list
json_list <- convert_to_doccano_json_sentence_level(
cohort_sentences_df = filtered_methods_cohort_sentences_df
)
# Write to JSON file
writeLines(toJSON(json_list,
auto_unbox = TRUE,
pretty = TRUE),
json_file)
json_data <- fromJSON(json_file,
simplifyVector = FALSE)
# Open connection to JSONL file
con <- file(jsonl_file, "w")
# Loop over each element (object) and write as one line
for (i in seq_along(json_data)) {
writeLines(toJSON(json_data[[i]], auto_unbox = TRUE), con)
}
# Close connection
close(con)
cat("JSONL saved to:", jsonl_file, "\n")
JSONL saved to: /Users/ibeasley/code/genomics_ancest_disease_dispar/output/doccano/methods_with_cohort_info.jsonl
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.7.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] tokenizers_0.3.0 jsonlite_2.0.0 xml2_1.4.0 rentrez_1.2.4
[5] httr_1.4.7 stringi_1.8.7 dplyr_1.1.4 readxl_1.4.5
[9] stringr_1.6.0 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] sass_0.4.10 utf8_1.2.6 generics_0.1.4
[4] tidyr_1.3.1 renv_1.0.3 digest_0.6.37
[7] magrittr_2.0.4 evaluate_1.0.5 fastmap_1.2.0
[10] cellranger_1.1.0 rprojroot_2.1.0 processx_3.8.6
[13] whisker_0.4.1 ps_1.9.1 promises_1.3.3
[16] BiocManager_1.30.26 purrr_1.1.0 XML_3.99-0.19
[19] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6
[22] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
[25] tools_4.3.1 httpuv_1.6.16 here_1.0.1
[28] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
[31] git2r_0.36.2 fs_1.6.6 pkgconfig_2.0.3
[34] callr_3.7.6 pillar_1.11.1 bslib_0.9.0
[37] later_1.4.4 glue_1.8.0 data.table_1.17.8
[40] Rcpp_1.1.0 xfun_0.55 tibble_3.3.0
[43] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50
[46] htmltools_0.5.8.1 SnowballC_0.7.1 rmarkdown_2.30
[49] compiler_4.3.1 getPass_0.2-4