tabulizer for parsing block-text from .pdf

rOpenSci package or resource used*

URL or code snippet for your use case*

Goal: extract certain block of data from different sections of the .pdf

Strategy: use split-apply-combine approach via locate_areas() (to get box coordinates of sections of interest) and then extract_tables() to get the data with the section.

locate_areas() demo tabulizer-demo|601x500

Code snippet

# |- fxn ----
# data munge function for map() later
CleanHeader <- function(tbl) {
  tbl %>%
    tidyr::pivot_longer(
      cols = -idx
    ) %>%
    dplyr::group_by(name) %>%
    dplyr::filter(value != "") %>%
    dplyr::summarise(value = paste0(value, collapse = ":")) %>%
    tidyr::separate(value, c("cat", "data"), sep = ":", extra = "merge") %>%
    dplyr::mutate_at(vars(data), stringr::str_remove, ":") %>% 
    dplyr::select(cat, data)
}

# |- data ----
file <- "./ropensci_white.pdf"

# get the box coordinates via interactive selection; this info is then used in extract_tables() area args
locate_areas(file, pages = 1)

# extract data
header_raw <- extract_tables(f, pages = 1,
                             area = list(c(74, 88, 128, 522)),
                             guess = FALSE)

# data munge; data/application specific
header_raw[[1]] %>% 
  tibble::as_tibble() %>% 
  janitor::remove_empty("cols") %>% 
  dplyr::mutate(idx = cumsum(stringr::str_detect(V1, "Section Marker"))) %>% 
  dlpyr::group_split(idx) %>% 
  purrr::map_dfr(CleanHeader)

Image

Sample .pdf where coloured boxes represent sections of interest to extract data from.

See follow up post below for picture (new users may only post 1 image/post 😅)

Sector

Energy.

Field(s) of application

Energy.

What did you do?

Use {tabulizer} from rOpenSci to extract data of interest from .pdf to save time and avoid data quality issue that may be introduced if it was done manually.

If you reported this use case and want to modify or remove it, use the "Suggest an edit" link in the sidebar to open a GitHub pull request, or contact us!