Pdftools 2.0: powerful pdf text extraction tools
A new version of pdftools has been released to CRAN. Go get it while it’s hot:
This version has two major improvements: low level text extraction and encoding improvements.
About PDF textboxes
A pdf document may seem to contain paragraphs or tables in a viewer, but this is not actually true. PDF is a printing format: a page consists of a series of unrelated lines, bitmaps, and textboxes with a given size, position and content. Hence a table in a pdf file is really just a large unordered set of lines and words that are nicely visually positioned. This makes sense for printing, but makes extracting text or data from a pdf file extremely difficult.
Because the pdf format has little semantic structure, the
pdf_text() function in pdftools has to render the PDF to a text canvas, in order to create the sentences or paragraphs. It does so pretty well, but some users have asked for something more low level.
@opencpu any chance pdftext could return a data frame with one row for each text box with columns giving page, positions, and contents?— Hadley Wickham (@hadleywickham) December 26, 2017
Unfortunately this was not trivial because it required some work in the underlying poppler library. One year later, and this functionality is now finally available in the upcoming poppler version 0.73. The pdftools CRAN binary packages for Windows and MacOS already contain a suitable libpoppler, however Linux users probably have to wait for the latest version of poppler to become available in their system package manager (or compile from source).
Low-level text extraction
We use an example pdf file from the rOpenSci tabulizer package. This file contains a few standard datasets which have been printed as a pdf table. First let’s try the
pdf_text() function, which returns a character vector of length equal to the number of pages in the file.
library(pdftools) pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf" txt <- pdf_text(pdf_file) cat(txt)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ...
pdf_text() converts all text on a page to a large string, which works pretty well. However if you would want to parse this text into a data frame (using e.g.
read.table) you run into a problem: the first column contains spaces within the values. Therefore we can’t use the whitespace as the column delimiter (as is the default in
Hence to write a proper pdf table extractor, we have to infer the column from the physical position of the textbox, rather than rely on delimiting characters. The new
pdf_data() provides exactly this. It returns a data frame with all textboxes in a page, including their width, height, and (x,y) position:
# All textboxes on page 1 pdf_data(pdf_file)[]
# A tibble: 430 x 6 width height x y space text <int> <int> <int> <int> <lgl> <chr> 1 29 8 154 139 TRUE Mazda 2 19 8 187 139 FALSE RX4 3 29 8 154 151 TRUE Mazda 4 19 8 187 151 TRUE RX4 5 19 8 210 151 FALSE Wag 6 31 8 154 163 TRUE Datsun 7 14 8 189 163 FALSE 710 8 30 8 154 176 TRUE Hornet 9 4 8 188 176 TRUE 4 10 23 8 196 176 FALSE Drive
Converting this pdf data into the original data frame is left as an exercise for the reader :)
Apart from the new
pdf_data() function, this release also fixes a few smaller problems related to text encoding, both in pdftools package as well as the underlying poppler library. The main issue was a bug related mixing
UTF-16LE, which is not something you ever want to worry about.
For most well behaved pdf files there was no problem, but some files using rare encoding could yield an error
Embedded NUL in string for metadata, or garbled
title fields. If you encountered any of these problems in the past, please update your
pdftools and try again!
Other rOpenSci PDF packages
pdftools we have two other packages that may be helpful to extract data from PDF files:
- The tesseract package provides R bindings to the Google Tesseract OCR C++ library. This allows for detecting text from scanned images.
- The tabulizer package provides R bindings to the Tabula java library, which can also be used to extract tables from PDF documents. Note this requires you have a Java installation.
Using rOpenSci packages? Tell us about your use case and how you make use of our software!