Pdftools 2.0: powerful pdf text extraction tools

doi:https://doi.org/10.59350/gnw3b-3wn92

Friday, December 14, 2018 From rOpenSci (https://ropensci.org/blog/2018/12/14/pdftools-20/). Except where otherwise noted, content on this site is licensed under the CC-BY license.

Pdftools 2.0: powerful pdf text extraction tools

By Jeroen Ooms

A new version of pdftools has been released to CRAN. Go get it while it’s hot:

install.packages("pdftools")

This version has two major improvements: low level text extraction and encoding improvements.

🔗 About PDF textboxes

A pdf document may seem to contain paragraphs or tables in a viewer, but this is not actually true. PDF is a printing format: a page consists of a series of unrelated lines, bitmaps, and textboxes with a given size, position and content. Hence a table in a pdf file is really just a large unordered set of lines and words that are nicely visually positioned. This makes sense for printing, but makes extracting text or data from a pdf file extremely difficult.

Because the pdf format has little semantic structure, the pdf_text() function in pdftools has to render the PDF to a text canvas, in order to create the sentences or paragraphs. It does so pretty well, but some users have asked for something more low level.

Unfortunately this was not trivial because it required some work in the underlying poppler library. One year later, and this functionality is now finally available in the upcoming poppler version 0.73. The pdftools CRAN binary packages for Windows and MacOS already contain a suitable libpoppler, however Linux users probably have to wait for the latest version of poppler to become available in their system package manager (or compile from source).

🔗 Low-level text extraction

We use an example pdf file from the rOpenSci tabulizer package. This file contains a few standard datasets which have been printed as a pdf table. First let’s try the pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

library(pdftools)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
txt <- pdf_text(pdf_file)
cat(txt[1])

                    mpg  cyl    disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6   160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6   160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4   108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6   258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8   360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6   225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8   360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4   146.7  62 3.69 3.190 20.00  1  0    4    2
...

Hence pdf_text() converts all text on a page to a large string, which works pretty well. However if you would want to parse this text into a data frame (using e.g. read.table) you run into a problem: the first column contains spaces within the values. Therefore we can’t use the whitespace as the column delimiter (as is the default in read.table).

Hence to write a proper pdf table extractor, we have to infer the column from the physical position of the textbox, rather than rely on delimiting characters. The new pdf_data() provides exactly this. It returns a data frame with all textboxes in a page, including their width, height, and (x,y) position:

# All textboxes on page 1
pdf_data(pdf_file)[[1]]

# A tibble: 430 x 6
   width height     x     y space text  
   <int>  <int> <int> <int> <lgl> <chr> 
 1    29      8   154   139 TRUE  Mazda 
 2    19      8   187   139 FALSE RX4   
 3    29      8   154   151 TRUE  Mazda 
 4    19      8   187   151 TRUE  RX4   
 5    19      8   210   151 FALSE Wag   
 6    31      8   154   163 TRUE  Datsun
 7    14      8   189   163 FALSE 710   
 8    30      8   154   176 TRUE  Hornet
 9     4      8   188   176 TRUE  4     
10    23      8   196   176 FALSE Drive

Converting this pdf data into the original data frame is left as an exercise for the reader :)

🔗 Encoding enhancements

Apart from the new pdf_data() function, this release also fixes a few smaller problems related to text encoding, both in pdftools package as well as the underlying poppler library. The main issue was a bug related mixing UTF-16BE and UTF-16LE, which is not something you ever want to worry about.

For most well behaved pdf files there was no problem, but some files using rare encoding could yield an error Embedded NUL in string for metadata, or garbled author or title fields. If you encountered any of these problems in the past, please update your pdftools and try again!

🔗 Other rOpenSci PDF packages

Besides pdftools we have two other packages that may be helpful to extract data from PDF files:

The tesseract package provides R bindings to the Google Tesseract OCR C++ library. This allows for detecting text from scanned images.
The tabulizer package provides R bindings to the Tabula java library, which can also be used to extract tables from PDF documents. Note this requires you have a Java installation.

Using rOpenSci packages? Tell us about your use case and how you make use of our software!

Filed under

Suggest an edit Open a pull request

Cite
doi.org/10.59350/gnw3b-3wn92

Browse

Pdftools 2.0: powerful pdf text extraction tools

🔗 About PDF textboxes

🔗 Low-level text extraction

🔗 Encoding enhancements

🔗 Other rOpenSci PDF packages

Our Newsletter