April 24, 2019 From rOpenSci (https://ropensci.org/technotes/2019/04/24/pdftools-22/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the
pdf_data() function which was introduced in pdftools 2.0 is now available on all major systems.
It is now possible to split, join, and compress pdf files with pdftools. For example the
pdf_subset() function creates a new pdf file with a selection of the pages from the input file:
# Load pdftools library(pdftools) # extract some pages pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf', pages = 1:3, output = "subset.pdf") # Should say 3 pdf_length("subset.pdf")
pdf_combine() is used to join several pdf files into one.
# Generate another pdf pdf("test.pdf") plot(mtcars) dev.off() # Combine them with the other one pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf") # Should say 4 pdf_length("joined.pdf")
The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.
The pdftools 2.0 announcement post from December introduced the new
pdf_data() function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.
I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use
pdf_data on Ubuntu LTS servers, including Travis:
sudo add-apt-repository ppa:cran/poppler sudo apt-get update sudo apt-get install libpoppler-cpp-dev
Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.
Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so
pdf_data() will work on Debian (and hence CRAN) as well!