Tesseract and Magick: High Quality OCR in R

  Jeroen Ooms   | AUGUST 17, 2017

Last week we released an update of the tesseract package to CRAN. This package provides R bindings to Google’s OCR library Tesseract.

install.packages("tesseract")

The new version ships with the latest libtesseract 3.05.01 on Windows and MacOS. Furthermore it includes enhancements for managing language data and using tesseract together with the magick package.

Installing Language Data

The new version has several improvements for installing additional language data. On Windows and MacOS you use the tesseract_download() function to install additional languages:

tesseract_download("fra")

Language data are now stored in rappdirs::user_data_dir('tesseract') which makes it persist across updates of the package. To OCR french text:

french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)

Très Bien! Note that on Linux you should not use tesseract_download but instead install languages using apt-get (e.g. tesseract-ocr-fra) or yum (e.g. tesseract-langpack-fra).

Tesseract and Magick

The tesseract developers recommend to clean up the image before OCR’ing it to improve the quality of the output. This involves things like cropping out the text area, rescaling, increasing contrast, etc.

The rOpenSci magick package is perfectly suitable for this task. The latest version contains a convenient wrapper image_ocr() that works with pipes.

devtools::install_github("ropensci/magick")

Let’s give it a try on some example scans:

example

# Requires devel version of magick
# devtools::install_github("ropensci/magick")

# Test it
library(magick)
library(magrittr)

text <- image_read("https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/bowers.jpg") %>%
  image_resize("2000") %>%
  image_convert(colorspace = 'gray') %>%
  image_trim() %>%
  image_ocr()

cat(text)
The Llfe and Work of
Fredson Bowers
by
G. THOMAS TANSELLE

N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE ACCOM-
plishment and influence cause them to be the symbols of their age;
their careers and oeuvres become the touchstones by which the
field is measured and its history told. In the related pursuits of
analytical and descriptive bibliography, textual criticism, and scholarly
editing, Fredson Bowers was such a figure, dominating the four decades
after 1949, when his Principles of Bibliographical Description was pub-
lished. By 1973 the period was already being called “the age of Bowers”:
in that year Norman Sanders, writing the chapter on textual scholarship
for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to
a section of his essay. For most people, it would be achievement enough
to rise to such a position in a field as complex as Shakespearean textual
studies; but Bowers played an equally important role in other areas.
Editors of ninetcemh-cemury American authors, for example, would
also have to call the recent past “the age of Bowers," as would the writers
of descriptive bibliographies of authors and presses. His ubiquity in
the broad field of bibliographical and textual study, his seemingly com-
plete possession of it, distinguished him from his illustrious predeces-
sors and made him the personification of bibliographical scholarship in

his time.

\Vhen in 1969 Bowers was awarded the Gold Medal of the Biblio-
graphical Society in London, John Carter’s citation referred to the
Principles as “majestic," called Bowers's current projects “formidable,"
said that he had “imposed critical discipline" on the texts of several
authors, described Studies in Bibliography as a “great and continuing
achievement," and included among his characteristics "uncompromising
seriousness of purpose” and “professional intensity." Bowers was not
unaccustomed to such encomia, but he had also experienced his share of
attacks: his scholarly positions were not universally popular, and he
expressed them with an aggressiveness that almost seemed calculated to

Not bad but not perfect. Can you do a better job?