Tesseract 4 is here! State of the art OCR in R!

November 6, 2018

By:   Jeroen Ooms

Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki: Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

Community Call - Working with images in R

October 24, 2018

By:   Stefanie Butland

rOpenSci’s software engineer / postdoc Jeroen Ooms will explain what images are, under the hood, and showcase several rOpenSci packages that form a modern toolkit for working with images in R, including opencv, av, tesseract, magick and pdftools. 🕘 Thursday, November 15, 2018, 10-11AM PST; 7-8PM CET (find your timezone) ☎️ Find all details for joining the call on our Community Calls page. Everyone is welcome. No RSVP needed.

What's this bird? Classify old natural history drawings with R

August 28, 2018

By:   Maëlle Salmon

In this new post, we’re taking a break from modern birding data in our birder’s series… let’s explore gorgeous drawings from a natural history collection! Armed with rOpenSci’s packages binding powerful C++ libraries and open taxonomy data, how much information can we automatically extract from images? Maybe not much, but we’ll at least have explored image manipulation, optical character recognition (OCR), language detection, taxonomic name resolution with rOpenSci’s packages.

Lessons Learned from rtika, a Digital Babel Fish

April 25, 2018

By:   Sasha Goodman

The Apache Tika parser is like the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy” 1. The Babel fish translates any natural language to any other. Although Tika does not yet translate natural language, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows an analyst to extract text and objects from Microsoft Word.

Support for hOCR and Tesseract 4 in R

February 14, 2018

By:   Jeroen Ooms

Earlier this month we released a new version of the tesseract package to CRAN. This package provides R bindings to Google’s open source optical character recognition (OCR) engine Tesseract. Two major new features are support for HOCR and support for the upcoming Tesseract 4. hOCR output Support for HOCR output was requested by one of our users on Github. The ocr() function gains a parameter HOCR which allows for returning results in hOCR format:

