Tesseract Update: Options and Languages
Installing Training Data
As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.
# Download French training data tesseract_download("fra")
Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:
sudo apt-get install tesseract-ocr-fra
And on Fedora/CentOS you use:
sudo yum install tesseract-langpack-fra
tesseract_info() to see which training data are currently installed.
OCR Engine Parameters
Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789")) ocr("image.png", engine = engine)
In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.
library(magick) library(tesseract) image <- image_read("http://jeroen.github.io/files/dog_hq.png") image <- image_crop(image, "1700x100+50+150") cat(ocr(image))
We plan to more integration between Magick and Tesseract in future versions.