rOpenSci tech notes

Tesseract and Magick: High Quality OCR in R

Jeroen Ooms — August 17, 2017
Last week we released an update of the tesseract package to CRAN. This package provides R bindings to Google's OCR library Tesseract. install.packages("tesseract") The new version ships with the latest libtesseract 3.05.01 on Windows and MacOS. Furthermore it includes enhancements for managing language data and using tesseract together with the magick package. Installing Language Data The new version has several improvements for installing additional language data. On Windows and MacOS you use the tesseract_download() function...

elastic - Elasticsearch for R

Scott Chamberlain — August 2, 2017
elastic is an R client for Elasticsearch elastic has been around since 2013, with the first commit in November, 2013. sidebar - 'elastic' was picked as a package named before the company now known as Elastic changed their name to Elastic. What is Elasticsearch? If you aren't familiar with Elasticsearch, it is a distributed, RESTful search and analytics engine. It's similar to Solr. It falls in the NoSQL bin of databases, holding data in JSON...

All the fake data that's fit to print

Scott Chamberlain — June 22, 2017
charlatan makes fake data. Excited to annonunce a new package called charlatan. While perusing packages from other programming languages, I saw a neat Python library called faker. charlatan is inspired from and ports many things from Python's https://github.com/joke2k/faker library. In turn, faker was inspired from PHP's faker, Perl's Faker, and Ruby's faker. It appears that the PHP library was the original - nice work PHP. Use cases What could you do with this package? Here's...

Random GeoJSON and WKT with randgeo

Scott Chamberlain, Noam Ross — April 20, 2017
randgeo generates random points and shapes in GeoJSON and WKT formats for use in examples, teaching, or statistical applications. Points and shapes are generated in the long/lat coordinate system and with appropriate spherical geometry; random points are distributed evenly across the globe, and random shapes are sized according to a maximum great-circle distance from the center of the shape. randgeo was adapted from https://github.com/tmcw/geojson-random to have a pure R implementation without any dependencies as well...

ccafs - client for CCAFS General Circulation Models data

Scott Chamberlain — March 1, 2017
I've recently released the new package ccafs, which provides access to data from Climate Change, Agriculture and Food Security (CCAFS; http://ccafs-climate.org/) General Circulation Models (GCM) data. GCM's are a particular type of climate model, used for weather forecasting, and climate change forecasting - read more at https://en.wikipedia.org/wiki/General_circulation_model. ccafs falls in the data client camp - its focus is on getting users data - many rOpenSci packages fall into this area. These kinds of packages are...

Package evolution - changing stuff in your package

Scott Chamberlain — January 5, 2017
Making packages is a great way to organize R code, whether it’s a set of scripts for personal use, a set of functions for internal company use or a lab group, or to distribute your new cool framework foobar to the masses. There's a number of guides to writing packages, including http://r-pkgs.had.co.nz/. As you develop packages there's a number of issues that don't often get much air time. I'll cover some of them here. Philosophy...

Update jsonlite 1.2

Jeroen Ooms — January 4, 2017
A new version of jsonlite package to CRAN. This is a maintenance release with enhancements and bug fixes. A summary of changes in v1.2 from the NEWS file: Add read_json and write_json convenience wrappers, #161 Update modp_numtoa from upstream, fixes a rounding issue in #148. Ensure asJSON.POSIXt does not use sci notation for negative values, #155 Tweak num_to_char to properly print large negative numbers Performance optimization for simplyfing data frames (see below) Use the Github...

finch - parse Darwin Core files

Scott Chamberlain — December 23, 2016
finch has just been released to CRAN (binaries should be up soon). finch is a package to parse Darwin Core files. Darwin Core (DwC) is: a body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their...

Announcing pdftools 1.0

Jeroen Ooms — December 9, 2016
This week we released version 1.0 of the ropensci pdftools package to CRAN. Pdftools provides utilities for extracting text, fonts, attachments and other data from PDF files. It also supports rendering of PDF files into bitmap images. This release has a few internal enhancements and fixes an annoying bug for landscape PDF pages. The version bump to 1.0 signifies that the package has undergone sufficient testing and the API is stable. Extracting Text As described...

Tesseract Update: Options and Languages

Jeroen Ooms — December 8, 2016
A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features. Installing Training Data As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data. # Download French training data tesseract_download("fra")...

fauxpas - HTTP conditions package

Scott Chamberlain — November 18, 2016
HTTP, or Hypertext Transfer Protocol is a protocol by which most of us interact with the web. When we do requests to a website in a browser on desktop or mobile, or get some data from a server in R, all of that is using HTTP. HTTP has a rich suite of status codes describing different HTTP conditions, ranging from Success to various client errors, to server errors. R has a few HTTP client libraries...

crul - an HTTP client

Scott Chamberlain — November 9, 2016
A new package crul is on CRAN. crul is another HTTP client for R, but is relatively simplified compared to httr, and is being built to link closely with webmockr and vcr. webmockr and vcr are packages ported from Ruby's webmock and vcr, respectively. They both make mocking HTTP requests really easy. A major use case for mocking HTTP requests is for unit tests. Nearly all the packages I work on personally make HTTP requests...

Parse NOAA Integrated Surface Data Files

Scott Chamberlain — November 3, 2016
A new package isdparser is on CRAN. isdparser was in part liberated from rnoaa, then improved. We'll use isdparser in rnoaa soon. isdparser does not download files for you from NOAA's ftp servers. The package focuses on parsing the files, which are variable length ASCII strings stored line by line, where each line has some mandatory data, and any amount of optional data. The data is great, and includes for example, wind speed and direction,...

Encryption and Digital Signatures in R using GPG

Jeroen Ooms — October 19, 2016
A new package gpg has appeared on CRAN. From the package description: Bindings to GnuPG for working with OpenGPG (RFC4880) cryptographic methods. Includes utilities for public key encryption, creating and verifying digital signatures, and managing your local keyring. Note that some functionality depends on the version of GnuPG that is installed on the system. In particular GnuPG 2 mandates the use of 'gpg-agent' for entering passphrases, which only works if R runs in a terminal...

Get air quality data for the United Kingdom using the rdefra package

Claudia Vitolo — October 6, 2016
Whether you are an environmental scientist, a pollution expert or just concerned about the air you breathe when cycling in the United Kingdom, the ropensci rdefra package can help find the information you need. This package gives you access to the UK-AIR database, hosted by the Department for Environment, Food & Rural Affairs in the United Kingdom, directly from R. The database comprises hundreds of air quality monitoring sites and each provides time series of...

New package graphql: A GraphQL Query Parser

Jeroen Ooms — October 5, 2016
The new ropensci graphql package is now on CRAN. It implements R bindings to the libgraphqlparser C++ library to parse GraphQL syntax and export the syntax tree in JSON format: graphql2json("{ field(complex: { a: { b: [ $var ] } }) }") A syntax parser is perhaps not super useful to most end-users, but can be used to validate graphql queries or implement a GraphQL API in R. We hope to add more related functionality...

Hunspell 2.0: High-Performance Stemmer, Tokenizer, and Spell Checker for R

Jeroen Ooms — September 12, 2016
A new version of the ropensci hunspell package has been released to CRAN. Hunspell is the spell checker library used by LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others. It provides a system for tokenizing, stemming and spelling in almost any language or alphabet. The R package exposes both the high-level spell-checker as well as low-level stemmers and tokenizers which analyze or extract individual words from various formats (text,...

New in Magick 0.3

Jeroen Ooms — September 8, 2016
A new version of the ropensci magick package has been released to CRAN. Magick is a package for Advanced Image-Processing in R. It wraps the ImageMagick STL which is perhaps the most comprehensive open-source image processing library available today. Our original announcement has more details. New features This new version now includes a beautiful vignette which gives an overview of the main functionality to get you started! It lists the various formats, transformations, effects, operations...