There is no problem in science quite as frustrating as other peoples’ data. Whether it’s malformed spreadsheets, disorganized documents, proprietary file formats, data without metadata, or any other data scenario created by someone else, scientists have taken to Twitter to complain about it. As a political scientist who regularly encounters so-called “open data” in PDFs, this problem is particularly irritating. PDFs may have “portable” in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally....
Version 2.0 of my data set validation package assertr
hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr
, the text below will point out the improvements.
I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).
...Everybody talks about the weather, but nobody does anything about it. - Charles Dudley Warner
As a scientist who models plant diseases, I use a lot of weather data. Often this data is not available for areas of interest. Previously, I worked with the International Rice Research Institute (IRRI) and often the countries I was working with did not have weather data available or I was working on a large area covering several countries and needed a single source of data to work from. Other scientists who work with crop biophysical models to model crop yields also have similar weather data needs and may experience similar issues with data availability.
...camsRad
is a lightweight R client for the CAMS Radiation Service, that provides satellite-based time series of solar irradiation for the actual weather conditions as well as for clear-sky conditions. Satellite-based solar irradiation data have been around roughly as long our modern era satellites. But the price tag has been very high, in the range of several thousand euros per site. This has damped research and development of downstream applications. With CAMS Radiation Service coming online in 2016, this changed as the services are provided under the (not yet fully implemented) European Union stand point that data and services produced with public funding shall be provided on free and open grounds. The service is part of Copernicus, a European Union programme aimed at developing information services based on satellite earth observation and in situ data. All Copernicus information services are free and openly accessible....
After 2.5 years of development, version 1.0 of the mongolite package has been released to CRAN. The package is now stable, well documented, and will soon be submitted for peer review to be onboarded in the rOpenSci suite.
MongoDB in R and mongolite
I started working on mongolite in September 2014, and it was first announced at the rOpenSci unconf 2015. At this time, there were already two Mongo clients on CRAN: rmongodb (no longer works) and RMongo (depends on Java). However I found both of them pretty clunky, and the MongoDB folks had just released 1.0 of their new C driver, so I decided to write a new client from scratch.
...