Monday, April 14, 2014 From rOpenSci (https://ropensci.org/blog/2014/04/14/webapis/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
We’ve received a number of questions from our users about dealing with the finer details of data sources on the web. Whether you’re reading data from local storage such as a csv file, a .Rdata
store, or possibly a proprietary file format, you’ve most likely run into some issues in the past. Common problems include passing incorrect paths, files being too big for memory, or requiring several packages to read files in incompatible formats. Reading data from the web entails a whole other set of challenges. Although there are many ways to obtain data from the web, this post primarily deals with retrieving data from Application Programming Interfaces also known as APIs.
To demystify some of the issues around getting data from the web in R, here’s some background on some of the common issues we get questions about:
First, let’s cover the types of APIs we work with. We prefer, and almost exclusively get data from providers via, REST APIs. REST stands for Representational State Transfer. This isn’t a specific set of rules, but more a set of principles followed. They are
limit
, page
, etc. (see below). In short, REST APIs are nicer to build and use (in many cases).Where are data actually stored? Data are stored in some sort of database, sometimes relational column-row based database such as a SQL database (MySQL, PostgreSQL, etc.), or perhaps a NoSQL databases like CouchDB (NoSQL databases don’t have a column-row setup, but store documents, like a text document, XML text, and even binary files). Nearly all of the data that can be obtained through one of our packages is likely stored in a relational database.
If there is an ability to search in the API, data providers often provide this by using an indexing tool like Solr or Elasticsearch to provide much richer search on their data.
Databases can be hosted by the institutions serving the data, but most often they are located on some cloud service like Amazon Web Services.
Information providers differ in many ways, but one of them is what kind of information they provide. Some provide data directly in response to the request. GBIF’s API is a good example of this. If you click on this URL http://api.gbif.org/v0.9/occurrence/search?taxonKey=1&limit=2 you will see JSON formatted data in your browser window.
Other providers give back metadata in response to requests. Dryad is a good example of this. If you search for a dataset on Dryad, you get back metadata describing the data. Another call is required to get the data itself.
Another set of providers only has, and thus can only provide, metadata. For example, our rmetadata R package interacts with a number of APIs for scholarly metadata. These are metadata describing scholarly works, and don’t provide the content of the works they describe (e.g., full content of a journal article).
The issue of what is returned from the provider is separate from the above topic about the type of database - data can be returned from a SQL (column-row database) or a NoSQL database.
Data providers often impose limits on the number of requests within a certain time frame to avoid slowing down the system, and to aid in supporting simultaneous data requests.
Given these limits, data providers impose limits on how many records can be returned in a single query. This parameter is usually called limit
. There is often a maximum number of requests you can request at any given time, and is an arbitary number set by the data provider. Because there is a limit, an additional parameter is often used called offset
(or start
). offset
is the record to start at for the returned data. Let’s say a data provider you want data from has a maximum limit
value of 1000, and you want 3000 records. If you want to get them all, you’d have to make 3 different data requests, setting limit
to 1000 for each request, and offset
to 1 for the first request, 1001 for the second, and 2001 for the third request.
Another approach is with a set of parameters: page
and page_size
. For example, if you wanted 3000 records returned, but the limit is 1000 records per page, you could set page_size
to 1000, and page
to 1, 2, and 3 on three separate requests. Some combination of the parameters limit
, offset
(or start
), page
, and page_size
are available in many of the functions in rOpenSci R packages.
In some functions in rOpenSci packages, we hide this detail from you and just expose the limit
parameter to you so that if you set limit = 15000
and the data provider allows a max of 1000, we figure out the calls needed, and make those for you so a many calls appear as one to the user. We haven’t been consistent across our packages in doing this method or letting the user manually do multiple calls.
Parameters can vary widely among data providers. The ones mentioned above, limit
, page
, and start
are relatively common among data providers. There is often a query parameter called query
or simply q
(Bond anyone?). If there is authentication (see below), there is often an api_key
or key
parameter. For searches of data that are geographically defined, there is often a geometry
parameter (often accepts Well Known Text polygons), or bbox
for bounding box (which accepts NW, SE, SW, and NE points defining the bounding box).
As you are using the API of a data provider through an R package we have written, we can set parameters differently, then just match them up internally in our code - so be aware that the parameter you see in the R function is not necessarily the same as the parameter name in the API. For example, if the parameter defined in the API is theQuickBrownFoxJumpsOverTheLazyDog
, we’d likely use a parameter in the function that is much shorter and easier to remember, like foxdog
as a shorthand.
HTTP codes are a system for signaling the status of an HTTP request in a 3-digit number. APIs return such codes so that developers can quickly resolve a problem. We are starting to expose these HTTP codes in our R packages when something goes wrong, and mostly not showing them when the call worked as planned. Success is indicated by 2XX series codes, where 200 indicates standard successful request, whereas 206 indicates that the provider is only giving partial content back. Here are error codes you may get when something is wrong - it’s good to be aware of them:
4XX series codes indicate a client error - or rather, a user error. Common codes include
5XX series codes indicate a server error, or a data provider error. Common ones include
If you get any of the 4XX series errors using our R packages, check to make sure your query is correct, and if so, get in touch with us as we may need to fix a bug in our code. If you get a 5XX series code, it suggests that something is wrong on the provider’s end. If this problem perists, contact the data provider directly or let us know.
Most providers require authentication to limit access to paying customers, impose limits on users, and/or simply to track data usage. These limits also prevent a few heavy users from monopolizing the resource.
Authentication usually happens in one of three ways:
Username/password combo: You usually pass these in as parameters to a function call. This is generally not a great way to do authentication, but sometimes data sources require this. In our experience, this method is rarely used.
API key: Pass in an alphanumeric key as a parameter in a function call.
OAuth: In R, a web page opens from your R session where you have to enter some personal information, or simply get redirected back to R saying you’re all set. This method is relatively easy, but is an extra step not required in the above two methods.
We hope this been useful for those that weren’t familiar with these things previously. If you have questions about APIs, ask below, on Twitter, or email us.