Thursday, January 25, 2018 From rOpenSci (https://ropensci.org/blog/2018/01/25/nodbi/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
What is DBI? DBI is an R package. It defines an interface to relational database management systems (R/DBMS) that other R packages build upon to interact with a specific relational database, such as SQLite or PostgreSQL.
NoSQL databases are a very broad class of database that can include document databases such as CouchDB and MongoDB, key-value stores such as Redis, and more. They are generally not row-column relational stores though, though can include that. NoSQL is often thought of now as “not only SQL”.
You can imagine how it is relatively straight-forward to create a common interace to row-column oriented databases, and DBI is great for that.
However, a common interface to NoSQL datbases is a bit harder to wrap your head around for various reasons. One of the most obvious of which is that they don’t share a fundamental data structure like R/DMBS do.
nodbi (the no
for “N” and “O” in “NoSQL”) was started nearly 3 years ago during the 2015 rOpenSci Unconference on 26 March together with Rich FitzJohn and Jeroen Ooms. We’ve worked on it here and there, but it hasn’t seen any commits in nearly a year.
Given the success of dplyr it goes to reason that something that can abstract away the details for NoSQL databases - for a subset of the most common operations - could prove to be very useful.
nodbi takes a different approach than DBI (thus the “sort of” in this section title) in that instead of package X for database Y importing DBI and extending its methods, nodbi provides the connection objects to various databases, and a set of common methods for interacting with those databases. Whereas DBI doesn’t import or depend in any way on packages for individual databases, nodbi has packages for individual databases in Suggests. That is to say, nodbi can be installed and loaded without any specific database package, but to use any one database (e.g., Mongo) you need to install the package for that database (i.e., mongolite). Because database packages in nodbi are in Suggests, you can pick and choose the packages you want to have installed - and forget about the others.
Check out the package on GitHub.
Thus far, we’ve built nodbi around data.frame’s. That is, we’re focusing on the data.frame use case as it’s very common that R users are dealing solely with data.frame’s in their analysis pipelines.
Let us know if you have a strong need for lists or vectors, etc. in nodbi.
Connection functions to each database follow the pattern src_
following the pattern used by dplyr
src_mongo
src_redis
src_couchdb
src_elastic
src_etcd
The details of connecting to different databases vary as you’d imagine, but what’s returned from src_
functions can be passed to any of the below methods (for the most part).
docdb_create
): Create an objectdocdb_get
): Get an objectdocdb_delete
): Delete an objectdocdb_update
): Update an object. CouchDB only for nowOperations in nodbi follow a key-value pattern. That is, an object (read: data.frame) is stored in your choice of database with a key (read: character string). Not every operation above needs both the key and value - for example, deleting an object only requires the key.
Internally, a key maps differently to Redis vs. Mongo vs. CouchDB, etc. But the point is that we’re abstracting away the details so that the user can focus on getting things done faster.
You’ll need to start up the database you want to work with if it’s not started already. Let’s work with Mongo for this blog post.
Check out the mongolite book for details on installing MongoDB
Start up mongo on your cli (i.e., shell or terminal)
mongod # or sudo mongod
Install database packages
install.packages(c("sofa", "etseed", "elastic", "mongolite", "redux"))
Install nodbi
devtools::install_github("ropensci/nodbi")
library(nodbi)
Connect to Mongo with src_mongo
(src <- src_mongo(verbose = FALSE))
#> MongoDB 3.6.2 (uptime: 3017s)
#> URL: leothelion.local/test
With the connection object you can access the name of the Mongo database
src$db
#> [1] "test"
And the connection mongolite
connection object
src$con
#> <Mongo collection> 'test'
#> $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000)
#> $count(query = "{}")
#> $distinct(key, query = "{}")
#> $drop()
#> $export(con = stdout(), bson = FALSE)
#> $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000)
#> $import(con, bson = FALSE)
#> $index(add = NULL, remove = NULL)
#> $info()
#> $insert(data, pagesize = 1000, stop_on_error = TRUE, ...)
#> $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0)
#> $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL)
#> $remove(query, just_one = FALSE)
#> $rename(name, db = NULL)
#> $replace(query, update = "{}", upsert = FALSE)
#> $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)
Insert a data.frame (the returned object is from mongolite
) using docdb_create
library("ggplot2")
x <- docdb_create(src = src, key = "diamonds", value = diamonds)
x
#> List of 5
#> $ nInserted : num 53940
#> $ nMatched : num 0
#> $ nRemoved : num 0
#> $ nUpserted : num 0
#> $ writeErrors: list()
Get the data.frame back from Mongo using docdb_get
head(docdb_get(src, "diamonds"))
#> carat cut color clarity depth table price x y z
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Delete the object using docdb_delete
docdb_delete(src, "diamonds")
The same pattern as above can be used for any of the other database options, except the connecting with src_*
differs among databases.
The one that comes closest is storr by Rich FitzJohn - which shares Redis in common with nodbi as a backend. storr seems to be more focused on the caching use case - whereas, nodbi focuses on user X has a specific database they keep their data in and they’d like to not worry about the details of working with that database. Perhaps the distinction isn’t really there, I’m not sure.
We’d love to get some eyes on this; to sort out problems that will no doubt arise from real world scenarios; to flesh out new use cases we hadn’t thought of, etc.
Open an issue: https://github.com/ropensci/nodbi/issues/new.