Wednesday, August 2, 2017 From rOpenSci (https://ropensci.org/blog/2017/08/02/elasticsearch-client/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
elastic is an R client for Elasticsearch
elastic
has been around since 2013, with the first commit in November, 2013.
sidebar - ’elastic’ was picked as a package named before the company now known as Elastic changed their name to Elastic.
If you aren’t familiar with Elasticsearch, it is a distributed, RESTful search and analytics engine. It’s similar to Solr. It falls in the NoSQL bin of databases, holding data in JSON documents, instead of rows and columns. Elasticsearch has a concept of index, similar to a database in SQL-land. You can hold many documents of similar type within a single index. There is powerful search capabilities, including lots of different types of queries that can be done separately or combined. And best of all it’s super fast.
The Elastic company maintains some official clients, including the Python client elasticsearch-py, and it’s higher level DSL client elasticsearch-dsl.
I won’t talk much about it, but we have slowly been working on an R equivalent of the Python DSL client, called elasticdsl, for a human friendly way to compose Elasticsearch queries.
Check out the elastic introduction vignette and the search vignette to get started.
elastic
has nearly complete coverage of the Elasticsearch HTTP API. If there’s
anything missing you need in this client, let us know! Check out the
features label
for features we plan to add to the package.elastic
is not
failing well!docs_bulk
, our interface to the Elasticsearch bulk API
we make it easy to create documents in your Elasticsearch instance from R lists,
data.frame’s and from bulk format files on disk.elastic
works with most versions of Elasticsearch. We run the test suite on 11
versions of Elasticsearch, from v1.0.0
up to v5.5.0
. We strive to fail well
with useful messages when there is a feature no longer available or one that is
a new feature and not available in previous Elasticsearch versions.[elasticsearch]
and [r]
https://stackoverflow.com/questions/tagged/elasticsearch+rInstall elastic
install.packages("elastic")
Or get the development version:
devtools::install_github("ropensci/elastic")
library(elastic)
I’m running Elasticsearch version:
ping()$version$number
#> [1] "5.4.0"
Using connect()
elastic::connect()
#> transport: http
#> host: 127.0.0.1
#> port: 9200
#> path: NULL
#> username: NULL
#> password: <secret>
#> errors: simple
#> headers (names): NULL
By default, you connect to localhost
and port 9200
. There’s paramaters
for setting transport schema, username, password, and base search path (e.g.,
_search
or something else).
See bottom of post about possible changes in connections.
Elasticsearch has a bulk load API to load data in fast. The format is pretty weird
though. It’s sort of JSON, but would pass no JSON linter. I include a few data
sets in elastic
so it’s easy to get up and running, and so when you run examples
in this package they’ll actually run the same way (hopefully).
A dataset inluded in the elastic
package is metadata for PLOS scholarly articles.
Get the file path, then load:
plosdat <- system.file("examples", "plos_data.json", package = "elastic")
invisible(docs_bulk(plosdat))
The main search function is Search()
. Running it without any inputs searches
across all indices - in this case only the plos
index.
Search()
#> $took
#> [1] 1
#>
#> $timed_out
#> [1] FALSE
#>
#> $`_shards`
#> $`_shards`$total
#> [1] 5
#>
#> $`_shards`$successful
#> [1] 5
#>
#> $`_shards`$failed
#> [1] 0
...
Search just the plos
index and only return 1 result
Search(index = "plos", size = 1)$hits$hits
#> [[1]]
#> [[1]]$`_index`
#> [1] "plos"
#>
#> [[1]]$`_type`
#> [1] "article"
#>
#> [[1]]$`_id`
#> [1] "0"
#>
#> [[1]]$`_score`
#> [1] 1
#>
#> [[1]]$`_source`
#> [[1]]$`_source`$id
#> [1] "10.1371/journal.pone.0007737"
#>
#> [[1]]$`_source`$title
#> [1] "Phospholipase C-β4 Is Essential for the Progression of the Normal Sleep Sequence and Ultradian Body Temperature Rhythms in Mice"
Search the plos
index, and the article
document type, sort by title, and query for antibody, limit to 1 result.
First, with Elasticsearch v5
and greater, we need to set fielddata = true
if we want to search on or sort on a text field.
mapping_create("plos", "article", update_all_types = TRUE, body = '{
"properties": {
"title": {
"type": "text",
"fielddata": true
}
}
}')
#> $acknowledged
#> [1] TRUE
Search(index = "plos", type = "article", sort = "title", q = "antibody", size = 1)$hits$hits
#> [[1]]
#> [[1]]$`_index`
#> [1] "plos"
#>
#> [[1]]$`_type`
#> [1] "article"
#>
#> [[1]]$`_id`
#> [1] "568"
#>
#> [[1]]$`_score`
#> NULL
#>
#> [[1]]$`_source`
#> [[1]]$`_source`$id
#> [1] "10.1371/journal.pone.0085002"
#>
#> [[1]]$`_source`$title
#> [1] "Evaluation of 131I-Anti-Angiotensin II Type 1 Receptor Monoclonal Antibody as a Reporter for Hepatocellular Carcinoma"
#>
#>
#> [[1]]$sort
#> [[1]]$sort[[1]]
#> [1] "1"
Get document with id=1
docs_get(index = 'plos', type = 'article', id = 1)
#> $`_index`
#> [1] "plos"
#>
#> $`_type`
#> [1] "article"
#>
#> $`_id`
#> [1] "1"
#>
#> $`_version`
#> [1] 1
#>
#> $found
#> [1] TRUE
#>
#> $`_source`
#> $`_source`$id
#> [1] "10.1371/journal.pone.0098602"
#>
#> $`_source`$title
#> [1] "Population Genetic Structure of a Sandstone Specialist and a Generalist Heath Species at Two Levels of Sandstone Patchiness across the Strait of Gibraltar"
Get certain fields
docs_get(index = 'plos', type = 'article', id = 1, fields = 'id')
#> $`_index`
#> [1] "plos"
#>
#> $`_type`
#> [1] "article"
#>
#> $`_id`
#> [1] "1"
#>
#> $`_version`
#> [1] 1
#>
#> $found
#> [1] TRUE
You can optionally get back raw JSON from many functions by setting parameter raw=TRUE
.
For example, get raw JSON, then parse with jsonlite
(out <- docs_mget(index = "plos", type = "article", id = 5:6, raw = TRUE))
#> [1] "{\"docs\":[{\"_index\":\"plos\",\"_type\":\"article\",\"_id\":\"5\",\"_version\":1,\"found\":true,\"_source\":{\"id\":\"10.1371/journal.pone.0085123\",\"title\":\"MiR-21 Is under Control of STAT5 but Is Dispensable for Mammary Development and Lactation\"}},{\"_index\":\"plos\",\"_type\":\"article\",\"_id\":\"6\",\"_version\":1,\"found\":true,\"_source\":{\"id\":\"10.1371/journal.pone.0098600\",\"title\":\"Correction: Designing Mixed Species Tree Plantations for the Tropics: Balancing Ecological Attributes of Species with Landholder Preferences in the Philippines\"}}]}"
#> attr(,"class")
#> [1] "elastic_mget"
jsonlite::fromJSON(out)
#> $docs
#> _index _type _id _version found _source.id
#> 1 plos article 5 1 TRUE 10.1371/journal.pone.0085123
#> 2 plos article 6 1 TRUE 10.1371/journal.pone.0098600
#> _source.title
#> 1 MiR-21 Is under Control of STAT5 but Is Dispensable for Mammary Development and Lactation
#> 2 Correction: Designing Mixed Species Tree Plantations for the Tropics: Balancing Ecological Attributes of Species with Landholder Preferences in the Philippines
Here, we’ll use another dataset that comes with the package on Shakespeare plays.
gbifdat <- system.file("examples", "gbif_data.json", package = "elastic")
invisible(docs_bulk(gbifdat))
Define an aggregation query:
aggs <- '{
"aggs": {
"latbuckets" : {
"histogram" : {
"field" : "decimalLatitude",
"interval" : 5
}
}
}
}'
Search the gbif
index
res <- Search(index = "gbif", body = aggs, size = 0)$aggregations$latbuckets$buckets
do.call("rbind.data.frame", res)
#> key doc_count
#> 2 -35 1
#> 22 -30 0
#> 3 -25 0
#> 4 -20 0
#> 5 -15 0
#> 6 -10 0
#> 7 -5 1
#> 8 0 0
#> 9 5 0
#> 10 10 0
#> 11 15 0
#> 12 20 0
#> 13 25 4
#> 14 30 2
#> 15 35 3
#> 16 40 2
#> 17 45 66
#> 18 50 183
#> 19 55 487
#> 20 60 130
#> 21 65 20
When you want all the documents, your best bet is likely to be scrolling search.
Here’s an example. First, use Search()
, setting a value for the scroll
parameter.
res1 <- Search(index = 'shakespeare', scroll = "1m")
You get a scroll ID back when setting the scroll
parameter
res1$`_scroll_id`
#> [1] "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAElFnZ2X3FJVWEyUU1HQjl2cFpWUFl0cXcAAAAAAAABJBZ2dl9xSVVhMlFNR0I5dnBaVlBZdHF3AAAAAAAAAScWdnZfcUlVYTJRTUdCOXZwWlZQWXRxdwAAAAAAAAEmFnZ2X3FJVWEyUU1HQjl2cFpWUFl0cXcAAAAAAAABIxZ2dl9xSVVhMlFNR0I5dnBaVlBZdHF3"
Use a while loop to get all results
out1 <- list()
hits <- 1
while (hits != 0) {
tmp1 <- scroll(scroll_id = res1$`_scroll_id`)
hits <- length(tmp1$hits$hits)
if (hits > 0) {
out1 <- c(out1, tmp1$hits$hits)
}
}
Woohoo! Collected all 1 documents in very little time.
Now, get _source
from each document:
docs <- lapply(out1, "[[", "_source")
length(docs)
#> [1] 4988
vapply(docs[1:10], "[[", "", "text_entry")
#> [1] "Without much shame retold or spoken of."
#> [2] "For more uneven and unwelcome news"
#> [3] "And shape of likelihood, the news was told;"
#> [4] "Mordake the Earl of Fife, and eldest son"
#> [5] "It is a conquest for a prince to boast of."
#> [6] "Amongst a grove, the very straightest plant;"
#> [7] "That some night-tripping fairy had exchanged"
#> [8] "Then would I have his Harry, and he mine."
#> [9] "This is his uncles teaching; this is Worcester,"
#> [10] "Malevolent to you in all aspects;"
You’ve already seen the bulk docs API in action above. Above though, we were
using docs_bulk.character
- where the input is a character string that’s a
file path.
Here, I’ll describe briefly how you can insert any data.frame as documents in your
Elasticsearch instance. We’ll use the diamonds dataset from the ~54K row ggplot2
package.
#> $acknowledged
#> [1] TRUE
library(ggplot2)
invisible(docs_bulk(diamonds, "diam"))
#> |==================================| 100%
Search("diam")$hits$total
#> [1] 47375
That’s pretty easy! This function is used a lot, particularly with data.frame’s - so we get many questions/feedback on this so it will just keep getting better/faster.
We’re planning to roll out changes in how you connect to Elasticsearch from elastic
.
Right now, you can only connect to one Elasticsearch instance per R session -
your details are set and then recalled internally in each function. We plan to change
this to instantiate a client and then you either call functions on the client
(e.g., using R6
) or pass the client object onto functions.
Checkout issue #87 to follow progress or discuss.
crul
is a relatively new R http client - and has async baked in - as well as mocking.
Development should be easier with it as I can mock requests for test suites, and
allow users to toggle async more easily.
We can use your help! Elasticsearch development moves pretty fast - we’d love this client to work with every single Elasticsearch version to the extent possible - and we’d love to squash every bug and solve every feature request fast.
If you need to use Elasticsearch from R, please try out elastic
!