rOpenSci | taxadb: A High-Performance Local Taxonomic Database Interface

taxadb: A High-Performance Local Taxonomic Database Interface

Dealing with taxonomic inconsistencies within and across datasets is a fundamental challenge of ecology and evolutionary biology. Accounting for species synonyms, taxa splitting and unification is especially important as aggregation of data across time and different data sources becomes increasingly common. One potentially powerful approach for addressing these issues is to resolve scientific names to taxonomic identifiers that follow a consistent taxonomic concept. In such a workflow, data from one of the many taxonomic providers (e.g. Integrated Taxonomic Information System 1, Catalogue of Life 2, National Center for Biological Information 3) is integrated with biodiversity datasets to identify an accepted ID for each name. Multiple tools exist to facilitate this workflow, including R’s taxize package 4, which provides an API interface to taxonomic databases. However, due to the nature of API queries which are slow, limited in scope, and dependent on the current state of the database, it remains difficult to resolve names to a taxonomic authority in quick, reproducible way. taxadb seeks to address these issues using a new approach for interfacing with taxonomic data via a local database of taxonomic providers.

The goal of this post is to illustrate the ease with which taxadb can be integrated into existing data munging workflows, as well as give a taste for the variety of other exploratory question that are facilitated by the database backend infrastructure.

🔗 Database backend

taxadb is built around a local database of taxonomic data from seven of the largest taxonomic providers. The tables of this database are standardized across providers and include information on accepted ID’s, synonym mappings, and common names when available. The database is accessible by the user through a variety of database backends. Using a local database interface allows not only for quick queries to retrieve taxon ID’s, but also queries across the whole-database. As taxonomic providers are constantly updating their data, databases will be time stamped and archived allowing for user selection of the desired release for reproducible results.

🔗 taxadb framework

taxadb has three main families of functions:

  • queries that return vectors: get_ids() and it’s complement, get_names(),
  • queries that filter the underlying taxonomic data frames: filter_name(), filter_rank(), filter_id(), and filter_common(),
  • database functions td_create(), td_connect() and taxa_tbl()

Query functions will trigger the automatic one-time set up of the local database for the chosen provider, but set up can also be triggered manually by td_create() for one or all providers.

🔗 taxadb workflow

taxadb is designed for relatively painless local database setup and easy integration of taxonomic ID’s into existing workflows. For example, the common scenario of merging two different datasets with their own taxonomic approaches, such as matching trait data to data on IUCN status. Here we use snippets of data from the Elton Traits v1.0 database 5 and the IUCN Redlist 6.

status_data <- read_tsv(system.file("extdata", "status_data.tsv", package="taxadb"))
iucn_namecategory
Pipile pipileCR
Pipile cumanensisLC
Pipile cujubiLC
Pipile jacutingaEN
Megapodius decollatusLC
Scleroptila gutturalisLC
Margaroperdix madagarensisLC
Falcipennis falcipennisNT
trait_data <- read_tsv(system.file("extdata", "trait_data.tsv", package="taxadb"))
elton_namemass
Aburria pipile1816.59
Aburria cumanensis1239.22
Aburria cujubi1195.82
Aburria jacutinga1240.96
Megapodius reinwardt666.34
Francolinus levalliantoides376.69
Margaroperdix madagascariensis245.00
Catreus wallichii1436.88
Falcipennis falcipennis685.61
Falcipennis canadensis473.65

The common approach in this scenario is to simply join by scientific name:

joined <- full_join(trait_data, status_data, by = c("elton_name" = "iucn_name")) 
elton_namemasscategory
Aburria pipile1816.59--
Aburria cumanensis1239.22--
Aburria cujubi1195.82--
Aburria jacutinga1240.96--
Megapodius reinwardt666.34--
Francolinus levalliantoides376.69--
Margaroperdix madagascariensis245.00--
Catreus wallichii1436.88--
Falcipennis falcipennis685.61NT
Falcipennis canadensis473.65--
Pipile pipile--CR
Pipile cumanensis--LC
Pipile cujubi--LC
Pipile jacutinga--EN
Megapodius decollatus--LC
Scleroptila gutturalis--LC
Margaroperdix madagarensis--LC

This results in only one match between the two datasets, Falcipennis falcipennis. However, if we resolve names first to taxonomic identifiers, which account for synonyms and taxonomic changes, we see a different story.

First we get ID’s for each dataset:

traits <- trait_data %>% mutate(id = get_ids(elton_name, "col"))
status <- status_data %>% mutate(id = get_ids(iucn_name, "col"))

And join on the ID:

joined <- full_join(traits, status, by = "id") 
elton_nameiucn_namemasscategoryid
Aburria pipilePipile pipile1816.59CRCOL:35517887
Aburria cumanensisPipile cumanensis1239.22LCCOL:35537158
Aburria cujubiPipile cujubi1195.82LCCOL:35537159
Aburria jacutingaPipile jacutinga1240.96ENCOL:35517886
Megapodius reinwardt--666.34--COL:35521309
Francolinus levalliantoides--376.69--COL:35518087
Margaroperdix madagascariensisMargaroperdix madagarensis245.00LCCOL:35521355
Catreus wallichii--1436.88--COL:35518185
Falcipennis falcipennisFalcipennis falcipennis685.61NTCOL:35521380
Falcipennis canadensis--473.65--COL:35521381
--Megapodius decollatus--LCCOL:35537166
--Scleroptila gutturalis--LC--

Now we see that there are many more matches between the datasets than we previously thought. In a workflow without taxonomic identifiers resolving these additional matches would require a significant investment of time as each name would need to be double checked and matched manually.

🔗 Database facilitated questions

The local database structure also allows us to ask general questions of the entire database, both across providers or across tables for one provider, that are not possible with the API interface. For example, which provider would be able to resolve the largest number of species names in our dataset?

provider_counts <- trait_data %>%
  select(elton_name) %>%
  mutate(
    gbif = get_ids(elton_name, "gbif"),
    col = get_ids(elton_name, "col"),
    itis = get_ids(elton_name, "itis"),
    ncbi = get_ids(elton_name, "ncbi"),
    wd = get_ids(elton_name, "wd"),
    iucn = get_ids(elton_name, "iucn"),
    ott = get_ids(elton_name, "ott")
  ) %>%
  purrr::map_dbl(function(x)
    sum(!is.na(x))) %>%
  tibble::enframe("provider", "ID_count")
providerID_count
gbif10
col10
itis10
ncbi1
wd4
iucn0
ott10

Or even more generally which bird families have the most species?

bird_families <- filter_rank(name = "Aves", rank = "class", provider = "col") %>%
  filter(taxonomicStatus == "accepted", taxonRank=="species") %>% 
  group_by(family) %>%
  count(sort = TRUE) %>%
  head()
familyn
Tyrannidae401
Thraupidae374
Psittacidae370
Trochilidae338
Muscicapidae314
Columbidae312

And which species has the most synonyms?

most_synonyms <-
  taxa_tbl("col") %>%
  count(acceptedNameUsageID, sort=TRUE) %>%
  head() %>%
  collect()
acceptedNameUsageIDn
COL:43082445456
COL:43081989373
COL:43124375329
COL:43353659328
COL:43223150322
COL:43337824307
For the provider Catalogue of Life it is COL:43082445, or the mint species _Mentha longifolia_.\ \ In addition to facilitating quick and easy incorporation of taxonomic identifiers into standard research workflows, taxadb provides direct access to the underlying database of taxonomic providers. Users can therefore use familiar syntax to ask important exploratory questions of the providers rather than being dependent upon the kinds of queries allowed by an API. By providing both a simple interface to ID's and the potential for more in depth exploration we hope to encourage improved inclusion and understanding of taxonomic data by the biodiversity community.

For more details on the backend options, providers, and the above examples please see our docs. We also welcome feedback on our manuscript.

🔗 Acknowledgements

taxadb was co-developed by Carl Boettiger. The package was greatly improved by the rOpenSci peer review process and reviewers Margaret Siple and Lindsay Platt.

🔗 References


  1. Retrieved [2019], from the Integrated Taxonomic Information System (ITIS) (http://www.itis.gov). ↩︎

  2. Roskov Y., Ower G., Orrell T., Nicolson D., Bailly N., Kirk P.M., Bourgoin T., DeWalt R.E., Decock W., Nieukerken E. van, Zarucchi J., Penev L., eds. (2019). Species 2000 & ITIS Catalogue of Life, 2019 Annual Checklist. Digital resource at www.catalogueoflife.org/annual-checklist/2019. Species 2000: Naturalis, Leiden, the Netherlands. ISSN 2405-884X. https://www.catalogueoflife.org/annual-checklist/2019/ ↩︎

  3. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J (2009). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009 Jan;37(Database issue):D5-15. Epub 2008 Oct 21. https://doi.org/10.1093/nar/gkn741 ↩︎

  4. Chamberlain S, Szoecs E, Foster Z, Arendsee Z, Boettiger C, Ram K, Bartomeus I, Baumgartner J, O’Donnell J, Oksanen J, Tzovaras BG, Marchand P, Tran V, Salmon M, Li G, Grenié M (2019). taxize: Taxonomic information from around the web. R package version 0.9.9, https://github.com/ropensci/taxize↩︎

  5. Wilman, H. et al. EltonTraits 1.0: Species-level foraging attributes of the world’s birds and mammals: Ecological Archives E095-178. Ecology 95, 2027–2027 (2014). https://doi.org/10.1890/13-1917.1 ↩︎

  6. IUCN 2019. The IUCN Red List of Threatened Species. Version 2019-3. http://www.iucnredlist.org↩︎