rOpenSci | When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

The COVID-19 pandemic has dramatically impacted all of our lives in a very short period of time. Spring and summer are usually very busy as students prepare to go the field to engage in various data collection efforts. The pandemic has also disrupted these carefully planned activities as travel is suspended and local and remote field stations have closed indefinitely. A lost field season can be a major setback for a dissertation timeline and students will have to improvise. One promising opportunity to continue research efforts during these unprecedented times is taking advantage of the massive amounts of open scientific data that are freely available. Open data can form the basis of a review, synthesis, or new research.

Inspired by tweets from Ethan White about “PhD research from a distance”, the rOpenSci team did an in-depth exploration of how we provide access to open data. Our goal is to inspire students to find research opportunities with open data and highlight some of the rOpenSci packages that already make programmatic access possible. We also highlight some examples of how specific collections of packages are being used right now in fields as varied as archaeology and climate science.

🔗 Exploring open data

Data are fundamental to scientific discovery and leveraging new discoveries would not be possible without access to data 1. Although people rarely develop new research entirely on open data, these datasets provide an opportunity to reproduce and validate existing results, improve models, and be combined with other data to generate new syntheses. The open science movement has been growing for over a decade and all of that interest has surfaced numerous databases and repositories. The growing interest in reproducibility has also led to the creation of a plethora of open source software to access such data. rOpenSci’s core mission is to develop such tools and to date we have built over 120 robust data-access packages. These packages provide access to an impressive variety and quantity of data:
eBird offers up 700 million observations, Crossref has 108 million records of scholarly works which include articles and books, Dryad makes available 13 terabytes of data associated with published papers, and GBIF has over 1.3 billion records of species worldwide.

We hope that this post and these tools provide inspiration for you to explore new data sources and research topics.

🔗 Data sources for your research

Many of rOpenSci’s tools are developed by practicing scientists and have strong communities behind them. We invited university faculty from our community of developer-researchers to highlight sources of open data for research in their fields.

🔗 Climate and weather

Brooke Anderson, Colorado State University

Research on weather and climate—and their impacts on humans and the environment—can draw on numerous excellent open data sources, including many made available through programmatic access to data collected and shared by institutions and monitoring networks. The US Geological Survey offers a particular exciting example, offering not only APIs for accessing their data, but also a full suite of R packages developed and shared through the USGS-R community. rOpenSci’s own rnoaa package provides access to data through a number of the US National Oceanic and Atmospheric Administration’s open data APIs, allowing for fast and convenient access from R to national or worldwide data on, among others, meteorological observations, sea ice, and tides and currents, while its bomrang package offers similar access to data from the Australian Government Bureau of Meteorology. Other rOpenSci packages provide access to weather- and climate-related data from the Iowa Environment Mesonet (riem), New Zealand’s National Climate Database (clifro), the US National Aeronautics and Space Administration’s Prediction of Worldwide Energy Resource (POWER) dataset (nasapower), the US National Centers for Environmental Information’s Global Surface Summary of the Day (GSOD) dataset (GSODR), the US National Hurricane Center (rrricanes), the Flanders Environment Agency and Flanders Hydraulics Research’s dataset (wateRinfo), and Environment and Climate Change Canada (ECCC) (weathercan). bowerbird is general-purpose package for maintaining local copies of a range of satellite- and model-derived environmental and climate data.

🔗 Water

Louise Slater, University of Oxford, Sam Zipper, University of Kansas, Ilaria Prosdocimi, Ca ‘Foscari University, Sam Albers, Government of British Columbia, and Claudia Vitolo, European Centre for Medium Range Weather Forecasts

In hydrology, there has been a rapid growth in the number of streamflow data archives made publicly available online by countries such as the UK (rnrfa package), USA (dataRetrieval package), Greece (rOpenSci’s hydroscoper package), and Canada (rOpenSci’s tidyhydat package) although most countries sadly do not yet apply an open policy to their hydrological data. The Task View on Hydrological Data and Modelling and accompanying blog post Getting your toes wet in R: Hydrology, meteorology, and more provide an exciting overview of the most up-to-date R packages that are available for downloading, analysing, and modelling these data. For an overview of the many advantages of using R for hydrological research, see the paper “Using R in Hydrology” 2 which describes approaches to retrieve, analyse, map, model, and visualise hydrological data.

🔗 Antarctic and Southern Ocean

Ben Raymond, Australian Antarctic Division and Anton Van de Putte, Royal Belgian Institute for Natural Science

Antarctic science has a strong culture of open data - the Antarctic treaty itself states that scientific observations and results from Antarctica should be openly shared, and the Scientific Committee on Antarctic Research has had an active data management group since the late 1980s. To find Antarctic and Southern Ocean data, search the Antarctic master directory (metadata catalogue) or portals such as the Antarctic Biodiversity portal or the Southern Ocean Observing System.

The Antarctic rOpenSci community is developing R resources to support Antarctic and Southern Ocean science, with a particular emphasis on simplifying data access and performing common analytical tasks. See this blog post and task view for an overview of some of the packages in development, and the types of analyses that we are aiming to support.

🔗 Archaeology

Ben Marwick, University of Washington

Research shuddered to a stop in the Geoarchaeology Lab in early March, with UW being one of the first US campuses to switch to remote work. No longer able to go to campus, we turned our attention to computational text analysis of a large corpus of archaeological conference abstracts to look at questions about gender imbalance and theory change in our field. Our quick pivot to this new area was only possible thanks to high quality and well-documented software such as rOpenSci’s tesseract, pdftools and magick packages. These enabled us to generate data rapidly, giving us more time for exploring and testing hypotheses, and ensuring our students could get to the end of the term ready to share some really interesting results.

We’ve been keeping up with the literature through in-depth study of new journal articles, especially those that include open data. Archaeologists use specialised repositories such as the Digital Archaeological Record (tDAR), Open Context as well as several generic repositories to share data (e.g. Zenodo, Figshare, Dataverse - each of these have R packages to access data). There are R packages for accessing data hosted by those archaeology repositories (tdar, opencontext), but many of our favourite recent articles (we keep a list here) had their data openly archived on the Open Science Framework data repository. While studying these articles we have enjoyed using rOpenSci’s osfr package to quickly and reproducibly access these materials for in-depth exploration. A favourite type of data for many archaeologists is radiocarbon ages, and our group has also been working with these with ease thanks to the c14bazAAR package. We’ve been using this package to get data to study radiocarbon dates from hundreds of archaeological sites in Australia. While we’re missing the lab, rOpenSci’s packages for acquiring archaeological data have been invaluable tools for efficiently enabling us to be active and engaged in our research.

Our task view for archaeological science shows the full range of tools we use, from data acquisition through environmental and geological analysis to writing reproducible manuscripts.

🔗 Transport

Robin Lovelace, University of Leeds

There has never been a better time for data driven and reproducible transport research. The COVID-19 pandemic has disrupted transport patterns worldwide. This has led to changes, such as the construction of ‘pop-up’ active transport infrastructure, the prioritisation of which can be supported by reproducible and open data analysis, as outlined in preprint (the analysis of which was undertaken in R) on the topic 3. There is a wealth of data out there that can be found with careful search queries and many new datasets (like Uber’s micromobility datasets, released on May 6th of this year).

  • For downloading data representing transport networks, I recommend heading to the overpass website and for R users checking out osmdata and the in-development geofabric (to be renamed) R packages.

  • For open origin-destination data there are many resources but the PCT package provides a way to access national-scale datasets quickly from the R command line, as outlined stplanr’s Origin-destination vignette.

  • For road safety data there is a lack of open data in many countries but you can access national road casualty data, with 60+ variables and 100,000+ records each year with the stats19 package.

  • For links to additional resources I recommend Chapter 12 of Geocomputation with R and Chapter 11 of QGIS for transport researchers.

  • For inspiration, I recommend checking out the Propensity to Cycle Tool, an interactive free and open web app that is being used to inform active transport investment plans in dozens of cities across the UK (it also has many data download options at zone, route and route network levels).

🔗 Taxonomy, biodiversity, ecology

rOpenSci has its roots in software for biodiversity research, with many packages in the areas of taxonomy, biological occurrences, and natural history/traits.

  • taxonomy: A good place to start is the taxonomy task view, covering many options for working with online taxonomy data

  • occurrences: Occurrence data forms the basis of much ecological research. The largest source of occurrence data, GBIF, can be accessed with the rgbif package. Many more are listed in the README for the package spocc.

  • natural history/traits: Conservation researchers may want to fetch data from the IUCN Red List via rredlist, Fishbase life history data from rfishbase, bird data from auk or rebird, or trait data from various marine taxa in WoRMS (called “attributes” by WoRMS; worrms).

A good general resource for rOpenSci packages on biodiversity is the rOpenSci Community Call from March 2019: Research Applications of rOpenSci Taxonomy and Biodiversity Tools.


Browse our table of > 100 data-access packages (under the bird) or jump ahead to see where you come in.

Lesser Violetear (Colibri cyanotus) by Carlos Sanchez, Macaulay Library, eBird Lesser Violetear Colibri cyanotus. Carlos Sanchez, Macaulay Library | eBird.


🔗 rOpenSci data-access packages

The table below shows a subset of our full suite of R packages. You can find scientific use cases for a package on our main page by clicking on a package name.

R package Data and source Maintainer
antanym Antarctic geographic names. Composite Gazetteer of Antarctica Ben Raymond
AntWeb Ant data. AntWeb database from the California Academy of Sciences Karthik Ram
auk bird sighting records. Matthew Strimas-Mackey
bikedata Historic ride data from public hire bicycle systems. London, U.K., from the U.S.A., San Francisco CA, New York City NY, Chicago IL, Washington DC, Boston MA, Los Angeles LA, Philadelphia PA, Minnesota, Montreal, Canada, and Guadalajara, Mexico. Mark Padgham
biomartr genomic data retrieval. ‘NCBI RefSeq’, ‘NCBI Genbank’, ‘ENSEMBL’, and ‘UniProt’ databases, plus interface to ‘BioMart’ database Hajk-Georg Drost
bittrex Bittrex crypto-currency exchange. Michael Kane
bold Bold Systems for genetic barcode data. Scott Chamberlain
brranching phylogenetic data. ‘Phylomatic’, and ‘Phylocom’ Scott Chamberlain
camsRad Time series of global, direct, and diffuse irradiations on horizontal surface. Copernicus Atmosphere Monitoring Service (CAMS) Lukas Lundstrom
ccafs Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models. Scott Chamberlain
chromer Chromosome Counts Database. Paula Andrea Martinez
clifro New Zealand National Climate Database. Blake Seers
comtradr United Nations Comtrade data. Chris Muir
cRegulome transcription factor/microRNA-gene correlations (co-expression) in cancer. Cistrome Cancer Liu et al. (2011) doi:10.1186/gb-2011-12-8-r83 and ‘miRCancerdb’ databases (in press). Mahmoud Ahmed
dbhydroR South Florida Water Management Districts DBHYDRO’ database. Joseph Stachelek Drosophila odorant response data for DoOR.functions. Daniel Münch
ecoengine Georeferenced specimen records from the University of California, Berkeley’s Natural History Museums. Karthik Ram
epubr reading and parsing of internal e-book content from EPUB files. EPUB e-books. Matthew Leonawicz
essurvey European Social Survey data. Jorge Cimentada
FedData Geospatial data from several federated data sources (mainly sources maintained by the US federal government). National Elevation Dataset National Hydrography Dataset (USGS), The Soil Survey Geographic (SSURGO) database, the Global Historical Climatology Network (GHCN), the Daymet gridded estimates of daily weather parameters, the International Tree Ring Data Bank, and the National Land Cover Database (NLCD). R. Kyle Bocinsky
fingertipsR Data for many indicators of public health in England. Sebastian Fox
genderdata Historical datasets of first names and dates of birth. Lincoln Mullen
getCRUCLdata University of East Anglia Climate Research Unit gridded climatology of monthly means. Adam Sparks
getlandsat Landsat 8 Data. Scott Chamberlain
GSODR Global Surface Summary of the Day (GSOD) weather data from USA National Centers for Environmental Information (NCEI). Adam Sparks
gtfsr public GTFS feeds. Danton Noriega-Goodwin
gutenbergr Project Gutenberg collection. David Robinson
hathi HathiTrust bibliographic API. Scott Chamberlain
hddtools hydrological data. various data providers Claudia Vitolo
helminthR London Natural History Museum’s host-parasite database. Tad Dallas
historydata sample data sets for historians on population, institutional, religious, military, and prosopographical data. Lincoln Mullen
hydroscoper Greek National Data Bank for Hydrological and Meteorological Information. Konstantinos Vantas
internetarchive Internet Archive. Lincoln Mullen
isdparser NOAA Integrated Surface Data. Scott Chamberlain
jaod Directory of Open Access Journals. Scott Chamberlain
MODIStsp time series of rasters from MODIS Satellite Land Products data. Lorenzo Busetto
musemeta museum metadata. Many different museums, including the MET, Getty Museum, and more Scott Chamberlain
nasapower NASA POWER (Prediction Of Worldwide Energy Resource) global meteorology and surface solar energy climatology data. Adam H. Sparks
natserv NatureServe. Scott Chamberlain
neotoma paleoecological datasets from the Neotoma Paleoecological Database. Simon J. Goring
nomisr UK official statistics from the Nomis database, including data from the from the Census, the Labour Force Survey, DWP benefit statistics and other economic and demographic data from the Office for National Statistics. Evan Odell
onekp Transcriptomes of over 1000 plant species.. The 1000 Plants Initiative ( Zebulun Arendsee
opencontext Open Context data. Ben Marwick
originr Species origin data from multiple sources. Encyclopedia of Life (, Flora ‘Europaea’ (, Global Invasive Species Database (, the Native Species Resolver (, Integrated Taxonomic Information Service (, and Global Register of Introduced and Invasive Species ( Scott Chamberlain
osmdata OpenStreetMap data. Mark Padgham
ots Ocean time series datasets, including BATS, HOT, and more. Scott Chamberlain
paleobioDB PaleobioDB fossil data. Sara Varela
pangaear Pangaea Database. Scott Chamberlain
phylotaR Orthologous sequence clusters within taxonomic groups from GenBank. Dom Bennett
pleiades Pleiades data. Scott Chamberlain
prism Oregon State Prism climate data. Alan Butler
qualtRics Survey results from the Qualtrics API. Julia Silge
rAvis proyectoavis database. Sara Varela
rbace Bielefeld Academic Search Engine (BASE) of more than 150 million scholarly documents from more than 7000 sources. Scott Chamberlain
rbhl Biodiversity Heritage Library (BHL) of digitized literature on biodiversity studies. Scott Chamberlain
rbison USGS BISON database for species occurrence data from the United States. Scott Chamberlain
rbraries data from 36 different package managers for programming languages. Scott Chamberlain
rcoreoa CORE API aggregates open access research outputs from repositories and journals. Scott Chamberlain
rdatacite DataCite metadata. Scott Chamberlain
rdataretriever Data Retriever. Henry Senyondo
rdefra DEFRA’s UK-AIR website. Claudia Vitolo
rdopa DOPA (Digital Observatory for protected Areas) by the European Union Joint Research Centre. Joona Lehtomaki
rdryad Dryad \Solr\ data underlying scientific publications. Scott Chamberlain
rebird eBird database of bird observations and locations. Sebastian Pardo
rentrez NCBIs EUtils API for databases like GenBank and PubMed'. David Winter
rerddap ERDDAP servers. Scott Chamberlain
rfishbase Fishbase data on over 30,000 species of fish, their biology, ecology, morphology and more. Carl Boettiger
rfisheries Karthik Ram
rfna Flora of North America website data. Scott Chamberlain
rgbif Global Biodiversity Information Facility (GBIF) data of species occurrence. Scott Chamberlain
rglobi Global Biotic Interactions (GloBI) data on spatial-temporal species interactions. Jorrit Poelen
rgpdd Global Population Dynamics Database. Carl Boettiger
riem Weather data from Automated Surface Observing System (ASOS) stations. Iowa Environment Mesonet website. Maëlle Salmon
rif Neuroscience Information Framework (NIF) data. Scott Chamberlain
rinat iNaturalist website of species occurrence data submitted by citizen scientists.. Stéphane Guillou
rnaturalearthdata Vector map data. Andy South
rnoaa Many NOAA data sources including NCDC climate data, and data on sea ice, severe weather, historical metadata, storm and tornado data. Scott Chamberlain
rnpn National Phenology Network data on various life history events that occur at specific times. Scott Chamberlain
ropenaq air quality data from the OpenAQ platform. Maëlle Salmon
rotl Open Tree of Life data on phylogenetic trees. Francois Michonneau
rperseus Perseus Digital Library collection of classical texts. David Ranzolin
rppo Global Plant Phenology Data Portal. John Deck
rredlist IUCN Red List of threatened and endangered species. Scott Chamberlain
rrricanes Data on past and current hurricanes and tropical storms for the Atlantic and eastern Pacific oceans. Tim Trice
rrricanesdata Storm discussions, forecast/advisories, public advisories, wind speed probabilities, strike probabilities and more. National Hurricane Center Tim Trice
rsnps SNP datasets for SNPs, genotypes, and phenotypes. Julia Gustavsen
rusda United States Department of Agriculture (USDA) data from the Systematic Mycology and Microbiology Laboratory (SMML). Franz-Sebastian Krah
rvertnet archives including taxonomic names, places, and dates. Scott Chamberlain
rWBclimate Model predictions from 15 different global circulation models in 20 years. Edmund Hart
skynet air transport statistics from the Bureau of Transport Statistics (BTS) in the United States. Filipe Teixeira
smapr NASA Soil Moisture Active Passive (SMAP) data. Maxwell Joseph
solrium data from Solr. Scott Chamberlain
spocc species occurrence data sources, including Global Biodiversity Information. Scott Chamberlain
suppdata Supplementary materials from published manuscripts,. William D. Pearse
tidyhydat Historical and real-time national hydrometric data from Water Survey of Canada data sources. Sam Albers
tradestatistics Access Open Trade Statistics API from R to download international trade data.. Mauricio Vargas
traits Species trait data from many different sources, including sequence data from from NCBI, plant trait data from BETYdb, plant data from the USDA plants database, data from EOL Traitbank, Coral traits data, Birdlife International, and more.. Scott Chamberlain
treebase TreeBASE repository of phylogenetic trees (of species, population, or genes). Carl Boettiger
USAboundaries Boundaries for geographical units in the United States of America. U.S. Census Bureau, Newberry Library’s ‘Atlas of Historical County Boundaries’ Lincoln Mullen
USAboundariesData Higher resolution boundary data, for use in the USAboundaries package.. U.S. Census Bureau, the Newberry Library’s ‘Historical Atlas of U.S. County Boundaries’, and Erik Steiner’s ‘United States Historical City Populations, 1790-2010’. Lincoln Mullen
weathercan Historical weather data from Environment and Climate Change Canada. Steffi LaZerte
webchem Chemical information from around the web.. Tamás Stirling


🔗 This is where you come in!

Have you successfully used one or more of these data sources in your research? We want others to imagine what’s possible by seeing examples. Share your story in the comments and cite your paper or preprint if it’s published.

Is there a data source you want to access programmatically but there’s no R package to do that? Tell us about it in the comments.

Need help? Ask in our discussion forum and we’ll do our best to get you answers.

  1. Tierney, N. J., & Ram, K. (2020). A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility. arXiv preprint arXiv:2002.11626. ↩︎

  2. Slater, L. J., Thirel, G., Harrigan, S., Delaigue, O., Hurley, A., Khouakhi, A., Prosdocimi, I., Vitolo, C., & Smith, K. (2019). Using R in hydrology: a review of recent developments and future directions. Hydrology and Earth System Sciences, 23(7), 2939-2963. ↩︎

  3. Lovelace, R., Morgan, M., Talbot, J., & Lucas-Smith, M. (2020, May 11). Methods to prioritise pop-up active transport infrastructure. ↩︎