rOpenSci | When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

The COVID-19 pandemic has dramatically impacted all of our lives in a very short period of time. Spring and summer are usually very busy as students prepare to go the field to engage in various data collection efforts. The pandemic has also disrupted these carefully planned activities as travel is suspended and local and remote field stations have closed indefinitely. A lost field season can be a major setback for a dissertation timeline and students will have to improvise. One promising opportunity to continue research efforts during these unprecedented times is taking advantage of the massive amounts of open scientific data that are freely available. Open data can form the basis of a review, synthesis, or new research.

Inspired by tweets from Ethan White about “PhD research from a distance”, the rOpenSci team did an in-depth exploration of how we provide access to open data. Our goal is to inspire students to find research opportunities with open data and highlight some of the rOpenSci packages that already make programmatic access possible. We also highlight some examples of how specific collections of packages are being used right now in fields as varied as archaeology and climate science.

🔗 Exploring open data

Data are fundamental to scientific discovery and leveraging new discoveries would not be possible without access to data 1. Although people rarely develop new research entirely on open data, these datasets provide an opportunity to reproduce and validate existing results, improve models, and be combined with other data to generate new syntheses. The open science movement has been growing for over a decade and all of that interest has surfaced numerous databases and repositories. The growing interest in reproducibility has also led to the creation of a plethora of open source software to access such data. rOpenSci’s core mission is to develop such tools and to date we have built over 120 robust data-access packages. These packages provide access to an impressive variety and quantity of data:
eBird offers up 700 million observations, Crossref has 108 million records of scholarly works which include articles and books, Dryad makes available 13 terabytes of data associated with published papers, and GBIF has over 1.3 billion records of species worldwide.

We hope that this post and these tools provide inspiration for you to explore new data sources and research topics.

🔗 Data sources for your research

Many of rOpenSci’s tools are developed by practicing scientists and have strong communities behind them. We invited university faculty from our community of developer-researchers to highlight sources of open data for research in their fields.

🔗 Climate and weather

Brooke Anderson, Colorado State University

Research on weather and climate—and their impacts on humans and the environment—can draw on numerous excellent open data sources, including many made available through programmatic access to data collected and shared by institutions and monitoring networks. The US Geological Survey offers a particular exciting example, offering not only APIs for accessing their data, but also a full suite of R packages developed and shared through the USGS-R community. rOpenSci’s own rnoaa package provides access to data through a number of the US National Oceanic and Atmospheric Administration’s open data APIs, allowing for fast and convenient access from R to national or worldwide data on, among others, meteorological observations, sea ice, and tides and currents, while its bomrang package offers similar access to data from the Australian Government Bureau of Meteorology. Other rOpenSci packages provide access to weather- and climate-related data from the Iowa Environment Mesonet (riem), New Zealand’s National Climate Database (clifro), the US National Aeronautics and Space Administration’s Prediction of Worldwide Energy Resource (POWER) dataset (nasapower), the US National Centers for Environmental Information’s Global Surface Summary of the Day (GSOD) dataset (GSODR), the US National Hurricane Center (rrricanes), the Flanders Environment Agency and Flanders Hydraulics Research’s waterinfo.be dataset (wateRinfo), and Environment and Climate Change Canada (ECCC) (weathercan). bowerbird is general-purpose package for maintaining local copies of a range of satellite- and model-derived environmental and climate data.

🔗 Water

Louise Slater, University of Oxford, Sam Zipper, University of Kansas, Ilaria Prosdocimi, Ca ‘Foscari University, Sam Albers, Government of British Columbia, and Claudia Vitolo, European Centre for Medium Range Weather Forecasts

In hydrology, there has been a rapid growth in the number of streamflow data archives made publicly available online by countries such as the UK (rnrfa package), USA (dataRetrieval package), Greece (rOpenSci’s hydroscoper package), and Canada (rOpenSci’s tidyhydat package) although most countries sadly do not yet apply an open policy to their hydrological data. The Task View on Hydrological Data and Modelling and accompanying blog post Getting your toes wet in R: Hydrology, meteorology, and more provide an exciting overview of the most up-to-date R packages that are available for downloading, analysing, and modelling these data. For an overview of the many advantages of using R for hydrological research, see the paper “Using R in Hydrology” 2 which describes approaches to retrieve, analyse, map, model, and visualise hydrological data.

🔗 Antarctic and Southern Ocean

Ben Raymond, Australian Antarctic Division and Anton Van de Putte, Royal Belgian Institute for Natural Science

Antarctic science has a strong culture of open data - the Antarctic treaty itself states that scientific observations and results from Antarctica should be openly shared, and the Scientific Committee on Antarctic Research has had an active data management group since the late 1980s. To find Antarctic and Southern Ocean data, search the Antarctic master directory (metadata catalogue) or portals such as the Antarctic Biodiversity portal or the Southern Ocean Observing System.

The Antarctic rOpenSci community is developing R resources to support Antarctic and Southern Ocean science, with a particular emphasis on simplifying data access and performing common analytical tasks. See this blog post and task view for an overview of some of the packages in development, and the types of analyses that we are aiming to support.

🔗 Archaeology

Ben Marwick, University of Washington

Research shuddered to a stop in the Geoarchaeology Lab in early March, with UW being one of the first US campuses to switch to remote work. No longer able to go to campus, we turned our attention to computational text analysis of a large corpus of archaeological conference abstracts to look at questions about gender imbalance and theory change in our field. Our quick pivot to this new area was only possible thanks to high quality and well-documented software such as rOpenSci’s tesseract, pdftools and magick packages. These enabled us to generate data rapidly, giving us more time for exploring and testing hypotheses, and ensuring our students could get to the end of the term ready to share some really interesting results.

We’ve been keeping up with the literature through in-depth study of new journal articles, especially those that include open data. Archaeologists use specialised repositories such as the Digital Archaeological Record (tDAR), Open Context as well as several generic repositories to share data (e.g. Zenodo, Figshare, Dataverse - each of these have R packages to access data). There are R packages for accessing data hosted by those archaeology repositories (tdar, opencontext), but many of our favourite recent articles (we keep a list here) had their data openly archived on the Open Science Framework data repository. While studying these articles we have enjoyed using rOpenSci’s osfr package to quickly and reproducibly access these materials for in-depth exploration. A favourite type of data for many archaeologists is radiocarbon ages, and our group has also been working with these with ease thanks to the c14bazAAR package. We’ve been using this package to get data to study radiocarbon dates from hundreds of archaeological sites in Australia. While we’re missing the lab, rOpenSci’s packages for acquiring archaeological data have been invaluable tools for efficiently enabling us to be active and engaged in our research.

Our task view for archaeological science shows the full range of tools we use, from data acquisition through environmental and geological analysis to writing reproducible manuscripts.

🔗 Transport

Robin Lovelace, University of Leeds

There has never been a better time for data driven and reproducible transport research. The COVID-19 pandemic has disrupted transport patterns worldwide. This has led to changes, such as the construction of ‘pop-up’ active transport infrastructure, the prioritisation of which can be supported by reproducible and open data analysis, as outlined in preprint (the analysis of which was undertaken in R) on the topic 3. There is a wealth of data out there that can be found with careful search queries and many new datasets (like Uber’s micromobility datasets, released on May 6th of this year).

  • For downloading data representing transport networks, I recommend heading to the overpass website and for R users checking out osmdata and the in-development geofabric (to be renamed) R packages.

  • For open origin-destination data there are many resources but the PCT package provides a way to access national-scale datasets quickly from the R command line, as outlined stplanr’s Origin-destination vignette.

  • For road safety data there is a lack of open data in many countries but you can access national road casualty data, with 60+ variables and 100,000+ records each year with the stats19 package.

  • For links to additional resources I recommend Chapter 12 of Geocomputation with R and Chapter 11 of QGIS for transport researchers.

  • For inspiration, I recommend checking out the Propensity to Cycle Tool, an interactive free and open web app that is being used to inform active transport investment plans in dozens of cities across the UK (it also has many data download options at zone, route and route network levels).

🔗 Taxonomy, biodiversity, ecology

rOpenSci has its roots in software for biodiversity research, with many packages in the areas of taxonomy, biological occurrences, and natural history/traits.

  • taxonomy: A good place to start is the taxonomy task view, covering many options for working with online taxonomy data

  • occurrences: Occurrence data forms the basis of much ecological research. The largest source of occurrence data, GBIF, can be accessed with the rgbif package. Many more are listed in the README for the package spocc.

  • natural history/traits: Conservation researchers may want to fetch data from the IUCN Red List via rredlist, Fishbase life history data from rfishbase, bird data from auk or rebird, or trait data from various marine taxa in WoRMS (called “attributes” by WoRMS; worrms).

A good general resource for rOpenSci packages on biodiversity is the rOpenSci Community Call from March 2019: Research Applications of rOpenSci Taxonomy and Biodiversity Tools.

 

Browse our table of > 100 data-access packages (under the bird) or jump ahead to see where you come in.

Lesser Violetear (<em>Colibri cyanotus</em>) by Carlos Sanchez, Macaulay Library, eBird Lesser Violetear Colibri cyanotus. Carlos Sanchez, Macaulay Library | eBird.

 

🔗 rOpenSci data-access packages

The table below shows a subset of our full suite of R packages. You can find scientific use cases for a package on our main page by clicking on a package name.

R packageData and sourceMaintainer
antanymAntarctic geographic names. Composite Gazetteer of AntarcticaBen Raymond
AntWebAnt data. AntWeb database from the California Academy of SciencesKarthik Ram
aukbird sighting records. http://ebird.orgMatthew Strimas-Mackey
bikedataHistoric ride data from public hire bicycle systems. London, U.K., from the U.S.A., San Francisco CA, New York City NY, Chicago IL, Washington DC, Boston MA, Los Angeles LA, Philadelphia PA, Minnesota, Montreal, Canada, and Guadalajara, Mexico.Mark Padgham
biomartrgenomic data retrieval. ‘NCBI RefSeq’, ‘NCBI Genbank’, ‘ENSEMBL’, and ‘UniProt’ databases, plus interface to ‘BioMart’ databaseHajk-Georg Drost
bittrexBittrex crypto-currency exchange. https://bittrex.comMichael Kane
boldBold Systems for genetic barcode data. http://www.boldsystems.orgScott Chamberlain
brranchingphylogenetic data. ‘Phylomatic’ http://phylodiversity.net/phylomatic, and ‘Phylocom’ https://github.com/phylocom/phylocomScott Chamberlain
camsRadTime series of global, direct, and diffuse irradiations on horizontal surface. Copernicus Atmosphere Monitoring Service (CAMS)Lukas Lundstrom
ccafsClimate Change, Agriculture, and Food Security (CCAFS) General Circulation Models.Scott Chamberlain
chromerChromosome Counts Database. http://ccdb.tau.ac.ilPaula Andrea Martinez
clifroNew Zealand National Climate Database. https://cliflo.niwa.co.nzBlake Seers
comtradrUnited Nations Comtrade data. https://comtrade.un.org/dataChris Muir
cRegulometranscription factor/microRNA-gene correlations (co-expression) in cancer. Cistrome Cancer Liu et al. (2011) doi:10.1186/gb-2011-12-8-r83 and ‘miRCancerdb’ databases (in press).Mahmoud Ahmed
dbhydroRSouth Florida Water Management Districts DBHYDRO’ database. https://www.sfwmd.gov/science-data/dbhydroJoseph Stachelek
DoOR.dataDrosophila odorant response data for DoOR.functions.Daniel Münch
ecoengineGeoreferenced specimen records from the University of California, Berkeley’s Natural History Museums. https://ecoengine.berkeley.eduKarthik Ram
epubrreading and parsing of internal e-book content from EPUB files. EPUB e-books.Matthew Leonawicz
essurveyEuropean Social Survey data. http://www.europeansocialsurvey.orgJorge Cimentada
FedDataGeospatial data from several federated data sources (mainly sources maintained by the US federal government). National Elevation Dataset National Hydrography Dataset (USGS), The Soil Survey Geographic (SSURGO) database, the Global Historical Climatology Network (GHCN), the Daymet gridded estimates of daily weather parameters, the International Tree Ring Data Bank, and the National Land Cover Database (NLCD).R. Kyle Bocinsky
fingertipsRData for many indicators of public health in England. http://fingertips.phe.org.ukSebastian Fox
genderdataHistorical datasets of first names and dates of birth.Lincoln Mullen
getCRUCLdataUniversity of East Anglia Climate Research Unit gridded climatology of monthly means. https://crudata.uea.ac.uk/cru/data/hrg/tmc/readme.txtAdam Sparks
getlandsatLandsat 8 Data. https://registry.opendata.aws/landsat-8Scott Chamberlain
GSODRGlobal Surface Summary of the Day (GSOD) weather data from USA National Centers for Environmental Information (NCEI). http://www1.ncdc.noaa.gov/pub/data/gsod/readme.txtAdam Sparks
gtfsrpublic GTFS feeds.Danton Noriega-Goodwin
gutenbergrProject Gutenberg collection. http://www.gutenberg.orgDavid Robinson
hathiHathiTrust bibliographic API. https://www.hathitrust.orgScott Chamberlain
hddtoolshydrological data. various data providersClaudia Vitolo
helminthRLondon Natural History Museum’s host-parasite database. http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasitesTad Dallas
historydatasample data sets for historians on population, institutional, religious, military, and prosopographical data.Lincoln Mullen
hydroscoperGreek National Data Bank for Hydrological and Meteorological Information. http://www.hydroscope.grKonstantinos Vantas
internetarchiveInternet Archive. https://archive.org/Lincoln Mullen
isdparserNOAA Integrated Surface Data. https://www.ncdc.noaa.gov/isdScott Chamberlain
jaodDirectory of Open Access Journals. https://doaj.orgScott Chamberlain
MODIStsptime series of rasters from MODIS Satellite Land Products data.Lorenzo Busetto
musemetamuseum metadata. Many different museums, including the MET, Getty Museum, and moreScott Chamberlain
nasapowerNASA POWER (Prediction Of Worldwide Energy Resource) global meteorology and surface solar energy climatology data. https://power.larc.nasa.govAdam H. Sparks
natservNatureServe. https://www.natureserve.orgScott Chamberlain
neotomapaleoecological datasets from the Neotoma Paleoecological Database. http://api.neotomadb.orgSimon J. Goring
nomisrUK official statistics from the Nomis database, including data from the from the Census, the Labour Force Survey, DWP benefit statistics and other economic and demographic data from the Office for National Statistics. https://www.nomisweb.co.uk/api/v01/helpEvan Odell
onekpTranscriptomes of over 1000 plant species.. The 1000 Plants Initiative (www.onekp.com)Zebulun Arendsee
opencontextOpen Context data. https://opencontext.orgBen Marwick
originrSpecies origin data from multiple sources. Encyclopedia of Life (http://eol.org), Flora ‘Europaea’ (http://rbg-web2.rbge.org.uk/FE/fe.html), Global Invasive Species Database (http://www.iucngisd.org/gisd), the Native Species Resolver (http://bien.nceas.ucsb.edu/bien/tools/nsr/), Integrated Taxonomic Information Service (http://www.itis.gov/), and Global Register of Introduced and Invasive Species (http://www.griis.org/).Scott Chamberlain
osmdataOpenStreetMap data. https://openstreetmap.orgMark Padgham
otsOcean time series datasets, including BATS, HOT, and more.Scott Chamberlain
paleobioDBPaleobioDB fossil data. http://paleobiodb.org/data1.1Sara Varela
pangaearPangaea Database. https://www.pangaea.deScott Chamberlain
phylotaROrthologous sequence clusters within taxonomic groups from GenBank. https://www.ncbi.nlm.nih.gov/genbankDom Bennett
pleiadesPleiades data. https://pleiades.stoa.orgScott Chamberlain
prismOregon State Prism climate data. http://www.prism.oregonstate.edu/Alan Butler
qualtRicsSurvey results from the Qualtrics API. https://www.qualtrics.com/aboutJulia Silge
rAvisproyectoavis database. http://proyectoavis.comSara Varela
rbaceBielefeld Academic Search Engine (BASE) of more than 150 million scholarly documents from more than 7000 sources. https://www.base-search.netScott Chamberlain
rbhlBiodiversity Heritage Library (BHL) of digitized literature on biodiversity studies. https://www.biodiversitylibrary.orgScott Chamberlain
rbisonUSGS BISON database for species occurrence data from the United States. https://bison.usgs.govScott Chamberlain
rbrariesLibraries.io data from 36 different package managers for programming languages. https://libraries.io/apiScott Chamberlain
rcoreoaCORE API aggregates open access research outputs from repositories and journals. https://core.ac.uk/docsScott Chamberlain
rdataciteDataCite metadata. https://www.datacite.orgScott Chamberlain
rdataretrieverData Retriever. http://data-retriever.orgHenry Senyondo
rdefraDEFRA’s UK-AIR website. https://uk-air.defra.gov.ukClaudia Vitolo
rdopaDOPA (Digital Observatory for protected Areas) by the European Union Joint Research Centre.Joona Lehtomaki
rdryadDryad \Solr\ data underlying scientific publications. https://datadryad.orgScott Chamberlain
rebirdeBird database of bird observations and locations. https://ebird.org/homeSebastian Pardo
rentrezNCBIs EUtils API for databases like GenBank and PubMed’. https://www.ncbi.nlm.nih.gov/genbank https://www.ncbi.nlm.nih.gov/pubmedDavid Winter
rerddapERDDAP servers. https://upwell.pfeg.noaa.gov/erddap/information.htmlScott Chamberlain
rfishbaseFishbase data on over 30,000 species of fish, their biology, ecology, morphology and more. http://www.fishbase.org http://www.sealifebase.orgCarl Boettiger
rfisheriesopenfisheries.org. http://www.openfisheries.org/Karthik Ram
rfnaFlora of North America website data. http://www.efloras.orgScott Chamberlain
rgbifGlobal Biodiversity Information Facility (GBIF) data of species occurrence. https://www.gbif.org/developer/summaryScott Chamberlain
rglobiGlobal Biotic Interactions (GloBI) data on spatial-temporal species interactions. https://www.globalbioticinteractions.org/Jorrit Poelen
rgpddGlobal Population Dynamics Database. https://ecologicaldata.org/wiki/global-population-dynamics-databaseCarl Boettiger
riemWeather data from Automated Surface Observing System (ASOS) stations. Iowa Environment Mesonet website.Maëlle Salmon
rifNeuroscience Information Framework (NIF) data. https://neuinfo.orgScott Chamberlain
rinatiNaturalist website of species occurrence data submitted by citizen scientists.. http://inaturalist.orgStéphane Guillou
rnaturalearthdataVector map data. http://www.naturalearthdata.comAndy South
rnoaaMany NOAA data sources including NCDC climate data, and data on sea ice, severe weather, historical metadata, storm and tornado data. https://www.ncdc.noaa.gov/cdo-web/webservices/v2Scott Chamberlain
rnpnNational Phenology Network data on various life history events that occur at specific times. https://usanpn.orgScott Chamberlain
ropenaqair quality data from the OpenAQ platform. https://docs.openaq.orgMaëlle Salmon
rotlOpen Tree of Life data on phylogenetic trees. https://tree.opentreeoflife.org/Francois Michonneau
rperseusPerseus Digital Library collection of classical texts. http://cts.perseids.orgDavid Ranzolin
rppoGlobal Plant Phenology Data Portal. https://www.plantphenology.orgJohn Deck
rredlistIUCN Red List of threatened and endangered species. http://apiv3.iucnredlist.org/api/v3/docsScott Chamberlain
rrricanesData on past and current hurricanes and tropical storms for the Atlantic and eastern Pacific oceans. https://www.nhc.noaa.gov/archive/1998/1998archive.shtmlTim Trice
rrricanesdataStorm discussions, forecast/advisories, public advisories, wind speed probabilities, strike probabilities and more. National Hurricane CenterTim Trice
rsnpsSNP datasets for SNPs, genotypes, and phenotypes. https://opensnp.org https://www.ncbi.nlm.nih.gov/projects/SNPJulia Gustavsen
rusdaUnited States Department of Agriculture (USDA) data from the Systematic Mycology and Microbiology Laboratory (SMML).Franz-Sebastian Krah
rvertnetVertNet.org archives including taxonomic names, places, and dates. http://vertnet.orgScott Chamberlain
rWBclimateModel predictions from 15 different global circulation models in 20 years.Edmund Hart
skynetair transport statistics from the Bureau of Transport Statistics (BTS) in the United States. https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0Filipe Teixeira
smaprNASA Soil Moisture Active Passive (SMAP) data. https://smap.jpl.nasa.gov/Maxwell Joseph
solriumdata from Solr. https://lucene.apache.org/solrScott Chamberlain
spoccspecies occurrence data sources, including Global Biodiversity Information.Scott Chamberlain
suppdataSupplementary materials from published manuscripts,.William D. Pearse
tidyhydatHistorical and real-time national hydrometric data from Water Survey of Canada data sources. http://dd.weather.gc.ca/hydrometric/csv http://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/wwwSam Albers
tradestatisticsAccess Open Trade Statistics API from R to download international trade data..Mauricio Vargas
traitsSpecies trait data from many different sources, including sequence data from from NCBI, plant trait data from BETYdb, plant data from the USDA plants database, data from EOL Traitbank, Coral traits data, Birdlife International, and more..Scott Chamberlain
treebaseTreeBASE repository of phylogenetic trees (of species, population, or genes). http://treebase.orgCarl Boettiger
USAboundariesBoundaries for geographical units in the United States of America. U.S. Census Bureau, Newberry Library’s ‘Atlas of Historical County Boundaries’Lincoln Mullen
USAboundariesDataHigher resolution boundary data, for use in the USAboundaries package.. U.S. Census Bureau, the Newberry Library’s ‘Historical Atlas of U.S. County Boundaries’, and Erik Steiner’s ‘United States Historical City Populations, 1790-2010’.Lincoln Mullen
weathercanHistorical weather data from Environment and Climate Change Canada. http://climate.weather.gc.ca/historical_data/search_historic_data_e.htmlSteffi LaZerte
webchemChemical information from around the web..Tamás Stirling

 

🔗 This is where you come in!

Have you successfully used one or more of these data sources in your research? We want others to imagine what’s possible by seeing examples. Share your story in the comments and cite your paper or preprint if it’s published.

Is there a data source you want to access programmatically but there’s no R package to do that? Tell us about it in the comments.

Need help? Ask in our discussion forum and we’ll do our best to get you answers.


  1. Tierney, N. J., & Ram, K. (2020). A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility. arXiv preprint arXiv:2002.11626. https://arxiv.org/abs/2002.11626 ↩︎

  2. Slater, L. J., Thirel, G., Harrigan, S., Delaigue, O., Hurley, A., Khouakhi, A., Prosdocimi, I., Vitolo, C., & Smith, K. (2019). Using R in hydrology: a review of recent developments and future directions. Hydrology and Earth System Sciences, 23(7), 2939-2963. https://www.hydrol-earth-syst-sci.net/23/2939/2019/ ↩︎

  3. Lovelace, R., Morgan, M., Talbot, J., & Lucas-Smith, M. (2020, May 11). Methods to prioritise pop-up active transport infrastructure. https://doi.org/10.31219/osf.io/7wjb6 ↩︎