Fostering the next generation of open science with R

Karthik Ram (@_inundata)

Supported by:

These data are hard to get to

Open Science

Source: PLOS, 2007

Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

Issuance of a new NSF Proposal & Award Policies and Procedures Guide (October 4th)

Why R?

The old way...

Why R?

A better way

glm(y ~ -1 + a + c + z + a:z, data = mydata, maxit = 30)

This is reproducible, repeatable and can serve as a analytic workflow.

Open Science needs open source tools

Source: Revolution Analytics, 2010, Nature editorial, 2012

Open data + code

Source: Wolkovich et al. Global Change Biology, 2012.

Enable access to scientific data repositories, full-text of articles, and science metrics and also facilitate a culture shift in the scientific community.

More info @

Treebase, Fishbase, 
GBIF, Vertnet
Dryad, ITIS
NPN, Taxize

rAltmetric, rEML,

Search full text of 100k+ open access articles - rplos

plot_throughtime(list("reproducible science"), 500)

Accessing data behind papers - dryad

# Get the URL for a data file
dryaddat <- download_url("10255/dryad.1759")

# Get a file given the URL
file <- dryad_getfile(dryaddat)

Mapping biodiversity data - rgbif

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Species distribution modeling

World Bank climate knowledge portal rWBclimate

eu_basin <- create_map_df(Eur_basin)
eu_basin_dat <- get_ensemble_temp(Eur_basin, "annualanom", 2080, 2100)

Resolve taxonomic names

splist <- c("Helanthus annuus", "Pinos contorta", "Collomia grandiflorra", "Abies magnificaa",
    "Rosa california", "Datura wrighti", "Mimulus bicolour", "Nicotiana glauca",
    "Maddia sativa", "Bartlettia scapposa")
splist_tnrs <- tnrs(query = splist, getpost = "POST", source_ = "iPlant_TNRS")

Taxize queries 11 different name resolution services

Encylopedia of Life
Taxonomic Name Resolution Service
Integrated Taxonomic Information Service
Global Names Resolver
Global Names Index
IUCN Red List
Theplantlist dot org
Catalogue of Life
Global Invasive Species Database

Measure research impact in real time

Tracking altmetrics - rAltmetric, ALM

## Altmetrics on: "Future impact: Predicting scientific success" with altmetric_id: 942310 published in Nature.
##   provider count
## 1 Facebook     1
## 2    Feeds    10
## 3  Google+     1
## 4    Cited   179
## 5   Tweets   159
## 6 Accounts   171

Sharing unpublished data - (figshare)

Using figshare's API it is now possible to share figures, data and any other object generated in R directly to any figshare account.

# uses api keys to login
id <- fs_create()
fs_upload(id, r_objects)

Moving from invidual data sources to data pipelines

Sharing robust data products

EML (Jones et al., 2001) is a comprehensive standard that has been adopted by a sector of the larger international ecological research community.

EML provides a common structure for these resources, to better enable ecologists to document, share, and interpret ecological data

EML standard enables data integration at the machine level (with little or no human intervention).

EML has four general descriptors at the top of the hierarchy. One can choose to describe a dataset, a protocol, a citation, or software.

Without metadata, a data table such as this one is useless.

A table with limited metadata

Valid EML

<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://" xmlns:ds="eml://" xmlns:xs="" xmlns:xsi="" xmlns:stmml="" packageId="reml_3794487.58997023" system="reml">
    <title>reml example</title>
      <electronicMailAddress>[email protected]</electronicMailAddress>

Valid EML

<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://" xmlns:ds="eml://" xmlns:xs="" xmlns:xsi="" xmlns:stmml="" packageId="reml_3794487.58997023" system="reml">
    <title>reml example</title>
      <electronicMailAddress>[email protected]</electronicMailAddress>

Units are well defined


A live demo of rEML

species <- species_codes()
tunas <- grep("Tuna", species$english_name)
who <- c("TUX", "COD", "VET", "NPA")
by_species <- lapply(who, function(x) landings(species = x))
names(by_species) <- who
dat <- melt(by_species, id = c("catch", "year"))
names(dat) <- c("catch", "year", "a3_code")

Full example code

Writing valid EML and uploading to a persistent repo is simple

description <- "Landings data for several species by year, from the OpenFisheries database"

eml_write(dat = dat, meta, title = "Landings Data", description = description,
    creator = "Karthik Ram <[email protected]>", file = "landings.xml")

eml_publish("landings.xml", description = description, categories = "Ecology",
    tags = "fisheries", destination = "figshare", visibility = "public")

Species occurrence data (SPOCC)

Combine various data sources

Live Shiny app

Natively render geojson on GitHub

View on gists

Visualize data with Cartodb

Other data apps

A reproducible workflow in R

Load your own data

load all raw untransformed data.

Acquire additional data from the web

e.g. resolve taxonomic names, acquire additional datasets.

Document everything with metadata

The REML package makes it really easy to add valid EML to your data

Submit to a persistent repository

Share your data by submitting to figshare or one at your institution

Generate interactive maps, viewers.

rNeXML - extensible and verifiable comparitive data

NeXML - extensible and verifiable comparitive data

Upcoming projects


To navigate this presentation, type M to see all slides.
G to go to a specific slide