Open tools for Open Science



Karthik Ram
NCEAS, June 24th


twitter @_inundata  



# Some packages to install
install.packages("devtools")
# Some packages to install
install_github("rgbif", "ropensci")
install_github("rfisheries", "ropensci")
install_github("rfigshare", "ropensci")
install.packages("treebase")
install.packages("knitr")
install.packages("formatR")






Science

Data Life Cycle

source: Michener, 2006 Ecoinformatics.


Open data + code

Source: Wolkovich et al. Global Change Biology, 2012.



Open Science




Source: PLOS, 2007



Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

Issuance of a new NSF Proposal & Award Policies and Procedures Guide (October 4th)




R packages are increasingly showing up in domain journals.



Source: Molecular Ecology, 2012

Why R?

The old way...

Why R?

A better way



glm(y ~ -1 + a + c + z + a:z, data = mydata, maxit = 30)


This is reproducible, repeatable and can serve as a analytic workflow.



Open Science needs open source tools



Source: Revolution Analytics, 2010, Nature editorial, 2012





Founded by 3 EEB postdocs (Carl Boettiger, Scott Chamberlain and I).


Enable access to scientific data repositories, full-text of articles, and science metrics and also facilitate a culture shift in the scientific community.



way more info @ ropensci.org/packages

      
 Data
Treebase, Fishbase, Flybase
GBIF, Vertnet
Dryad, ITIS
NPN, Taxize
opensnp

      
  Journals
PLOS
Springer
textmine
pensoft
      
  Hybrid
figshare
Mendeley
DataONE
rImpactStory
rAltmetrics


R and APIs

API keys can be stored in a users.rprofile

 
	options(MendeleyKey = "uf5daib7wyil7ag5buc")
	options(MendeleyPrivateKey = "faj2os5dyd7jop2fok6")
	options(PlosApiKey = "ef3vip9yak7od3hud4g")
	options(SpringerMetdataKey = "ri9hi7woc6jax4vaf8w")
	





Note: These keys aren't real.

Mine open access journals - rplos


library(rplos)
plot_throughtime(list("reproducible science"), 500)

Visualize biodiversity data - rgbif

distribution <- occurrencelist(scientificname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Also see Cartodb's powerful mapping capabilites and R package.

dat <- occurrencelist(scientificname = "Accipiter erythronemius", coordinatestatus = TRUE, 
    maxresults = 100)
gbifdata(dat)
gbifmap(out)


Validate results from papers - treebase

library(treebase)
tree <- search_treebase("Derryberry", "author")[[1]]
## http://purl.org/phylo/treebase/phylows/study/find?query=dcterms.contributor=Derryberry&format=rss1&recordSchema=tree
## Query resolved, looking at each matching resource...
## 4 resources found matching query
## Attempting try 1
## Looking for nexus files...
## Tree read in successfully
plot(tree)
Derryberry et al. appeared in Evolution recently on diversification in ovenbirds and woodcreepers

Measure research impact in real time

ImpactStory.org


Tracking altmetrics - rImpactStory, rAltmetric

library(rAltmetric)
altmetrics("doi/10.1038/489201a")
## Altmetrics on: "Future impact: Predicting scientific success" with altmetric_id: 942310 published in Nature.
##   provider count
## 1 Facebook     1
## 2    Feeds    10
## 3  Google+     1
## 4    Cited   179
## 5   Tweets   159
## 6 Accounts   171


Tracking altmetrics - ImpactStory and Altmetric

doi_data <- read.csv("dois.csv", header = TRUE)
doi_data
##                          doi
## 1        10.1038/nature09210
## 2    10.1126/science.1187820
## 3 10.1016/j.tree.2011.01.009
## 4             10.1086/664183


Tracking altmetrics - ImpactStory and Altmetric

library(plyr)
library(rAltmetric)
# First, let's retrieve the metrics.
raw_metrics <- llply(doi_data$doi, altmetrics, .progress = "text")
# Now let's pull the data together.
metric_data <- ldply(raw_metrics, altmetric_data)
# Finally we save this to a spreadsheet for further
# analysis/vizualization.
write.csv(metric_data, file = "metric_data.csv")


Accessing data behind papers - rdryad

# Get URL for a specific dataset
dryaddat = download_url("10255/dryad.1759")

# Download the file from the Dryad servers
file <- dryad_getfile(dryaddat)

# Just first four columns
head(file[, 1:4])
  year nest.identity season clutch.size
1 2001             1      0           6
2 2001             1      0           6
3 2001             1      0           6
4 2001             1      0           6
5 2001             1      0           6
6 2001             1      0           6


Sharing unpublished data - figshare

Using figshare's new API it is now possible to share figures, data and any other object generated in R directly to one's figshare account.


> fs_upload(data)


USGS species occurrence data - (rbison)

library(rbison)
results <- bison(species = "Bison bison", type = "scientific_name", start = 0, 
    count = 10)
bison_data(results)
##   total observation fossil specimen unknown
## 1   761          30      4      709      18


USGS species occurrence data - (rbison)

head(bison_data(input = results, datatype = "counties"))
##   record_id total county_name      state
## 1     48295     7    Lipscomb      Texas
## 2     41025    15      Harney     Oregon
## 3     49017     8    Garfield       Utah
## 4     35031     2    McKinley New Mexico
## 5     56013     1     Fremont    Wyoming
## 6     40045     2       Ellis   Oklahoma


USGS species occurrence data - (rbison)

bisonmap(results, tomap = "county")


Paleoecological data - (neotoma)

#  North America = -167.3, LatS 5.4, LonE -52,2, LatN 83.2 
all.NA <- get_datasets(loc = c(-167, 5.4, -83, 52))
# Now limit it to sites in North America with Spruce pollen:
all.NA.pic <- get_datasets(loc = c(-167, 5.4, -83, 52), taxonname = "Picea*")


Fisheries landing data - (rfisheries)

library(plyr)
library(rfisheries)
countries <- country_codes()
# let's take a small subset, say 5 random countries
c_list <- countries[sample(nrow(countries), 5), ]$iso3c
# and grab landings data for these countries
results <- ldply(c_list, function(x) {
    df <- landings(country = x)
    df$country <- x
    df
}, .progress = "text")


Fisheries landing data - (rfisheries)



Mashup data from multiple repositories (Shiny example)




Executible papers and pre-print servers

Rapid peer-peer sharing of code is great for science



Free repositories such as GitHub make it possible to share data, code, manuscripts in progress and easily collaborate with anyone while providing powerful revision control.

Manuscripts can be forked and merged by any collaborator.




R + collaborative writing


knitr + Markdown



Xie Y (2012). knitr: A general-purpose package for dynamic report generation in R.

knitr + Markdown + GitHub

GitHub automatically renders Markdown and even provides syntax highlighting




knitr + Markdown + GitHub = executible paper



ropensci.org/workshops/
NCEAS

ropensci
@ropensci

To navigate this presentation, type M to see all slides.
G to go to a specific slide

/

#