rOpenSci | targets: Democratizing Reproducible Analysis Pipelines

targets: Democratizing Reproducible Analysis Pipelines

Make1-like pipelines enhance the integrity, transparency, shelf life, efficiency, and scale of large analysis projects. With pipelines, data science feels smoother and more rewarding, and the results are worthy of more trust.

…looking to get your project/s organised in the new year? hoping just to distract from feelings of impending doom/crushing loss of hope? I promise workflowing will make you feel better… and @wmlandau has made it SO EASY.

{targets} and its predecessors are visionary work. I can’t imagine making pipelines in a linear script ever again.

targets
targets hex logo

install.packages("targets")

The targets2 package is a new pipeline toolkit for R. It recently cleared software review, and it is now on CRAN. targets is the long-term successor of drake3, which in turn succeeded Rich FitzJohn’s groundbreaking remake4 package. A chapter in the user manual explains the future of drake, the advantages of targets, and how to transition. The reference website explains how to get started, and the overview vignette describes the major features of targets and its user manual.

🔗 How it works

In targets, a data analysis pipeline is a collection of target objects that express the individual steps of the workflow, from upstream data processing to downstream R Markdown reports5. These targets live in a special script called _targets.R.

# _targets.R file
library(targets)
tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr"))

# Most workflows have custom functions to support the targets.
read_clean <- function(path) {
  path %>%
    read_csv(col_types = cols()) %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}

fit_model <- function(data) {
  biglm(Ozone ~ Wind + Temp, data)
}

create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 12) +
    theme_gray(24)
}

# List of targets.
list(
  # airquality dataset in base R:
  tar_target(raw_data_file, "raw_data.csv", format = "file"),
  tar_target(data, read_clean(raw_data_file)),
  tar_target(fit, fit_model(data)),
  tar_target(hist, create_plot(data))
)

targets inspects your code and constructs a dependency graph.

# R console
tar_visnetwork()

tar_make() runs the correct targets in the correct order.

# R console
tar_make()
#> ● run target raw_data_file
#> ● run target data
#> ● run target fit
#> ● run target hist
#> ● end pipeline

Alternatives tar_make_clustermq() and tar_make_future() leverage clustermq6 and future7, respectively, to distribute targets on traditional schedulers such as SLURM8. It is only a matter of time before these backends become capable of sending jobs to the cloud9.

Your can store the results in the _targets/ folder (default) or Amazon S3 buckets. Either way, loading data back into R is the same.

# R console
tar_read(hist) # see also tar_load()
histogram of ozone readings from the airquality dataset in base R

Up-to-date targets do not rerun, which saves countless hours in computationally intense fields like machine learning, Bayesian statistics, and statistical genomics.

# R console
tar_make()
#> ✓ skip target raw_data_file
#> ✓ skip target data
#> ✓ skip target fit
#> ✓ skip target hist
#> ✓ skip pipeline

🔗 The next challenge

To help workflows scale, targets adopts the classical, pedantic, function-oriented perspective of the R language.10

Nearly everything that happens in R results from a function call. Therefore, basic programming centers on creating and refining functions.

— John Chambers

The more often you write your own functions, the nicer your experience becomes.

But if your mind is on the domain knowledge, or if you feel pressure to work fast, then it can be hard to write functions for everything.

🔗 Target factories

The best way to write fewer functions is to write less code. To write less code, we need abstraction and automation. Target factories are package functions that return lists of pre-configured target objects, and they make specialized pipelines reusable.

# script inside example.package

#' @export
read_clean <- function(path) {
  path %>%
    read_csv(col_types = cols()) %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}

#' @export
fit_model <- function(data) {
  biglm(Ozone ~ Wind + Temp, data)
}

#' @export
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 12) +
    theme_gray(24)
}

#' @title Example target factory.
#' @description Concise shorthand to express our example pipeline.
#' @details
#'   Target factories should use `tar_target_raw()`.
#'   `tar_target()` is for users, and `tar_target_raw()` is for developers.
#'   The former quotes its arguments, while the latter evaluates them.
#' @export
biglm_factory <- function(file) {
  list(
    tar_target_raw("raw_data_file", as.expression(file), format = "file"),
    tar_target_raw("data", quote(example.package::read_clean(raw_data_file))),
    tar_target_raw("fit", quote(example.package::fit_model(data))),
    tar_target_raw("hist", quote(example.package::create_plot(data)))
  )
}

With the factory above, our long _targets.R file suddenly collapses down to three lines.

# _targets.R file
library(targets)
library(example.package)
biglm_factory("raw_data.csv")

And you still have complete freedom to add more targets to the list.

# _targets.R file
library(targets)
library(example.package)
run_model2 <- function(data) {...}
list( # Target lists can be arbitrarily nested.
  biglm_factory("raw_data.csv"),
  tar_target(model2, run_model2(data))
)

The R Targetopia
R Targetopia hex logos

The R Targetopia11 is an emerging ecosystem of packages to bring target factories to specific domains of Statistics and data science.

🔗 stantargets

library(remotes)
install_github("wlandau/stantargets")
library(cmdstanr)
install_cmdstan()

stantargets12 abstracts away most of the targets and functions required for a solid Bayesian data analysis with Stan13. With a single target factory and a single function to generate data, stantargets can give you an entire sensitivity analysis or an entire simulation study.

# _targets.R file
# Repeatedly simulate data from the prior predictive distribution
# and compute a 95% posterior interval for beta for each model run.
library(targets)
library(stantargets)

simulate_data <- function(n = 10L) {
  alpha <- rnorm(n = 1, mean = 0, sd = 1)
  beta <- rnorm(n = n, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, alpha + x * beta, 1)
  list(
    n = n,
    x = x,
    y = y
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    "model.stan",
    simulate_data(),
    batches = 5, # Number of branch targets.
    reps = 2, # Number of model reps per branch target.
    variables = c("alpha", "beta"),
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    log = R.utils::nullfile()
  )
)
# R console
tar_visnetwork()

tarchetypes
tarchetypes hex logo

install.packages("tarchetypes")

The tarchetypes14 R Targetopia package is far more general than stantargets. Its target factories include tar_rep() for arbitrary simulation studies, tar_render() for dependency-aware literate programming, and tar_render_rep() for parameterized R Markdown. tar_plan() is a drake_plan()-like target factory to help drake users transition to targets.

# _targets.R file
library(targets)
library(tarchetypes)
tar_plan(
  tar_target(raw_data_file, "raw_data.csv", format = "file"),
  data = read_clean(raw_data_file),
  fit = fit_model(data),
  hist = create_plot(data)
)

🔗 You can help!

The R Targetopia has exciting potential for tidymodels15, mlr316, keras17, torch18, PK/PD, spatial statistics, and beyond. If your field needs a friendly pipeline tool, please consider creating an R Targetopia package of your own. I am trying to make it easy, and I would be eager to get in touch.

🔗 Thanks

Volunteers drive the rOpenSci review process, and each review is an act of altruism. This was especially true for targets because of COVID-19, the overlap with the holidays, and the unusually copious workload. Despite the obstacles, everyone delivered incredible feedback that substantially improved targets and its documentation. Sam Oliver and TJ Mahr served as reviewers, and Mauro Lepore served as editor. Sam inspired a section on getting started, an overview vignette, more debugging advice, and a new tar_branches() function to show branch provenance. TJ suggested a new chapter on functions, helped me contrast the two styles of branching, and raised interesting questions about target names. Mauro was continuously diligent, responsive, thoughtful, and conscientious as he mediated the review process and ensured a successful outcome.

Thanks also to Matt Warkentin, Timing Liu, Miles McBain, Gorka Navarrete, Bruno Carlin, Noam Ross, Kendon Bell, and others who adopted targets early in development, proposed insightful ideas, and influenced the direction and behavior of the package.

My colleague Richard Payne was a serious drake user, and he built a proprietary drake_plan() generator for our team. His package was the major inspiration for target factories and the R Targetopia.

Everyone who contributed to drake is part of targets. Four years of pull requests, issues, rOpenSci discussions, RStudio Community posts, Stack Overflow threads are materializing in this new suite of tools.

🔗 Disclaimer

The views in this post do not necessarily reflect those of my employer.

🔗 References


  1. Stallman, R. (1998). GNU Make, Version 3.77. Free Software Foundation. ISBN: 1882114809 ↩︎

  2. Landau, W. M., (2021). The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 6(57), 2959, https://doi.org/10.21105/joss.02959 ↩︎

  3. Landau, W. M. (2018). The drake R package: a pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 3(21), 550. https://doi.org/10.21105/joss.00550 ↩︎

  4. Rich FitzJohn (2021). remake: Make-like build management. R package version 0.3.0. ↩︎

  5. JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.6.4. URL https://rmarkdown.rstudio.com ↩︎

  6. Schubert, M. (2019). clustermq enables efficient parallelization of genomic analyses. Bioinformatics, 35(21), 4493–4495. https://doi.org/10.1093/bioinformatics/btz284 ↩︎

  7. Bengtsson, H. (2020). A unifying framework for parallel and distributed processing in R using futures. https://arxiv.org/abs/2008.00553 ↩︎

  8. Yoo A.B., Jette M.A., Grondona M. (2003) SLURM: Simple Linux Utility for Resource Management. In: Feitelson D., Rudolph L., Schwiegelshohn U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2003. Lecture Notes in Computer Science, vol 2862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10968987_3 ↩︎

  9. Amazon Web Services (2020). Overview of Amazon Web Services. https://d1.awsstatic.com/whitepapers/aws-overview.pdf ↩︎

  10. Chambers, John. 2008. “Software for Data Analysis: Programming with R.” In “Programming with R: The Basics”, 37–76. Springer. https://link.springer.com/chapter/10.1007/978-0-387-75936-4_3 ↩︎

  11. Landau, W. M., (2021). The R Targetopia: an R package ecosystem for democratized reproducible pipelines at scale. https://wlandau.github.io/targetopia/ ↩︎

  12. Landau, W. M., (2021). stantargets: Targets for Stan Workflows. https://wlandau.github.io/stantargets/, https://github.com/wlandau/stantargets↩︎

  13. Stan Development Team (2012). Stan: a C++ library for probability and sampling. https://mc-stan.org ↩︎

  14. Landau, W. M. (2021). tarchetypes: Archetypes for Targets. https://docs.rOpenSci.org/tarchetypes/, https://github.com/rOpenSci/tarchetypes↩︎

  15. Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org ↩︎

  16. Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, Au Q, Casalicchio G, Kotthoff L, Bischl B (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software. https://doi.org/10.21105/joss.01903, https://joss.theoj.org/papers/10.21105/joss.01903 ↩︎

  17. JJ Allaire and François Chollet (2020). keras: R Interface to ‘Keras’. R package version 2.3.0.0. https://CRAN.R-project.org/package=keras ↩︎

  18. Daniel Falbel and Javier Luraschi (2020). torch: Tensors and Neural Networks with ‘GPU’ Acceleration. R package version 0.2.0. https://CRAN.R-project.org/package=torch ↩︎