rOpenSci | drake transformed

drake transformed

Version 7.0.0 of drake just arrived on CRAN, and it is faster and easier to use than previous releases.


🔗 Recap

Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?

For projects in R, the workflow package drake can help. It analyzes your project, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

Other pipeline tools such as Make, Snakemake, Airflow, Luigi, and Dask accomplish similar goals, but drake is uniquely suited to R. It features

🔗 Plans

drake’s DSL is still experimental, but it is more powerful than old interface functions such as evaluate_plan(). A new section of the manual walks through the details, and the revised gross state product example is a motivating use case.

Three transformations form the core of the new syntax.

TransformationTidyverse equivalent
map()pmap() from purrr
cross()crossing() from tidyr
combine()summarize() from dplyr

These transformations help declare batches of multiple targets.


# short names to tell get_data() what to download:
source <- c("census", "gapminder")

plan <- drake_plan(
  data = target(
    get_data(from), # custom function
    transform = map(from = !!source, .id = FALSE)
  run = target(
    transform = cross(data, analysis_function = c(lda, rf))
  out = target(
    transform = combine(run, .by = analysis_function)

#> # A tibble: 8 x 2
#>   target         command                                   
#>   <chr>          <expr>                                    
#> 1 data           get_data("census")                        
#> 2 data_2         get_data("gapminder")                     
#> 3 run_lda_data   lda(data)                                 
#> 4 run_rf_data    rf(data)                                  
#> 5 run_lda_data_2 lda(data_2)                               
#> 6 run_rf_data_2  rf(data_2)                                
#> 7 out_lda        my_summaries(run_lda_data, run_lda_data_2)
#> 8 out_rf         my_summaries(run_rf_data, run_rf_data_2)

Above, note the <expr> label underneath the command header. For the sake of faster and cleaner metaprogramming, plan$command is now a list of expressions by default. (However, you can still supply a character vector if you wish.)

🔗 Performance

After serious profiling and benchmarking, make() and drake_config() are substantially faster in workflows with thousands of targets. The changes affect drake’s internal storage protocol, so targets built with previous versions are not up to date anymore, but make() pauses to let you downgrade to an earlier version or start from scratch.

In addition, to increase clarity and ease of use, the parallelism argument of make() has fewer choices. Of the previous 11 options, only the best 3 remain. See the updated high-performance computing guide for details.

"loop"No parallel computing
"clustermq"Persistent workers
"future"Transient workers

🔗 Reproducibility

By default, drake watches your session and environment for dependencies. This behavior frees drake to fully focus on R, enhancing interactivity, flexibility, and independence from cumbersome configuration files. However, a fully reproducible use of drake requires care. Version 7.0.0 comes with new workarounds and safeguards.

🔗 Interactive sessions

A serious drake workflow should be consistent and reliable, ideally with the help of a master R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode helps ensure all this goes according to plan. If you use a single persistent interactive R session to repeatedly invoke make() while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets.

To combine interactivity with reproducibility, version 7.0.0 has a new experimental callr-like interface. Functions such as r_make(), r_outdated(), and r_drake_graph_info() each run drake in a transient callr session so that accidental changes to your interactive session do not break your results.

For more information, please see the newly refreshed chapter on drake projects. For example code, you can download the updated main example (drake_example("main")) and experiment with files _drake.R and interactive.R.

🔗 Self-invalidation

An additional safeguard prevents workflows from invalidating themselves.

plan <- drake_plan(
  x = {
  y = mean(x)

#> # A tibble: 2 x 2
#>   target command                            
#>   <chr>  <expr>                             
#> 1 x      {     data(mtcars)     mtcars$mpg }
#> 2 y      mean(x)

#> target x
#> fail x
#> Error: Target `x` failed. Call `diagnose(x)` for details. Error message:
#>   cannot add bindings to a locked environment. 
#> Please read the "Self-invalidation" section of the make() help file.

The error above comes from the call to data(mtcars). Without guardrails, the very act of building x changes x’s dependencies. In other words, x is still invalid after make() completes.

make(plan, lock_envir = FALSE)
#> target x
#> target y

make(plan, lock_envir = FALSE)
#> target x

There are legitimate use cases for lock_envir = FALSE, but most projects should stick with the default lock_envir = TRUE.

🔗 Remarks

drake enhances reproducibility, but not in all respects. Literate programming, local library managers, containerization, and strict session managers offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques working together.

🔗 Thanks

drake thrives on active participation from the open source community. Most of the inspiration for version 7.0.0 arose from new GitHub issues and conversations at the 2018 and 2019 RStudio Conferences. Many thanks to the following people for their insight, guidance, and code patches.

🔗 Disclaimer

This announcement is a product of my own opinions and does not necessarily represent the official views of my employer.