Most of us who work in R just want to Get Stuff Done™. We want a minimum amount of friction between ourselves and the data we need to wrangle, analyze, and visualize. We're focused on solving a problem or gaining insights into a new area of research. We rely on a rich, community-driven ecosystem of packages to help get our work done and likely make an unconscious assumption that there is a safety net out there, protecting us from harm.
Unfortunately, I get to be "that guy" who comes along and shatters such assumptions. It's time to put our hard hats on, get our clipboards out, and take a safety inspection tour of R. Along the way, we'll introduce features and design concepts of our rOpenSci #runconf17 project — the
notary package — that are aimed at making working in R a bit safer and more secure.
Since we say "we" quite a bit in this post, here are the folks that are represented by those two letters:
(Since "Bob" typed out the post, I get to insert what a privilege it was to work with those four folks. They're incredibly talented individuals doing really great work for the R community.)
Before we go into the concept of package trust, we'd like you to put one finger on this blog post (to hold the page) and switch over to your R console and verify what CRAN mirror you are using. Since you're down to one hand you can copy and paste this snippet:
options("repos") and review the results.
If any URL in that list doesn't start with
https:// replace it with one from this official mirror list that does (you will likely need to use both hands for that, so make sure you leave the browser tab open). If you don't use a crytographically secure method of installing packages, then everyone from your ISP, to your employer, to the government (depending on where you reside) can see what packages you're downloading and installing. Furthermore, using plain ol'
http:// means it's far easier for those who would seek to do you harm to intercept and switch out the contents of what you're retrieving.
Now, that you're sure you're using
https://, consider how much you know about the CRAN mirror you just picked. Are you sure that you can either trust the site or at least trust that the site is maintained sufficiently to deter attackers who would seek to do you (or the community) harm? Running a secure site is non-trivial and, like it or not, "data science" is one of the fastest growth areas in virtually every modern organization (commercial or academic). Such a condition is a natural attractor for attackers and while the R package ecosystem may not be in the top ten most sinister threat scenarios (for now), it will be easy to take advantage of in its current state.
To that end, the team came up with the concept of signing packages (hence the
notary name). Without taking you down a deep dive into digital signatures, you're already familiar with this concept if you use something like an iOS-based device (i.e. iPhone or iPad) and have downloaded an app from the Apple app store. A developer applies for a developer account with Apple. They get a key. They make an app. They digitally sign the app with the key they received. Apple reviews the app and (usually)eventually approves it. The signed app goes into the store and your iOS device (if you haven't "rooted" it) is configured to only run signed and approved apps.
There are three functions in
notary to help facilitate a more secure package ecosystem —
available_packages() — each of which is a thin wrapper around their base, dotted counterparts which ultimately will require modifications to CRAN mirrors to house digital signatures for packages and CRAN mirror sites themselves.
Why all this extra infrastructure and scaffolding? If we think of the R Core/CRAN team as the R equivalent of the Apple app store guardians, then when they review and approve a package that version becomes the gold standard. But, there's no current, easy, complete way to know for sure that what's on
cran.r-project.org is also what's on one of the mirror sites.
By having a similar set of signing and validation idioms, it will be possible to ensure that what you think you're getting from a CRAN repository is what was approved by the CRAN team. We still need to get one "secure" mirror setup to enable a proof-of-concept, so stay tuned for advancements in this area.
While the CRAN distribution model is not perfect, it's Fort Knox compared to GitHub.
Oh, but before we go into that, you should check out some extremely cutting edge functionality Hadley and others are putting into
purrr. Just do a quick
devtools::install_github("hadlley/purrr") bring up the help for the new threaded parallel execution of
Now, you know this is a post about security & R so hopefully your Spidey-sense was triggered and you knew enough not to even try that or you caught the
ll before you did the copy/paste. If you did end up doing the install attempt, be a teensy bit thankful that I deleted the
hadlley account before I wrote the post.
GitHub (and other public code repositories) are wonderful places where folks can collaborate and share creations. They are also fraught with peril. This is easily demonstrated by this proof-of-concept R package
rpwnd. Since GitHub is the most popular public R package development area, we'll focus on it for the remainder of this section.
One way to begin to mitigate the threat of GitHub package distribution is to impose some rules and provide a means to ensure some level of authenticity at the author and release level. To that end, we have two core functions:
validate_release() that rely on a setting that most of you likely do not have enabled in GitHub - PGP keys. You can read up on GitHub & PGP but you should really keep one finger on this page (again) and go check out A Git Horror Story: Repository Integrity With Signed Commits.
Back? Good. Let's continue.
The premise is simple: only install actual releases (which is a good idea anyway) and only, then, install signed releases. This is not a panacea and does not fix all the security & integrity problems associated with the GitHub distribution model, but if combined with some manual inspection of the repository and repo owner profile it will help ensure that you're somewhat closer to getting benign code.
This functionality is available today. So go setup your own PGP keys, add them to your owner profile and start generating signed releases.
Rounding out the feature set are two functions
sys_source_safe_sign() which are more secure (well, at least safer) wrappers for their dotted base siblings.
I literally break down into tears when I see a
source() suggestion posted anywhere, especially to non-
https:// URLs. Why? Even if you did a manual inspection at one point in time that the code is not malicious, how do you know that it hasn't been modified since then? Their
devtools counterparts (
source_url()) are a tad better, provided they you use
sha1 parameter to ensure that what you think you are sourcing hasn't changed.
notary sourcers go one step further and use a
sodium-based signature to verify the integrity of the source code you so desperately want to use via this methodology. These functions need some kinder, gentler companion functions to make it easier for all users to sign scripts, so you'll have to check back for those as we continue to poke at the project.
While we have a great start at building a foundation of safer and more secure R package and code delivery, the best part of building the
notary package was working with a team who genuinely wants to help ensure that the R community can operate as safely as possible without garish, creativity-crushing impediments. Rich, Oliver, Stephanie and Jeroen all had clever ideas for tough problems and we'll hopefully be able to continue to make small steps towards progress.
Hopefully we've helped folks understand some of the dangers that are out there and further demonstrated that we've begun to address some of them with the
notary package. If the idea of helping find ways to keep data science folks safer has piqued your interest, please do not hesitate to contact any of the team. We'd love to engage with more of the community on
notary, and would love feedback on usability and ideas for new or improved functionality.
Thank you, again, to rOpenSci for the opportunity to come together and collaborate on this project.