Data validation with the assertr package
Version 2.0 of my data set validation package
assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using
assertr, the text below will point out the improvements.
I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).
assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.
I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package;
assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the
assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell
assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that
assertr attempts to impose a grammar of error checking for datasets.
In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:
- For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates.
assertrcalls this verb
- For every element in a column, you want to make sure certain criteria are met but the criteria can be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for.
assertrcalls this verb
- For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated.
assertrcalls this verb
- For every row of a dataset, you want to make sure certain assumptions hold but the criteria can be decided only after looking at the entire column as a whole. This closely mirrors the distinction between
insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance.
assertrcalls this verb
- You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc…
assertrcalls this last verb
Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…
Pretend that, before finding the average miles per gallon for each number of engine cylinders in the
mtcars dataset, we wanted to confirm the following dataset assumptions…
- that it has the columns
- that the dataset contains more than 10 observations
- that the column for ‘miles per gallon’ (mpg) is a positive number
- that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
- that the
vscolumns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
- each row contains at most 2 NAs
- each row is unique jointly between the
- each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)
assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the
chain_end functions; all assumptions within a chain (below a call to
chain_start and above
chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.
Another major improvement since the last version of
assertr on CRAN is that
assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell
assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct
assertr to just give a warning instead of throwing a fatal error. For more information on this, see
Beyond these examples
Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using
assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.
The first is that there is another package, assertive that greatly enhances the
assertr experience. The predicates (functions that start with
is_) from this (meta)package can be used in
assertr pipelines just as easily as
assertr’s own predicates. And
assertive has an enormous amount of them! Some specialized and exciting examples include
The second is if
assertive doesn’t have what you’re looking for, with just a little bit of studying the
assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.