rOpenSci | A package for tidying nested lists

Data == knowledge! Much of the data we use, whether it be from government repositories, social media, GitHub, or e-commerce sites comes from public-facing APIs. The quantity of data available is truly staggering, but munging JSON output into a format that is easily analyzable in R is an equally staggering undertaking. When JSON is turned into an R object, it usually becomes a deeply nested list riddled with missing values that is difficult to untangle into a tidy format. Moreover, every API presents its own challenges; code you’ve written to clean up data from GitHub isn’t necessarily going to work on Twitter data, as each API spews data out in its own unique, headache-inducing nested list structure. To ease and generalize this process, Amanda Dobbyn proposed an unconf18 project for a general API response tidier! Welcome roomba, our first stab at easing the process of tidying nested lists!

Drawing

roomba will eventually be able to walk nested lists in a variety of different structures from JSON output, replace NULL or .empty values with NAs or a user-specified value, and return a tibble with names matching a user-specified list. Of course, in two days we haven’t fully achieved this vision, but we’re off to a promising start.

🔗 The birth of roomba

It was clear Amanda was on to something good by the lively discussion in the #runconf18 issues repository leading up to the unconf. Thanks to input from Jenny Bryan, Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam Ross, we had a lot of ideas to work with when the unconf began. Fortunately, Jim already had a function called dfs_idx() (here) written to perform depth-first searches of nested lists from the GitNub GraphQL API. With the core list-traversal code out of the way, we split our efforts between developing a usable interface, stockpiling .JSON files to test on, and developing a Shiny app.

🔗 What’s working

We’ve got the basic structure of roomba sorted out, and you should install it from GitHub to try out! Here are a few of the examples we’ve put together.

library(roomba)
#load twitter data example
data(twitter_data)

#roomba-fy!
roomba(twitter_data, c("created_at", "name"))

## # A tibble: 24 x 2
##    name                 created_at                    
##    <chr>                <chr>                         
##  1 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  2 Ben Lorica <U+7F57><U+745E><U+5361>    Mon Dec 22 22:06:18 +0000 2008
##  3 Dan Sholler          Thu Apr 03 20:09:24 +0000 2014
##  4 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  5 FiveThirtyEight      Tue Jan 21 21:39:32 +0000 2014
##  6 Digital Impact       Wed Oct 07 21:10:53 +0000 2009
##  7 Drew Williams        Thu Aug 07 18:41:29 +0000 2014
##  8 joe                  Fri May 29 13:25:25 +0000 2009
##  9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick       Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows

And just the first element of the twitter_data list will show you that roomba has simplified this process quite a bit.

twitter_data[[1]]

## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
## 
## $id
## [1] 9.98624e+17
## 
## $id_str
## [1] "998623997397876743"
## 
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl… https://t.co/7tVf1qmNmq"
## 
## $truncated
## [1] TRUE
## 
## $entities
## $entities$hashtags
## list()
## 
## $entities$symbols
## list()
## 
## $entities$user_mentions
## list()
## 
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
## 
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
## 
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9…"
## 
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
## 
## $entities$urls[[1]]$indices[[2]]
## [1] 140
## 
## 
## 
## 
## 
## $source
## [1] "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>"
## 
## $in_reply_to_status_id
## NULL
## 
## $in_reply_to_status_id_str
## NULL
## 
## $in_reply_to_user_id
## NULL
## 
## $in_reply_to_user_id_str
## NULL
## 
## $in_reply_to_screen_name
## NULL
## 
## $user
## $user$id
## [1] 64482503
## 
## $user$id_str
## [1] "64482503"
## 
## $user$name
## [1] "Code for America"
## 
## $user$screen_name
## [1] "codeforamerica"
## 
## $user$location
## [1] "San Francisco, California"
## 
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
## 
## $user$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
## 
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
## 
## 
## 
## 
## 
## $user$entities$description
## $user$entities$description$urls
## list()
## 
## 
## 
## $user$protected
## [1] FALSE
## 
## $user$followers_count
## [1] 49202
## 
## $user$friends_count
## [1] 1716
## 
## $user$listed_count
## [1] 2659
## 
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
## 
## $user$favourites_count
## [1] 4490
## 
## $user$utc_offset
## [1] -25200
## 
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
## 
## $user$geo_enabled
## [1] TRUE
## 
## $user$verified
## [1] TRUE
## 
## $user$statuses_count
## [1] 15912
## 
## $user$lang
## [1] "en"
## 
## $user$contributors_enabled
## [1] FALSE
## 
## $user$is_translator
## [1] FALSE
## 
## $user$is_translation_enabled
## [1] FALSE
## 
## $user$profile_background_color
## [1] "EBEBEB"
## 
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_tile
## [1] FALSE
## 
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
## 
## $user$profile_link_color
## [1] "CF1B41"
## 
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
## 
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
## 
## $user$profile_text_color
## [1] "333333"
## 
## $user$profile_use_background_image
## [1] FALSE
## 
## $user$has_extended_profile
## [1] FALSE
## 
## $user$default_profile
## [1] FALSE
## 
## $user$default_profile_image
## [1] FALSE
## 
## $user$following
## [1] TRUE
## 
## $user$follow_request_sent
## [1] FALSE
## 
## $user$notifications
## [1] FALSE
## 
## $user$translator_type
## [1] "none"
## 
## 
## $geo
## NULL
## 
## $coordinates
## NULL
## 
## $place
## NULL
## 
## $contributors
## NULL
## 
## $is_quote_status
## [1] FALSE
## 
## $retweet_count
## [1] 0
## 
## $favorite_count
## [1] 0
## 
## $favorited
## [1] FALSE
## 
## $retweeted
## [1] FALSE
## 
## $possibly_sensitive
## [1] FALSE
## 
## $possibly_sensitive_appealable
## [1] FALSE
## 
## $lang
## [1] "en"

We created a Shiny app too, which in its current state allows you to select a .Rda or .JSON file, pick two variables, and create a scatterplot of them.

Run the app like this:

shiny_roomba()

🔗 What’s not

Of course, in two days we weren’t able to build a magical one-size-fits-all solution to every API response data headache. Right now, the main barrier to usability is that both the roomba() function and shiny_roomba() app only work on sub-list items of the same length and same data type stored at the same depth. To illustrate on the twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))

## # A tibble: 1,007 x 1
##    user      
##    <list>    
##  1 <int [1]> 
##  2 <chr [1]> 
##  3 <chr [1]> 
##  4 <chr [1]> 
##  5 <chr [1]> 
##  6 <chr [1]> 
##  7 <chr [1]> 
##  8 <list [2]>
##  9 <lgl [1]> 
## 10 <int [1]> 
## # ... with 997 more rows

#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))

## # A tibble: 0 x 0

In addition, we’ve got some features we want to add, such as handling a larger variety of column names (i.e. passing a string for a single column name, keeping all values even if they are all NULL). We would love your feedback on other things we can add (open an issue in our Git repository)!

🔗 The team

Amanda Dobbyn

Job: Data Scientist at Earlybird Software

Project contributions: initial GH issue, package name, wrapper for dfs_idx()

Jim Hester

Job: Software Engineer at RStudio

Project contributions: dfs_idx() and remove_nulls() functions, package building, README, and debugging

Christine Stawitz

Job: Postdoctoral researcher at University of Washington’s School of Aquatic and Fishery Sciences

Project contributions: Shiny app, README and blog post writing

Laura DeCicco

Job: Data Scientist at U.S. Geological Survey

Project contributions: Fixing merge conflicts :)

Isabella Velasquez

Job: Data Analyst at the Bill & Melinda Gates Foundation

Project contributions: hex sticker!