rOpenSci | Distinguish yourself in CRAN person() with ORCID

Distinguish yourself in CRAN person() with ORCID

Proper identification of individuals is crucial for acknowledging and studying their scientific work, be it journal articles or pieces of software. In this tech note, one year after CRAN started supporting ORCIDs, we shall explain why and how to use unique author identifiers in DESCRIPTION files.

🔗 Why use ORCIDs on CRAN?

When analyzing the authorship of CRAN packages, one can look at authors’ names and email addresses. Names can be written with and without quotes, email addresses change, which makes it all tricky as noted by David Smith when he looked for the most prolific CRAN authors (notice our very own Scott Chamberlain and Jeroen Ooms in that scoreboard by the way?). Besides, several people can have the same name!

So all in all, using an unique ID per author makes sense. That’s something offered for the academic world and beyond by ORCID, ORCID meaning “Open Researcher and Contributor ID”. Anyone can set up an ORCID account for free and link it up to the rest of their online persona, as well as to their employment and works. When the authors of a scientific paper give their ORCID ID to the publisher, these can be included in the html and PDF versions of a paper. Note that there’s no, say, Keybase verification for ORCIDs, and often the input of an ORCID ID for a paper is declarative as well. Read more about ORCID here.

About one year ago, CRAN added support for ORCID in DESCRIPTION files by a minimally invasive hack. One can add the ORCID ID of an author as a comment field in person(), see e.g. Carl Boettiger’s ID in codemetar DESCRIPTION, and one gets a nice little green bubble near their name on CRAN, see e.g. codemetar CRAN page.

screenshot of\n<a href=\https://cran.r-project.org/web/packages/codemetar/index.html" src="/img/blog-images/2018-10-08-orcid/codemetar-1.png">

The advantages of using ORCID IDs in DESCRIPTION are:

  • Unique identifiers for free! Sure, there’s no verification so someone could use your identity, but at least, it should help assigning your own work to you.

  • Clickable name on CRAN pages, METACRAN pages and pkgdown websites! E.g. when clicking on Carl’s green bubble on codemetar CRAN page, one gets to his ORCID profile including a link to his personal website. See also codemetar METACRAN page and codemetar pkgdown website.

screenshot of\n<a href=\https://orcid.org/0000-0002-1642-628X" src="/img/blog-images/2018-10-08-orcid/carl-1.png">

Even for non academics, making one’s online persona more accessible from a CRAN DESCRIPTION can’t hurt.

🔗 How to manipulate ORCIDs in DESCRIPTION?

So, how does one add add and interact with ORCID in DESCRIPTION? Well, thankfully, the current dev version of the RStudio desc package by Gábor Csárdi has some direct support of ORCIDs! Gábor too is well ranked in the scoreboard mentioned earlier!

remotes::install_github("r-lib/desc")

🔗 Search authors by ORCID

desc offers many functions for manipulating authors: adding or deleting them or their roles. Now you can perform these operations using ORCID to identify authors to be modified. E.g. if you want to add a role to an existing author, say Carl who worked hard enough to earn the “aut” role, you’d run

desc::desc_add_role(role = "aut",
                    orcid = "0000-0002-1642-628X")

Obviously this very example above has probably little real-life applicability, but you could imagine manipulating DESCRIPTION files en masse using a table of ORCIDs and new roles.

🔗 Add ORCID to authors of a package

The new add_orcid() method and desc_add_orcid() functions make it possible to add ORCID IDs to authors directly instead of via the comment argument. So you could write some script

desc::desc_add_orcid("0000-0002-2815-0399",
                     given = "Maëlle",
                     family = "Salmon")

and run it inside all your local package folders to add your ORCID. It will add this ORCID ID to the author whose given name is “Maëlle” and whose family name is “Salmon”. If you only write

desc::desc_add_orcid("0000-0002-2815-0399",
                     given = "Maëlle")

and there are two Maëlle’s, the function will error and you’ll need to run it with more specific arguments.

🔗 Add all of your identity at once!

The coolest feature of desc ORCID support might be that desc::desc_add_me() will now use the ORCID of the environment variable ORCID_ID if you created it. Hence, once per machine it’d now be recommended to

  • install the whoami package necessary for desc::desc_add_me() to work. install.packages("whoami").

  • create an environment variable ORCID_ID, editing .Renviron by calling e.g. usethis::edit_r_environ(). Here’s how the corresponding .Renviron line looks for me:

ORCID_ID="0000-0002-2815-0399"

desc::desc_add_me() looks for your name and email addresses in your existing configuration (git configuration, GitHub profile), see the whoami package docs. So say you’ve just been told to add yourself as contributor of a package you’ve improved: inside the package directory, you’d just need to run desc::desc_add_me() and voilà ! No risk to wrongly copy-paste your ORCID.

As a conclusion of this section, not only are ORCIDs useful, but they’re getting citizen status in the world of as-automatic-as-possible package development. Now, we shall see how one could use ORCIDs inside R to gather information about package authors.

🔗 How to study CRAN authors via their ORCID

As mentioned earlier, rOpenSci’s Scott Chamberlain is the most prolific CRAN package maintainer, so perhaps unsurprisingly, he maintains a package interacting with ORCID’s API, rorcid!

The API requires authentication so the first step after installing the package is to run:

rorcid::orcid_auth()

and to save the resulting token as an ORCID_TOKEN environment variable in .Renviron (again, usethis::edit_r_environ() can help with that). Authentication warrants your having an ORCID account, so if you haven’t one yet, register via this link.

Then, we’ll query all comment’s of Authors@R of CRAN packages. Some CRAN packages use the Author field instead, which is not recommended and doesn’t allow using comment.

Side-note, in general, how would one analyze stuff over the whole CRAN collection?

Thanks to Gábor, who maintains METACRAN, for patiently answering my questions about meta-CRAN things. If we were after the comments section of just a few packages, we could query METACRAN database like so:

url <- "https://crandb.r-pkg.org/-/latest?keys=[%22desc%22,%22codemetar%22]"
obj  <- jsonlite::fromJSON(url, simplifyVector = FALSE)
lapply(obj, function(x) eval(parse(text = x$`Authors@R`))$comment)

but in our case, when wanting to get all the current CRAN authors, it’s easier to use tools::CRAN_package_db(). As a side-note, if you find a better solution than the functions defined below parse_authors() and parse_author() (some purrr magic?) to coerce all authors information to a data.frame, feel free to comment under the post!

library("magrittr")
all_info <- tools::CRAN_package_db()
all_info <- all_info[!is.na(all_info$`Authors@R`),]
all_info <- all_info[, c("Package", "Authors@R")]
names(all_info)[2] <- "authors_at_r"

null_to_na <- function(x){
  if(is.null(x)){
    NA
  }else{
    x
  }
}

parse_author <- function(author){
  author <- as.person(author)
  tibble::tibble(given = null_to_na(author$given),
                 family = null_to_na(author$family),
                 email = null_to_na(author$email),
                 comment = as.character(null_to_na(author$comment)),
                 orcid = as.character(null_to_na(author$comment["ORCID"])))
}

parse_authors <- function(authors_at_r){
  authors <- try(eval(parse(text = authors_at_r)),
                silent = TRUE)
  if(inherits(authors, "try-error")){
    return(list(NA))
  }
  
  else{
    list(purrr::map_dfr(authors, parse_author))
  }
  
}

cran_authors <- all_info %>%
  as.data.frame() %>%
  dplyr::select(Package, authors_at_r) %>% 
  dplyr::rowwise() %>%
  dplyr::mutate(authors = parse_authors(authors_at_r)) %>%
  tidyr::unnest(authors)

This is how the resulting table looks like. Out of 13147 packages on CRAN we are looking at 5421, i.e. packages using the Authors@R field. There is one line per author in a package.

str(cran_authors)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16272 obs. of  7 variables:
##  $ Package     : chr  "abbyyR" "abc" "abc" "abc" ...
##  $ authors_at_r: chr  "person(\"Gaurav\", \"Sood\", email = \"[email protected]\", role = c(\"aut\", \"cre\"))" "c( \n    person(\"Csillery\", \"Katalin\", role = \"aut\", email=\"[email protected]\"),\n    person(\"Le"| __truncated__ "c( \n    person(\"Csillery\", \"Katalin\", role = \"aut\", email=\"[email protected]\"),\n    person(\"Le"| __truncated__ "c( \n    person(\"Csillery\", \"Katalin\", role = \"aut\", email=\"[email protected]\"),\n    person(\"Le"| __truncated__ ...
##  $ given       : chr  "Gaurav" "Csillery" "Lemaire" "Francois" ...
##  $ family      : chr  "Sood" "Katalin" "Louisiane" "Olivier" ...
##  $ email       : chr  "[email protected]" "[email protected]" NA NA ...
##  $ comment     : chr  NA NA NA NA ...
##  $ orcid       : chr  NA NA NA NA ...

We can count the number of occurrences of ORCIDs.

(cran_authors %>%
  dplyr::filter(!is.na(orcid)) %>%
  dplyr::pull(Package) %>%
  unique() %>%
  length() -> orcid_pkg_no)
## [1] 613

Only 613 packages have at least an ORCID ID out of 13147 CRAN packages (you can’t have an ORCID ID without using the Authors@R field).

There are 512 unique ORCID IDs. Let’s look at the most prolific authors with an ORCID ID.

cran_authors %>%
  dplyr::filter(!is.na(orcid)) %>%
  dplyr::count(orcid, sort = TRUE) %>%
  head(n = 10) -> most_prolific
knitr::kable(most_prolific)
orcidn
0000-0003-1444-913529
0000-0002-4035-028921
0000-0003-0918-376621
0000-0003-4198-991119
0000-0003-4097-632616
0000-0001-8301-047113
0000-0003-0645-566612
0000-0001-5243-233X11
0000-0001-5670-264011
0000-0002-8584-459X10

Then we can use rorcid to get a glimpse of their profile.

rorcid::as.orcid(most_prolific$orcid)
## [[1]]
## <ORCID> 0000-0003-1444-9135
##   Name: Chamberlain, Scott
##   URL (first): /
##   Country: US
##   Keywords: ecology, open access, bioinformatics, evolution, R
## 
## [[2]]
## <ORCID> 0000-0002-4035-0289
##   Name: Ooms, Jeroen
##   URL (first): https://github.com/jeroen
##   Country: NL
##   Country: US
##   Keywords: 
## 
## [[3]]
## <ORCID> 0000-0003-0918-3766
##   Name: Zeileis, Achim
##   URL (first): https://eeecon.uibk.ac.at/~zeileis/
##   Country: AT
##   Keywords: 
## 
## [[4]]
## <ORCID> 0000-0003-4198-9911
##   Name: Hornik, Kurt
##   URL (first): 
##   Country: 
##   Keywords: 
## 
## [[5]]
## <ORCID> 0000-0003-4097-6326
##   Name: Leeper, Thomas
##   URL (first): http://www.lse.ac.uk/government/whosWho/Academic%20profiles/ThomasLeeper.aspx
##   Country: GB
##   Keywords: Public Opinion, Survey Experiments, Political Behaviour, R, Research Design
## 
## [[6]]
## <ORCID> 0000-0001-8301-0471
##   Name: Hothorn, Torsten
##   URL (first): 
##   Country: 
##   Keywords: 
## 
## [[7]]
## <ORCID> 0000-0003-0645-5666
##   Name: Xie, Yihui
##   URL (first): https://yihui.name
##   Country: 
##   Keywords: 
## 
## [[8]]
## <ORCID> 0000-0001-5243-233X
##   Name: Tang, Yuan
##   URL (first): https://terrytangyuan.github.io/about/
##   Country: US
##   Keywords: Machine Learning, Data Visualization, Open Source, Software Engineering
## 
## [[9]]
## <ORCID> 0000-0001-5670-2640
##   Name: Rudis, Bob
##   URL (first): http://rud.is/b
##   Country: US
##   Keywords: cybersecurity, r, data visualization, web crawling/scraping
## 
## [[10]]
## <ORCID> 0000-0002-8584-459X
##   Name: You, Kisung
##   URL (first): https://kisungyou.github.io/
##   Country: KR
##   Keywords:

Interestingly we recognize some names from the most prolific CRAN maintainers (Scott and Jeroen, again!) but not only. Note that another difference with David Smith’s post is that the list above includes authors no matter their role.

Over all authors with an ORCID, using rorcid one could extract their location and much more, but this is beyond the scope of this tech note.

Inversely, if one wanted to find CRAN packages by ORCID ID, instead of ORCID IDs by CRAN package, one could make the most of METACRAN. There is no search by ORCID, but since everything in DESCRIPTION is indexed by METACRAN, one can use the ORCID ID as search term. Say you want to find all packages by Scott Chamberlain:

🔗 Conclusion

In this tech note we made the case for using ORCID IDs as identifiers for the authors of CRAN packages, for academics and non-academics as well. We described the current desc support for ORCID. We also gave a small insight into the wealth of information one can get via rorcid.

For academics, another aspect of making their software contributions valuable is getting the software they write as work inside their ORCID profile. One can do that by hand, but one can also

  • get a legit DOI for each package release by activating Zenodo for the repository,

  • write a software paper, e.g. via JOSS, that’d have its own DOI.

But this is probably a topic for another time, such as would be the topic of CRAN potentially adopting other IDs like GitHub username, ideally via Keybase. By the way if you’d like to see a Keybase integration of ORCID happen, please “+1” this issue by Steph Locke. In the meantime, let’s hope ORCID IDs will be more adopted by academics and non-academics alike for:

  • Making CRAN package authors easier to identify and their online persona easier to find.

  • Making academic CRAN package authors’ scientific contributions easier to access when studying packages.

Have you started desc::add_orcid()-ing? Don’t hesitate to comment below!

Many thanks to 0000-0002-1642-628X and 0000-0003-1419-2405 for their feedback on this note.