rOpenSci | How r-universe searches for packages on CRAN / Bioconductor

How r-universe searches for packages on CRAN / Bioconductor

This post is part of a series of technotes about r-universe, a new umbrella project by rOpenSci under which we experiment with various ideas for improving publication and discovery of research software in R. As the project evolves, we will post updates to document features and technical details. For more information, visit the r-universe project page.

🔗 How packages appear in r-universe

Last month we explained how r-universe makes it easy to search and browse through the countless R packages, articles, and datasets to let you discover and learn new things. We are continuously growing this database by adding more R projects, to guide you through everything the R ecosystem has to offer.

Currently r-universe is tracking and indexing of over 18.000 R packages. These are a mix of packages found on popular networks like CRAN or Bioconductor, and packages that were registered by users.

In previous posts we already explained how to create your personal CRAN-like repository and publish packages on r-universe yourself. This post explains the other part: how the scraper automatically finds packages on CRAN and Bioconductor that should be included in r-universe.

🔗 Why we look for the upstream package source

For R packages to be trackable by r-universe, the source has to be publicly accessible via Git1. Most packages in r-universe are found on GitHub, but in fact any Git server is allowed.

We strongly prefer tracking projects from their official upstream Git source, where the authors commit changes and where users report bugs. The Git source provides a lot of useful information such as:

  • The latest changes
  • Who is the owner of the project
  • If there is an active bug tracker
  • Historical statistics on development activity and contributors
  • Project metadata, such as keywords

R-universe automatically analyses all this information, uses it to rank and classify packages, and presents the data via the web user interfaces and APIs. For this reason we really want to know the official Git url and owner, even when a copy of the package exists on CRAN or BioConductor.

🔗 How we determine the upstream source

For all R packages on CRAN and BioConductor we perform the following steps to try to find the upstream git source url:

  1. Inspect the URL and BugReports field in the DESCRIPTION file to look for a github/gitlab/bitbucket/r-forge url. If the package can be found here, this is the preferred method.
  2. If this fails, but the maintainer has a GitHub account, we look for the package under this Github account. If the package is found and the version is equal or greater than the version on CRAN or Bioconductor, this is treated as the official source.
  3. Finally if the maintainer has a Github account, but we could not find the package there either, we add a copy of the package from metacran in the universe of the maintainer. As explained earlier, for this set of packages we do not have upstream metadata, so we can only index the package content and some maintainer information.

This list of package URLs is updated every night and published in crantogit. Today’s statistics are:

  • 10.805 CRAN/Bioc packages found at the Git url mentioned in the DESCRIPTION file (yay, you rule!)
  • 1.983 packages found under the maintainer’s personal Github account
  • 4.613 packages ingested from the CRAN/Bioc mirror in the maintainer’s universe

Currently we do not process CRAN/Bioc packages that have no public Git source, and also the maintainer has no Github account, because we cannot determine the owner (and hence r-universe subdomain).

This is roughly how it works, but there are some caveats. For example, the scraper may not be able to find a package if it is stored in an unusual subdirectory within a Git repository. Also, CRAN has an unusual practice of unpredictably archiving and unarchiving packages. Therefore, packages that get archived on CRAN and are also not part of any other registry, still remain on r-universe for 2 months.

🔗 Tips for package authors

If you maintain an R package, regardless of where you publish it, I highly recommend these two things to let us (and others) identify the official source and maintainer of the project:

  • Put the Github/Gitlab/R-forge/Bitbucket home of your project in the URL and/or BugReports fields in the package DESCRIPTION file when you publish on CRAN/BioConductor2. This makes it clear where to report bugs, and also prevents confusion about the official source if someone forks your package, or creates a package with the same name.
  • If you have a GitHub account (even if you never use it!), do register your maintainer email address(es) in your Github account settings (see also this faq). This way the maintainer can be linked to your github account/picture and systems like r-universe and metacran can correctly identify ownership and contributions.

Finally I want to emphasize again that packages do not need to be on CRAN or Bioconductor to be included in r-universe. It is super easy to setup your own universe and get the same benefits!

  1. One notable exception is r-forge which uses SVN, but has a live Git mirror on ↩︎

  2. You can do it manually or by running usethis::use_github_links()↩︎

Start discussion