citecorp: working with open citations

  Scott Chamberlain SEPTEMBER 17, 2019

citecorp is a new (hit CRAN in late August) R package for working with data from the OpenCitations Corpus (OCC). OpenCitations, run by David Shotton and Silvio Peroni, houses the OCC, an open repository of scholarly citation data under the very open CC0 license. The I4OC (Initiative for Open Citations) is a collaboration between many parties, with the aim of promoting “unrestricted availability of scholarly citation data”. Citation data is available through Crossref, and available in R via our packages rcrossref, fulltext and crminer. Citation data is also available via the OCC; and this OCC data is now available in R through the new package citecorp.

How much citation data does the OCC have?

Quoting the OpenCitations website (as of today):

the OCC has ingested the references from 326,743 citing bibliographic resources, and contains information about 13,964,148 citation links to 7,565,367 cited resources

Why citation data? Citations are the links among scholarly works (articles, books, etc.), leading to many important uses including finding related articles, calculating article impact, and even use in academic hiring decisions.

Why open citation data? Until recently most citation data has been locked behind publisher walls. Unfortunately, many publishers see giving away citation data as losing potential profits. Through the I4OC, many publishers have made their citation metadata public, but some of the largest publishers still have not done so: Elsevier, American Chemical Society, IEEE. Without all citation data being open, any work that builds on citation data only has a sub-sample of all citations; you are drawing conclusions about citations from a rather small subset of all existing citations. Nonetheless, the currently available open citation data is an important resource; and can be amended with citation data behind paywalls for those that have access.

About citecorp and the OCC

OpenCitations created their own identifiers called Open Citation Identifiers (oci), e.g.,

020010009033611182421271436182433010601-02001030701361924302723102137251614233701000005090307

You are probably not going to be using oci identifiers, but rather DOIs and/or PMIDs (PubMed identifier) and/or PMCIDs (PubMed Central identifier) (see the PubMed Wikipedia entry for more). See ?citecorp::oc_lookup for methods for cross-walking among identifier types.

OpenCitations has a Sparql endpoint for querying their data; you can find that at http://opencitations.net/sparql; we do interface with the OCC Sparql endpoint in citecorp, but we don’t provide a user interface directly to it in citecorp.

The OCC is also available as data dumps.


citecorp source code: https://github.com/ropenscilabs/citecorp

citecorp on CRAN: https://cloud.r-project.org/web/packages/citecorp/



Installation

Install from CRAN

install.packages("citecorp")

Development version

remotes::install_github("ropenscilabs/citecorp")

Load citecorp

library(citecorp)


Converting among identifiers

Three functions are provided for converting among different identifier types; each function gives back a data.frame containing the url for the article, PMID, PMCID and DOI:

  • oc_doi2ids
  • oc_pmid2ids
  • oc_pmcid2ids
oc_doi2ids("10.1097/igc.0000000000000609")
#>    type                           value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2  pmid                        26645990
#> 3 pmcid                      PMC4679344
#> 4   doi    10.1097/igc.0000000000000609
oc_pmid2ids("26645990")
#>    type                           value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2   doi    10.1097/igc.0000000000000609
#> 3 pmcid                      PMC4679344
#> 4  pmid                        26645990
oc_pmcid2ids("PMC4679344")
#>    type                           value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2   doi    10.1097/igc.0000000000000609
#> 3  pmid                        26645990
#> 4 pmcid                      PMC4679344

Under the hood we interact with their Sparql endpoint to do these queries.

COCI methods

A series of three more functions are meant for fetching references of, citations to, or metadata for individual scholarly works.

Here, we look for data for the DOI 10.1108/jd-12-2013-0166

Peroni, S., Dutton, A., Gray, T. and Shotton, D. (2015), “Setting our bibliographic references free: towards open citation data”, Journal of Documentation, Vol. 71 No. 2, pp. 253-277.

Note: If you don’t load tibble you get normal data.frame’s

library(tibble)
doi <- "10.1108/jd-12-2013-0166"

references: the works cited within the paper

oc_coci_refs(doi)
#> # A tibble: 36 x 7
#>    oci           timespan citing   creation author_sc journal_sc cited     
#>  * <chr>         <chr>    <chr>    <chr>    <chr>     <chr>      <chr>     
#>  1 020010100083… P9Y2M5D  10.1108… 2015-03… no        no         10.1001/j…
#>  2 020010100083… P41Y8M   10.1108… 2015-03… no        no         10.1002/a…
#>  3 020010100083… P25Y6M   10.1108… 2015-03… no        no         10.1002/(…
#>  4 020010100083… P17Y2M   10.1108… 2015-03… no        no         10.1007/b…
#>  5 020010100083… P2Y2M3D  10.1108… 2015-03… no        no         10.1007/s…
#>  6 020010100083… P5Y8M27D 10.1108… 2015-03… no        no         10.1007/s…
#>  7 020010100083… P2Y3M    10.1108… 2015-03… no        no         10.1016/j…
#>  8 020010100083… P1Y10M   10.1108… 2015-03… no        no         10.1016/j…
#>  9 020010100083… P12Y     10.1108… 2015-03… no        no         10.1023/a…
#> 10 020010100083… P13Y10M  10.1108… 2015-03… no        no         10.1038/3…
#> # … with 26 more rows

citations: the works that cite the paper

oc_coci_cites(doi)
#> # A tibble: 13 x 7
#>    oci             timespan citing    creation author_sc journal_sc cited  
#>  * <chr>           <chr>    <chr>     <chr>    <chr>     <chr>      <chr>  
#>  1 02001010707360… P1Y4M1D  10.1177/… 2016-07… no        no         10.110…
#>  2 02007050504361… P2Y11M2… 10.7554/… 2018-03… no        no         10.110…
#>  3 02001010405360… P2Y      10.1145/… 2018     no        no         10.110…
#>  4 02001000903361… P2Y3M4D  10.1093/… 2017-06… no        no         10.110…
#>  5 02001000007360… P1Y      10.1007/… 2017     no        no         10.110…
#>  6 02003030406361… P0Y      10.3346/… 2015     no        no         10.110…
#>  7 02001000007360… P2Y9M12D 10.1007/… 2017-12… no        no         10.110…
#>  8 02003020303362… P1Y14D   10.3233/… 2016-03… no        no         10.110…
#>  9 02003020303362… P3Y5M4D  10.3233/… 2018-08… no        no         10.110…
#> 10 02001000007360… P2Y      10.1007/… 2018     no        no         10.110…
#> 11 02001010402362… P3Y4M21D 10.1142/… 2018-07… no        no         10.110…
#> 12 02001000007360… P1Y      10.1007/… 2017     no        no         10.110…
#> 13 02001000507362… P2Y4M    10.1057/… 2017-08  no        no         10.110…

metadata: including the ISSN, volumne, title, authors, etc.

oc_coci_meta(doi)
#> # A tibble: 1 x 13
#>   reference source_id volume title author source_title oa_link citation
#> * <chr>     <chr>     <chr>  <chr> <chr>  <chr>        <chr>   <chr>   
#> 1 10.1001/… issn:002… 71     Sett… Peron… Journal Of … ""      10.1177…
#> # … with 5 more variables: page <chr>, citation_count <chr>, issue <chr>,
#> #   year <chr>, doi <chr>


Use cases

There are many example use cases using the OCC already in the literature. Here are a few of those, not necessarily using R:


To do


Get in touch

Get in touch if you have any citecorp questions in the issue tracker or the rOpenSci discussion forum.