Wednesday, January 17, 2018 From rOpenSci (https://ropensci.org/blog/2018/01/17/fulltext-v1/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Text-mining - the art of answering questions by extracting patterns, data, etc. out of the published literature - is not easy.
It’s made incredibly difficult because of publishers. It is a fact that the vast majority of publicly funded research across the globe is published in paywall journals. That is, taxpayers pay twice for research: once for the grant to fund the work, then again to be able to read it. These paywalls mean that every potential person text-mining will have different access: some have access through their university, some may have access through their company, and others may only have access to whatever happens to be open access. On top of that, access for paywall journals often depends on your IP address - something not generally on top of mind for most people.
Another hardship with text-mining is the huge number of publishers together with no standardized way to figure out the URL for full text versions of a scholarly work. There is the DOI (Digital Object Identifier) system used by Crossref, Datacite and others, but those generally help you sort out the location of the scholarly work on a web page - the html version. What one probably wants for text-mining is the PDF or XML version if available. Publishers can optionally choose to include URLs for full text (PDF and/or XML) with Crossref’s metadata (e.g., see this Crossref API call and search for “link” on the page), but the problem is that it’s optional.
fulltext
is a package to help R users address the above problems, and get published literature from the web in it’s many forms, and across all publishers.
fulltext
tries to make the following use cases as easy as possible:
fulltext
organizes functions around the above use cases, then provides flexiblity to query many data sources within that use case (i.e. function). For example fulltext::ft_search
searches for articles - you can choose among one or more of many data sources to search, passing options to each source as needed.
ft_search()
ft_get()
using the output of the previous stepft_collect()
ft_chunks()
, orquanteda
or similar text-mining packagesfulltext
has undergone a re-organization, which includes a bump in the major version to v1
to reinforce the large changes the package has undergone. Changes include:
ft_
prefix. e,g, chunks
is now ft_chunks
ft_get
has undergone major re-organization - biggest of which may be that all full text XML/plain text/PDF goes to disk to simplify the user interface. Along with this we’ve changed to using DOIs/IDs as the file namesstorr
is now imported to manage mapping between real DOIs and file paths that include normalized DOIs - and aids in the function ft_table()
for creating a data.frame of text resultsft_get()
overhaul, the only option is to write to disk. Before we attempted to provide many different options for saving XML and PDF data, but it was too complicated. This has implications for using the output of ft_get()
- the output includes the paths to the files - use ft_collect()
to collect the text if you want to use ft_chunks()
or other fulltext
functions downstream.ft_abstract
is introduced to fetch abstracts for when you just need abstractsft_table
has been introduced to gather all your documents into a data.frame to make it easy to do downstream analyses with other packagesft_search()
and ft_abstract()
ft_get_ls()
, ft_links_ls()
, and ft_search_ls()
We’ve battle tested ft_get()
on a lot of DOIs - but there still may be errors - let us know if you have any problems.
Along with an overhual of the package we have made a new manual for fulltext
. Check it out at https://books.ropensci.org/fulltext//
Install fulltext
at the time of writing binaries are not yet available on CRAN, so you’ll have to install from source from CRAN (which shouldn’t provide any problems since there’s no compiled code in the package), or install from GitHub
install.packages("fulltext")
Or get the development version:
devtools::install_github("ropensci/fulltext")
library(fulltext)
Below I’ll discuss some of the new features of the package, and not do an exhaustive tutorial to the package. Check out the manual for more details: https://books.ropensci.org/fulltext//
ft_abstract()
is a new function in fulltext
. It gives you access to absracts from the following data sources:
A quick example. Search for articles in PLOS.
res <- ft_search(query = 'biology', from = 'plos', limit = 10,
plosopts = list(fq = list('doc_type:full', '-article_type:correction',
'-article_type:viewpoints')))
Now pass the DOIs to ft_abstract()
to get abstracts:
ft_abstract(x = res$plos$data$id, from = "plos")
## <fulltext abstracts>
## Found:
## [PLOS: 90; Scopus: 0; Microsoft: 0; Crossref: 0]
The function ft_get
is the workhorse for getting full text articles.
Using this function can be tricky depending on where you want to
get articles from. While searching (ft_search
) usually doesn’t present any
barriers or stumbling blocks because search portals are generally open
(except Web Of Science), ft_get
can get frustrating because so many
publishers paywall their articles. The combination of paywalls and their
patchwork of who gets to get through them means that we can’t easily predict
who will run into problems with Elsevier, Wiley, Springer, etc. (well, mostly
those big three because they publish such a large portion of the papers).
With this version we’ve tried to bulk up the documentation as much as possible (see the manual) to make jumping over these barriers as painless as possible.
Let’s do an example to demonstrate how to use ft_get()
and some of the new
features.
Get DOIs from PLOS (excluding partial document types)
library("rplos")
dois <- searchplos(q="*:*", fl='id',
fq=list('doc_type:full',"article_type:\"research article\""),
limit=5)$data$id
Once we have DOIs we can go to ft_get()
:
res <- ft_get(dois, from = "plos")
Internally, ft_get()
attemps to write the file to disk if we can successfully
access the file - if an error occurs for any reason (see ft_get errors in the manual) we delete that file so you
don’t end up with partial/empty files.
Since ft_get()
writes files to your machine’s disk, even if a function call
to ft_get()
fails at some point in the process, the articles that we’ve
successfully retrieved aren’t affected.
In addition, we have fixed reusing cached files on disk. Thus, even if you get a failure
in a call to ft_get()
you can rerun it again and those files already retrieved will
make the function call faster.
Having a look at the output of ft_get()
, we can see that only
one list element (plos
) has data in it because we only searched for articles
from one publisher.
vapply(res, function(x) is.null(x$data), logical(1))
## plos entrez elife pensoft arxiv biorxiv elsevier wiley
## FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The output for elements searched for are the following in a list:
ft_collect()
The backend
can only be one value - ext
, representing file extension
.
We’re retaining that information for now because we may decide to add
additional backends in the future.
In the last version of fulltext
you could get the extracted text from
XML or PDF in the output of ft_get()
. This is changed. You now only
get metadata and the path to the file on disk. To get text into the object
is a separate function call to ft_collect()
ft_collect(res)
Which returns the same class object as ft_get()
returns, but the data
slot is now populated with the text.
Remember that wit this change you now should use ft_collect()
before passing
the output of ft_get()
to ft_chunks()
ft_get(dois, from = "plos") %>%
ft_collect() %>%
ft_chunks(c("doi","history")) %>%
ft_tabularize()
ft_extract()
used to have options for extracting text from PDF’s with different pieces of software. To simplify the function it only uses the pdftools package.
path <- system.file("examples", "example1.pdf", package = "fulltext")
ft_extract(path)
## <document>/Library/Frameworks/R.framework/Versions/3.4/Resources/library/fulltext/examples/example1.pdf
## Title: Suffering and mental health among older people living in nursing homes---a mixed-methods study
## Producer: pdfTeX-1.40.10
## Creation date: 2015-07-17
ft_table()
is a new function to gather the text from all your
articles into a data.frame. This should simplify analysis for most users.
(x <- ft_table())
## # A tibble: 192 x 4
## dois ids_norm text paths
## * <chr> <chr> <chr> <chr>
## 1 10.1002/9783527696109.ch41 10_1002_9783527696109_ch41 " … /Use…
## 2 10.1002/chin.199038056 10_1002_chin_199038056 "ChemInfor… /Use…
## 3 10.1002/cite.330221605 10_1002_cite_330221605 " Versamml… /Use…
## 4 10.1002/dvg.22402 10_1002_dvg_22402 "C 2013 Wi… /Use…
## 5 10.1002/jctb.5010090209 10_1002_jctb_5010090209 " … /Use…
## 6 10.1002/qua.560200801 10_1002_qua_560200801 "Internati… /Use…
## 7 10.1002/risk.200590063 10_1002_risk_200590063 " … /Use…
## 8 10.1002/scin.5591692420 10_1002_scin_5591692420 "Books\n … /Use…
## 9 10.1006/bbrc.1994.2001 10_1006_bbrc_1994_2001 "http://ap… /Use…
## 10 10.1007/11946465_42 10_1007_11946465_42 " Hoon Cho… /Use…
## # ... with 182 more rows
(you can optionally only extract text from PDFs, or only from XMLs)
We give the the DOI, the normalized DOI that we used for the file path,
the text, and the file path. You can then use this output in quanteda
or other text-mining packages (the function quanteda::kwic()
is for locating
keywords in context):
library(quanteda)
z <- corpus(x, docid_field="dois", text_field="text")
head(quanteda::kwic(z, "cell"))
##
## [10.1002/9783527696109.ch41, 253] Basic Concepts A lithium-ion battery |
## [10.1002/9783527696109.ch41, 397] in a typical lithium-ion battery |
## [10.1002/9783527696109.ch41, 461] in a typical lithium-ion battery |
## [10.1002/9783527696109.ch41, 764] material 1 Lithium LCO LiCoO2 |
## [10.1002/9783527696109.ch41, 2744] of about 3.6-3.8 V per |
## [10.1002/9783527696109.ch41, 6237] nate/ anode and hour |
##
## cell | consists of a positive electrode
## cell | . material. During discharging
## cell | . The main reactions occurring
## Cell | phones, High capacity cobalt
## cell | and highest energy densities with
## cell | cathode 4. Electrovaya,
We have lots of ideas to make fulltext
even better. Check out what we’ll be working on in the issue tracker.
Please do upgrade/install fulltext
v1.0.0
and let us know what you think.