Friday, January 13, 2023 From rOpenSci (https://ropensci.org/blog/2023/01/13/curl5-release/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
A new major version of the curl package has been released to CRAN. This release both brings internal improvements as well as new user-facing functionality, in particular with respect to concurrent downloads. From the NEWS file:
curl 5.0.0 - New function multi_download() which supports concurrent downloads and resuming download for large files, while giving detailed progress information. - Windows: updated libcurl to 7.84.0 + nghttp2 - Windows: default to CURLSSLOPT_NATIVE_CA when using openssl unless an ennvar with CURL_CA_BUNDLE is set. - Use the new optiontype API for type checking if available (libcurl 7.73.0)
The curl package is used by most other R packages for performing HTTP requests. Over 60% of rOpenSci packages directly or indirectly depend on curl for network interaction, hence improvements and bugs in curl have a big impact on the entire ecosystem.
The most exciting new feature is
multi_download(): an advanced alternative to
curl_download(). It can perform many requests concurrently, with nice progress updates and support for interrupting and resuming large files. This function does not error in case any of the individual requests fail; it returns a data frame with information about the status of each request.
pkg <- 'curl' mirror <- 'https://cloud.r-project.org' db <- available.packages(repos = mirror) packages <- c(pkg, tools::package_dependencies(pkg, db = db, reverse = TRUE)[[pkg]]) versions <- db[packages,'Version'] urls <- sprintf("%s/src/contrib/%s_%s.tar.gz", mirror, packages, versions) res <- curl::multi_download(urls) res # A tibble: 316 × 10 # success status_…¹ resum…² url destf…³ error type modified time # <lgl> <int> <dbl> <chr> <chr> <chr> <chr> <dttm> <dbl> # 1 TRUE 200 0 http… curl_5… NA appl… 2023-01-12 18:10:03 0.260 # 2 TRUE 200 0 http… abbyyR… NA appl… 2019-06-25 04:30:07 0.713 # 3 TRUE 200 0 http… addins… NA appl… 2021-01-10 18:50:12 0.214 # 4 TRUE 200 0 http… alfred… NA appl… 2021-07-26 10:20:03 0.225 # 5 TRUE 200 0 http… allcon… NA appl… 2022-08-11 10:00:07 0.226 # 6 TRUE 200 0 http… AMAPVo… NA appl… 2022-12-05 09:22:33 0.266 # 7 TRUE 200 0 http… AnnoPr… NA appl… 2022-11-14 08:30:13 0.891 # 8 TRUE 200 0 http… anyfli… NA appl… 2022-08-12 15:40:03 0.394 # 9 TRUE 200 0 http… anyLib… NA appl… 2018-11-05 15:00:04 0.282 #10 TRUE 200 0 http… aopdat… NA appl… 2022-08-31 13:10:04 0.237 # … with 306 more rows, 1 more variable: headers <list>, and abbreviated # variable names ¹status_code, ²resumefrom, ³destfile all.equal(unname(tools::md5sum(res$destfile)), unname(db[packages, 'MD5sum'])) # TRUE
Above a small example from the
?multi_download manual, which downloads all reverse dependencies for a given CRAN package. It downloads 316 files, total 261.41 Mb. On a fast server, the
multi_download() part takes about 1 or 2 seconds.
The function scales well in terms of the number of requests. Below is an example, which downloads the DESCRIPTION file for the first 3000 CRAN packages. On a fast server (with HTTP/2 support) this again takes about 2 or 3 seconds.
mirror <- 'https://cloud.r-project.org' pkgs <- row.names(available.packages(repos = mirror))[1:3000] urls <- sprintf('%s/web/packages/%s/DESCRIPTION', mirror, pkgs) files <- sprintf('descriptions/%s.txt', pkgs) dir.create('descriptions', showWarnings = FALSE) res <- curl::multi_download(urls, files)
This second example will especially from HTTP/2 support because there are many small files that can be multiplexed, whereas with HTTP/1.1 these need to be requested one after another.
The Windows binaries are now using
libcurl 7.84.0 with
nghttp 1.51.0. The latter brings support for HTTP/2, but only when using the OpenSSL TLS backend, which is not (yet) the default. You can change this by setting the
CURL_SSL_BACKEND environment variable in your
~/.Renviron file and then restart R. The Windows vignette explains this in more detail.
To test if HTTP/2 is working you can perform a verbose request:
library(curl) multi_download('https://httpbin.org/get', tempfile(), verbose = TRUE)
And the output will show
HTTP/2 200 somewhere in the response:
... * Connection state changed (MAX_CONCURRENT_STREAMS == 128)! < HTTP/2 200 ...
Right now OpenSSL is not the default, because Windows Native TLS back-end may be more robust, which has to do with the next topic.
As mentioned above, libcurl on Windows can use one of two SSL back-ends (for https): SecureChannel (the native Windows TLS implementation) or OpenSSL. OpenSSL is also used by most other operating systems and is therefore better tested and moreover it supports HTTP/2. However there was always a big limitation with OpenSSL Windows: it required us to ship a ca-bundle with root certificates, which gets outdated quickly and may not work well on corporate networks that use custom SSL certificates.
This has now changed because libcurl has gained a new experimental option
CURLSSLOPT_NATIVE_CA which lets OpenSSL import the root certificates from the native Windows certificate store, instead of a custom ca-bundle. The R package now enables this option by default when using the OpenSSL back-end. Thereby curl in R should support the same TLS connections, regardless of which SSL back-end is in use. This might make OpenSSL once again the preferable option, and if this works well we may make it the default in a future version of the R package.
The final topic is mostly an internal change, but I’m pretty proud of it because it is based on functionality in libcurl that I proposed myself, and is now finally widely available.
At the curl-up 2020 conference I gave a presentation 5 years of libcurl bindings for R, after which we had a discussion on potential improvements for language bindings, such as in the R package. Eventually this led to the proposal of a new API that exposes a list of supported libcurl options and their types, to the language binding. This is important such that when users in R set an option in
new_handle(), it can be verified that the option is valid and has the correct type (e.g. string, number, vector), because passing invalid types to libcurl will result in a crash.
The proposal was merged later in 2020, and is now (2 years later) available in the stable versions of most operating systems. Version 5.0.0 of the R package (conditionally) use this API if available, which makes the type bindings safer to use.
It's been a journey, but with help from friends like @opencpu, today we landed a new API to libcurl to query it for details about "easy options". This should allow for better libcurl bindings in the future.— daniel:// stenberg:// (@bagder) August 27, 2020
Suitable staring point for reading up: https://t.co/OlAWuBuDaR