Thursday, December 19, 2019 From rOpenSci (https://ropensci.org/blog/2019/12/19/urls-tidying/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Last year we reported on the joy of using commonmark and xml2 to parse Markdown content, like the source of this website built with Hugo, in particular to extract links, at the time merely to count them. How about we go a bit further and use the same approach to find links to be fixed? In this tech note we shall report our experience using R to find broken/suboptimal links and fix them.
We tackled a few URL issues:
We had used absolute links (using our domain name) instead of
relative
links.
https://ropensci.org/blog/
should be /blog/
.
Some internal and external links were broken, but we did not know which ones.
A few links were short links (bit.ly/blabla
) whereas it’s best to
store the actual link because the short link could break too.
Some links were http links, although the same https link might work and would be preferred over http for security.
There were links to ropensci.github.io documentation websites that can be replaced with links to our brand-new docs server.
There are three main ingredients to our website spring/fall cleaning: R tools, elbow grease and version control! Most changes happened in a branch, and although one can’t possibly look in detail at a diff of more than one hundred files, we tried to be as careful as possible.
To remove the absolute links, we resorted to using regular expressions.
library("magrittr")
# Identify the Markdown files to be examined
mds <- fs::dir_ls("content", recurse = TRUE, glob = "*.md")
mds <- mds[!grepl("\\/tutorials\\/", mds)]
# Function to fix each file if needed
fix_ropensci <- function(filepath){
readLines(filepath) -> text
# We only edit files that had the issue
if (any(grepl("http(s)?\\:\\/\\/ropensci\\.org\\/", text))){
text %>%
stringr::str_replace_all("http(s)?\\:\\/\\/ropensci\\.org\\/", "/") %>%
writeLines(filepath)
}
}
purrr::walk(mds, fix_ropensci)
Voilà!
Now, what about the links that do not link to anything? We started by extracting all links together with the relevant file paths.
library("magrittr")
website_source <- "/home/maelle/Documents/ropensci/roweb2"
mds <- fs::dir_ls(website_source, recurse = TRUE, glob = "*.md")
mds <- mds[!grepl("\\/tutorials\\/", mds)]
get_links <- function(filepath){
readLines(filepath) %>%
glue::glue_collapse(sep = "\n") %>%
commonmark::markdown_html(normalize = TRUE,
extensions = TRUE) %>%
xml2::read_html() %>%
xml2::xml_find_all("//a") %>%
xml2::xml_attr("href") -> urls
tibble::tibble(filepath = filepath,
url = urls)
}
all_urls <- purrr::map_df(mds, get_links)
all_urls <- all_urls %>%
dplyr::mutate(url = stringr::str_remove_all(url, "#.*"),
url = stringr::str_remove(url, "\\/$"))
all_urls
## # A tibble: 14,234 x 2
## filepath url
## <chr> <chr>
## 1 /home/maelle/Documents/ropensci/roweb2/content/aut… https://adamhsparks.netl…
## 2 /home/maelle/Documents/ropensci/roweb2/content/aut… https://aldocompagnoni.w…
## 3 /home/maelle/Documents/ropensci/roweb2/content/aut… http://robitalec.ca
## 4 /home/maelle/Documents/ropensci/roweb2/content/aut… https://alison.rbind.io
## 5 /home/maelle/Documents/ropensci/roweb2/content/aut… https://dobb.ae
## 6 /home/maelle/Documents/ropensci/roweb2/content/aut… https://thestudyofthehou…
## 7 /home/maelle/Documents/ropensci/roweb2/content/aut… https://annakrystalli.me
## 8 /home/maelle/Documents/ropensci/roweb2/content/aut… https://paleantology.com…
## 9 /home/maelle/Documents/ropensci/roweb2/content/aut… https://aurielfournier.g…
## 10 /home/maelle/Documents/ropensci/roweb2/content/aut… https://faculty.washingt…
## # … with 14,224 more rows
We chose a different method to find those within and outside of our website.
When building a Hugo website, one gets a sitemap, which is basically a collection of links to all the pages of the website. If an internal link is not in the sitemap, it does not exist.
We generated the sitemap from within the website folder to extract links.
cwd <- getwd()
setwd(website_source)
p <- processx::process$new("hugo", args = "server", echo = TRUE)
## Running hugo server
Sys.sleep(120)
localhost <- "http://localhost:1313"
browseURL(localhost)
paste0(localhost, "/sitemap.xml") %>%
xml2::read_xml() %>%
xml2::xml_ns_strip() %>%
xml2::xml_find_all("//loc") %>%
xml2::xml_text() %>%
stringr::str_remove_all(localhost) %>%
stringr::str_remove("\\/$") -> links
p$kill()
## [1] TRUE
setwd(cwd)
head(links)
## [1] "/authors/scott-chamberlain" "/tags/api"
## [3] "/authors" "/tags/http"
## [5] "/technotes/2019/12/11/http-testing" "/tags/mocking"
So these are the existing internal links. We could also have extracted them using the multi-request features of curl.
Let’s now extract the internal links we used in the content.
all_urls %>%
dplyr::filter(!grepl("^http", url)) ->
internal_urls
head(internal_urls)
## # A tibble: 6 x 2
## filepath url
## <chr> <chr>
## 1 /home/maelle/Documents/ropensci/roweb2/content/blog… /community
## 2 /home/maelle/Documents/ropensci/roweb2/content/blog… /community
## 3 /home/maelle/Documents/ropensci/roweb2/content/blog… /blog/2013/05/10/introdu…
## 4 /home/maelle/Documents/ropensci/roweb2/content/blog… /about
## 5 /home/maelle/Documents/ropensci/roweb2/content/blog… /community
## 6 /home/maelle/Documents/ropensci/roweb2/content/blog… /contact
So, what are the missing ones? We used the code below to identify them and then we manually fixed or removed them.
internal_urls %>%
dplyr::filter(!url %in% links)
To identify broken external URLs, we ran
crul::ok()
on all
of them and created a big spreadsheet of URLs to look at.
external_urls <- dplyr::anti_join(all_urls, internal_urls,
by = c("filepath", "url"))
unique_urls <- unique(external_urls[, "url"])
ok <- memoise::memoise(
ratelimitr::limit_rate(crul::ok,
ratelimitr::rate(1, 1)))
get_ok <- function(url){
message(url)
ok(url)
}
unique_urls <- unique_urls %>%
dplyr::group_by(url) %>%
dplyr::summarise(ok = get_ok(url))
external_urls <- dplyr::left_join(external_urls, unique_urls,
by = "url")
external_urls <- dplyr::arrange(external_urls, url)
parse_one_post <- function(path){
if (grepl("\\_index", path)){
return(NULL)
}
lines <- suppressWarnings(readLines(path, encoding = "UTF-8"))
yaml <- blogdown:::split_yaml_body(lines)$yaml
yaml <- glue::glue_collapse(yaml, sep = "\n")
yaml <- yaml::yaml.load(yaml)
meta <- tibble::tibble(date = anytime::anydate(yaml$date),
author = yaml$authors,
title = yaml$title,
software_peer_review = "Software Peer Review" %in% yaml$tags,
type = dplyr::if_else(grepl("\\/blog\\/", path),
"blog post", "tech note"),
filepath = path)
meta
}
info <- purrr::map_df(mds[grepl("blog", mds)|grepl("technotes",mds)], parse_one_post)
info <- dplyr::group_by(info, filepath) %>%
dplyr::summarize(date = date[1],
author = toString(author),
title = title[1],
type = type[1])
bad_urls <- dplyr::filter(external_urls, !ok)
bad_urls <- dplyr::left_join(bad_urls, info, by = "filepath")
readr::write_csv(bad_urls, "urls.csv")
From that spreadsheet hundreds of links were examined manually! When there was a replacement link, we used it thanks to a code looping over all links. For the about 50 links without replacement, we amended the posts by hand to make sure to take context into account (e.g. removing the link vs. removing the whole sentence presenting it).
There were quite a few false positives i.e. actually valid URLs. This
lead to some edits in
crul::ok()
and the
following wisdom:
Sometimes you’ll get an error for the HEAD request but not the GET request.
# use get verb instead of head
crul::ok("http://animalnexus.ca")
## [1] FALSE
crul::ok("http://animalnexus.ca", verb = "get")
## [1] TRUE
Sometimes you’ll need an user-agent whose name does not contain “curl”,
which the default user-agent of crul contains (crul:::make_ua()
is
libcurl/7.58.0 r-curl/4.3 crul/0.9.1.9991).
# some urls will require a different useragent string
# they probably regex the useragent string
crul::ok("https://doi.org/10.1093/chemse/bjq042")
## GnuTLS recv error (-54): Error in the pull function.
## [1] FALSE
crul::ok("https://doi.org/10.1093/chemse/bjq042", verb = "get", useragent = "foobar")
## [1] TRUE
We only identified short links using the bit.ly service. We found the corresponding link by running the function below. There were actually only 4 short links so that was quick.
get_long <- function(url){
crul::HttpClient$new(url)$get()$url
}
get_long("http://bit.ly/2JfrzmE")
## [1] "https://www.timeanddate.com/worldclock/fixedtime.html?msg=rOpenSci+Community+Call+on+Reproducible+Research+with+R&iso=20190730T09&p1=791&ah=1"
HTTPS: HTTP + security pic.twitter.com/pkk7ZVzjz3
— 🔎Julia Evans🔍 (@b0rk) August 9, 2019
We proceeded as previously when checking external links, except we used
better settings for crul::ok()
.
http <- dplyr::filter(all_urls, grepl("http\\:", url))
http <- dplyr::mutate(http, https = sub("http\\:", "https:", url))
unique_urls <- unique(http[, "https"])
ok <- memoise::memoise(
ratelimitr::limit_rate(crul::ok,
ratelimitr::rate(1, 1)))
get_ok <- function(url){
message(url)
ok(url, verb = "get", useragent = "Maëlle Salmon checking links")
}
unique_urls <- unique_urls %>%
dplyr::group_by(https) %>%
dplyr::summarise(ok = get_ok(https))
http <- dplyr::left_join(http, unique_urls, by = "https")
httpsok <- dplyr::filter(http, ok)
modify_url <- function(index, df = httpsok) {
row <- df[index,]
readLines(row$filepath) %>%
stringr::str_replace_all(row$url, row$https) %>%
writeLines(row$filepath)
}
purrr::walk(seq_len(nrow(httpsok)), modify_url)
You can browse the related
PR. Note that in the
above, we could have used
urltools
to parse URLs and
extract their scheme (http or https).
To replace some ropensci.github.io links with docs.ropensci.org links, we used the brute force approach below (there were only about 80 such links).
dotgithub <- dplyr::filter(all_urls, urltools::domain(url) == "ropensci.github.io")
make_docs_url <- function(url, ropensci_pkgs = ropensci_pkgs) {
message(url)
newurl <- url
urltools::domain(newurl) <- "docs.ropensci.org"
if (crul::ok(newurl, verb = "get", useragent = "Maëlle Salmon checking links")) {
return(newurl)
} else {
return(url)
}
}
dotgithub <- dotgithub %>%
dplyr::group_by(url) %>%
dplyr::mutate(newurl = make_docs_url(url))
modify_url <- function(index, df = dotgithub) {
row <- df[index,]
readLines(row$filepath) %>%
stringr::str_replace_all(row$url, row$newurl) %>%
writeLines(row$filepath)
}
purrr::walk(seq_len(nrow(dotgithub)), modify_url)
In this tech note we saw how to use a combination of regular expressions, commonmark, xml2 and crul to identify links to be fixed in Markdown content. For html content, check out the experimental checker package by François Michonneau. For packages, have a look at Bob Rudis’ RStudio add-in.
Some of the issues we fixed, like using relative rather than absolute links, and not storing shortlinks, could be avoided in the future by stricter URL guidelines for new content. We also plan to stop using Click here links, refer to this page about why Click here links are bad.
Now, a remaining issue is the frequency at which URL cleaning should occur. In our dev guide, we clean links before each release, but this website has no such schedule, so let’s hope we remember to clean URLs once in a while. Maybe some old pages could also be “archived” like this example. When do you clean URLs in your content, and how?