Thursday, September 18, 2025 From rOpenSci (https://ropensci.org/blog/2025/09/18/markdown-programmatic-parsing/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
If life gives you a bunch of Markdown files to analyse or edit, do you warm up your regex muscles and get going? How about using more specific tools instead? In this post, we shall give an overview of programmatic ways to parse and edit Markdown files: Markdown, R Markdown, Quarto, Hugo files, you name it.
Markdown is a (punny, eh) markup language created by John Gruber and Aaron Swartz. Here is an example:
# My first header
Some content, with parts in **bold** or *italic*.
Let me add a [link](https://ropensci.org).
Different Markdown files can lead to the same output, for instance this is equivalent to our first example:
My first header
===============
Some content, with parts in __bold__ or _italic_. Let me add a [link](https://ropensci.org).
Furthermore there are different flavours or specifications (specs) of Markdown1, which add some extended syntax, like emojis written with colons.
R users will commonly interact with different Markdown flavors through their usual tools:
Many tools using Markdown also accept frontmatter: metadata at the top of Markdown files, for instance YAML, TOML pr JSON. Here is an example with a YAML frontmatter:
---
title: My cool thing
author: Myself
---
Some content, *nice* content.
Most often R users will write Markdown manually, or with the help of an editor such as the Positron visual editor or the RStudio IDE visual editor. But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time. This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming 🧙!
In Markdown you can add code chunks, that will be properly formatted and highlighted, using the following syntax:
```r
1 + 1
```
Tools for literate programming such as knitr (for R Markdown and Quarto) will let you add code chunks that will be executed to render the document:
```{r}
#| label: my-chunk
#| echo: true
1 + 1
```
The latter syntax, the executable code chunks, are not necessarily properly handled by off-the-shelf “normal” tools like Pandoc. This is something to keep in mind if you’re dealing with documents that contain executable code chunks.
Imagine you need to create a bunch of different R Markdown files, for instance for students to use as personalized exercises. In that case, you can create a boilerplate document as a template, and create its different output versions using a templating tool.
Templating tools include:
knitr::knit_expand()
by Yihui Xie;The simplest example of the whisker package might furthermore remind you of the glue package.
A common workflow would be:
{{name}}
.Here’s an example Markdown file that we can use as a template:
---
title: "Homework assignment 1"
author: "{{name}}"
---
Create a normal distribution with a mean of {{mean}} and a standard deviation of {{sd}}:
```{r solution-1}
# hint: use the rnorm function
```
Using the workflow below, we can create different Markdown documents corresponding to different students.
# generate student variables ----
students <- c("Maëlle", "Christophe", "Zhian")
n <- length(students)
key <- data.frame(
name = students,
mean = rpois(n, 5),
sd = sprintf("%.1f", runif(n)),
file = sprintf("%s-hw.md", students)
)
# render and write assignment from template ----
make_assignment <- function(key, template) {
lapply(seq(n), function(i) {
new <- whisker::whisker.render(template, data = key[i, ])
brio::write_lines(new, key$file[i])
})
return(invisible())
}
md <- brio::read_lines("hw-template.md")
make_assignment(key, template = md)
print(key)
#> name mean sd file
#> 1 Maëlle 3 0.9 Maëlle-hw.md
#> 2 Christophe 4 0.2 Christophe-hw.md
#> 3 Zhian 8 0.8 Zhian-hw.md
Here’s how Zhian’s homework looks like:
---
title: "Homework assignment 1"
author: "Zhian"
---
Create a normal distribution with a mean of 8 and a standard deviation of 0.8:
```{r solution-1}
# hint: use the rnorm function
```
You can use string manipulation tools to parse Markdown if you are sure of the Markdown variants your code will get as input, or if you are willing to grow your codebase to accommodate many edge cases… which in the end means you are writing an actual Markdown parser. Not for the faint of heart… nor necessary if you read the section after this one. 😌
You’d detect headings using for instance grep("^#", markdown_lines)
2.
Example of string manipulation tools include base R (sub()
, grep()
and friends), stringr (and stringi), xfun::gsub_file()
.
Although string manipulation tools are of a limited usefulness when parsing Markdown, they can complement the actual parsing tools. Even if using specific Markdown parsing tools will help you write fewer regular expressions yourself… they won’t completely free you from them.
Abstract representation manipulation tools are fantastic, and numerous. These translate the Markdown document into a data structure called an Abstract Syntax Tree (AST) that gives you fine-grained control over specific elements of the document (e.g. individual headings or links regardless of how they are written). With a formal data structure, you can programmatically manipulate the Markdown document by adding, removing, or manipulating pieces of Markdown in a standardized way.
Some of these tools allow you to read, edit and write back to the document.
We will only mention the tools you can directly use from R.
Let’s say you have created a bunch of tutorials that link to a website containing a gallery of extensions for a popular plotting package. Let’s also say that one day, someone discovers that the link to the website is suddenly redirecting to a potentially malicious site that is most certainly not related to the grammar of graphics and you need to replace all instances of that link to **redacted**
. Since links in Markdown could be written any number of ways, regex is not going to help you, but a fine-grained Markdown parser will!
A workflow for this situation would be:
The tinkr package dreamed up by Maëlle Salmon and maintained by Zhian Kamvar parses Markdown to XML using Commonmark, allows you to extract and manipulate Markdown using XPath via the xml2 package. Tinkr writes the XML back to Markdown using XSLT. The YAML metadata is available as a string. Tinkr supports executable code chunks.
The tinkr package is used in the babeldown and aeolus packages.
The md4r package, is a recent experimental package maintained by Colin Rundel, and is an R wrapper around the MD4C (Markdown for C) library and represents the AST as a nested list with attributes in R. The development version of the package has utilities for constructing Markdown documents programmatically.
With Pandoc that we presented in a tech note, you can parse a Markdown files to a Pandoc Abstract Syntax Tree (either in its native format, or in JSON).
How would you use Pandoc to edit and write back a Markdown file?
Using Lua filters: Pandoc converts to AST in its native format, Lua filters allow to process it to tweak it, and than Pandoc can write back to markdown.
Using JSON filters: Pandoc converts to AST outputing a JSON representation of it, then any tools can modify this JSON file and provided a modified version to pandoc to convert back to markdown.
Note that Pandoc does not support executable code chunks, as it won’t be able to parse executable code chunk as Codeblock
.
Nic Crane has an experimental package called parseqmd that uses this strategy, parsing the output with the jsonlite package. You can also parse to, say HTML, and then back to Markdown. The benefit of parsing it to HTML is that you can use a package such as xml2 or rvest to extract and manipulate the elements.
The parsermd package is another package maintained by Colin Rundel and is an “implementation of a formal grammar and parser for R Markdown documents using the Boost Spirit X3 library. It also includes a collection of high level functions for working with the resulting abstract syntax tree.”
This package has functionality for a tidy workflow allowing you to select different sections of the document. One useful feature is that it has the function rmd_check_template()
allowing you to compare student Markdown submissions against a standard template. You can watch his RStudio::conf(2021) talk about it.
The parsermd package even allows you to modify documents.
The lightparser package by Sébastien Rochette “splits your rmarkdown or quarto files by sections into a tibble: titles, text, chunks; rebuilds the file from the tibble”. It can be used to translate documents for instance.
When parsing and editing Markdown, then writing it back to Markdown, some undesired changes might appear. For instance, with tinkr list items all start with a -
even if in the original document they started with a *
. With md4r, lists that are indented with extra space will be readjusted.
Depending on your use case you might want to find ways to mitigate such losses, for instance only re-writing the lines you made intentional edits to.
You can choose a parser based on what it lets you manipulate the Markdown with: if you prefer XML3 and HTML to nested lists for instance, you might prefer using tinkr or Pandoc. If the high-level functions of md4r or parsermd are suitable for your use case, you might prefer one of them.
Importantly, if your documents contain executable code chunks, you need to use a tool that supports them such as parsermd, lightparser, tinkr.
Another important criterion is to choose a parser that’s close to the use case of your Markdown files as possible. If you are only going to work with Markdown files for GitHub, commonmark/tinkr is an excellent choice since GitHub itself uses commonmark. Now, your work might encompass different sorts of Markdown files that will be used by different tools. For instance, the babeldown package processes any Markdown file4: Markdown, R Markdown, Quarto, Hugo. In that case, or if there is no R parser doing exactly what your Markdown’s end user does, you need to pay attention to the quirks of that end user. Maybe you have to throw Pandoc raw attributes around a Hugo shortcode, for instance. Furthermore, if you need to parse certain elements, like again Hugo shortcodes, you might need to write the parsing code yourself, that is, regular expressions.
Programmatically parsing and editing R code is out of the scope of this post, but closely related enough to throw in a few tips.
As with Markdown, you might need to use regular expressions, but that’s a risky approach as for instance plot (x)
and plot(x)
are both valid function calls in R.
You can parse the code to XML using base R parsing and xmlparsedata, then you manipulate the XML with XPath. To write code back, you can make use of the attributes of each node that indicates the original lines and columns.
So a possible workflow, as exemplified in Maëlle’s blog post is:
treesitter by Davis Vaughan “provides R bindings to tree-sitter, an incremental parsing system”.
We dedicated this whole post to the body of Markdown documents. What about the metadata contained in their frontmatter, like:
---
title: "Cool doc"
author: "Jane Doe"
---
To extract or edit YAML/TOML/JSON metadata, you first need to decapitate Markdown documents. For instance, rmarkdown has a function called rmarkdown::yaml_front_matter()
to extract the YAML metadata of an R Markdown document; the quarto R package has a function called quarto::quarto_inspect()
that among other things outputs the metadata.
You might read the lines of the Markdown document using readLines()
or brio::read_lines()
, before resorting to regular expressions to identify the start and end of the frontmatter depending on its format.
Then, to handle YAML you’d use {yaml}, to handle TOML you could use {tomledit} or {RcppTOML}, to handle JSON you could use {jsonlite}.
Finally if you need to write back the Markdown document, you’d write back its lines using writeLines()
or brio::write_lines()
.
The pegboard package created by Zhian Kamvar and maintained by The Carpentries, parses and validates Carpentries’ lessons for structural Markdown elements, including valid links, alt-text, and known fenced-divs thanks to tinkr. This package was instrumental in converting all of The Carpentries lesson infrastructure from Jekyll’s Markdown syntax to Pandoc’s Markdown5.
The babeldown package maintained by Maëlle Salmon transforms Markdown to XML, sends it to DeepL API for translation, and writes the results back to Markdown, also using tinkr.
In this post we explained how to best parse and edit Markdown files. To create boilerplate documents (think: mailmerge), we recommended templating tools such as knitr::knit_expand()
, the whisker package, the brew package, Pandoc. To edit small parts of a document, we brought up string manipulation tools i.e. regular expressions, with base R (sub()
, grep()
and friends), stringr (and stringi), xfun::gsub_file()
. For heavier, and safer, manipulation, we listed tools based on tools that manipulate the abstract representation of documents: tinkr, md4r, Pandoc, parseqmd, parsermd, lightparser. We also mentioned tools for working with the R code inside code cells, and for working with the YAML/TOML/JSON frontmatter.
What do you use to handle Markdown files?
As of 2024-06-20, there are 76 programs that parse Markdown, some with their own unique flavour. ↩︎
But this would also detect code comments! Don’t do this! ↩︎
Both Maëlle and Zhian are huge fans of XML and XPath (see: https://masalmon.eu/2022/04/08/xml-xpath/ and https://zkamvar.netlify.app/blog/gh-task-lists/). ↩︎
Or at least it’s supposed to 😅 Thankfully users report edge cases that are not covered yet. ↩︎
For examples, see The Carpentries Workbench Transition Guide. ↩︎