Thursday, February 22, 2024 From rOpenSci (https://ropensci.org/blog/2024/02/22/beautiful-code/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
rOpenSci’s second cohort of champions was onboarded! Their training started with a session on code style, which we will summarize here in this post. Knowing more about code quality is relevant to all Champion projects, be it creating a new package, submitting a package to software review, or reviewing a package. This training session consisted of a talk and discussion, whereas the next package development training sessions will be more hands-on.
Although your code will be executed by machines, it will be read by humans. Those humans, whether they are future you, collaborators you know or collaborators you don’t know, will need to understand your code to check that it has no mistake, to fix potential bugs, and to build upon it by adding new features. Making it easier to understand your code is therefore crucial.
In the first part, we shared tips that made the code more “well-proportioned”. It is not only a matter of aesthetics. Well-proportioned code is easier to parse by humans.
Compare
starwars%>%
select(name,height, mass,homeworld) %>%
mutate(
mass=NULL,
height =height *0.0328084 # convert to feet
)
to
starwars %>%
select(name, height, mass, homeworld) %>%
mutate(
mass = NULL,
height = height * 0.0328084 # convert to feet
)
In the first chunk, spacing between elements is irregular.
For instance, there is no space before height
, no space around the equal sign that comes after mass
.
We instead recommend to follow spacing (and line-breaking!) rules consistently. Unless you have a strongly differing opinion, the easiest strategy is to follow your collaborators’ style guide, or a popular style guide like the tidyverse style guide.
So how do you implement these rules in practice?
First you’ll need to be accustomed to using a particular style.
Automatic tools like the styler package or your IDE can help you.
For example, in the RStudio IDE, the keyboard shortcut Ctrl+I
fixes indentation.
A traditional rule is to not have more than 80 characters per line. The exact number isn’t important, what’s important is to prevent too much horizontal scrolling!
The lintr package can warn you about too wide lines, among many other things. Compared to styler, lintr does not fix things itself.
There is a also a setting in RStudio IDE to show a margin at 80 characters (Code > Display > Show Margin).
Vertical space is limited in code both by the screen and by what the reader can see at a glance (never mind limits to how much they can hold in their head).
One way to make your code shorter, but still easy to parse is to use code paragraphs. Line breaks are not free since they take up vertical space. Use line breaks to separate blocks of code that do a related thing. As in prose, one paragraph should roughly correspond to one idea. For instance, in the example code below, the first block does something related to a website page head, while the second block handles the body of the website page.
head <- collect_metadata(website)
head_string <- stringify(head)
body <- create_content(website)
body_string <- stringify(body)
A second way to make your code less long is to break down your code into functions.
In a main function, you can outsource tasks to other functions.
This way, a reader can see at a glance what the main function does, and then head to the other functions to read more details, as in the example below where create_content()
calls other functions to create a title, a page, and then create its output that combines the two.
create_content <- function(website) {
title <- create_title(website)
page <- create_page(website)
combine_elements(title = title, page = page)
}
In their book Learn AI-Assisted Python Programming, Leo Porter and Daniel Zingaro share the attributes of good functions: One clear task to perform, clearly defined behavior, short in number of lines of code, clear input and output, general value over specific use.
It is also helpful to know how to quickly navigate between functions in your IDE!
In RStudio IDE, you can use Ctrl+click
on the function name, or type its name in the search bar accessed with Ctrl+.
.
A third way to shorten your code is to use existing functions from base R or add-on packages.
For instance, to combine a list of default values with a list of custom values, you can use the modifyList()
function.
As with human languages, we learn more R words over time, by reading other people’s code and having them read our code..
This part of the training was a shorter version of the R-hub blog post Why comment your code as little (and as well) as possible.
Code comments are not a narrator’s voice-over of the code, they should be little alerts. The more comments there are, the more likely it is that the reader will skip them.
Code comments should not be a band-aid for bad naming or overly complex code: instead of adding a comment, can you rename a variable or refactor a piece of code?
A useful idea is to use self-explanatory functions or variables, where code like
if (!is.na(x) && nzchar(x)) {
use_string(x)
}
becomes
x_is_not_empty_string <- (!is.na(x) && nzchar(x))
if (x_is_not_empty_string) {
use_string(x)
}
Of course code comments remain important when needed! Examples of good comments include:
# This query can not be done via GraphQL, so have to use v3 REST API
,In the second part of the training, we shared tips that improve code clarity.
Naming things is notoriously hard. We shared these ideas:
Follow fashion, meaning, use the same words as others in your field or programming language.
Felienne Hermans, in her book The Programmer’s Brain, advises choosing the concepts that go into the name, the words to say it, then putting them together. This approach in three steps is a good way to get unstuck.
Following the previous advice, names should be consistent across code base and name molds are a very good tool for that. Name molds are patterns in which the elements of a name are combined, for example if you calculate the maximum value of crop yield, you need to agree if maximum
will be max
or maximum
and if the word will be at the beginning or at the end of the variable name: should be maxYield
or yieldMax
? By normalizing how to name things, our code will be easier to read.
“The greater the distance between a name’s declaration and its uses, the longer the name should be” (Andrew Gerrand). However, no matter how close to defining a variable you use it, don’t use a smart very short abbreviation.
There are several ways to write variable names. camelCase style leads to higher accuracy when reading code (Dave Binkley, 2009) and is better for reading the code with screen readers. We know it is difficult to change the style of an existing project, but if you are in a situation where you can decide from scratch, then consider using Camel Case? If you’re not sure about case names, refer to Allison Horst’s cartoon of cases (scroll down to “Cartoon representations of common cases in coding”).
A name is clear if the person reviewing your code agrees. 😉
A further tip is that it’s absolutely ok to create functions that wrap existing functions just to change their name. This strategy is common to change the argument order, but fine for naming too. Say you prefer your function names to be actions (verbs) rather than passive descriptions, you can have:
# In utils.R
remove_extension <- function(path) {
tools::file_path_sans_ext(path)
}
# In other scripts
remove_extension(path)
return()
, switch()
In a function,
do_thing <- function(x) {
if (is.na(x)) {
NA
} else {
x + 1
}
}
is equivalent to
do_thing <- function(x) {
if (is.na(x)) {
return(NA)
}
x + 1
}
but the latter, with the early return()
has less nesting and emphasizes the “happy path”.
The switch()
function can also help you remove nested if-else.
With it,
if (type == "mean") {
mean(x)
} else if (type == "median") {
median(x)
} else if (type == "trimmed") {
mean(x, trim = .1)
}
becomes
switch(type,
mean = mean(x),
median = median(x),
trimmed = mean(x, trim = .1)
)
The code you don’t write has no bug (that you are responsible for) and does not need to be read. 🎉
First of all, be strict about the scope of what you are trying to accomplish.
Second, use trusted dependencies to outsource part of the work. The “Dependencies: Mindset and Background “ chapter of the R packages book by Hadley Wickham and Jenny Bryan is a great read on the topic.
In practice, how do you apply your code style learnings? And how do you update your legacy codebases created before you knew about some of these aspects?
Maybe you can work on code styling and refactoring regularly
Once a year? Andy Teucher wrote an interesting blog post about the tidyverse spring cleaning.
More often?
A good strategy is also to work a bit on refactoring every time you enter a codebase to fix a bug or add a feature. The refactoring does not need to go into the same commit / branch, keep your code changes nuclear and easy to review.
The lintr package is a fantastic package. Its linters, or rules, will remind you or teach you of elements to fix that you didn’t know about or couldn’t keep in your head. You can run it every once in a while or have it run on continuous integration.
Even simply reading through its reference might show you functions or patterns you were not aware of. A true gem of the R ecosystem!
Other humans will have a good external perspective on your code and probably good tips for you!
Read your colleagues’ code and vice versa! The tidyverse team has a code review guide.
At rOpenSci, we run a software peer-review system of packages 😁
These are the references for most of the training content. 😸
Jenny Bryan’s talk Code Smells and Feels
Book The Art of Readable Code by Dustin Boswell and Trevor Foucher
Book Tidy Design by Hadley Wickham, in progress, with associated newsletter