A package for dimensionality reduction of large data
Note: Recently, two new UMAP R packages have appeared. These new packages provide more features than umapr does and they are more actively developed. These packages are:
umap, which provides the same Python wrapping function as umapr and also an R implementation, removing the need for the Python version to be installed. It is available on CRAN.
uwot, which also provides an R implementation, removing the need for the Python version to be installed.
A few weeks ago, as part of the rOpenSci Unconference, a group of us (Sean Hughes, Malisa Smith, Angela Li, Ju Kim, and Ted Laderas) decided to work on making the UMAP algorithm accessible within R. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that allows the user to reduce high dimensional data (multiple columns) into a smaller number of columns for visualization purposes (github, arxiv). It is similar to both Principal Components Analysis (PCA) and t-SNE, which are techniques often used in the single-cell omics (such as genomics, flow cytometry, proteomics) world to visualize high dimensional data. t-SNE is actually quite a slow algorithm; one of the advantages of UMAP is that it runs faster than t-SNE. Because the
data.frames that are typically run with these algorithms can run into millions of rows, efficiency is important.
We decided to start working on the
umapr package to make this technique accessible within R. As with most rOpenSci Unconf projects, this started with an issue entry in the rOpenSci unconf repo:
I recently read about a new non-linear dimensionality reduction algorithm called UMAP, which is much faster than t-SNE, while producing two-dimensional visualizations that share many characteristics with t-SNE. I initially found out about it in the context of use on high-dimensional single-cell data in this paper.
My thought is that the ideal would be a package focused on UMAP specifically, implemented in R or Rcpp. Unfortunately I am not at all an expert in this topic or familiar with the mathematics involved, so the best I would be able to do is try to translate the Python implementation into R.
We all met at the unconference the first day and decided that this was a project worth working on. Since t-SNE is so used in the single cell and flow-cytometry community, we thought that having an alternative that was just as good, but faster to run would be helpful.
Making a Development Plan
Rather than recreate the UMAP code completely from scratch in R, we decided to use the
reticulate package to call the implementation in Python from R. It was tempting to just wrap the function’s arguments with
... and let the user refer to the python documentation. However, we didn’t really think that was in the spirit of the unconf. We wanted to make UMAP much more accessible.
Learning about Package Building, Testing, and Documentation
Although our package only really has one main function (
umap()), we felt it was important to have good documentation and unit tests. We spent some time learning about
roxygen for function documentation and
testthat for unit testing, and setting up our package with Travis-CI for continuous integration testing. This included unit tests on each argument and including examples varying the essential parameters.
We spent a lot of time learning more about the specifics of package building and vignette building in R. We were definitely excited by all of the available tools and built a vignette profiling the performance of the UMAP algorithm versus other dimensionality reduction techniques, such as t-SNE. Our vignette can be read here: https://github.com/ropenscilabs/umapr#basic-use
umapr using different datasets
Part of the appeal of UMAP is that it is faster than t-SNE. So we profiled the performance of UMAP on a number of different datasets:
iris (of course!), the
BreastCancer dataset from the
mlbench package, a
Soybean dataset from
mlbench, and finally, a single cell RNA dataset. You can see our results in our readme file.
Thankfully, UMAP does run faster than t-SNE on these datasets, showing a reduction of 66% compared to both versions of t-SNE for the
Soybean dataset, and reduced memory usage for all of the datasets, except for the single cell RNA dataset (see above figure).
Exploring the Results with Shiny
We built a small Shiny app that lets people explore their embedding vectors (the dimensionally reduced vectors) and how they separate the data into different groupings in the 2D space. The app is simple, but allows users to immediately assess the results of the UMAP algorithm in differentiating groupings in the data by coloring the
umap result by the different variables included in the analysis.
Final Results: Get
umapr is currently available in the
ropenscilabs organization, and can be installed with the following commands, after the python modules are installed.
As a group, we learned a lot by building the
umapr package. More importantly, I think we’ll work together on future projects. It was great to work together, and we are talking about having a hackathon between our multiple groups to improve some current open source flow cytometry tools. This was a really fun project and we’re excited to do more!