The drake
R package is a pipeline toolkit. It manages data science workflows, saves time, and adds more confidence to reproducibility. I hope it will impact the landscapes of reproducible research and high-performance computing, but I originally created it for different reasons. This post is the prequel to drake
’s inception. There was struggle, and drake
was the answer.
Dissertation frustration
My dissertation project was intense. The final computational challenge was to analyze multiple genomics datasets using an emerging method and its competitors. Even with GPU computing, which shrank days of runtime down to hours, the full battery of Markov chain Monte Carlo runs took several weeks from start to finish. I organized my workflow as an R package, and I worked in a loop:
...We’re very pleased to be introducing someone who needs no introduction in the R community. Join us in welcoming Maëlle Salmon to rOpenSci as a Research Software Engineer (part time, working from Nancy, France). We’d like to formally introduce her here and share a bit about the kinds of things she’ll be working on.
Maëlle did a B.Sc. in Biology with an emphasis on maths and quantitative work, two Masters degrees - one in Ecology and one in Public Health - and a Ph.D. in epidemiological statistics at the Ludwig-Maximilian University in Germany. Her thesis dealt with statistical algorithms for aberration detection in time series of counts of reported cases of infectious diseases. Most recently, Maëlle worked as a data manager and statistician for the CHAI project. Maëlle has contributed six packages to rOpenSci to date, and has written about two of them, ropenaq
and rtimicropem
for our guest blog series about onboarded software.
DBI
What is DBI? DBI is an R package. It defines an interface to relational database management systems (R/DBMS) that other R packages build upon to interact with a specific relational database, such as SQLite or PostgreSQL.
NoSQL
NoSQL databases are a very broad class of database that can include document databases such as CouchDB and MongoDB, key-value stores such as Redis, and more. They are generally not row-column relational stores though, though can include that. NoSQL is often thought of now as “not only SQL”.
...The problem
Text-mining - the art of answering questions by extracting patterns, data, etc. out of the published literature - is not easy.
It’s made incredibly difficult because of publishers. It is a fact that the vast majority of publicly funded research across the globe is published in paywall journals. That is, taxpayers pay twice for research: once for the grant to fund the work, then again to be able to read it. These paywalls mean that every potential person text-mining will have different access: some have access through their university, some may have access through their company, and others may only have access to whatever happens to be open access. On top of that, access for paywall journals often depends on your IP address - something not generally on top of mind for most people.
...One of the best things about learning R is that no matter your skill level, there is always someone who can benefit from your experience. Topics in R ranging from complicated machine learning approaches to calculating a mean all find their relevant audiences. This is particularly true when writing R packages. With an ever evolving R package development landscape (R, GitHub, external data, CRAN, continuous integration, users), there is a strong possibility that you will be taken into regions of the R world that you never knew existed. More experienced developers may not get stuck in these regions and therefore not think to shine a light on them. It is the objective of this post to explore some of those regions in the R world that were highlighted for me when the tidyhydat
package was reviewed by rOpenSci....