Teaching collaborative software development
In the University of British Columbia’s Master of Data Science program one of the courses we teach is called Collaborative Software Development, DSCI 524. In this course we focus on teaching how to exploit practices from collaborative software development techniques in data scientific workflows. This includes appropriate use of the software life cycle, unit testing and continuous integration, as well as packaging code for use by others.
...The free online book Open Forensic Science in R was created to foster open science practices in the forensic science community. It is comprised of eight chapters: an introduction and seven chapters covering different areas of forensic science: the validation of DNA interpretation systems, firearms analysis of bullets and casings, latent fingerprints, shoe outsole impressions, trace glass evidence, and decision-making in forensic identification tasks. The chapters of Open Forensic Science in R have the same five sections: Introduction, Data, R Package(s), Drawing Conclusions, and Case Study. There is R code throughout the chapter to guide the reader along in an analysis, and the case study walks the reader through solving a forensic science problem in R, from reading the data to answering a specific question such as, “Were these two bullets fired by the same gun?”...
rOpenSci HQ
Software Peer Review
5 community-contributed packages passed software peer review.
...Introduction
The availability of large quantities of freely available data is revolutionizing the world of ecological research. Open data maximizes the opportunities to perform comparative analyses and meta-analyses. Such synthesis efforts will increasingly exploit “population data”, which we define here as time series of population abundance. Such population data plays a central role in testing ecological theory and guiding management decisions. One of the richest sources of open access population data is the USA Long Term Ecological Research (LTER) Network. However, LTER data presents the drawback common to all ecological time-series: extreme heterogeneity derived from differences in sampling designs. We experienced this heterogeneity first hand, upon embarking on our own comparative analysis of population data. Specifically, we noticed that heterogeneities in sampling design made datasets hard to compare, and therefore hard to search and analyze.
...Ambitious workflows in R, such as machine learning analyses, can be difficult to manage. A single round of computation can take several hours to complete, and routine updates to the code and data tend to invalidate hard-earned results. You can enhance the maintainability, hygiene, speed, scale, and reproducibility of such projects with the drake R package. drake resolves the dependency structure of your analysis pipeline, skips tasks that are already up to date, executes the rest with optional distributed computing, and organizes the output so you rarely have to think about data files. This talk demonstrates how to create and maintain a realistic machine learning project using drake-powered automation....