Posts with the "archiving" tag
Lessons Learned from rtika, a Digital Babel Fish
April 25, 2018
The Apache Tika parser is like the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy” 1. The Babel fish translates any natural language to any other. Although Tika does not yet translate natural language, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows an analyst to extract text and objects from Microsoft Word.
The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database
June 3, 2015
Despite the hype around “big data”, a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments – the outputs of many and isolated scientific studies conducted around the globe.
Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community.
Reproducible research is still a challenge
June 9, 2014
Science is reportedly in the middle of a reproducibility crisis. Reproducibility seems laudable and is frequently called for (e.g., nature and science). In general the argument is that research that can be independently reproduced is more reliable than research that cannot be independently reproduced. It is also worth noting that reproducing research is not solely a checking process, and it can provide useful jumping-off points for future research questions. It is difficult to find a counter-argument to these claims, but arguing that reproducibility is laudable in general glosses over the fact that for each research group it is a significant amount of work to make their research (easily) reproducible for independent scientists.
dvn - Sharing Reproducible Research from R
February 20, 2014
Reproducible research involves the careful, annotated preservation of data, analysis code, and associated files, such that statistical procedures, output, and published results can be directly and fully replicated. As the push for reproducible research has grown, the R community has responded with an increasingly large set of tools for engaging in reproducible research practices (see, for example, the ReproducibleResearch Task View on CRAN). Most of these tools focus on improving one’s own workflow through closer integration of data analysis and report generation.