Antarctic/Southern Ocean science and rOpenSci
Collaboration and reproducibility are fundamental to Antarctic and Southern Ocean science, and the value of data to Antarctic science has long been promoted. The Antarctic Treaty (which came into force in 1961) included the provision that scientific observations and results from Antarctica should be openly shared. The high cost and difficulty of acquisition means that data tend to be re-used for different studies once collected. Further, there are many common data requirement themes (e.g. sea ice information is useful to a wide range of activities, from voyage planning through to ecosystem modelling). Support for Antarctic data management is well established. The SCAR-COMNAP Joint Committee on Antarctic Data Management was established in 1997 and remains active as a SCAR Standing Commitee today.
...Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:
Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.
...
While many people groan at the thought of participating in a group ice breaker activity, we’ve gotten consistent feedback from people who have been to recent rOpenSci unconferences.
Best ice breaker ever!
We’ve had lots of requests for a detailed description of how we do it. This post shares our recipe, including a script you can adapt, a reflection on its success, examples of how others have used it, and some tips to remember. Let us know in the comments if you’ve used or adapted it!
...rOpenSci’s software engineer / postdoc Jeroen Ooms will explain what images are, under the hood, and showcase several rOpenSci packages that form a modern toolkit for working with images in R, including opencv, av, tesseract, magick and pdftools.
🕘 Thursday, November 15, 2018, 10-11AM PST; 7-8PM CET (find your timezone)
☎️ Find all details for joining the call on our Community Calls page. Everyone is welcome. No RSVP needed.
Agenda
Abstract
Images in various forms are used for numerous applications across scientific disciplines. Whether you are observing through satellite or microscope, looking at MRI scans or petri dishes, trying to find patterns or abnormalities, the data is in the image. Unfortunately the tools for working with images are traditionally highly fragmented by field, and often narrow in scope. At rOpenSci we are working on a suite of general purpose packages based on powerful c/c++ libraries. These provide an extensible and interoperable foundation for working with images in R, which can be used to implement domain specific-methods. This talk gives a taste of things we can currently do with images in R, and highlights some of the ongoing developments and challenges.
...pubchunks is a package grown out of the fulltext package. fulltext
provides a single interface to many sources of full text scholarly articles. As
part of the user flow in fulltext
there is an extraction step where fulltext::chunks()
pulls parts of articles out of XML format article files.
As part of making fulltext
more maintainable and focused on simply fetching articles,
and realizing that pulling out bits of structured XML files is a more general problem,
we broke out pubchunks
into a separate package. fulltext::ft_chunks()
and
fulltext::ft_tabularize()
will eventually be removed and we’ll point users to
pubchunks
.