Discover hydrological data using the hddtools R package
I’ve worked for over 12 years in hydrology and natural hazard modelling and one of the things that still fascinates me is the variety of factors that come into play in trying to predict phenomena such as river floods. From local observations of meteorological and hydrological variables and their spatio-temporal patterns to the type and condition of soils and vegetation/land use as well as the geometry and state of river channels and engineering structures affecting the flow.
This large number of well known predictors directly translates in the need to collect a wide variety of information such as in-situ sensor observations, river survey data, satellite imagery and, more recently, social media information (reporting in real-time the evolution of on-going events). In this field, data collection is challenging on many levels: data are available from different providers under a variety of licences, in various data types and formats that need to be handled and homogenised.
On the one side, not much can be done to overcome licensing issues and the learning curve to become an expert hydrologist is rather steep. On the other side, the R community is working hard to provide tools to make data standards more user-friendly and the convenience of data APIs available to everyone, not only web developers. Here is where
hddtools comes into play! This R package is a proof of concept that hydrological data can be made more accessible and consists of a collection of functions to retrieve and homogenise hydrological information. Let me walk you through the main functionalities!
The R package
hddtools stands for Hydrological Data Discovery Tools. This is an open source project designed to facilitate access to on-line data sources. This typically implies the download of a metadata catalogue, selection of information needed, formal request for dataset(s), de-compression, conversion, manual filtering and parsing. All those operation are made more efficient by re-usable functions.
Depending on the data license, functions can provide offline and/or on-line modes. When redistribution is allowed, for instance, a copy of the dataset is cached within the package and updated twice a year. This is the fastest option and also allows offline use of functions. When re-distribution is not allowed, only on-line mode is provided.
hddtools package and the examples in this blog depend on other CRAN packages. Before attempting to install
hddtools, solve any missing dependencies using the commands below:
packs <- c("zoo", "sp", "RCurl", "XML", "rnrfa", "Hmisc", "raster", "stringr", "devtools", "leaflet") new_packages <- packs[!(packs %in% installed.packages()[,"Package"])] if(length(new_packages)) install.packages(new_packages)
The package is available from the Comprehensive R Archive Network (CRAN):
However, the development of this package is very dynamic. Data sources become often unavailable or migrate to new systems. URLs change and, as a consequence, functions may stop working. For this reason, I always suggest to use the development version of this package, which is available from github using devtools:
Data sources and Functions
In my quest for hydrological data I found that there are tons of open datasets available but the problem is that you need to know where to look!
Here is the list of data sources available within the
- The Koppen Climate Classification map
- The Global Runoff Data Centre
- NASA’s Tropical Rainfall Measuring Mission
- The Top-Down modelling Working Group:
- MOPEX -SEPA river level data
For each source, functions are available to obtain and/or filter relevant data. Their usage is described below.
The Koppen Climate Classification map
The Koppen Climate Classification is the most widely used system for classifying the world’s climates. Its categories are based on the annual and monthly averages of temperature and precipitation. This climate classification is used, in hydrological studies, to explain the continental-scale variability in
annual runoff. The
hddtools package contains a function to identify the updated Koppen-Greiger climate zone, given a bounding box. In the example below I’m getting the climate class for my beautiful home town in Italy: Pompeii.
Country borders are retrieved using the
getData() function from the raster package, which retrieves global administrative areas from the following website: http://www.gadm.org/.
# Plot Pompeii on a map library(raster) Italy <- getData("GADM", country = "IT", level = 0) Pompeii <- SpatialPoints(coords = data.frame(x = 14.4870, y = 40.7510)) plot(Italy, col = NA, border = "darkgrey") plot(Pompeii, add = TRUE, col = "red") # Define and plot a bounding box centred in Pompeii (Italy) areaBox <- raster::extent(Pompeii@coords[] - 0.5, Pompeii@coords[] + 0.5, Pompeii@coords[] - 0.5, Pompeii@coords[] + 0.5) plot(areaBox, add = TRUE, col = "red")
# Extract climate zones from Kottek's map: KGClimateClass(areaBox = areaBox, updatedBy = "Kottek")
## ID Class Frequency ## 1 34 Csa 131
From the column
Class in the above table, it can be derived that the area falls in a warm temperate climate (C) with dry (s) and hot summers (a). A description of the retrieved class and related criterion can be printed setting the argument
verbose = TRUE in the function
The Global Runoff Data Centre
The Global Runoff Data Centre (GRDC) is an international archive hosted by the Federal Institute of Hydrology in Koblenz, Germany. The Centre operates under the auspices of the World Meteorological Organisation and retains services and datasets for all the major rivers in the world. The data catalogue, kml files and the Long-Term Mean Monthly Discharges are open data and accessible via
Information on all the GRDC stations can be retrieved using the function
catalogueGRDC with no input arguments, as in the example below:
# GRDC full catalogue GRDC_catalogue_all <- catalogueGRDC()
However, there are a number of options to filter the catalogue and return only a subset of stations. For instance, the catalogue can be filtered based on a geographical bounding box as in the function
KGClimateClass(). It might also be interesting to subset only stations in a given country, e.g. Italy. As the 8th column in the catalogue lists the country codes, the catalogue can be filtered passing two input arguments to the
columnName = "country_code" and
columnValue = "IT", as in the example below.
# Filter GRDC catalogue based on a country code GRDC_IT <- catalogueGRDC(columnName = "country_code", columnValue = "IT") # Convert the table to a SpatialPointsDataFrame hydro <- SpatialPointsDataFrame(coords = GRDC_IT[, c("long", "lat")], data = GRDC_IT) # Plot the stations on the map plot(Italy, col = NA, border = "darkgrey") plot(hydro, add = TRUE, col = "blue", pch = 1)
columnValue can be used to filter over other columns. For instance, the example below shows how to subset stations along the Tanaro River which source in the Ligurian Alps.
# Filter GRDC catalogue based on rivername, search is not case sensitive! Tanaro <- catalogueGRDC(columnName = "river", columnValue = "Tanaro") # Convert the table to a SpatialPointsDataFrame TanaroSP <- SpatialPointsDataFrame(coords = Tanaro[, c("long", "lat")], data = Tanaro) # Highight in red the stations on the Tanaro River plot(Italy, col = NA, border = "darkgrey") plot(hydro, add = TRUE, col = "blue", pch = 1) plot(TanaroSP, add = TRUE, col = "red", pch = 1)
columnName refers to a numeric field, in the
columnValue other terms of comparison can be specified. For example, it is straightforward to find out that the country with the most longstanding monitoring stations is Germany (with 11 out of 15 stations with more than 150 years of recordings).
# Filter GRDC stations with more than 150 years of recordings GRDC150 <- catalogueGRDC(columnName = "d_yrs", columnValue = ">150") # Which country has the most longstanding monitoring stations? table(GRDC150$country_code)
## ## DE FI LT RO SE ## 11 1 1 1 1
# Where is the oldest station? oldest <- GRDC150[which(GRDC150$d_yrs == max(GRDC150$d_yrs)), c("grdc_no", "river", "station", "d_yrs")] oldest
## # A tibble: 1 × 4 ## grdc_no river station d_yrs ## <chr> <chr> <chr> <dbl> ## 1 6340120 ELBE RIVER DRESDEN 208
The oldest stastion is in Dresden on the Elbe River. Let’s now find out whether monthly data is available. For this we need the GRDC identification number (this is stored in the column grdc_no of the catalogue) and the function
# Monthly data extraction DresdenStation <- tsGRDC(stationID = oldest$grdc_no, plotOption = TRUE)
The plot above, shows the time on the x-axis and the flow (in m3/s) on the y-axis. The river flow has had huge oscillations over time, is this an effect of anthropogenic changes and/or climate change? This is well out of the scope of this blog post, I leave the reader to look at trends from the mean monthly values
NASA’s Tropical Rainfall Measuring Mission
The Tropical Rainfall Measuring Mission (TRMM) is a joint mission between NASA and the Japan Aerospace Exploration Agency (JAXA) that uses a research satellite to measure precipitation within the tropics in order to improve our understanding of climate and its variability.
The TRMM satellite records global historical rainfall estimation in a gridded format since 1998 with a daily temporal resolution and a spatial resolution of 0.25 degrees (spatial extent goes from -50 to +50 degrees latitude). This information is openly available for educational purposes and downloadable from an FTP server. The
hddtools provides a function, called TRMM(), to download and convert a selected portion of the TRMM dataset into a raster-brick that can be opened in any GIS software.
As an example, I’m going to download precipitation maps for three months in 2016 using the
areabox defined previously to locate Pompeii and surrounding areas. But remember, values become less reliable moving away from the tropics!
# Define a temporal extent twindow <- seq(as.Date("2016-01-01"), as.Date("2016-03-31"), by = "months") # Retrieve mean monthly precipitations based on a bounding box and time extent TRMMfile <- TRMM(twindow = twindow, areaBox = areaBox) plot(TRMMfile)
Each of the above plots shows the average precipitation over Pompeii during January (top left), February (top right) and March (bottom left) 2016. Precipitation seems generally more abundant in the south (coastal areas) where the plots show prevalence of green, especially during January and February.
Top-Down modelling Working Group
The Top-Down modelling Working Group (TDWG) for the Prediction in Ungauged Basins (PUB) Decade (2003-2012) is an initiative of the International Association of Hydrological Sciences (IAHS) which collected datasets for hydrological modelling free-of-charge, available here. This package provides a common interface to retrieve, browse and filter two datasets: Data60UK and MOPEX.
The Data60UK dataset
The Data60UK initiative collated datasets of areal precipitation and streamflow discharge across 61 gauging sites in England and Wales (UK). The database was prepared from source databases for research purposes, with the intention to make it re-usable. This is now available in the public domain free of charge.
hddtools contain two functions to interact with this database: one to retrieve the catalogue and another to retrieve time series of areal precipitation and streamflow discharge.
UK <- getData("GADM", country = "GBR", level = 0) plot(UK, col = NA, border = "darkgrey") # Data60UK full catalogue allData60UK <- catalogueData60UK() hydroALL <- SpatialPointsDataFrame(coords = allData60UK[, c("Longitude", "Latitude")], data = allData60UK) plot(hydroALL, add = TRUE, col = "blue", pch = 1)
## stationID River Location gridReference Latitude ## 1 22001 Coquet Morwick NU234044 55.33310 ## 2 22006 Blyth Hartford Bridge NZ242800 55.11381 ## 3 23006 South Tyne Featherstone NY672610 54.94257 ## 4 24004 Bedburn Beck Bedburn NZ117321 54.68382 ## 5 25005 Leven Leven Bridge NZ444120 54.50139 ## 6 25006 Greta Rutherford Bridge NZ033122 54.50511 ## Longitude ## 1 -1.632691 ## 2 -1.622163 ## 3 -2.513539 ## 4 -1.820046 ## 5 -1.315911 ## 6 -1.950550
Given the station identification number (in the column stationID), time series of areal precipitation and streamflow discharge can be downloaded using the function
tsData60UK(). In the example below I show how to get the time series for the first station in the table, for a given temporal window.
# Extract time series for the first station stationID <- catalogueData60UK()$stationID # Extract time series for a specified temporal window twindow <- seq(as.Date("1988-01-01"), as.Date("1989-12-31"), by = "days") MorwickTSplot <- tsData60UK(stationID = stationID, plotOption = TRUE, twindow = twindow)
The above figure is divided into two plots: precipitation over time (top) and river flow over time (bottom). High flows and related precipitation events seem to be clearly identifiable. The baseflow is around 10 m3/s, but during the most important events the flow reached 120 m3/s. The shape of the hydrograph, however, suggests that the flow was contained by the embankments and did not cause floods.
This source contains historical hydrometeorological data and river basin characteristics for hundreds of river basins and from a range of climates in the US. As with the previous dataset,
hddtools contains functions to download the catalogue and time series. The example below shows how to download the MOPEX catalogue and the time series for the first station in the catalogue.
# MOPEX full catalogue allMOPEX <- catalogueMOPEX() # Extract time series BroadRiver <- tsMOPEX(stationID = allMOPEX$stationID, plotOption = TRUE)
In the plots above P stands for precipitation, E for potential evapotranspiration, Q for river flow, Tmin for minimum temperature and Tmax for maximum temperature.
SEPA river level data
The Scottish Environment Protection Agency (SEPA) manages river level data for hundreds of gauging stations in the UK. The catalogue of stations was derived from a list that’s no longer available, but there’s an equivalent dataset of river levels from the SEPA. The time series of the last few days is available from the SEPA website and can be downloaded using the following function
tsSEPA(), as in the example below. Plese note that this data is updated every 15 minutes and the code will always generate different plots.
# SEPA unofficial catalogue allSEPA <- catalogueSEPA() # Single time series extraction Kilphedir <- tsSEPA(stationID = catalogueSEPA()$stationId, plotOption = TRUE)
There are a number of possible research applications for the
package. Retrieved precipitation and streamflow data could, for
instance, be used to draw spatial trends, as done by (Vitolo et al.,
2016b) using the
rnrfa package but over larger areas. The package
could also be used to compare hydrological behaviours in different areas
of the world, to run and calibrate hydrological models such as fuse
(Clark et al., 2008, Vitolo et al. 2016a) as well as to undertake
I’m very grateful to Erin Le Dell and
Michael Sumner who reviewed, on behalf of
hddtools package and the related paper published in the
Journal of Open Source Software (Vitolo, 2017). Both reviewers provided
very constructive suggestions that grately improved this package. I’d
like to also thank Stefanie
Butland and Scott
Chamberlain for providing invaluable advice
when reviewing this post.
Clark, M. P., Slater, A. G., Rupp, D. E., Woods, R. A., Vrugt, J. A., Gupta, H. V., Wagener, T. and Hay, L. E.: Framework for understanding structural errors (fuse): A modular framework to diagnose differences between hydrological models, Water Resources Research, 44(12), n/a–n/a, doi:10.1029/2007WR006735, 2008.
Vitolo, C.: Hddtools: Hydrological data discovery tools, The Journal of Open Source Software, 2(9), doi:10.21105/joss.00056, 2017.
Vitolo, C., Wells, P., Dobias, M. and Buytaert, W.: fuse: An R package for ensemble Hydrological Modelling, The Journal of Open Source Software, 1(8), doi:10.21105/joss.00052, 2016a.
Vitolo, C., Fry, M. and Buytaert, W.: rnrfa: An R package to Retrieve, Filter and Visualize Data from the UK National River Flow Archive, The R Journal, 8(2) [online] Available from: https://journal.r-project.org/archive/2016-2/vitolo-fry-buytaert.pdf, 2016b.