Caption: "Thermal Map of North America. Delineating the Isothermal Zodiac, the Isothermal Axis of Identity, and its expansions up and down the 'Plateau' " From William Gilpin’s "Mission of the North American People (1873)." Via the "Making Maps: DIY Cartography" blog.
How do I take over 100 NetCDF files, each containing thousands of layers of hourly raster data, and translate into dataset?
I'll flesh out the problem more,
NetCDF files. We have over 100 NetCDF datasets: 1850weather.nc, 1851weather.nc , 1851weather , ... , 1971weather, ... etc..
Raster layers. Each NetCDF is a stack of hourly raster layers: 1850weather.nc: layer.hour1 , layer.hour2 , ... , layer.hour2000, ... etc.
Shapefiles. We also have a shapefile countaining country boundaries.
Our goal. Extract a giant panel dataset, containing average hourly weather data for each country.
In other words, we need a way to cycle through each NetCDF file and for every layer within a file, retrieve the average raster stats corresponding to each country in our shapefile. This adds up.
This problem is interesting because normal approaches to processing piles of raster data (using nested loops) takes forever. A more "functional" approach to the problem can be amazingly more efficient.
In this post I consider the ways in which we can tackle this problem (Part I.) and then motive a solution (Part 2.) for comfortably extracting statistics from gigs of GIS weather data.
Part 1. - A comparison of Two Approaches.
Consider two approaches to our problem, a conventional loop-based approach and an optimized approach.
Conventional Dead Ends with Loops.
Most of us would attack the problem using nexted loops. The first loop reads and prepares the NetCDF file; a second deals with the raster layers within each file.
We're tempted to iterate over the raster layers (
layer[l] ) and use standard R GIS libraries to, say, extract everage values from the rasters over a country boundary shapefile (e.g. hourly weather readings over the borders of Finland). Above all, we're tempted to "harvest" the geographic statistics, taking the values we extract and collating them in a giant file.
Straight up, this is a bad idea.
Loops can be abysmal for these tasks. When iterating through objects with a
for() loop, we're actually calling many tiny functions ... over and over again. Not only is the
for() a function, but so is the ":", and so are the brackets "[ ]".
To make matters worse, when we manipulate a vector or data.frame with a for-loop, we're also making many internal copies of our objects. Unbeknownst to us, mundane data transormations can quickly fill out memory with repeated copies.
Moreover, embedded in our first attempt is the agony loop-based "data-harvesting." Unless we're carefull, using a loop to incrementally "grow" a dataset will bring your computer to its knees.
Consider an Alternative Approach.
Instead of loops, the following template - and specifically the full program in Part 2 - considers a solution more suited to R.
By combining functional style with the use of the
raster() library, we eliminate the need for nested loops and boil the problem down to a streamlined "apply+function" structure.
Instead of using a for-loop to iterate over NetCDF weather files, we take a list of files
list_of_netcdffiles and "apply" a big function,
We eliminate the entire inner loop with the help of
RasterBrick manipulations. That is, instead of looping over the individual raster layers within a NetCDF file, we transform the NetCDF into a
RasterBrick object and manipulate the collection of layers as a single object.
RasterBrickinstead of cycling through individual layers feels a lot of like the practice of
Part 2. The Program.
Let's expand the alternative template above into a full program.
The first part of the program defines our functions: the main
generate_datatable_from_rasterbricks function and a set of small sub-functions used within it.
generate_datatable_from_rasterbricks functon eats a raw NetCDF file, and using a a team of smaller functions, reads it as a RasterBrick, aligns it with our country shapefile, extracts the country-level weather statistcs, then saves the output file.
The second part contains the core code. Here we define a list of NetCDF raster files (
raster_file_list) and the country shapefile (
countryshape) used in our calculations. An *apply function feeds NetCDF files through the
Instead of using
lapply I use
mclapply: the latter is a multiprocessor version of list-apply provided by the
parallel() package. Conveniently,
mclapply utilizes the power of out multi-core processor (if we have one).
The third part of the program takes our saved files and assembles them into a giant file via the amazingly useful functions provifed by the
lapply(), our list of .csv files is opened with the speedy
fread() function. A list big list of opened .csv files is then fed through
rbindlist(), which combines them into a single massive data.table.
One thing to try.
Compiling functions with R's byte-compiler
Writing our own functions also allows us to easily use R's
compiler() on chunks of our code. The
cmpfun() function allows us to generate byte-compiled versions of our own functions,
Sometimes byte-compiled function can give our programs another performance boost, often with minimal upfront costs, and can yield surprising gains in big data projects without having to turn to Rcpp/C++.