R can be nasty when it comes to reading and writing “large” datasets. As practitioners, we often appeal to hacky practices, emergent libraries, alternatives methods to avoid crashing our systems during heavy jobs.
One alternative I had the appealed to was the
ff library , a great library for processing large data. Importantly,
ff had an awesome function for quickly saving multi-gig .csv files (
write.table.ffdf (and other go-to methods) seemed to constantly crash large jobs on my Linux machine. I’d come back to my computer to only to find,
fwrite() to the rescure.
data.table package. Like
data.table is useful in its own right for processing big data. Though the library had great methods for quickly opening large files with its
fread() command, it lacked a comparable
fwrite() command. Until now…
data.table author, Matt Dowle, is experimenting with adding (See his extensive blog post here) an
fwrite() functon (contributed by Otto Seiskari) to the package. Like
fwrite() is also written in C and is surprisingly efficient.
fwrite() function has truly come through for me, allowing me to write terrabytes of weather data without the complications I have run into with other packages.
Installing fwrite() functionality.
As of March 30, 2017,
fwrite() function hasn’t been added to current, core CRAN release of the package. To install the development version of the
data.table library with the
fwrite() command, you’ll have to uninstall your current standard version of the library and install the development version from the github repository:
There wont be any documentation; don’t be alarmed when nothing shows up when you type
For my current project I have to repeatedly process a hundred files, each with over 40 million lines (around 4 gigs each). Writing these data.frames to a .csv takes under a minute on a pretty lowly desktop:
Much faster than other methods.
Also: fairly recently, the prolific Wes McKinney and Hadley Wickham developed the
feather package (See: http://github.com/wesm/feather.) to address some of the memory issues that plague folks trying to edit and save large data sets in R (See a pretty good discussion here.).