Quick Note: Writing large .csv files in R with fwrite()! and beyond

The problem.

R can be nasty when it comes to reading and writing “large” datasets. As practitioners, we often appeal to hacky practices, emergent libraries, alternatives methods to avoid crashing our systems during heavy jobs.

One alternative I had the appealed to was the ff library , a great library for processing large data. Importantly, ff had an awesome function for quickly saving multi-gig .csv files (write.table.ffdf).

Recently, however, write.table.ffdf (and other go-to methods) seemed to constantly crash large jobs on my Linux machine. I’d come back to my computer to only to find,

fwrite() to the rescure.

(Re-)enter the data.table package. Like ff, data.table is useful in its own right for processing big data. Though the library had great methods for quickly opening large files with its fread() command, it lacked a comparable fwrite() command. Until now…

data.table author, Matt Dowle, is experimenting with adding (See his extensive blog post here) an fwrite() functon (contributed by Otto Seiskari) to the package. Like fread(), fwrite() is also written in C and is surprisingly efficient.

The fwrite() function has truly come through for me, allowing me to write terrabytes of weather data without the complications I have run into with other packages.

Installing fwrite() functionality.

As of March 30, 2017, fwrite() function hasn’t been added to current, core CRAN release of the package. To install the development version of the data.table library with the fwrite() command, you’ll have to uninstall your current standard version of the library and install the development version from the github repository:

	type = "source",
    repos = "http://Rdatatable.github.io/data.table" )

See the data.table wiki for more.s

installing to /home/XXX/R/x86_64-pc-linux-gnu-library/3.3/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (data.table)

There wont be any documentation; don’t be alarmed when nothing shows up when you type ?(?)fwrite:

> ?fwrite
No documentation for fwrite in specified packages and libraries:
you could try ??fwrite

For my current project I have to repeatedly process a hundred files, each with over 40 million lines (around 4 gigs each). Writing these data.frames to a .csv takes under a minute on a pretty lowly desktop:

Much faster than other methods.

Also: fairly recently, the prolific Wes McKinney and Hadley Wickham developed the feather package (See: http://github.com/wesm/feather.) to address some of the memory issues that plague folks trying to edit and save large data sets in R (See a pretty good discussion here.).