For a current project, I am having to repeatedly manipulate and save a hundred datasets, each with about 4 million observations. While R tools like
fread(), part of the
data.table library, make it trivial to load massive data sets into memory, writing these data sets–and doing so repeatedly–is another story..
When trying to save big data sets, many of folks first recommend the
write.table() function, which gives some people performance gains over the default
write.csv() method. For my project,
write.table() wasn’t cutting it and, instead, often crashed my system.
ff library, an incredibly handy tool for working with big data. In fact, ff nimbly deals with some of R’s serious memory issues (see a more technical guide here.). Instead of doing anything fancy with ff, however, I realized one could harness the library to simply save problematic data sets. The performance gains were striking.
A simple alternative to write.table():
Below I take a data.table object I was manipulating (perhaps in a loop), convert it into an “ff dataframe” (ffdf), which I can then save using
ff’s speedy .csv file writing function.
Note: the type of data table (mygiant_datatable) I was working with was quite simple, composed of only a few numeric columns. Thus, coercing the data into an ffdf object was no problemo.
Of course, in the cheap workaround above, the
as.ffdf() adds a costly step (time-wise). However, it was well worth the benefit of utilizing the
write.csv.ffdf() function … and worth not crashing ad nauseum.