Note: This was originally some notes to RAs but I figured it may be useful for other people out there.
I've had some discussion with econ folks and RAs who are working with giant datasets in R for the first time. In particular, those having to "harvest" or "grow" unweildy datasets. R is notoriously slow when it comes to expanding datasets, such as when you want to increntally append rows to a file with results from a scraping API, or combine a giant stack of raw text files from another text mining project.
The usual "good" method for concatination uses a
do.call function with the rbind function. This method essentially takes a list of stuff and passes them as arguments all at once to
rbind. In other words, you can take a list of data.frame names and bind the rows together in one motion:
do.call("rbind", <<<A list of data.frame names>>>).
A common task I encounter is grabbing a chunk of files from a directory and combining them into a dataset. Such a task requires three steps. First, generating a list of files from a directory that match a pattern (e.g. all the .csv files in a directory) using the
list.files() function. Next, looping over this list of files and loading them into R with with
lapply, applying the
read.csv() function to a list of files. Then, finally, using
rbind, or stack, all the loaded .csv files into a single dataset.
This is all great, but it can still take a ton of time. Note, I condense the
lapply function and the
rbind line into one.
If you're doing more complicated tasks or working with large sets of data, processing time can balloon.
data.table package can speed things along if we're trying to get big data into R efficiently (I highly recommend checking out the github for the project).
rbindlist function included in the package is incredibly fast and written in C. In addition the
fread function is built to efficiently read data into R.
Below I replace the normal
read.csv function with
fread(), and replace
Both methods deliver identical datasets but there are some real efficiency gains when using
rbindlist from the super useful
This can have pretty amazing payoffs when working trying to load massive data sets into R to process.
26 Jun 2015