Lately I have been looking for ways to decrease the amount of time it takes me to run multiple regressions over a very large data set. There are several options that I am investigating to do this, and certainly more that I don’t know of yet.
- Code more efficiently.
- Compute several operations in parallel over a two or more CPU cores.
- Tap into a network of computers, and further expand the number of CPU cores to parallelize calculations.
Because many of my computer jobs are “embarassingly parallel”, the options mentioned above would immediately improve the speed I can compute (and re-compute) jobs. This post will go through an example using the CRAN package
snowfall to parallelize a computation over several CPU cores on the same computer (bullet #2 above).
Before beginning to use
snowfall, do the following:
- Upgrade to the latest version of R – as of this post version 2.14.1 (or the patched version of R-2.13.0 – available here). FYI – There is a bug in version 2.13.0 (for MS Windows 7) that prevents snowfall from operating smoothly.
- Install the latest version of the package
install.packages('snowfall', dependencies = TRUE))
- Find out how many cores you have on the CPU of the machine you will be using. In my example below, I am using a machine with 8 CPU cores and running Windows 7.
- Convert any ‘for’ loops into a function that you can call using
apply(). See my previous post that outlines this process.
snowfall: A simple example
The reason I put together this post is because I couldn’t easily find a ‘plug’n play’ code example in the existing online literature to execute the type of parallelization I wanted. Out of necessity I worked through the wrinkles and am now successfully utilizing multiple CPU cores in R. – Note: By default, R uses only one CPU core unless you explicitly code it to use multiple cores (as in this example).