Posts tagged ‘heteroskedasticity’

June 11, 2011

Clustered Standard Errors in R

So, you want to calculate clustered standard errors in R (a.k.a. cluster-robust, huber-white, White’s) for the estimated coefficients of your OLS regression? This post shows how to do this in both Stata and R:


Let’s say that you want to relax the Gauss-Markov homoskedasticity assumption, and account for the fact that there may be several different covariance structures within your data sample that vary by a certain characteristic – a “cluster” – but are homoskedastic within each cluster.

For example, say you have a panel data set with a bunch of different test scores from different schools around the country. You may want to cluster your sample by state, by school district, or even by town. Economic theory and intuition will guide you in this decision.

So, similar to heteroskedasticity-robust standard errors, you want to allow more flexibility in your variance-covariance (VCV) matrix (Recall that the diagonal elements of the VCV matrix are the squared standard errors of your estimated coefficients). The way to accomplish this is by using clustered standard errors. The formulation is as follows:

Variance_{cluster} =

\bigl (\frac{M}{M-1}\frac{N-1}{N-K} \bigr )(X'X)^{-1}\sum_{j=1}^M \{X_M\hat{\varepsilon}_M\hat{\varepsilon}'_MX_M\}(X'X)^{-1}

where M = number of unique clusters (e.g. number of school districts) N = number of observations, and K = the number of regressors (including the intercept). (See pages 312-313 of Angrist and Pischke’s Mostly Harmless Econometrics (Princeton University Press, 2009) for a better explanation of the summation notation.) This returns a Variance-covariance (VCV) matrix where the diagonal elements are the estimated cluster-robust coefficient variances — the ones of interest. Estimated coefficient standard errors are the square root of these diagonal elements.

Stata makes it very easy to calculate, by simply adding ,cluster(state) to the end of your regression command.


use wr-nevermar.dta
reg nevermar impdum, cluster(state)


In R, you first must run a function here called cl() written by Mahmood Ara in Stockholm University – the backup can be found here and here.

   cl   <- function(dat,fm, cluster){
              require(sandwich, quietly = TRUE)
              require(lmtest, quietly = TRUE)
              M <- length(unique(cluster))
              N <- length(cluster)
              K <- fm$rank
              dfc <- (M/(M-1))*((N-1)/(N-K))
              uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
              vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
              coeftest(fm, vcovCL) }

After running the code above, you can run your regression with clustered standard errors as follows:

nmar = read.dta("")

# Run a plain linear regression
regt = lm(nevermar ~ impdum, data = nmar)

# apply the 'cl' function by choosing a variable to cluster on.
# here, we are clustering on state.
cl(nmar, regt, nmar$state)

Thoughts, comments, ideas?  Let me know; I’m always appreciative of feedback. You can contact me via e-mail at kevingoulding {at} gmail [dot] com.