Clustered Standard Errors in R

So, you want to calculate clustered standard errors in R (a.k.a. cluster-robust, huber-white, White’s) for the estimated coefficients of your OLS regression? This post shows how to do this in both Stata and R:

Overview

Let’s say that you want to relax the Gauss-Markov homoskedasticity assumption, and account for the fact that there may be several different covariance structures within your data sample that vary by a certain characteristic – a “cluster” – but are homoskedastic within each cluster.

For example, say you have a panel data set with a bunch of different test scores from different schools around the country. You may want to cluster your sample by state, by school district, or even by town. Economic theory and intuition will guide you in this decision.

So, similar to heteroskedasticity-robust standard errors, you want to allow more flexibility in your variance-covariance (VCV) matrix (Recall that the diagonal elements of the VCV matrix are the squared standard errors of your estimated coefficients). The way to accomplish this is by using clustered standard errors. The formulation is as follows:

Variance_{cluster} =

\bigl (\frac{M}{M-1}\frac{N-1}{N-K} \bigr )(X'X)^{-1}\sum_{j=1}^M \{X_M\hat{\varepsilon}_M\hat{\varepsilon}'_MX_M\}(X'X)^{-1}

where M = number of unique clusters (e.g. number of school districts) N = number of observations, and K = the number of regressors (including the intercept). (See pages 312-313 of Angrist and Pischke’s Mostly Harmless Econometrics (Princeton University Press, 2009) for a better explanation of the summation notation.) This returns a Variance-covariance (VCV) matrix where the diagonal elements are the estimated cluster-robust coefficient variances — the ones of interest. Estimated coefficient standard errors are the square root of these diagonal elements.

Stata makes it very easy to calculate, by simply adding ,cluster(state) to the end of your regression command.

STATA:

use wr-nevermar.dta
reg nevermar impdum, cluster(state)

R:

In R, you first must run a function here called cl() written by Mahmood Ara in Stockholm University – the backup can be found here and here.

   cl   <- function(dat,fm, cluster){
              require(sandwich, quietly = TRUE)
              require(lmtest, quietly = TRUE)
              M <- length(unique(cluster))
              N <- length(cluster)
              K <- fm$rank
              dfc <- (M/(M-1))*((N-1)/(N-K))
              uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
              vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
              coeftest(fm, vcovCL) }

After running the code above, you can run your regression with clustered standard errors as follows:

require(foreign)
nmar = read.dta("http://www.montana.edu/econ/cstoddard/562/wr-nevermar.dta")

# Run a plain linear regression
regt = lm(nevermar ~ impdum, data = nmar)

# apply the 'cl' function by choosing a variable to cluster on.
# here, we are clustering on state.
cl(nmar, regt, nmar$state)

Thoughts, comments, ideas?  Let me know; I’m always appreciative of feedback. You can contact me via e-mail at kevingoulding {at} gmail [dot] com.

Advertisements

33 Responses to “Clustered Standard Errors in R”

  1. Dear Kevin, thank you for your blog (especially this entry). one question: do you know a procedure to use multiple input variables and not only one like in the example given by Mahmood Ara?
    my goal is to use a formula like y ~ x1 + x2 + x1*x2 … any ideas will be highly motivating….

    • Hi Riccardo – the procedure outlined above should work fine for a regression with multiple independent variables. For example, if instead of “regt = lm(nevermar ~ impdum, data = nmar)” it was “regt = lm(nevermar ~ impdum + x2 + x3, data = nmar)”, the procedure to output clustered standard errors would be exactly the same: “cl(nmar, regt, nmar$state)” . If your question is how to cluster around more than one variable (e.g. multiple clusters), then I suggest you look at the function “mclx()” that is also in the Mahmoon Ara documentation. Hope this helps. -Kevin

  2. Hi Kevin, thanks for your very quick reply. I thought so as well but the sandwich estimator function throwed me following error: Fehler in bread. %*% meat. : nicht passende Argumente (Failure in bread. %*% meat, arguments not matching). this was given me the idea about to use this function only in a one-variable-regression way. examining the results of the used bread and meat functions just showed me a NA problem for the t-values, P values and so on in the linematching (lm) results for the multiple variable model…
    i hope it will work when the problem is solved.

  3. dear kevin,

    due to some singularity in my data i was receiving this error. now i have solved this and everything is working fine. Yet: is there any possibility to get an R-squared value?

    all the best, riccardo

    • Hi,

      I have the same problem (singularities within the data) and wonder how you solved it. Is there any other way than dropping some/those variables?

      best, sjard.

      • Hi STS — What is causing the singularities in your data? This could be due to overspecification (e.g., too many regressors for too little data), or a “dummy variable trap”, or all the variation in one regressor is captured in another. You will have to look closely at your data to identify what is causing the singularity. Another solution could be to add more data (increase sample size). Without understanding the nature of the singularity, it is difficult for me to help you. HTH, Kevin

  4. in the original script by Arai he is dividing the results of sandwich by dfcw. have you missed that? best regards, riccardo

  5. one more thing: you`re not using “dat” in your function.
    the original function by Arai uses a correction for the degrees of freedom, which you are not using. so the function only depends on clusters and the model….

  6. Just use some stepwise function to create a good linear model. Than there will be no singularities. And one more thing : excuse my question about the r-squared. Of course it is given by lm() itself, isn’t it?

  7. Hi Kevin,

    I was just wondering about a little thing on clustered standard errors. i was hoping to hear an opinion from you on this.
    IF we cluster the standard erros, do we have to cluster them at the unit of observations (say, villages). Or, if 5 to 6 villages are comprised in a county, would it also be possible to cluster the according to the number of counties? The latter makes more sense to me since we would allow for clustering within the cluster group (region) which seems reasonable to me (imagine the county decides to build a new railway station, all villages would be affected). So, provides that we think that the no of clusters is high enough, could we also cluster at a unit that is NOT the unit of the observations in one’s dataset?

    I am really curious to read your views on that!

    • Hi Sach, great question. The short answer is yes. We can cluster at any “grouping level” we want, and can cluster on multiple groupings. Nearly always it makes the most sense to group at a level that is not at the unit-of-observation level. For example, if you have individual test scores for students across the country, you might want to cluster the standard errors by state, school district, region, or some combination thereof. To take your question a bit further, the only time you might want to cluster standard errors at the observation level is if you have a time series component in your data (i.e. repeated measures). Although if this were the case, you might want to investigate exactly how observations for the same individual vary over time — are they autocorrelated, for example. This might lead you to an approach that explicitly models autocorrelation at the individual level, and while it is related to clustering standard errors, is not the same. To conclude, it is highly unlikely that you will want to cluster standard errors at the unit-of-observation level. Hope this helps.

  8. Hey, Kevin!

    This is great post. I do have a little problem applying the procedure, though. Eventhough I included the packages zoo and sandwich, when I run c1 I am told that the function “coeftest” couldn’t be found. Perhaps this is a stupid questions which has nothing to do with the thread but it would be very nice, if you answered anyway. Thx – me

  9. Hey, Kevin, great post. Thanks for doing this. I have seen similar posts on robust and clustered SEs, and there are often annoying small differences between results from that code that stata. This was pretty much spot on. I had a quick question about clustered SEs and other models, say negative binomial and poisson. Do you know if this code works for non lm() models as well?

  10. Hi, I get the error message Error in tapply(x, cluster, sum) : arguments must have same length
    is there anyway to get around this?

    • Yes — I’m getting the same error. Any luck finding a solution?

      • I was getting the same error, and realized this is due to NA values in one of the variables in the original lm() I ran, which made the length of the vector I was trying to cluster on (which had no NAs) a different length. After I limited the data in the lm() to only those rows that had no NAs in any of the variables I was inputting into the lm, the cl() function worked…although my Stata output is still not quite matching the R output yet…

      • I’m getting the “Error in tapply(x, cluster, sum) : arguments must have same length” I tried the code Aks suggests to deal with NA’s and got “Error in data$fipstate : object of type ‘closure’ is not subsettable” Any suggestions would be greatly appreciated.

  11. Kevin, I am having trouble loading the sandwich package. I receive the following error message: Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
    there is no package called ‘zoo’
    Any suggestions?

  12. Hi Kevin,
    The function does not estimate when the residuals have NA in them.
    so when it computes estfun(fm), it has missing rows and hence the number of rows in estfun(fm) is less than the number of rows in data.
    this causes error in the next step of computing uj because rows of data are not the same as rows of estfun.

    I tried amending the code as follows
    fm<- lm(y~x, wt=wt, na.action=na.exclude)
    uj <- apply(estfun(fm),2, function(x) tapply(x, data$fipstate, sum, na.rm=TRUE))

    In this modified code, at least the code runs and computes standard errors, but the clustered se is not the same as obtained in stata.

  13. Hi Kevin,

    I’m having an issue using your function: http://stackoverflow.com/questions/23313907/clustering-standard-errors

    If you have any suggestions, would be very grateful!

    Best,

    Sam

  14. Thanks for this! Does the F-statistic change? If so, how do I get that? I am assuming the R-squared stays the same?

  15. Hey Kevin,
    Why don’t we use the bread part of the sandwich function i.e (X’X)^-1 parts around the meat in this code? The original paper states that under some assumed conditions S(teta) i.e the sandwich function reduces to the bread function which mean that we don’t have the (X’X)^-1 part you have written on the equation. Is this correct?

  16. Hi,
    after running the code i received an error message “Error: unexpected ‘}’ in “coeftest(fm,vcovCL)}”
    please kindly help me solve this problem

  17. Hello all

    If you want clustered standard errors in R, the best way is probably now to use the “multiwayvcov” package. Another alternative is the “robcov” function in Frank Harrell’s “rms” package.

  18. Hi,
    I am currently writing my bachelor thesis and I have to do a regression and a multiple regression, both with clustered standard errors. I have absolutely no experience with R myself and I am just using codes I find (and understand) on the internet. I use this code here and extended it so that I can have two clusters (http://stackoverflow.com/questions/8389843/double-clustered-standard-errors-for-panel-data). Everything works perfectly fine. The only thing that I would need is the Adjusted R-Squared value for the clustered regression. Unfortunately, I can’t figure it out. I know for the regression function I just have to type in summary(regtX) to get it. But for the regression with clusters that does not work, does it?
    Any (simple) help appreciated 😀

    • Clustering only affects standard errors, not the coefficient estimates or model fit! In short, you can use the R-squared estimate from your earlier (non-clustered) regression because it won’t change.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: