Archive for June, 2011

June 29, 2011

TikZ diagrams with R: A Normal probability distribution function

You may have seen an earlier post where I went through some examples of how to create a normal distribution in LaTeX using TikZ. In this post, I will show a different way to accomplish a similar result using R and the package tikzDevice().

tikzDevice() is an R package that outputs any image from R as TikZ code in a .tex file. In order to include the outputted .tex file in you LaTeX document, you need to do two things:

  • add \usepackage{tikz} in the preamble to your LaTeX document.
  • add \include{normal_pdf} where you’d like your image (after you’ve created and outputted normal_pdf.tex from R, as shown below).

R:

# load tikzDevice package
require(tikzDevice)

# Choose boundaries to be shaded in blue
a = 0.5
b = 1.8

# creates x & y boundaries based on a and b parameters
x.val <- c(a,seq(a,b,0.01),b) 
y.val <- c(0,dnorm(seq(a,b,0.01)),0)

# choose the name and location for your .tex file
# it should be the same directory as your latex document
tikz( '/Users/kevingoulding/latex_documents/thesis/normal_pdf.tex' )

# plots a normal distribution curve
curve(dnorm(x,0,1),xlim=c(-3,3),main='The Standard Normal PDF',
	xlab = '$x$', ylab = '$f(x)$',
	frame.plot = FALSE, axes = FALSE)

# shades in a polygon underneath curve
polygon(x.val,y.val,col='skyblue')

# creates blank axes
Axis(side=1, labels=FALSE)
Axis(side=2, labels=FALSE)

# must turn device off to complete .tex file
dev.off()
Advertisements
Tags: , , ,
June 25, 2011

TikZ diagrams with R: tikzDevice

There are several options for integrating your R workspace with LaTeX. One of these is the R package tikzDevice that allows you to export images created in R as tikz code in a .tex file, for immediate use in a LaTeX document via the line \include{diagrams}.

A simpler way, the one we all start out with, is to export an image from R as a .pdf, then include it using the line \includegraphics{diagrams.pdf}. This is a pretty easy and straightforward workflow – so, why would I want to use tikzDevice?

There several advantages to converting your images into TikZ code directly from R:

  1. TikZ diagrams consist of vectors coded directly into your LaTeX document: there’s no loss of image resolution.
  2. The labels on TikZ diagrams match the font of your LaTeX document.
  3. Wonderful LaTeX equations can be effortlessly used as labels in your diagrams.
  4. You can harness the power of the loop in R to create a single .tex file containing many images.
  5. You can harness the power of the loop in R to add \caption{} and \label{} lines to all your images for immediate reference within LaTeX.
  6. You can include all these features and output via one line in LaTeX: \include{diagrams}.

A Simple Example

That being said, let’s export a TikZ scatterplot using the tikzDevicepackage. We will use data posted on Dr. Walter Enders web site.

Notice the fancy latex equations as labels on the plot.

R:

# gdata helps read .xls files
require(gdata)
df = read.xls("http://cba.ua.edu/assets/docs/wenders/arch.xls", sheet = 1)

# tikzDevice will export the plots as a .tex file
require(tikzDevice)

# choose a name and path for the .tex file
# folder should be the same as where your latex document resides
tikz( '/Users/kevingoulding/latex_documents/thesis/plot_with_line.tex' )

plot(df, xlab = "$\\alpha_t + \\hat{\\beta}X_t$", ylab = "$Y_t$",
            main = "$Y_t = \\alpha_t + \\hat{\\beta}X_t$")
abline(h = mean(df[,2]), col = "red", lwd = 2)

dev.off()					# must turn device off to complete .tex file

To include this diagram in your LaTeX document, simply add the line \include{plot_with_line} and compile. Don’t forget to include \usepackage{tikz} in the preamble. If you zoom in, you can see that we’ve labeled the plot and axes using LaTeX math language (amsmath).

A few things to be careful with as you try to code LaTeX equations from within R:

  • All backslashes need to be doubled. \ –> \\.
  • All equations still need to be bordered by $ on each side.

To be continued…

Tags: , ,
June 20, 2011

Differences-in-Differences estimation in R and Stata

{ a.k.a. Difference-in-Difference, Difference-in-Differences,DD, DID, D-I-D. }

DID estimation uses four data points to deduce the impact of a policy change or some other shock (a.k.a. treatment) on the treated population: the effect of the treatment on the treated.  The structure of the experiment implies that the treatment group and control group have similar characteristics and are trending in the same way over time.  This means that the counterfactual (unobserved scenario) is that had the treated group not received treatment, its mean value would be the same distance from the control group in the second period.  See the diagram below; the four data points are the observed mean (average) of each group. These are the only data points necessary to calculate the effect of the treatment on the treated.  The dotted lines represent the trend that is not observed by the researcher.  Notice that although the means are different, they both have the same time trend (i.e. slope).

For a more thorough work through of the effect of the Earned Income Tax Credit on female employment, see an earlier post of mine:

Calculate the D-I-D Estimate of the Treatment Effect

We will now use R and Stata to calculate the unconditional difference-in-difference estimates of the effect of the 1993 EITC expansion on employment of single women.

R:

# Load the foreign package
require(foreign)

# Import data from web site

require(foreign)

# update: first download the file eitc.dta from this link:
# https://docs.google.com/open?id=0B0iAUHM7ljQ1cUZvRWxjUmpfVXM
# Then import from your hard drive:
eitc = read.dta("C:/link/to/my/download/folder/eitc.dta")

# Create two additional dummy variables to indicate before/after
# and treatment/control groups.

# the EITC went into effect in the year 1994
eitc$post93 = as.numeric(eitc$year >= 1994)

# The EITC only affects women with at least one child, so the
# treatment group will be all women with children.
eitc$anykids = as.numeric(eitc$children >= 1)

# Compute the four data points needed in the DID calculation:
a = sapply(subset(eitc, post93 == 0 & anykids == 0, select=work), mean)
b = sapply(subset(eitc, post93 == 0 & anykids == 1, select=work), mean)
c = sapply(subset(eitc, post93 == 1 & anykids == 0, select=work), mean)
d = sapply(subset(eitc, post93 == 1 & anykids == 1, select=work), mean)

# Compute the effect of the EITC on the employment of women with children:
(d-c)-(b-a)

The result is the width of the “shift” shown in the diagram above.

STATA:

cd "C:\DATA\Econ 562\homework"
use eitc, clear

gen anykids = (children >= 1)
gen post93 = (year >= 1994)

mean work if post93==0 & anykids==0     /* value 1 */
mean work if post93==0 & anykids==1     /* value 2 */
mean work if post93==1 & anykids==0     /* value 3 */
mean work if post93==1 & anykids==1     /* value 4 */

Then you must do the calculation by hand (shown on the last line of the R code).
(value 4 – value 3) – (value 2 – value 1)

Run a simple D-I-D Regression

Now we will run a regression to estimate the conditional difference-in-difference estimate of the effect of the Earned Income Tax Credit on “work”, using all women with children as the treatment group. This is exactly the same as what we did manually above, now using ordinary least squares. The regression equation is as follows:

work = \beta_0 + \delta_0post93 + \beta_1 anykids + \delta_1 (anykids \times post93)+\varepsilon

Where \varepsilon is the white noise error term, and \delta_1 is the effect of the treatment on the treated — the shift shown in the diagram. To be clear, the coefficient on (anykids \times post93) is the value we are interested in (i.e., \delta_1).

R:

eitc$p93kids.interaction = eitc$post93*eitc$anykids
reg1 = lm(work ~ post93 + anykids + p93kids.interaction, data = eitc)
summary(reg1)

The coefficient estimate on p93kids.interaction should match the value calculated manually above.

STATA:

gen interaction = post93*anykids
reg work post93 anykids interaction
Tags: ,
June 19, 2011

TikZ for economists: A how-to guide with examples

Below is a guide I created to assist economists in quickly creating economics diagrams using the package TikZ in LaTeX.




Thoughts, comments, ideas? Let me know; I’m always appreciative of feedback. Correspondence via e-mail can be directed to kevingoulding [at] gmail {dot} com.

Tags: , ,
June 17, 2011

TikZ diagrams for economists: A normal pdf with shaded area.

I have been dabbling with the TikZ package to create some diagrams relevant to a first year microeconomics course. The following diagram of the probability density function (pdf) of a normal distribution may be useful to others wishing to integrate similar diagrams into their LaTeX documents or Beamer presentations. To use, insert the following code anywhere you like within a .tex document (you must include \usepackage{tikz} in your header):

The Cumulative Density of y

INSERT INTO .TEX DOCUMENT

\begin{tikzpicture}
% define normal distribution function 'normaltwo'
    \def\normaltwo{\x,{4*1/exp(((\x-3)^2)/2)}}

% input y parameter
    \def\y{4.4}

% this line calculates f(y)
    \def\fy{4*1/exp(((\y-3)^2)/2)}

% Shade orange area underneath curve.
    \fill [fill=orange!60] (2.6,0) -- plot[domain=0:4.4] (\normaltwo) -- ({\y},0) -- cycle;

% Draw and label normal distribution function
    \draw[color=blue,domain=0:6] plot (\normaltwo) node[right] {};

% Add dashed line dropping down from normal.
    \draw[dashed] ({\y},{\fy}) -- ({\y},0) node[below] {$y$};

% Optional: Add axis labels
    \draw (-.2,2.5) node[left] {$f_Y(u)$};
    \draw (3,-.5) node[below] {$u$};

% Optional: Add axes
    \draw[->] (0,0) -- (6.2,0) node[right] {};
    \draw[->] (0,0) -- (0,5) node[above] {};

\end{tikzpicture}

The Probability of u Falling Between x and y

INSERT INTO .TEX DOCUMENT

\begin{tikzpicture}
% define normal distribution function 'normaltwo'
    \def\normaltwo{\x,{4*1/exp(((\x-3)^2)/2)}}

% input x and y parameters
    \def\y{4.4}
    \def\x{3.4}

% this line calculates f(y)
    \def\fy{4*1/exp(((\y-3)^2)/2)}
    \def\fx{4*1/exp(((\x-3)^2)/2)}

% Shade orange area underneath curve.
    \fill [fill=orange!60] ({\x},0) -- plot[domain={\x}:{\y}] (\normaltwo) -- ({\y},0) -- cycle;

% Draw and label normal distribution function
    \draw[color=blue,domain=0:6] plot (\normaltwo) node[right] {};

% Add dashed line dropping down from normal.
    \draw[dashed] ({\y},{\fy}) -- ({\y},0) node[below] {$y$};
    \draw[dashed] ({\x},{\fx}) -- ({\x},0) node[below] {$x$};

% Optional: Add axis labels
    \draw (-.2,2.5) node[left] {$f_Y(u)$};
    \draw (3,-.5) node[below] {$u$};

% Optional: Add axes
    \draw[->] (0,0) -- (6.2,0) node[right] {};
    \draw[->] (0,0) -- (0,5) node[above] {};

\end{tikzpicture}

The TikZ code snippet above is meant to be dropped into a .tex document and work without any further “tinkering”. Please let me know if this is not the case!

Tags: , ,
June 17, 2011

TikZ diagrams for economists: An excise tax

I have been dabbling with the TikZ package to create some diagrams relevant to a first year microeconomics course. The following diagram of an excise tax may be useful to others wishing to integrate similar diagrams into their LaTeX documents or Beamer presentations. To use, insert the following code anywhere you like within a .tex document (you must include \usepackage{tikz} in your header):

This diagram was created using TikZ.

INSERT INTO .TEX DOCUMENT

%                         TikZ code: An excise tax
 
\begin{tikzpicture}[domain=0:5,scale=1,thick]
\usetikzlibrary{calc}	                             %allows coordinate calculations.
\usetikzlibrary{decorations.pathreplacing}           %allows drawing curly braces.

% Define linear parameters for supply and demand
\def\dint{4.5}		%Y-intercept for DEMAND.
\def\dslp{-0.5}		%Slope for DEMAND.
\def\sint{1.2}		%Y-intercept for SUPPLY.
\def\sslp{0.8}  	%Slope for SUPPLY.

\def\tax{1.5}		%Excise (per-unit) tax

% Define Supply and Demand Lines as equations of parameters defined above.
  \def\demand{\x,{\dslp*\x+\dint}}
  \def\supply{\x,{\sslp*\x+\sint}}
  \def\demandtwo{\x,{\dslp*\x+\dint+\dsh}}
  \def\supplytwo{\x,{\sslp*\x+\sint+\ssh}}


% Define coordinates.
    \coordinate (ints) at ({(\sint-\dint)/(\dslp-\sslp)},{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (ep) at  (0,{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (eq) at  ({(\sint-\dint)/(\dslp-\sslp)},0);
    \coordinate (dint) at (0,{\dint});
    \coordinate (sint) at (0,{\sint});

    \coordinate (teq) at  ({(\sint+\tax-\dint)/(\dslp-\sslp)},0); %quantity
    \coordinate (tep) at  (0,{(\sint+\tax-\dint)/(\dslp-\sslp)*\sslp+\sint+\tax}); %price
    \coordinate (tint) at  ({(\sint+\tax-\dint)/(\dslp-\sslp)},{(\sint+\tax-\dint)/(\dslp-\sslp)*\sslp+\sint+\tax}); %tax equilibrium
    
    \coordinate (sep) at (0,{\sslp*(\sint+\tax-\dint)/(\dslp-\sslp)+\sint});
    \coordinate (sen) at ({(\sint+\tax-\dint)/(\dslp-\sslp)},{\sslp*(\sint+\tax-\dint)/(\dslp-\sslp)+\sint});   

% DEMAND
    \draw[thick,color=blue] plot (\demand) node[right] {$P(q) = -\frac{1}{2}q+\frac{9}{2}$};
    
% SUPPLY
    \draw[thick,color=purple] plot (\supply) node[right] {Supply};

% Draw axes, and dotted equilibrium lines.
    \draw[->] (0,0) -- (6.2,0) node[right] {$Q$};
    \draw[->] (0,0) -- (0,6.2) node[above] {$P$};
    \draw[decorate,decoration={brace},thick]  ($(sep)+(-0.8,0)$) -- ($(tep)+(-0.8,0)$) node[midway,below=-8pt,xshift=-18pt] {tax};
    
    \draw[dashed] (tint) -- (teq) node[below] {$Q_T$};         			
    \draw[dashed] (tint) -- (tep) node[left] {$P_d$};          				
    \draw[dashed] (sen) -- (sep) node[left] {$P_s$};          				
        
\end{tikzpicture}

The TikZ code snippet above is meant to be dropped into a .tex document and work without any further “tinkering”. Please let me know if this is not the case!

Tags: , ,
June 16, 2011

The Chow test in R: A case study of Yellowstone’s Old Faithful Geyser

Recently I took a road trip south to Yellowstone National Park, where the fascinating phenomenon that is Old Faithful is still spewing piping-hot water 120 feet into the air every hour or so. From fellow spectators, I was told that the time between eruptions was approximately 60 minutes, but could be longer depending on the length of the previous eruption. I wondered, would it be possible to get a better estimate of our waiting time, conditional on the previous eruption’s length?

Fortunately, there is plenty of data available on the geyser. See the site: http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL

Let’s look at some data showing length of the eruption and waiting time between eruptions captured by some park rangers in the 1980s.

A description of the data can be found here.

The Chow Test for Structural Breaks

The Chow test is used to see if it makes sense to run two separate regressions on two mutually exclusive subsets of your data (divided by a break point) by comparing the results of the two “unrestricted” regressions versus the “restricted” regression that pools all the data together.

H_0: \beta_{ur_1} = \beta_{ur_2}

H_a: \beta_{ur_1} \neq \beta_{ur_2}

The procedure is as follows:

  1. Run a “restricted” regression on all your data (pooled).
  2. Divide your sample into to groups, determined by your breakpoint (e.g. a point in time, or a variable value).
  3. Run an “unrestricted” regression on each of your subsamples.  You will run two “unrestricted” regressions with a single breakpoint.
  4. Calculate the Chow F-statistic as follows:

\frac{(SSR_r - SSR_u)/k}{SSR_u/(n-2k)} \sim F_{k,n-2k}

where SSR_r = the sum of squared residuals of the restricted model, SSR_u = the sum of squared residuals from the unrestricted models, k = the number of regressors (including the intercept), and n= the number of total observations.

For a much more thorough review of the topic, see page 113 of Dr. Anton Bekkerman’s class notes here.

Eruption Time versus Waiting Time

Following the relationship mentioned to me from a fellow geyser visitor between the length of the eruption and the subsequent waiting time between eruptions, let’s see if there appears to be a relationship:

of = faithful
plot(of, col="blue",main="Eruptions of Old Faithful")

A scatterplot.

There certainly does. One thing that does pop out is the existence of two groups – there’s a cluster of data points in the upper right, and another in the lower left. What could explain this?

Maybe there is an extra chamber within the geyser that, once it is breached, empties completely. Then, it takes longer for the next eruption to occur because that chamber has to fill back up with boiling hot water. NOTE: This non-scientific explanation is not backed by any scientific studies.

So, from the scatter plot, it appears that the breakpoint between the two groups occurs at the value of 3.25 minutes of eruption. For this example, this will be our break point.

## Run three regressions (1 restricted, 2 unrestricted)
r.reg = lm(waiting ~ eruptions, data = of)
ur.reg1 = lm(waiting ~ eruptions, data = of[of$eruptions > 3.25,])
ur.reg2 = lm(waiting ~ eruptions, data = of[of$eruptions
## review the regression results
summary(reg.r)
summary(ur.reg1)
summary(ur.reg2)

## Calculate sum of squared residuals for each regression
SSR = NULL
SSR$r = r.reg$residuals^2
SSR$ur1 = ur.reg1$residuals^2
SSR$ur2 = ur.reg2$residuals^2

## K is the number of regressors in our model
K = r.reg$rank

## Computing the Chow test statistic (F-test)
numerator = ( sum(SSR$r) - (sum(SSR$ur1) + sum(SSR$ur2)) ) / K
denominator = (sum(SSR$ur1) + sum(SSR$ur2)) / (nrow(of) - 2*K)
chow = numerator / denominator
chow

## Calculate P-value
1-pf(chow, K, (nrow(of) - 2*K))

Now, we can plot the results. In this figure, the red dotted line is the restricted regression line, and the blue lines are the two unrestricted regression lines.

## Plot the results
plot(of,main="Eruptions of Old Faithful")
# restricted model
abline(r.reg, col = "red",lwd = 2, lty = "dashed")
# unrestricted model 1
segments(0, ur.reg2$coefficients[1], 3.25,
ur.reg2$coefficients[1]+3.25*ur.reg2$coefficients[2], col= 'blue')
# unrestricted model 2
segments(3.25, ur.reg1$coefficients[1]+3.25*ur.reg1$coefficients[2],
5.2, ur.reg1$coefficients[1]+5.2*ur.reg1$coefficients[2], col= 'blue')

A Quicker Way

The CRAN package strucchange provides a more streamlined approach to calculating a Chow test statistic, but it requires that the data is ordered so that the breakpoint can be identified by a specific row number. In our example, we’ll have to order our data by eruptions, and then point out that the 98th data point is the last data point where eruptions is less than 3.25. The code is below:

## Sort the data
sort.of = of[order(of$eruptions) , ]
sort.of = cbind(index(sort.of),sort.of)

## Identify the row number of our breakpoint
brk = max(sort.of[,1][sort.of$eruptions brk

## Using the CRAN package 'strucchange'
require(strucchange)
sctest(waiting ~ eruptions, type = "Chow", point = brk, data = sort.of)

The results above should mirror exactly the results we achieved via manual calculation in the section above.

A Single Regression

If you wanted to run a single regression that allowed the marginal effect of eruption time on waiting time to vary before and after the breakpoint, you add a dummy variable and an interaction term. In R, it is relatively straightforward:

of$dummy = as.numeric(of$eruptions >= 3.25)
summary(lm(eruptions ~ waiting + I(dummy*waiting) + dummy, data = of))

*Be careful as you interpret the coefficient estimates because you must add a few of them together.

Tags:
June 13, 2011

Calculate OLS regression manually using matrix algebra in R

The following code will attempt to replicate the results of the lm() function in R. For this exercise, we will be using a cross sectional data set provided by R called “women”, that has height and weight data for 15 individuals.

The OLS regression equation:

Y = X\beta + \varepsilon

where \varepsilon = a white noise error term. For this example Y = weight, and X = height. \beta = the marginal impact a one unit change in height has on weight.

## This is the OLS regression we will manually calculate:
reg = lm(weight ~ height, data=women)
summary(reg)

Recall that the following matrix equation is used to calculate the vector of estimated coefficients \hat{\beta} of an OLS regression:

\hat{\beta} = (X'X)^{-1}X'Y

where X = the matrix of regressor data (the first column is all 1’s for the intercept), and Y = the vector of the dependent variable data.

Matrix operators in R

  • as.matrix() coerces an object into the matrix class.
  • t() transposes a matrix.
  • %*% is the operator for matrix multiplication.
  • solve() takes the inverse of a matrix. Note, the matrix must be invertible.

For a more complete introduction to doing matrix operations in R, check out this page.

Back to OLS

The following code calculates the 2 x 1 matrix of coefficients, \hat{\beta}:

## Create X and Y matrices for this specific regression
X = as.matrix(cbind(1,women$height))
Y = as.matrix(women$weight)

## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Height"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat

Calculating Standard Errors

To calculate the standard errors, you must first calculate the variance-covariance (VCV) matrix, as follows:

Var(\hat{\beta}|X) = \frac{1}{n-k}\hat{\varepsilon}'\hat{\varepsilon}(X'X)^{-1}

The VCV matrix will be a square k x k matrix. Standard errors for the estimated coefficients \hat{\beta} are found by taking the square root of the diagonal elements of the VCV matrix.

## Calculate vector of residuals
res = as.matrix(women$weight-bh[1]-bh[2]*women$height)

## Define n and k parameters
n = nrow(women)
k = ncol(X)

## Calculate Variance-Covariance Matrix
VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients
StdErr = sqrt(diag(VCV))

## Calculate p-value for a t-test of coefficient significance
P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),
2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame
beta.hat = cbind(beta.hat,StdErr,P.Value)
beta.hat

A Scatterplot with OLS line

Women's height vs. weight using plot() and abline() functions in R.

## Plot results
plot(women$height,women$weight, xlab = "Height", ylab = "Weight",
				main = "OLS: Height and Weight")
abline(a = bh[1], b = bh[2], col = 'red', lwd = 2, lty="dashed")

Now you can check the results above using the canned lm() function:

summary(lm(weight ~ height, data = women))
Tags:
June 11, 2011

TikZ diagrams for economists: A price ceiling

I have been dabbling with the TikZ package to create some diagrams relevant to a first year microeconomics course. The following diagram of a price ceiling may be useful to others wishing to integrate similar diagrams into their LaTeX documents or Beamer presentations. To use, insert the following code anywhere you like within a .tex document (you must include \usepackage{tikz} in your header):

This diagram was created with the TikZ package in LaTeX.

INSERT INTO .TEX DOCUMENT

\begin{tikzpicture}[domain=0:5,scale=1,thick]
\usetikzlibrary{calc}	%allows coordinate calculations.

%Define linear parameters for supply and demand
\def\dint{4.5}			%Y-intercept for DEMAND.
\def\dslp{-0.5}			%Slope for DEMAND.
\def\sint{1.2}			%Y-intercept for SUPPLY.
\def\sslp{0.8}  		%Slope for SUPPLY.

\def\pfc{2.5}			%Price floor or ceiling

\def\demand{\x,{\dslp*\x+\dint}}
\def\supply{\x,{\sslp*\x+\sint}}

% Define coordinates.
    \coordinate (ints) at ({(\sint-\dint)/(\dslp-\sslp)},{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (ep) at  (0,{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (eq) at  ({(\sint-\dint)/(\dslp-\sslp)},0);
    \coordinate (dint) at (0,{\dint});
    \coordinate (sint) at (0,{\sint});
    \coordinate (pfq) at  ({(\pfc-\dint)/(\dslp)},0);
    \coordinate (pfp) at  ({(\pfc-\dint)/(\dslp)},{\pfc});
    \coordinate (sfq) at  ({(\pfc-\sint)/(\sslp)},0);
    \coordinate (sfp) at  ({(\pfc-\sint)/(\sslp)},{\pfc});

% DEMAND
    \draw[thick,color=blue] plot (\demand) node[right] {$P(q) = -\frac{1}{2}q+\frac{9}{2}$};
   
% SUPPLY
    \draw[thick,color=purple] plot (\supply) node[right] {Supply};

% Draw axes, and dotted equilibrium lines.
    \draw[->] (0,0) -- (6.2,0) node[right] {$Q$};
    \draw[->] (0,0) -- (0,6.2) node[above] {$P$};
       
    %Price floor and ceiling lines
    \draw[dashed,color=black] plot (\x,{\pfc}) node[right] {$P_c$};
    \draw[dashed] (pfp) -- (pfq) node[below] {$Q_d$};
    \draw[dashed] (sfp) -- (sfq) node[below] {$Q_s$};
    
\draw[->,baseline=5] ($(0,{\pfc})+(-1.5,0.7)$) node[label= left:Price Ceiling] {} -- ($(0,{\pfc})+(-.1,0.1)$);

\end{tikzpicture}
Tags: , ,
June 11, 2011

Clustered Standard Errors in R

So, you want to calculate clustered standard errors in R (a.k.a. cluster-robust, huber-white, White’s) for the estimated coefficients of your OLS regression? This post shows how to do this in both Stata and R:

Overview

Let’s say that you want to relax the Gauss-Markov homoskedasticity assumption, and account for the fact that there may be several different covariance structures within your data sample that vary by a certain characteristic – a “cluster” – but are homoskedastic within each cluster.

For example, say you have a panel data set with a bunch of different test scores from different schools around the country. You may want to cluster your sample by state, by school district, or even by town. Economic theory and intuition will guide you in this decision.

So, similar to heteroskedasticity-robust standard errors, you want to allow more flexibility in your variance-covariance (VCV) matrix (Recall that the diagonal elements of the VCV matrix are the squared standard errors of your estimated coefficients). The way to accomplish this is by using clustered standard errors. The formulation is as follows:

Variance_{cluster} =

\bigl (\frac{M}{M-1}\frac{N-1}{N-K} \bigr )(X'X)^{-1}\sum_{j=1}^M \{X_M\hat{\varepsilon}_M\hat{\varepsilon}'_MX_M\}(X'X)^{-1}

where M = number of unique clusters (e.g. number of school districts) N = number of observations, and K = the number of regressors (including the intercept). (See pages 312-313 of Angrist and Pischke’s Mostly Harmless Econometrics (Princeton University Press, 2009) for a better explanation of the summation notation.) This returns a Variance-covariance (VCV) matrix where the diagonal elements are the estimated cluster-robust coefficient variances — the ones of interest. Estimated coefficient standard errors are the square root of these diagonal elements.

Stata makes it very easy to calculate, by simply adding ,cluster(state) to the end of your regression command.

STATA:

use wr-nevermar.dta
reg nevermar impdum, cluster(state)

R:

In R, you first must run a function here called cl() written by Mahmood Ara in Stockholm University – the backup can be found here and here.

   cl   <- function(dat,fm, cluster){
              require(sandwich, quietly = TRUE)
              require(lmtest, quietly = TRUE)
              M <- length(unique(cluster))
              N <- length(cluster)
              K <- fm$rank
              dfc <- (M/(M-1))*((N-1)/(N-K))
              uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
              vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
              coeftest(fm, vcovCL) }

After running the code above, you can run your regression with clustered standard errors as follows:

require(foreign)
nmar = read.dta("http://www.montana.edu/econ/cstoddard/562/wr-nevermar.dta")

# Run a plain linear regression
regt = lm(nevermar ~ impdum, data = nmar)

# apply the 'cl' function by choosing a variable to cluster on.
# here, we are clustering on state.
cl(nmar, regt, nmar$state)

Thoughts, comments, ideas?  Let me know; I’m always appreciative of feedback. You can contact me via e-mail at kevingoulding {at} gmail [dot] com.