June 17, 2011

TikZ diagrams for economists: An excise tax

I have been dabbling with the TikZ package to create some diagrams relevant to a first year microeconomics course. The following diagram of an excise tax may be useful to others wishing to integrate similar diagrams into their LaTeX documents or Beamer presentations. To use, insert the following code anywhere you like within a .tex document (you must include \usepackage{tikz} in your header):

This diagram was created using TikZ.

INSERT INTO .TEX DOCUMENT

%                         TikZ code: An excise tax
 
\begin{tikzpicture}[domain=0:5,scale=1,thick]
\usetikzlibrary{calc}	                             %allows coordinate calculations.
\usetikzlibrary{decorations.pathreplacing}           %allows drawing curly braces.

% Define linear parameters for supply and demand
\def\dint{4.5}		%Y-intercept for DEMAND.
\def\dslp{-0.5}		%Slope for DEMAND.
\def\sint{1.2}		%Y-intercept for SUPPLY.
\def\sslp{0.8}  	%Slope for SUPPLY.

\def\tax{1.5}		%Excise (per-unit) tax

% Define Supply and Demand Lines as equations of parameters defined above.
  \def\demand{\x,{\dslp*\x+\dint}}
  \def\supply{\x,{\sslp*\x+\sint}}
  \def\demandtwo{\x,{\dslp*\x+\dint+\dsh}}
  \def\supplytwo{\x,{\sslp*\x+\sint+\ssh}}


% Define coordinates.
    \coordinate (ints) at ({(\sint-\dint)/(\dslp-\sslp)},{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (ep) at  (0,{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (eq) at  ({(\sint-\dint)/(\dslp-\sslp)},0);
    \coordinate (dint) at (0,{\dint});
    \coordinate (sint) at (0,{\sint});

    \coordinate (teq) at  ({(\sint+\tax-\dint)/(\dslp-\sslp)},0); %quantity
    \coordinate (tep) at  (0,{(\sint+\tax-\dint)/(\dslp-\sslp)*\sslp+\sint+\tax}); %price
    \coordinate (tint) at  ({(\sint+\tax-\dint)/(\dslp-\sslp)},{(\sint+\tax-\dint)/(\dslp-\sslp)*\sslp+\sint+\tax}); %tax equilibrium
    
    \coordinate (sep) at (0,{\sslp*(\sint+\tax-\dint)/(\dslp-\sslp)+\sint});
    \coordinate (sen) at ({(\sint+\tax-\dint)/(\dslp-\sslp)},{\sslp*(\sint+\tax-\dint)/(\dslp-\sslp)+\sint});   

% DEMAND
    \draw[thick,color=blue] plot (\demand) node[right] {$P(q) = -\frac{1}{2}q+\frac{9}{2}$};
    
% SUPPLY
    \draw[thick,color=purple] plot (\supply) node[right] {Supply};

% Draw axes, and dotted equilibrium lines.
    \draw[->] (0,0) -- (6.2,0) node[right] {$Q$};
    \draw[->] (0,0) -- (0,6.2) node[above] {$P$};
    \draw[decorate,decoration={brace},thick]  ($(sep)+(-0.8,0)$) -- ($(tep)+(-0.8,0)$) node[midway,below=-8pt,xshift=-18pt] {tax};
    
    \draw[dashed] (tint) -- (teq) node[below] {$Q_T$};         			
    \draw[dashed] (tint) -- (tep) node[left] {$P_d$};          				
    \draw[dashed] (sen) -- (sep) node[left] {$P_s$};          				
        
\end{tikzpicture}

The TikZ code snippet above is meant to be dropped into a .tex document and work without any further “tinkering”. Please let me know if this is not the case!

Tags: , ,
June 16, 2011

The Chow test in R: A case study of Yellowstone’s Old Faithful Geyser

Recently I took a road trip south to Yellowstone National Park, where the fascinating phenomenon that is Old Faithful is still spewing piping-hot water 120 feet into the air every hour or so. From fellow spectators, I was told that the time between eruptions was approximately 60 minutes, but could be longer depending on the length of the previous eruption. I wondered, would it be possible to get a better estimate of our waiting time, conditional on the previous eruption’s length?

Fortunately, there is plenty of data available on the geyser. See the site: http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL

Let’s look at some data showing length of the eruption and waiting time between eruptions captured by some park rangers in the 1980s.

A description of the data can be found here.

The Chow Test for Structural Breaks

The Chow test is used to see if it makes sense to run two separate regressions on two mutually exclusive subsets of your data (divided by a break point) by comparing the results of the two “unrestricted” regressions versus the “restricted” regression that pools all the data together.

H_0: \beta_{ur_1} = \beta_{ur_2}

H_a: \beta_{ur_1} \neq \beta_{ur_2}

The procedure is as follows:

  1. Run a “restricted” regression on all your data (pooled).
  2. Divide your sample into to groups, determined by your breakpoint (e.g. a point in time, or a variable value).
  3. Run an “unrestricted” regression on each of your subsamples.  You will run two “unrestricted” regressions with a single breakpoint.
  4. Calculate the Chow F-statistic as follows:

\frac{(SSR_r - SSR_u)/k}{SSR_u/(n-2k)} \sim F_{k,n-2k}

where SSR_r = the sum of squared residuals of the restricted model, SSR_u = the sum of squared residuals from the unrestricted models, k = the number of regressors (including the intercept), and n= the number of total observations.

For a much more thorough review of the topic, see page 113 of Dr. Anton Bekkerman’s class notes here.

Eruption Time versus Waiting Time

Following the relationship mentioned to me from a fellow geyser visitor between the length of the eruption and the subsequent waiting time between eruptions, let’s see if there appears to be a relationship:

of = faithful
plot(of, col="blue",main="Eruptions of Old Faithful")

A scatterplot.

There certainly does. One thing that does pop out is the existence of two groups – there’s a cluster of data points in the upper right, and another in the lower left. What could explain this?

Maybe there is an extra chamber within the geyser that, once it is breached, empties completely. Then, it takes longer for the next eruption to occur because that chamber has to fill back up with boiling hot water. NOTE: This non-scientific explanation is not backed by any scientific studies.

So, from the scatter plot, it appears that the breakpoint between the two groups occurs at the value of 3.25 minutes of eruption. For this example, this will be our break point.

## Run three regressions (1 restricted, 2 unrestricted)
r.reg = lm(waiting ~ eruptions, data = of)
ur.reg1 = lm(waiting ~ eruptions, data = of[of$eruptions > 3.25,])
ur.reg2 = lm(waiting ~ eruptions, data = of[of$eruptions
## review the regression results
summary(reg.r)
summary(ur.reg1)
summary(ur.reg2)

## Calculate sum of squared residuals for each regression
SSR = NULL
SSR$r = r.reg$residuals^2
SSR$ur1 = ur.reg1$residuals^2
SSR$ur2 = ur.reg2$residuals^2

## K is the number of regressors in our model
K = r.reg$rank

## Computing the Chow test statistic (F-test)
numerator = ( sum(SSR$r) - (sum(SSR$ur1) + sum(SSR$ur2)) ) / K
denominator = (sum(SSR$ur1) + sum(SSR$ur2)) / (nrow(of) - 2*K)
chow = numerator / denominator
chow

## Calculate P-value
1-pf(chow, K, (nrow(of) - 2*K))

Now, we can plot the results. In this figure, the red dotted line is the restricted regression line, and the blue lines are the two unrestricted regression lines.

## Plot the results
plot(of,main="Eruptions of Old Faithful")
# restricted model
abline(r.reg, col = "red",lwd = 2, lty = "dashed")
# unrestricted model 1
segments(0, ur.reg2$coefficients[1], 3.25,
ur.reg2$coefficients[1]+3.25*ur.reg2$coefficients[2], col= 'blue')
# unrestricted model 2
segments(3.25, ur.reg1$coefficients[1]+3.25*ur.reg1$coefficients[2],
5.2, ur.reg1$coefficients[1]+5.2*ur.reg1$coefficients[2], col= 'blue')

A Quicker Way

The CRAN package strucchange provides a more streamlined approach to calculating a Chow test statistic, but it requires that the data is ordered so that the breakpoint can be identified by a specific row number. In our example, we’ll have to order our data by eruptions, and then point out that the 98th data point is the last data point where eruptions is less than 3.25. The code is below:

## Sort the data
sort.of = of[order(of$eruptions) , ]
sort.of = cbind(index(sort.of),sort.of)

## Identify the row number of our breakpoint
brk = max(sort.of[,1][sort.of$eruptions brk

## Using the CRAN package 'strucchange'
require(strucchange)
sctest(waiting ~ eruptions, type = "Chow", point = brk, data = sort.of)

The results above should mirror exactly the results we achieved via manual calculation in the section above.

A Single Regression

If you wanted to run a single regression that allowed the marginal effect of eruption time on waiting time to vary before and after the breakpoint, you add a dummy variable and an interaction term. In R, it is relatively straightforward:

of$dummy = as.numeric(of$eruptions >= 3.25)
summary(lm(eruptions ~ waiting + I(dummy*waiting) + dummy, data = of))

*Be careful as you interpret the coefficient estimates because you must add a few of them together.

Tags:
June 13, 2011

Calculate OLS regression manually using matrix algebra in R

The following code will attempt to replicate the results of the lm() function in R. For this exercise, we will be using a cross sectional data set provided by R called “women”, that has height and weight data for 15 individuals.

The OLS regression equation:

Y = X\beta + \varepsilon

where \varepsilon = a white noise error term. For this example Y = weight, and X = height. \beta = the marginal impact a one unit change in height has on weight.

## This is the OLS regression we will manually calculate:
reg = lm(weight ~ height, data=women)
summary(reg)

Recall that the following matrix equation is used to calculate the vector of estimated coefficients \hat{\beta} of an OLS regression:

\hat{\beta} = (X'X)^{-1}X'Y

where X = the matrix of regressor data (the first column is all 1’s for the intercept), and Y = the vector of the dependent variable data.

Matrix operators in R

  • as.matrix() coerces an object into the matrix class.
  • t() transposes a matrix.
  • %*% is the operator for matrix multiplication.
  • solve() takes the inverse of a matrix. Note, the matrix must be invertible.

For a more complete introduction to doing matrix operations in R, check out this page.

Back to OLS

The following code calculates the 2 x 1 matrix of coefficients, \hat{\beta}:

## Create X and Y matrices for this specific regression
X = as.matrix(cbind(1,women$height))
Y = as.matrix(women$weight)

## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Height"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat

Calculating Standard Errors

To calculate the standard errors, you must first calculate the variance-covariance (VCV) matrix, as follows:

Var(\hat{\beta}|X) = \frac{1}{n-k}\hat{\varepsilon}'\hat{\varepsilon}(X'X)^{-1}

The VCV matrix will be a square k x k matrix. Standard errors for the estimated coefficients \hat{\beta} are found by taking the square root of the diagonal elements of the VCV matrix.

## Calculate vector of residuals
res = as.matrix(women$weight-bh[1]-bh[2]*women$height)

## Define n and k parameters
n = nrow(women)
k = ncol(X)

## Calculate Variance-Covariance Matrix
VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients
StdErr = sqrt(diag(VCV))

## Calculate p-value for a t-test of coefficient significance
P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),
2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame
beta.hat = cbind(beta.hat,StdErr,P.Value)
beta.hat

A Scatterplot with OLS line

Women's height vs. weight using plot() and abline() functions in R.

## Plot results
plot(women$height,women$weight, xlab = "Height", ylab = "Weight",
				main = "OLS: Height and Weight")
abline(a = bh[1], b = bh[2], col = 'red', lwd = 2, lty="dashed")

Now you can check the results above using the canned lm() function:

summary(lm(weight ~ height, data = women))
Tags:
June 11, 2011

TikZ diagrams for economists: A price ceiling

I have been dabbling with the TikZ package to create some diagrams relevant to a first year microeconomics course. The following diagram of a price ceiling may be useful to others wishing to integrate similar diagrams into their LaTeX documents or Beamer presentations. To use, insert the following code anywhere you like within a .tex document (you must include \usepackage{tikz} in your header):

This diagram was created with the TikZ package in LaTeX.

INSERT INTO .TEX DOCUMENT

\begin{tikzpicture}[domain=0:5,scale=1,thick]
\usetikzlibrary{calc}	%allows coordinate calculations.

%Define linear parameters for supply and demand
\def\dint{4.5}			%Y-intercept for DEMAND.
\def\dslp{-0.5}			%Slope for DEMAND.
\def\sint{1.2}			%Y-intercept for SUPPLY.
\def\sslp{0.8}  		%Slope for SUPPLY.

\def\pfc{2.5}			%Price floor or ceiling

\def\demand{\x,{\dslp*\x+\dint}}
\def\supply{\x,{\sslp*\x+\sint}}

% Define coordinates.
    \coordinate (ints) at ({(\sint-\dint)/(\dslp-\sslp)},{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (ep) at  (0,{(\sint-\dint)/(\dslp-\sslp)*\sslp+\sint});
    \coordinate (eq) at  ({(\sint-\dint)/(\dslp-\sslp)},0);
    \coordinate (dint) at (0,{\dint});
    \coordinate (sint) at (0,{\sint});
    \coordinate (pfq) at  ({(\pfc-\dint)/(\dslp)},0);
    \coordinate (pfp) at  ({(\pfc-\dint)/(\dslp)},{\pfc});
    \coordinate (sfq) at  ({(\pfc-\sint)/(\sslp)},0);
    \coordinate (sfp) at  ({(\pfc-\sint)/(\sslp)},{\pfc});

% DEMAND
    \draw[thick,color=blue] plot (\demand) node[right] {$P(q) = -\frac{1}{2}q+\frac{9}{2}$};
   
% SUPPLY
    \draw[thick,color=purple] plot (\supply) node[right] {Supply};

% Draw axes, and dotted equilibrium lines.
    \draw[->] (0,0) -- (6.2,0) node[right] {$Q$};
    \draw[->] (0,0) -- (0,6.2) node[above] {$P$};
       
    %Price floor and ceiling lines
    \draw[dashed,color=black] plot (\x,{\pfc}) node[right] {$P_c$};
    \draw[dashed] (pfp) -- (pfq) node[below] {$Q_d$};
    \draw[dashed] (sfp) -- (sfq) node[below] {$Q_s$};
    
\draw[->,baseline=5] ($(0,{\pfc})+(-1.5,0.7)$) node[label= left:Price Ceiling] {} -- ($(0,{\pfc})+(-.1,0.1)$);

\end{tikzpicture}
Tags: , ,
June 11, 2011

Clustered Standard Errors in R

So, you want to calculate clustered standard errors in R (a.k.a. cluster-robust, huber-white, White’s) for the estimated coefficients of your OLS regression? This post shows how to do this in both Stata and R:

Overview

Let’s say that you want to relax the Gauss-Markov homoskedasticity assumption, and account for the fact that there may be several different covariance structures within your data sample that vary by a certain characteristic – a “cluster” – but are homoskedastic within each cluster.

For example, say you have a panel data set with a bunch of different test scores from different schools around the country. You may want to cluster your sample by state, by school district, or even by town. Economic theory and intuition will guide you in this decision.

So, similar to heteroskedasticity-robust standard errors, you want to allow more flexibility in your variance-covariance (VCV) matrix (Recall that the diagonal elements of the VCV matrix are the squared standard errors of your estimated coefficients). The way to accomplish this is by using clustered standard errors. The formulation is as follows:

Variance_{cluster} =

\bigl (\frac{M}{M-1}\frac{N-1}{N-K} \bigr )(X'X)^{-1}\sum_{j=1}^M \{X_M\hat{\varepsilon}_M\hat{\varepsilon}'_MX_M\}(X'X)^{-1}

where M = number of unique clusters (e.g. number of school districts) N = number of observations, and K = the number of regressors (including the intercept). (See pages 312-313 of Angrist and Pischke’s Mostly Harmless Econometrics (Princeton University Press, 2009) for a better explanation of the summation notation.) This returns a Variance-covariance (VCV) matrix where the diagonal elements are the estimated cluster-robust coefficient variances — the ones of interest. Estimated coefficient standard errors are the square root of these diagonal elements.

Stata makes it very easy to calculate, by simply adding ,cluster(state) to the end of your regression command.

STATA:

use wr-nevermar.dta
reg nevermar impdum, cluster(state)

R:

In R, you first must run a function here called cl() written by Mahmood Ara in Stockholm University – the backup can be found here and here.

   cl   <- function(dat,fm, cluster){
              require(sandwich, quietly = TRUE)
              require(lmtest, quietly = TRUE)
              M <- length(unique(cluster))
              N <- length(cluster)
              K <- fm$rank
              dfc <- (M/(M-1))*((N-1)/(N-K))
              uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
              vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
              coeftest(fm, vcovCL) }

After running the code above, you can run your regression with clustered standard errors as follows:

require(foreign)
nmar = read.dta("http://www.montana.edu/econ/cstoddard/562/wr-nevermar.dta")

# Run a plain linear regression
regt = lm(nevermar ~ impdum, data = nmar)

# apply the 'cl' function by choosing a variable to cluster on.
# here, we are clustering on state.
cl(nmar, regt, nmar$state)

Thoughts, comments, ideas?  Let me know; I’m always appreciative of feedback. You can contact me via e-mail at kevingoulding {at} gmail [dot] com.

May 28, 2011

Heteroskedasticity-Robust and Clustered Standard Errors in R

Recall that if heteroskedasticity is present in our data sample, the OLS estimator will still be unbiased and consistent, but it will not be efficient. Specifically, estimated standard errors will be biased, a problem we cannot solve with a larger sample size.

To correct for this bias, it may make sense to adjust your estimated standard errors.  Two popular ways to tackle this are to use:

  1. “Robust” standard errors (a.k.a. White’s Standard Errors, Huber–White standard errors, Eicker–White or Eicker–Huber–White)
  2. Clustered Standard Errors

In practice, heteroskedasticity-robust and clustered standard errors are usually larger than standard errors from regular OLS — however, this is not always the case. For further detail on when robust standard errors are smaller than OLS standard errors, see Jorn-Steffen Pische’s response on Mostly Harmless Econometrics’ Q&A blog.

“Robust” standard errors

The following example will use the CRIME3.dta

Because one of this blog’s main goals is to translate STATA results in R, first we will look at the robust command in STATA. For backup on the calculation of heteroskedasticity-robust standard errors, see the following link: http://www.stata.com/support/faqs/stat/cluster.html. The formulation is as follows:

Variance_{robust} = \bigl (\frac{N}{N-K} \bigr )(X'X)^{-1}\sum_{i=1}^N \{X_iX_i'\hat{\varepsilon}_i^2\}(X'X)^{-1}

where N = number of observations, and K = the number of regressors (including the intercept). This returns a Variance-covariance (VCV) matrix where the diagonal elements are the estimated heteroskedasticity-robust coefficient variances — the ones of interest. Estimated coefficient standard errors are the square root of these diagonal elements.

STATA:

reg cmrdrte cexec cunem if year==93, robust

R:

The following bit of code was written by Dr. Ott Toomet (mentioned in the Dataninja blog). I added a degrees of freedom adjustment so that the results mirror STATA’s robust command results.

## Heteroskedasticity-robust standard error calculation.
summaryw <- function(model) {
s <- summary(model)
X <- model.matrix(model)
u2 <- residuals(model)^2
XDX <- 0

## Here one needs to calculate X'DX. But due to the fact that
## D is huge (NxN), it is better to do it with a cycle.
for(i in 1:nrow(X)) {
XDX <- XDX + u2[i]*X[i,]%*%t(X[i,])
}

# inverse(X'X)
XX1 <- solve(t(X)%*%X)

# Variance calculation (Bread x meat x Bread)
varcovar <- XX1 %*% XDX %*% XX1

# degrees of freedom adjustment
dfc <- sqrt(nrow(X))/sqrt(nrow(X)-ncol(X))

# Standard errors of the coefficient estimates are the
# square roots of the diagonal elements
stdh <- dfc*sqrt(diag(varcovar))

t <- model$coefficients/stdh
p <- 2*pnorm(-abs(t))
results <- cbind(model$coefficients, stdh, t, p)
dimnames(results) <- dimnames(s$coefficients)
results
}

To use the function written above, simply replace summary() with summaryw() to look at your regression results — like this:

require(foreign)
mrdr = read.dta(file="/Users/kevingoulding/data/MURDER.dta")

# run regression
reg4 = lm(cmrdrte ~ cexec + cunem, data = subset(mrdr,year == 93))

# see heteroskedasticity-robust standard errors
summaryw(reg4)

These results should match the STATA output exactly.

Heteroskedasticity-robust LM Test

It may also be important to calculate heteroskedasticity-robust restrictions on your model (e.g. an F-test).

Clustered Standard Errors

Let’s say that you want to relax your homoskedasticity assumption, and account for the fact that there might be a bunch of covariance structures that vary by a certain characteristic – a “cluster” – but are homoskedastic within each cluster. Similar to heteroskedasticity-robust standard errors, you want to allow more flexibility in your variance-covariance (VCV) matrix. The result is clustered standard errors, a.k.a. cluster-robust.

STATA:

use wr-nevermar.dta
reg nevermar impdum, cluster(state)

R:

In R, you first must run a function here called cl() written by Mahmood Ara in Stockholm University – the backup can be found here.

   cl   <- function(dat,fm, cluster){
              attach(dat, warn.conflicts = F)
              library(sandwich)
              M <- length(unique(cluster))
              N <- length(cluster)
              K <- fm$rank
              dfc <- (M/(M-1))*((N-1)/(N-K))
              uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
              vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
              coeftest(fm, vcovCL) }

After running the code above, you can run your regression with clustered standard errors as follows:

nmar = read.dta("http://www.montana.edu/econ/cstoddard/562/wr-nevermar.dta")

# Run a plain linear regression
regt = lm(nevermar ~ impdum, data = nmar)

# apply the 'cl' function by choosing a variable to cluster on.
# here, we are clustering on state.
cl(nmar, regt, nmar$state)
Tags: ,
May 27, 2011

Surviving Graduate Econometrics with R: Advanced Panel Data Methods — 4 of 8

Some questions may arise when contemplating what model to use to empirically answer a question of interest, such as:

  1. Is there unobserved-heterogeneity in my data sample? If so, is it time-invariant?
  2. What variation in my data sample do I need to identify my coefficient of interest?
  3. What is the data-generating process for my unobserved heterogeneity?

The questions above can be (loosely) translated into these more specific questions:

  1. Should include fixed-effects (first-differenced, time-demeaned transformations, etc.) when I run my regression? Should I account for the unobserved heterogeneity using time dummy variables or individual dummy variables?
  2. Is the variation I’m interested in between individuals or within individuals? This might conflict with your choice of time or individual dummy variables.
  3. Can I use a random effects model?

That said, choosing a model for your panel data can be tricky. In what follows, I will offer some tools to help you answer some of these questions.  The first part of this exercise will use the data panel_hw.dta (can be found here); the second part will use the data wr-nevermar.dta (can be found here).

A Pooled OLS Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). panel_hw.dta is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

STATA:

use panel_hw.dta

reg vaprate gsp midterm regdead WNCentral South Border

And then run an F-test on the joint significance of the included dummy variables:

test WNCentral South Border

R:

require(foreign)
voter = read.dta("/Users/kevingoulding/DATA/wr-nevermar.dta")

reg1 <- lm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter)

Then run an F-test on the joint significance of the included regions:

require(car)
linearHypothesis(reg1, c("WNCentral", "South", "Border = 0"))

Similarly, this could be accomplished using the plm package (I recommend using this method).

reg1.pool <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, 
data=voter, index = c("state","year"), model = "pooling")
summary(reg1.pool)

# F-test
linearHypothesis(reg1.pool, c("WNCentral", "South", "Border = 0"), test="F")

A Fixed Effects Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). panel_hw.dta is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

STATA:

iis stcode
tis year
xtreg vaprate midterm gsp regdead WNCentral South Border, fe

In R, recall that we’ll have to transform the data into a panel data form.

R:

require(plm)

# model is specified using "within" estimator -&gt; includes state fixed effects.
reg1.fe <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border,
data=voter, index = c("state","year"), model = "within")	
summary(reg1.fe)

Well, should we use the fixed effects model or the pooled OLS model? In R, you can run a test between the two:

pFtest(reg1.fe,reg1.pool)

Or, we can test for individual fixed effects present in the pooled model, like this:

plmtest(reg1.pool, effect = "individual")

The Random Effects Estimator

It could be, however, that the unobserved heterogeneity is uncorrelated with all of the regressors in all time periods — so called “random effects”. This would mean that if we did not account for these effects, we would still consistently estimate our coefficients, but their standard errors will be biased. To correct for this, we can use the randome effects model, a form of Generalized Least Squares that accounts for the embedded serial correlation in the error terms caused by random effects.

STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, re

R:

reg1.re <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, 
data=voter, index = c("state","year"), model = "random")	
summary(reg1.re)

Pooled OLS versus Random Effects

The Breush-Pagan LM test can be used to determine if you should use Random Effects model or pooled OLS. The null hypothesis is that the variance of the unobserved heterogeneity is zero, e.g.

H_0 = \sigma_\alpha^2 = 0
H_a = \sigma_\alpha^2 \neq 0

Failure to reject the null hypothesis implies that you will have more efficient estimates using OLS.

STATA:

xttest0

R:

plmtest(reg1.pool, type="bp")

Fixed Effects versus Random Effects

The Hausman test can help to determine if you should use Random Effects (RE) model or Fixed Effects (FE). Recall that a RE model is appropriate when the unobserved heterogeneity is uncorrelated with the regressors. The logic behind the Hausman test is that under the scenario that truth is RE, both the RE estimator and the FE estimator will be consistent (so you should opt to use the RE estimator because it is efficient). However, under the scenario that truth is FE, the RE estimator will be inconsistent — so you must use the FE estimator. The null hypothesis then, is that the unobserved heterogeneity \alpha_i and the regressors X_{it} are uncorrelated. Another way to think about it is that in the null hypothesis, the coefficient estimates of the two models are not statistically different. If you fail to reject the null hypothesis, this lends support for the use of the RE estimator. If the null is rejected, RE will produce biased coefficient estimates, so a FE model is preferred.

H_0: \text{Corr}[X_{it},\alpha_i] = 0
H_a: \text{Corr}[X_{it},\alpha_i] \neq 0

STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, fe
estimates store fe

xtreg vaprate midterm gsp regdead WNCentral South Border, re
estimates store re

hausman fe re

R:

phtest(reg1.fe,reg1.re)

Some plots

The following examples use the data wr-nevermar.dta

Say we are interested in plotting the mean of the variable “nevermar” over time.

STATA:

egen meannevermar = mean(nevermar), by(year)
twoway (line meannevermar year, sort), ytitle(Mean--nevermar)

R:

nmar <- read.dta(file="/Users/kevingoulding/DATA/wr-nevermar.dta")

b1 <- as.matrix(tapply(nmar$nevermar, nmar$year , mean))

plot(row.names(b1), b1, type="l", main="NEVERMAR Mean", xlab = "Year", ylab = "Mean(nevermar)", col="red", lwd=2)

Tags: ,
May 25, 2011

Surviving Graduate Econometrics with R: Fixed Effects Estimation — 3 of 8

The following exercise uses the CRIME3.dta and MURDER.dta panel data sets from Jeffrey Wooldridge’s econometrics textbook,

Wooldridge, Jeffrey. 2002. Introductory Econometrics: A Modern Approach. South-Western College Pub. 2nd Edition.

If you own the textbook, you can access the data files here.

Load and summarize the data

STATA:

use "C:\Users\CRIME3.dta"
des
sum

R:

require(foreign)
crime = read.dta(file="/Users/CRIME3.dta")
sumstats(crime)
as.matrix(sapply(crime,class))

If you haven’t yet loaded in the sumstats function, I suggest you do – you can find the code here.

A hypothesis test

See Part 2 of this series for a primer on hypothesis testing. Here, we will do one more example of testing a hypothesis of a linear restriction. Namely, from the regression equation:
\text{log}(crime_{it}) = \beta_0 + \delta_0 d78_t + \beta_1 clrprc_{i,t-1} + \beta_2 clrprc_{i,t-2} + \alpha_i + \varepsilon_{it}
where \alpha_i are “district” fixed effects, and \varepsilon_{it} is a white noise error term.
We would like to test the following hypothesis:
H_0: \beta_1 = \beta_2
H_a: \beta_1 \neq \beta_2
This can be re-written in matrix form:
H_0: R \beta = q
H_a: R \beta \neq q
Where:
R =  \begin{bmatrix}  0 & 0 & 1 & -1\\  \end{bmatrix}

\beta =  \begin{bmatrix}  \beta_0 \\  \delta_0 \\  \beta_1 \\  \beta_2 \\  \end{bmatrix}

q =  \begin{bmatrix}  0 \\  \end{bmatrix}

STATA:

reg clcrime cclrprc1 cclrprc2
cclrprc1= cclrprc2

R:

# Run the regression
reg1a = lm(lcrime ~ d78 + clrprc1 + clrprc2, data=crime)

# Create R and q matrices
R = rbind(c(0,0,1,-1))
q = rbind(0)

# Test the linear hypothesis beta_1  = beta_2
require(car)
linearHypothesis(reg1a,R,q)

# Equivalently, we can skip creating the R and q matrices
# and use this streamlined approach:
linearHypothesis(reg1a,"clrprc1 = clrprc2")

# Or, we can use the glhtest function in gmodels package
require(gmodels)
glh.test(reg1a, R, q)

First-Differenced model

As a review, let’s go over two very similar models that take out individual-specific time-invariant heterogeneity in panel data analysis. Our example regression is:

Y_{it} = X_{it} \beta + \varepsilon_{it}

where individual and time period are denoted by the i and t subscripts, respectively.

The within estimator — a.k.a the “fixed effects” model, wherein individual dummy variables (intercept shifters) are included in the regression.  All variation driving the coefficients on the other regressors is from the differences from individual specific means (= individual dummy estimates). The new model is:

Y_{it} = X_{it} \beta + \alpha_i + \varepsilon_{it}

where \alpha_i represents the individual dummy variables.

The first-differenced model — The first-differenced model creates new variables reflecting the one-period change in values. The regression then becomes \Delta Y_{i} = \Delta X_{i} \beta + \Delta \varepsilon_{i} where \Delta Y_{i} = Y_{it} - Y_{i,t-1}.

Note: These two models are very similar because they “strip out” / “eliminate” / “control for”  the variation “between” individuals in your panel data.  To do this, they use slightly different methods.  The variation left over, and therefore identifying the coefficients on the other regressors, is the “within” variation — or the variation “within” individuals.

STATA:

reg clcrime cavgclr
outreg2 using H3_1312, word replace

There are two ways we can calculate the first-differenced model, given the variables included in CRIME3.dta . Since the data set included changed variables with a “c” prefix (e.g. “clcrime” = change in “lcrime”; “cavgclr” = change in “avgclr”) we can do a simple OLS regression on the changed variables:

R:

reg2 =  lm(clcrime ~ cavgclr, data=crime)
summary(reg2)

Or, we can take a more formal approach using the plm package for panel data. This approach will prepare us for more advanced panel data methods.

require(plm)		# load panel data package

# convert the data set into a pdata.frame by identifying the
# individual ("district") and time ("year") variables in our data
crime.pd = pdata.frame(crime, index = c("district", "year"),
			 drop.index = TRUE, row.names = TRUE)

# Now, we can run a regression choosing the
# first-differenced model ("fd")
reg.fd = plm(lcrime ~ avgclr, data = crime.pd, model = "fd")
summary(reg.fd)

Back to Pooled OLS

Let’s switch over to the MURDER.dta data set to do some further regressions and analysis. First, we’ll compute a pooled OLS model for the years 1990 and 1993:

mrdrte_{it} = \delta_0 + \delta_1 d93_t + \beta_1 exec_{it} + \beta_2 unem_{it} + \alpha_{it} + \varepsilon_{it}

By using pooled OLS, we are disregarding the term \alpha_{it} in the regression equation above.

STATA:

reg mrdrte d93 exec unem if year==90|year==93

R:

crime = read.dta(file="/Users/CRIME3.dta")
sumstats(crime)

mrdrYR = subset(mrdr, year == 90 | year == 93)

reg3 = lm(mrdrte ~ d93 + exec + unem, data=mrdrYR)
summary(reg3)

# convert the data set into a pdata.frame (panel format) by identifying the
# individual ("state") and time ("year") variables in our data
require(plm)
mrdr.pd = pdata.frame(mrdrYR, index = c("state", "year"),
			 drop.index = TRUE, row.names = TRUE)

# Run a pooled OLS regression - results are the same as reg3
reg3.po = plm(mrdrte ~ d93 + exec + unem, data = mrdr.pd, model = "pooling")
summary(reg3.po)

Another First-Differenced Model

STATA:

reg cmrdrte cexec cunem if year==93

R:

# We can run the regression using the variables
# provided in the data set:
reg4 = lm(cmrdrte ~ cexec + cunem, data = subset(mrdrYR,year == 93))
summary(reg4)

# Or, we can run a regression using the plm package by choosing the
# first-differenced model ("fd")
reg4.fd = plm(mrdrte ~ d93 + exec + unem, data = mrdr.pd, model = "fd")
summary(reg4.fd)

# Note: we don't need the d93 dummy anymore, so it's equivalent
# to running the regression without it:
summary(plm(mrdrte ~ exec + unem, data = mrdr.pd, model = "fd"))

The Fixed Effects model

Another way to account for individual-specific unobserved heterogeneity is to include a dummy variable for each individual in your sample – this is the fixed effects model. Following from the regression in the previous section, our individuals MURDER.dta are states (e.g. Alabama, Louisiana, California, Montana…). So, we will need to add one dummy variable for each state in our sample but exclude one to avoid perfect collinearity — the “dummy variable trap”.

In STATA, if your data is set up correctly (e.g. individual in first column, time variable in second column), it is accomplished by adding ,fe to the end of your regression command.

STATA:

reg mrdrte exec unem, fe

In R, we can add dummy variables for each state in the following way:

R:

reg5 = lm(mrdrte ~ exec + unem + factor(state), data=mrdr)
summary(reg5)

See Part 4 of this series for more attention to fixed effects models, inference testing, and comparison to random effects models.

The Breusch-Pagan test for Heteroskedasticity

The Breusch-Pagan (BP) test can be done via a LaGrange Multiplier (LM) test or F-test. We will do the LM test version; this means that only one restricted model is run.
Var(\varepsilon_{it}|X_{it}) = \Omega \sigma^2
H_O: \Omega = identity matrix, \Rightarrow Var(\varepsilon_{it}|X_{it}) =\sigma^2 \Rightarrow homoskedasticity
H_a: \Omega \neq identity matrix, e.g. heteroskedasticity

First, we will run the test manually in three stages:

  1. Square the residuals from the original regression \rightarrow \hat{\varepsilon}^2.
  2. Run an auxiliary regression of \hat{\varepsilon}^2 on the original regressors.
  3. Calculate the BP LM test statistic = nR^2, where R^2 is r-squared fit measure from the auxiliary regression, and n is the number of observations used in the regression.

STATA:

reg cmrdrte cexec cunem if year==93
predict resid , resid
gen resid2 = resid^2
reg resid2 cexec cunem if year==93

R:

# Breusch-Pagan test for heteroskedasticity

# Square the residuals
res4 = residuals(reg4)
sqres4 = res4^2

m4 = subset(mrdr,year == 93)
m4$sqres = sqres4

# Run auxiliary regression
BP = lm(sqres ~ cexec + cunem, data = m4)
BPs = summary(BP)

# Calculation of LM test statistic:
BPts = BPs$r.squared*length(BP$residuals)

# Calculate p-value from Chi-square distribution
# with 2 degrees of freedom
BPpv = 1-pchisq(BPts,df=BP$rank-1)

# The following code uses a 5% significance level
if (BPpv < 0.05) {
    cat("We reject the null hypothesis of homoskedasticity.\n",
    "BP = ",BPts,"\n","p-value = ",BPpv)
} else {
    cat("We fail to reject the null hypothesis; implying homoskedasticity.\n",
    "BP = ",BPts,"\n","p-value = ",BPpv)
}

Now, let’s compare the results obtained above to the function bptest() provided in the R lmtest package:

require(lmtest)
bptest(reg4)

I Hope your results are exactly the same as when you did the Breush-Pagan test manually — they should be!

White’s Test for Heteroskedasticity

White’s test for heteroskedasticity is similar the Breusch-Pagan (BP) test, however the auxiliary regression includes all multiplicative combinations of regressors. Because of this it can be quite bulky and finding heteroskedasticity may simply imply model mispecification. The null hypothesis is homoskedasticity (same as BP).

So, here we will run a special case of the White test using the fitted values of the original regression:

\hat{\varepsilon}_{it}^2 = \hat{Y}_{it} + \hat{Y}_{it}^2

STATA:

reg cmrdrte cexec cunem if year==93
gen resid2 = resid^2
predict yhat
gen yhat2 = yhat^2
reg resid2 yhat yhat2 if year==93

R:

# White's test for heteroskedasticity: A Special Case

# Collect fitted values and squared f.v. from your regression
yhat = reg4$fitted.values
yhat2 = yhat^2
m4 = NULL 			# clears data previously in m4

# create a new data frame with the three variables of interest
m4 = data.frame(cbind(sqres4,yhat,yhat2))

# Run auxiliary regression
WH = lm(sqres4 ~ yhat + yhat2, data = m4)
WHs = summary(BP)

# Calculation of LM test statistic:
WHts = WHs$r.squared*length(WH$residuals)

# Calculate p-value from Chi-square distribution
# with 2 degrees of freedom
WHpv = 1-pchisq(WHts,df=WH$rank-1)

# The following code uses a 5% significance level
if (WHpv < 0.05) {
    cat("We reject the null hypothesis of homoskedasticity.\n",
    "BP = ",WHts,"\n","p-value = ",WHpv)
} else {
    cat("We fail to reject the null hypothesis; implying homoskedasticity.\n",
    "BP = ",WHts,"\n","p-value = ",WHpv)
}

Heteroskedasticity-Robust Standard Srrors

If heteroskedasticity is present in our data sample, using OLS will be inefficient. See this post for details behind calculating heteroskedasticity-robust and cluster-robust standard errors.

To continue on to Part 4 of our series, Advanced Panel Data Methods, click here.

Tags: ,
May 24, 2011

Surviving Graduate Econometrics with R: Difference-in-Differences Estimation — 2 of 8

The following replication exercise closely follows the homework assignment #2 in ECNS 562. The data for this exercise can be found here.

The data is about the expansion of the Earned Income Tax Credit. This is a legislation aimed at providing a tax break for low income individuals.  For some background on the subject, see

Eissa, Nada, and Jeffrey B. Liebman. 1996. Labor Supply Responses to the Earned Income Tax Credit. Quarterly Journal of Economics. 111(2): 605-637.

Continue reading

Tags: ,
May 24, 2011

Surviving Graduate Econometrics with R: The Basics — 1 of 8

Introduction

The following is an introduction to statistical computing with R and STATA. In the future, I would like to include SAS. It is meant for the graduate or undergraduate student in Econometrics that may want to use one statistical software package, but his teacher, adviser, or friends are using a different one.  I encountered this issue when I wanted to learn and use R, while both my econometrics courses were taught using SAS and STATA.  I will be following the course homeworks for ECNS 562: Econometrics II taught by Dr. Christiana Stoddard in the Spring of 2011, so you may see reference to STATA in the actual questions. Read further for the R code.

ACKNOWLEDGMENTS

Special thanks to Dr. Christiana Stoddard for letting me use her homework assignments and class notes to structure this blog series. In a subject that is prone to dry class experiences, her econometrics course was incredibly engaging, useful, and challenging — a true pleasure. Also, thank you to Dr. Joe Atwood for his help in getting me started using R and providing insightful guidance on my code and supporting me in myriad ways. Roger Avalos, a fellow graduate student, provided his STATA code for this series — as well as encouragement in writing this blog. Thank you, Roger.

Let’s Get Started

For this assignment, we will be using the data available available at www.montana.edu/stock/ecns403/rawcpsdata.dta – raw Consumer Pricing Index data.

Continue reading

Tags: ,