Surviving Graduate Econometrics with R: Advanced Panel Data Methods — 4 of 8

Some questions may arise when contemplating what model to use to empirically answer a question of interest, such as:

  1. Is there unobserved-heterogeneity in my data sample? If so, is it time-invariant?
  2. What variation in my data sample do I need to identify my coefficient of interest?
  3. What is the data-generating process for my unobserved heterogeneity?

The questions above can be (loosely) translated into these more specific questions:

  1. Should include fixed-effects (first-differenced, time-demeaned transformations, etc.) when I run my regression? Should I account for the unobserved heterogeneity using time dummy variables or individual dummy variables?
  2. Is the variation I’m interested in between individuals or within individuals? This might conflict with your choice of time or individual dummy variables.
  3. Can I use a random effects model?

That said, choosing a model for your panel data can be tricky. In what follows, I will offer some tools to help you answer some of these questions.  The first part of this exercise will use the data panel_hw.dta (can be found here); the second part will use the data wr-nevermar.dta (can be found here).

A Pooled OLS Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). panel_hw.dta is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

STATA:

use panel_hw.dta

reg vaprate gsp midterm regdead WNCentral South Border

And then run an F-test on the joint significance of the included dummy variables:

test WNCentral South Border

R:

require(foreign)
voter = read.dta("/Users/kevingoulding/DATA/wr-nevermar.dta")

reg1 <- lm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter)

Then run an F-test on the joint significance of the included regions:

require(car)
linearHypothesis(reg1, c("WNCentral", "South", "Border = 0"))

Similarly, this could be accomplished using the plm package (I recommend using this method).

reg1.pool <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, 
data=voter, index = c("state","year"), model = "pooling")
summary(reg1.pool)

# F-test
linearHypothesis(reg1.pool, c("WNCentral", "South", "Border = 0"), test="F")

A Fixed Effects Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). panel_hw.dta is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

STATA:

iis stcode
tis year
xtreg vaprate midterm gsp regdead WNCentral South Border, fe

In R, recall that we’ll have to transform the data into a panel data form.

R:

require(plm)

# model is specified using "within" estimator -&gt; includes state fixed effects.
reg1.fe <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border,
data=voter, index = c("state","year"), model = "within")	
summary(reg1.fe)

Well, should we use the fixed effects model or the pooled OLS model? In R, you can run a test between the two:

pFtest(reg1.fe,reg1.pool)

Or, we can test for individual fixed effects present in the pooled model, like this:

plmtest(reg1.pool, effect = "individual")

The Random Effects Estimator

It could be, however, that the unobserved heterogeneity is uncorrelated with all of the regressors in all time periods — so called “random effects”. This would mean that if we did not account for these effects, we would still consistently estimate our coefficients, but their standard errors will be biased. To correct for this, we can use the randome effects model, a form of Generalized Least Squares that accounts for the embedded serial correlation in the error terms caused by random effects.

STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, re

R:

reg1.re <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, 
data=voter, index = c("state","year"), model = "random")	
summary(reg1.re)

Pooled OLS versus Random Effects

The Breush-Pagan LM test can be used to determine if you should use Random Effects model or pooled OLS. The null hypothesis is that the variance of the unobserved heterogeneity is zero, e.g.

H_0 = \sigma_\alpha^2 = 0
H_a = \sigma_\alpha^2 \neq 0

Failure to reject the null hypothesis implies that you will have more efficient estimates using OLS.

STATA:

xttest0

R:

plmtest(reg1.pool, type="bp")

Fixed Effects versus Random Effects

The Hausman test can help to determine if you should use Random Effects (RE) model or Fixed Effects (FE). Recall that a RE model is appropriate when the unobserved heterogeneity is uncorrelated with the regressors. The logic behind the Hausman test is that under the scenario that truth is RE, both the RE estimator and the FE estimator will be consistent (so you should opt to use the RE estimator because it is efficient). However, under the scenario that truth is FE, the RE estimator will be inconsistent — so you must use the FE estimator. The null hypothesis then, is that the unobserved heterogeneity \alpha_i and the regressors X_{it} are uncorrelated. Another way to think about it is that in the null hypothesis, the coefficient estimates of the two models are not statistically different. If you fail to reject the null hypothesis, this lends support for the use of the RE estimator. If the null is rejected, RE will produce biased coefficient estimates, so a FE model is preferred.

H_0: \text{Corr}[X_{it},\alpha_i] = 0
H_a: \text{Corr}[X_{it},\alpha_i] \neq 0

STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, fe
estimates store fe

xtreg vaprate midterm gsp regdead WNCentral South Border, re
estimates store re

hausman fe re

R:

phtest(reg1.fe,reg1.re)

Some plots

The following examples use the data wr-nevermar.dta

Say we are interested in plotting the mean of the variable “nevermar” over time.

STATA:

egen meannevermar = mean(nevermar), by(year)
twoway (line meannevermar year, sort), ytitle(Mean--nevermar)

R:

nmar <- read.dta(file="/Users/kevingoulding/DATA/wr-nevermar.dta")

b1 <- as.matrix(tapply(nmar$nevermar, nmar$year , mean))

plot(row.names(b1), b1, type="l", main="NEVERMAR Mean", xlab = "Year", ylab = "Mean(nevermar)", col="red", lwd=2)

Advertisements
Tags: ,

3 Comments to “Surviving Graduate Econometrics with R: Advanced Panel Data Methods — 4 of 8”

  1. Thank you for a great post! I’d like to go through these examples, but the links to the data files are not working. Could you update them?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: