## Surviving Graduate Econometrics with R: Advanced Panel Data Methods — 4 of 8

Some questions may arise when contemplating what model to use to empirically answer a question of interest, such as:

1. Is there unobserved-heterogeneity in my data sample? If so, is it time-invariant?
2. What variation in my data sample do I need to identify my coefficient of interest?
3. What is the data-generating process for my unobserved heterogeneity?

The questions above can be (loosely) translated into these more specific questions:

1. Should include fixed-effects (first-differenced, time-demeaned transformations, etc.) when I run my regression? Should I account for the unobserved heterogeneity using time dummy variables or individual dummy variables?
2. Is the variation I’m interested in between individuals or within individuals? This might conflict with your choice of time or individual dummy variables.
3. Can I use a random effects model?

That said, choosing a model for your panel data can be tricky. In what follows, I will offer some tools to help you answer some of these questions.  The first part of this exercise will use the data  panel_hw.dta  (can be found here); the second part will use the data  wr-nevermar.dta  (can be found here).

## A Pooled OLS Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border).  panel_hw.dta  is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

#### STATA:

use panel_hw.dta

reg vaprate gsp midterm regdead WNCentral South Border


And then run an F-test on the joint significance of the included dummy variables:

test WNCentral South Border


#### R:

require(foreign)

reg1 <- lm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter)


Then run an F-test on the joint significance of the included regions:

require(car)
linearHypothesis(reg1, c("WNCentral", "South", "Border = 0"))


Similarly, this could be accomplished using the  plm  package (I recommend using this method).

reg1.pool <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border,
data=voter, index = c("state","year"), model = "pooling")
summary(reg1.pool)

# F-test
linearHypothesis(reg1.pool, c("WNCentral", "South", "Border = 0"), test="F")


## A Fixed Effects Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border).  panel_hw.dta  is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

#### STATA:

iis stcode
tis year
xtreg vaprate midterm gsp regdead WNCentral South Border, fe


In R, recall that we’ll have to transform the data into a panel data form.

#### R:

require(plm)

# model is specified using "within" estimator -&gt; includes state fixed effects.
reg1.fe <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border,
data=voter, index = c("state","year"), model = "within")
summary(reg1.fe)


Well, should we use the fixed effects model or the pooled OLS model? In R, you can run a test between the two:

pFtest(reg1.fe,reg1.pool)


Or, we can test for individual fixed effects present in the pooled model, like this:

plmtest(reg1.pool, effect = "individual")


## The Random Effects Estimator

It could be, however, that the unobserved heterogeneity is uncorrelated with all of the regressors in all time periods — so called “random effects”. This would mean that if we did not account for these effects, we would still consistently estimate our coefficients, but their standard errors will be biased. To correct for this, we can use the randome effects model, a form of Generalized Least Squares that accounts for the embedded serial correlation in the error terms caused by random effects.

#### STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, re


#### R:

reg1.re <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border,
data=voter, index = c("state","year"), model = "random")
summary(reg1.re)


## Pooled OLS versus Random Effects

The Breush-Pagan LM test can be used to determine if you should use Random Effects model or pooled OLS. The null hypothesis is that the variance of the unobserved heterogeneity is zero, e.g.

$H_0 = \sigma_\alpha^2 = 0$
$H_a = \sigma_\alpha^2 \neq 0$

Failure to reject the null hypothesis implies that you will have more efficient estimates using OLS.

#### STATA:

xttest0


#### R:

plmtest(reg1.pool, type="bp")


## Fixed Effects versus Random Effects

The Hausman test can help to determine if you should use Random Effects (RE) model or Fixed Effects (FE). Recall that a RE model is appropriate when the unobserved heterogeneity is uncorrelated with the regressors. The logic behind the Hausman test is that under the scenario that truth is RE, both the RE estimator and the FE estimator will be consistent (so you should opt to use the RE estimator because it is efficient). However, under the scenario that truth is FE, the RE estimator will be inconsistent — so you must use the FE estimator. The null hypothesis then, is that the unobserved heterogeneity $\alpha_i$ and the regressors $X_{it}$ are uncorrelated. Another way to think about it is that in the null hypothesis, the coefficient estimates of the two models are not statistically different. If you fail to reject the null hypothesis, this lends support for the use of the RE estimator. If the null is rejected, RE will produce biased coefficient estimates, so a FE model is preferred.

$H_0: \text{Corr}[X_{it},\alpha_i] = 0$
$H_a: \text{Corr}[X_{it},\alpha_i] \neq 0$

#### STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, fe
estimates store fe

xtreg vaprate midterm gsp regdead WNCentral South Border, re
estimates store re

hausman fe re


#### R:

phtest(reg1.fe,reg1.re)


## Some plots

The following examples use the data  wr-nevermar.dta

Say we are interested in plotting the mean of the variable “nevermar” over time.

#### STATA:

egen meannevermar = mean(nevermar), by(year)
twoway (line meannevermar year, sort), ytitle(Mean--nevermar)


#### R:

nmar <- read.dta(file="/Users/kevingoulding/DATA/wr-nevermar.dta")

b1 <- as.matrix(tapply(nmar$nevermar, nmar$year , mean))

plot(row.names(b1), b1, type="l", main="NEVERMAR Mean", xlab = "Year", ylab = "Mean(nevermar)", col="red", lwd=2)