Some questions may arise when contemplating what model to use to empirically answer a question of interest, such as:

- Is there unobserved-heterogeneity in my data sample? If so, is it time-invariant?
- What variation in my data sample do I need to identify my coefficient of interest?
- What is the data-generating process for my unobserved heterogeneity?

The questions above can be (loosely) translated into these more specific questions:

- Should include fixed-effects (first-differenced, time-demeaned transformations, etc.) when I run my regression? Should I account for the unobserved heterogeneity using time dummy variables or individual dummy variables?
- Is the variation I’m interested in between individuals or within individuals? This might conflict with your choice of time or individual dummy variables.
- Can I use a random effects model?

That said, choosing a model for your panel data can be tricky. In what follows, I will offer some tools to help you answer some of these questions. Â The first part of this exercise will use the data ` panel_hw.dta `

(can be found here); the second part will use the data ` wr-nevermar.dta `

(can be found here).

## A Pooled OLS Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). ` panel_hw.dta `

is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

#### STATA:

use panel_hw.dta reg vaprate gsp midterm regdead WNCentral South Border

And then run an F-test on the joint significance of the included dummy variables:

test WNCentral South Border

#### R:

require(foreign) voter = read.dta("/Users/kevingoulding/DATA/wr-nevermar.dta") reg1 <- lm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter)

Then run an F-test on the joint significance of the included regions:

require(car) linearHypothesis(reg1, c("WNCentral", "South", "Border = 0"))

Similarly, this could be accomplished using the ` plm `

package (I recommend using this method).

reg1.pool <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter, index = c("state","year"), model = "pooling") summary(reg1.pool) # F-test linearHypothesis(reg1.pool, c("WNCentral", "South", "Border = 0"), test="F")

## A Fixed Effects Regression

To review, let’s load the data and run a model looking at voter participation rate as a function of a few explanatory variables and regional dummy variables (WNCentral, South, Border). ` panel_hw.dta `

is a panel data set where individual = “stcode” (state code) and time = “year”. We are, then, pooling the data in the following regression.

#### STATA:

iis stcode tis year xtreg vaprate midterm gsp regdead WNCentral South Border, fe

In R, recall that we’ll have to transform the data into a panel data form.

#### R:

require(plm) # model is specified using "within" estimator -> includes state fixed effects. reg1.fe <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter, index = c("state","year"), model = "within") summary(reg1.fe)

Well, should we use the fixed effects model or the pooled OLS model? In R, you can run a test between the two:

pFtest(reg1.fe,reg1.pool)

Or, we can test for individual fixed effects present in the pooled model, like this:

plmtest(reg1.pool, effect = "individual")

## The Random Effects Estimator

It could be, however, that the unobserved heterogeneity is uncorrelated with all of the regressors in all time periods — so called “random effects”. This would mean that if we did not account for these effects, we would still consistently estimate our coefficients, but their standard errors will be biased. To correct for this, we can use the randome effects model, a form of Generalized Least Squares that accounts for the embedded serial correlation in the error terms caused by random effects.

#### STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, re

#### R:

reg1.re <- plm(vaprate ~ gsp + midterm + regdead + WNCentral + South + Border, data=voter, index = c("state","year"), model = "random") summary(reg1.re)

## Pooled OLS versus Random Effects

The Breush-Pagan LM test can be used to determine if you should use Random Effects model or pooled OLS. The null hypothesis is that the variance of the unobserved heterogeneity is zero, e.g.

Failure to reject the null hypothesis implies that you will have more efficient estimates using OLS.

#### STATA:

xttest0

#### R:

plmtest(reg1.pool, type="bp")

## Fixed Effects versus Random Effects

The Hausman test can help to determine if you should use Random Effects (RE) model or Fixed Effects (FE). Recall that a RE model is appropriate when the unobserved heterogeneity is uncorrelated with the regressors. The logic behind the Hausman test is that under the scenario that truth is RE, both the RE estimator and the FE estimator will be consistent (so you should opt to use the RE estimator because it is efficient). However, under the scenario that truth is FE, the RE estimator will be inconsistent — so you must use the FE estimator. The null hypothesis then, is that the unobserved heterogeneity and the regressors are uncorrelated. Another way to think about it is that in the null hypothesis, the coefficient estimates of the two models are not statistically different. If you fail to reject the null hypothesis, this lends support for the use of the RE estimator. If the null is rejected, RE will produce biased coefficient estimates, so a FE model is preferred.

#### STATA:

xtreg vaprate midterm gsp regdead WNCentral South Border, fe estimates store fe xtreg vaprate midterm gsp regdead WNCentral South Border, re estimates store re hausman fe re

#### R:

phtest(reg1.fe,reg1.re)

## Some plots

The following examples use the data ` wr-nevermar.dta `

Say we are interested in plotting the mean of the variable “nevermar” over time.

#### STATA:

egen meannevermar = mean(nevermar), by(year) twoway (line meannevermar year, sort), ytitle(Mean--nevermar)

#### R:

nmar <- read.dta(file="/Users/kevingoulding/DATA/wr-nevermar.dta") b1 <- as.matrix(tapply(nmar$nevermar, nmar$year , mean)) plot(row.names(b1), b1, type="l", main="NEVERMAR Mean", xlab = "Year", ylab = "Mean(nevermar)", col="red", lwd=2)