Archive for June 16th, 2011

June 16, 2011

The Chow test in R: A case study of Yellowstone’s Old Faithful Geyser

Recently I took a road trip south to Yellowstone National Park, where the fascinating phenomenon that is Old Faithful is still spewing piping-hot water 120 feet into the air every hour or so. From fellow spectators, I was told that the time between eruptions was approximately 60 minutes, but could be longer depending on the length of the previous eruption. I wondered, would it be possible to get a better estimate of our waiting time, conditional on the previous eruption’s length?

Fortunately, there is plenty of data available on the geyser. See the site:

Let’s look at some data showing length of the eruption and waiting time between eruptions captured by some park rangers in the 1980s.

A description of the data can be found here.

The Chow Test for Structural Breaks

The Chow test is used to see if it makes sense to run two separate regressions on two mutually exclusive subsets of your data (divided by a break point) by comparing the results of the two “unrestricted” regressions versus the “restricted” regression that pools all the data together.

H_0: \beta_{ur_1} = \beta_{ur_2}

H_a: \beta_{ur_1} \neq \beta_{ur_2}

The procedure is as follows:

  1. Run a “restricted” regression on all your data (pooled).
  2. Divide your sample into to groups, determined by your breakpoint (e.g. a point in time, or a variable value).
  3. Run an “unrestricted” regression on each of your subsamples. ¬†You will run two “unrestricted” regressions with a single breakpoint.
  4. Calculate the Chow F-statistic as follows:

\frac{(SSR_r - SSR_u)/k}{SSR_u/(n-2k)} \sim F_{k,n-2k}

where SSR_r = the sum of squared residuals of the restricted model, SSR_u = the sum of squared residuals from the unrestricted models, k = the number of regressors (including the intercept), and n= the number of total observations.

For a much more thorough review of the topic, see page 113 of Dr. Anton Bekkerman’s class notes here.

Eruption Time versus Waiting Time

Following the relationship mentioned to me from a fellow geyser visitor between the length of the eruption and the subsequent waiting time between eruptions, let’s see if there appears to be a relationship:

of = faithful
plot(of, col="blue",main="Eruptions of Old Faithful")

A scatterplot.

There certainly does. One thing that does pop out is the existence of two groups – there’s a cluster of data points in the upper right, and another in the lower left. What could explain this?

Maybe there is an extra chamber within the geyser that, once it is breached, empties completely. Then, it takes longer for the next eruption to occur because that chamber has to fill back up with boiling hot water. NOTE: This non-scientific explanation is not backed by any scientific studies.

So, from the scatter plot, it appears that the breakpoint between the two groups occurs at the value of 3.25 minutes of eruption. For this example, this will be our break point.

## Run three regressions (1 restricted, 2 unrestricted)
r.reg = lm(waiting ~ eruptions, data = of)
ur.reg1 = lm(waiting ~ eruptions, data = of[of$eruptions > 3.25,])
ur.reg2 = lm(waiting ~ eruptions, data = of[of$eruptions
## review the regression results

## Calculate sum of squared residuals for each regression
SSR$r = r.reg$residuals^2
SSR$ur1 = ur.reg1$residuals^2
SSR$ur2 = ur.reg2$residuals^2

## K is the number of regressors in our model
K = r.reg$rank

## Computing the Chow test statistic (F-test)
numerator = ( sum(SSR$r) - (sum(SSR$ur1) + sum(SSR$ur2)) ) / K
denominator = (sum(SSR$ur1) + sum(SSR$ur2)) / (nrow(of) - 2*K)
chow = numerator / denominator

## Calculate P-value
1-pf(chow, K, (nrow(of) - 2*K))

Now, we can plot the results. In this figure, the red dotted line is the restricted regression line, and the blue lines are the two unrestricted regression lines.

## Plot the results
plot(of,main="Eruptions of Old Faithful")
# restricted model
abline(r.reg, col = "red",lwd = 2, lty = "dashed")
# unrestricted model 1
segments(0, ur.reg2$coefficients[1], 3.25,
ur.reg2$coefficients[1]+3.25*ur.reg2$coefficients[2], col= 'blue')
# unrestricted model 2
segments(3.25, ur.reg1$coefficients[1]+3.25*ur.reg1$coefficients[2],
5.2, ur.reg1$coefficients[1]+5.2*ur.reg1$coefficients[2], col= 'blue')

A Quicker Way

The CRAN package strucchange provides a more streamlined approach to calculating a Chow test statistic, but it requires that the data is ordered so that the breakpoint can be identified by a specific row number. In our example, we’ll have to order our data by eruptions, and then point out that the 98th data point is the last data point where eruptions is less than 3.25. The code is below:

## Sort the data
sort.of = of[order(of$eruptions) , ]
sort.of = cbind(index(sort.of),sort.of)

## Identify the row number of our breakpoint
brk = max(sort.of[,1][sort.of$eruptions brk

## Using the CRAN package 'strucchange'
sctest(waiting ~ eruptions, type = "Chow", point = brk, data = sort.of)

The results above should mirror exactly the results we achieved via manual calculation in the section above.

A Single Regression

If you wanted to run a single regression that allowed the marginal effect of eruption time on waiting time to vary before and after the breakpoint, you add a dummy variable and an interaction term. In R, it is relatively straightforward:

of$dummy = as.numeric(of$eruptions >= 3.25)
summary(lm(eruptions ~ waiting + I(dummy*waiting) + dummy, data = of))

*Be careful as you interpret the coefficient estimates because you must add a few of them together.