October 27, 2012

## Calculate an OLS regression using matrices in Python using Numpy

The following code will attempt to replicate the results of the numpy.linalg.lstsq() function in Numpy. For this exercise, we will be using a cross sectional data set provided by me in .csv format called “cdd.ny.csv”, that has monthly cooling degree data for New York state. The data is available here (File –> Download).

The OLS regression equation:

$Y = X\beta + \varepsilon$

where $\varepsilon =$ a white noise error term. For this example $Y =$ the population-weighted Cooling Degree Days (CDD) (CDD.pop.weighted), and $X =$ CDD measured at La Guardia airport (CDD.LGA). Note: this is a meaningless regression used solely for illustrative purposes.

Recall that the following matrix equation is used to calculate the vector of estimated coefficients $\hat{\beta}$ of an OLS regression:

$\hat{\beta} = (X'X)^{-1}X'Y$

where $X =$ the matrix of regressor data (the first column is all 1’s for the intercept), and $Y =$ the vector of the dependent variable data.

## Matrix operators in Numpy

• matrix() coerces an object into the matrix class.
• .T  transposes a matrix.
• * or dot(X,Y) is the operator for matrix multiplication (when matrices are 2-dimensional; see here).
• .I takes the inverse of a matrix. Note: the matrix must be invertible.