June 13, 2011

## Calculate OLS regression manually using matrix algebra in R

The following code will attempt to replicate the results of the  lm() function in R. For this exercise, we will be using a cross sectional data set provided by R called “women”, that has height and weight data for 15 individuals.

The OLS regression equation: $Y = X\beta + \varepsilon$

where $\varepsilon =$ a white noise error term. For this example $Y =$ weight, and $X =$ height. $\beta =$ the marginal impact a one unit change in height has on weight.

## This is the OLS regression we will manually calculate:
reg = lm(weight ~ height, data=women)
summary(reg)


Recall that the following matrix equation is used to calculate the vector of estimated coefficients $\hat{\beta}$ of an OLS regression: $\hat{\beta} = (X'X)^{-1}X'Y$

where $X =$ the matrix of regressor data (the first column is all 1’s for the intercept), and $Y =$ the vector of the dependent variable data.

## Matrix operators in R

• as.matrix()  coerces an object into the matrix class.
• t()  transposes a matrix.
• %*%  is the operator for matrix multiplication.
• solve()  takes the inverse of a matrix. Note, the matrix must be invertible.

For a more complete introduction to doing matrix operations in R, check out this page.

## Back to OLS

The following code calculates the 2 x 1 matrix of coefficients, $\hat{\beta}$:

## Create X and Y matrices for this specific regression
X = as.matrix(cbind(1,women$height)) Y = as.matrix(women$weight)

## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Height"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat


## Calculating Standard Errors

To calculate the standard errors, you must first calculate the variance-covariance (VCV) matrix, as follows: $Var(\hat{\beta}|X) = \frac{1}{n-k}\hat{\varepsilon}'\hat{\varepsilon}(X'X)^{-1}$

The VCV matrix will be a square k x k matrix. Standard errors for the estimated coefficients $\hat{\beta}$ are found by taking the square root of the diagonal elements of the VCV matrix.

## Calculate vector of residuals
res = as.matrix(women$weight-bh-bh*women$height)

## Define n and k parameters
n = nrow(women)
k = ncol(X)

## Calculate Variance-Covariance Matrix
VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients
StdErr = sqrt(diag(VCV))

## Calculate p-value for a t-test of coefficient significance
P.Value = rbind(2*pt(abs(bh/StdErr), df=n-k,lower.tail= FALSE),
2*pt(abs(bh/StdErr), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame
beta.hat = cbind(beta.hat,StdErr,P.Value)
beta.hat


## A Scatterplot with OLS line

## Plot results
plot(women$height,women$weight, xlab = "Height", ylab = "Weight",
main = "OLS: Height and Weight")
abline(a = bh, b = bh, col = 'red', lwd = 2, lty="dashed")


Now you can check the results above using the canned lm() function:

summary(lm(weight ~ height, data = women))

Tags: