Home > Teaching > Tutorials > R Tutorial Index
5 ::: Linear Models
This will only cover linear models for the purposes of Stat M12.
basic linear models
fitting a model in R
finding predictors and residuals
regression diagnostics
confidence intervals
If there is a variable x that is believed to hold a linear relationship with another variable y, then a linear model may be useful. The model will take the form
yi = axi + b + ei
It is easier to think of the model without the ei, which is just the residual of data point i (some may call this the 'error'). So, the goal is the find the best fitting line to the data.
As discussed in lab, this best linear model (by many standards) and the most commonly used
method is called the 'least squares regression line' and it has some
special properties:
- it minimizes the sum of the squared residuals,
- the sum of the residuals is zero, and
- the point (mean(x), mean(y)) falls on the line.
Given a vector x that may be an explanatory variable of y (through a linear relationship), a model may be fit in R using the function lm():
> x # the values of x [1] 1 2 3 4 5 6 7 8 9 10 11 22 13 14 15 16 17 18 19 20 21 12 23 [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 [47] 47 48 49 50 > y # the values of y [1] 17.904874 16.869444 8.577667 11.514831 37.501027 36.153710 [7] 31.095892 -3.168944 21.165378 43.404508 24.303967 43.302560 [13] 46.234474 59.766088 27.568532 23.638366 53.436572 65.486651 [19] 90.248055 48.700811 43.273224 38.537765 64.810771 45.214556 [25] 36.170230 96.215944 50.387662 54.798343 134.467379 75.629519 [31] 33.460263 56.375976 57.751208 52.897848 77.047355 62.359089 [37] 38.304368 57.394667 82.131770 62.388716 92.809843 108.512499 [43] 84.240950 82.989272 100.224232 82.375033 121.487718 97.679322 [49] 90.685929 111.203797 > plot(x, y) # look at the scatter plot > fit <- lm(y ~ x) # y 'as a linear function of' x > abline(fit) # add the least squares line onto the scatter plot
The output is saved in fit and may be viewed by typing in fit and hitting enter or by looking at summary(fit):
> fit Call: lm(formula = y ~ x) Coefficients: (Intercept) x 15.610 1.659
The number corresponding to the y-intercept is the number below
(Intercept). The second value that falls below x
corresponds to the coefficient of x (the slope in the equation).
To put this into the form of an equation, use the form 'y=mx+b':
y = 1.659*x + 15.610
To get more information from fit, the summary is useful:
> summary(fit) # the blue highlighting was done by me, not R Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -38.683 -11.689 -2.379 10.182 70.751 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.6101 5.6999 2.739 0.00863 ** x 1.6588 0.1945 8.527 3.56e-11 *** --- Signif. codes: 0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1 Residual standard error: 19.85 on 48 degrees of freedom Multiple R-Squared: 0.6024, Adjusted R-squared: 0.5941 F-statistic: 72.71 on 1 and 48 DF, p-value: 3.558e-11
Notice that the two values from the output of fit fall under the column Estimate. The r2 of the fit is found by looking at Multiple R-Squared (in this case, it is 0.6024). For Stat M12, we will not need the rest of the information in the summary.
Continuing on with the example above, it's time to put the model fit to use. Given the model,
if we want to look at what the predicted value of one of our x's would be, say at 12, what do
we do? To find this, we just use our least squares fit and plug 12 into the equation:
y-predicted = 1.659*12 + 15.610 = 35.518
So when x=12, we expect y to be about 35.518. This could be computed in R rather than
using a calculator.
The way I constructed x in R, the position in x
corresponding to the value 12 is the 22nd position, so I could the following in R:
> 1.659*x[22] + 15.610 [1] 35.518
To check my answer in R, I could also use fit, which stores extra information including predicted values:
> fit$fitted[22] 12 35.51623
The reason why the answers are slightly different is due to round off error in the intercept and slope of the values in the R output (R rounded to 4 decimal places in the output of fit and summary(fit)). The output from fit$fitted[12] is more precise.
If we are interested in the predicted value, odds are we also have interest in the residual
at the given point. To compute the residual of a point, we use
residual = observed - predicted
So, in the case for when x=12 with a predicted value of 35.52 (and an observed value of 38.54, which is
shown by the red dot in the plot),
the residual would be
38.54 - 35.52 = 3.02
As with the pedicted, R will save the residuals in fit, and the residuals
may be extracted using
> fit$resid[22] 12 3.021535
As before, the manually computed and R-output may differ slightly, which is again due to round-off error in the values of the model fit.
One last remark. If a residual is positive, then we know the observed value was larger
than the predicted. If a residual is negative, then we know the observed value was
smaller than the predicted. This follows directly from the equation
residual = observed - predicted
[[ Stat M12 students: be certain to understand how to do the manual computations of the predicted value and residuals from the output of fit and summary(fit). Use fit$fitted and fit$resid only as a method to check your answers. ]]
The plots below are the first two diagnostic plots available in R by plotting fit:
> plot(fit)
To view the plots, hit 'return'. The first plot gives an idea of whether there is any curvature in the data. If the red line is strongly curved, a quadratic or other model may be better. In this case, the curvature is not strong (so a quadratic component in the model is not necessary). The second plot is to check whether the residuals are normally distributed. (Stat M12 students, don't worry about this.)
The third plot is used to check if the variance is contant (ie, if the standard deviation among the residuals appears to be about constant). If the red line is strongly tilted up/down, that is a red flag. There are no issues with that in this example - the variance appears constant. (The red line will always move up/down a little because of inherent randomness.) The last plot is used to check to see if there were any overly influential points (Stat M12 students, don't worry about this plot at all).
Given a linear model, a confidence interval may be easily obtained by using the function confint():
> confint(fit) 2.5 % 97.5 % (Intercept) 4.149683 27.070573 x 1.267703 2.049981
To change the level of confidence, add in the level argument:
> confint(fit, level=0.9) 5 % 95 % (Intercept) 6.050094 25.170162 x 1.332563 1.985121 > confint(fit, level=0.99) 0.5 % 99.5 % (Intercept) 0.3217958 30.898460 x 1.1370588 2.180625