mtcarsWe load the dataset
data(mtcars)
head(mtcars)Suppose we consider a linear regression of mpg against
disp. We first check the scatter plot.
plot(mtcars$disp, mtcars$mpg)It shows a nonlinear trend. So we create two new columns by taking the log.
mtcars['logmpg'] = log(mtcars$mpg)
mtcars['logdisp'] = log(mtcars$disp)
plot(mtcars$logdisp, mtcars$logmpg)Now the trend is linear. We run a linear regression model.
fit = lm(logmpg ~ logdisp, data=mtcars)
print(summary(fit))## 
## Call:
## lm(formula = logmpg ~ logdisp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22758 -0.08874 -0.00791  0.07970  0.32143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.38097    0.20803   25.87  < 2e-16 ***
## logdisp     -0.45857    0.03913  -11.72 1.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1282 on 30 degrees of freedom
## Multiple R-squared:  0.8207, Adjusted R-squared:  0.8148 
## F-statistic: 137.3 on 1 and 30 DF,  p-value: 1.006e-12Both intercept and the slope are significant. \(R^2=0.8148\). The fitted model is \[logmpg = 5.38 - 0.459\times logdisp + \epsilon\quad\text{with}\ \epsilon\sim N(0, 0.1282^2).\] The residual plot is checked.
plot(fit$fitted.values, fit$residuals)
The residual plot is not too bad, though some unequal variances.
Then we check the QQ-plot.
qqnorm(fit$residuals)
qqline(fit$residuals)The QQ-plot indicates the residuals are normal.
Or we can call plot directly.
plot(fit)Suppose there is a new car with disp=300. The prediction
is
pred = predict(fit, newdata = data.frame(logdisp=log(300)))
print(pred)##        1 
## 2.765399The predicted value for the logmpg is 2.765.
The confidence interval for the mean response is
ci = predict(fit, newdata = data.frame(logdisp=log(300)), interval="confidence")
print(ci)##        fit      lwr      upr
## 1 2.765399 2.708287 2.822511The CI for the mean response of logmpg is (2.708,
2.823).
The prediction interval is
pi = predict(fit, newdata = data.frame(logdisp=log(300)), interval="prediction")
print(pi)##        fit      lwr      upr
## 1 2.765399 2.497496 3.033301The PI for logmpg is (2.497, 3.033).
We can print the CI band and PI band.
newx = seq(4, 6.5, 0.1)
ci.band = predict(fit, newdata = data.frame(logdisp=newx), interval="confidence")
pi.band = predict(fit, newdata = data.frame(logdisp=newx), interval="prediction")
plot(mtcars$logdisp, mtcars$logmpg)
lines(newx, ci.band[,'fit'], type='l')
lines(newx, ci.band[,'upr'], type='l', lty='dashed')
lines(newx, ci.band[,'lwr'], type='l', lty='dashed')
lines(newx, pi.band[,'upr'], type='l', lty='dotted')
lines(newx, pi.band[,'lwr'], type='l', lty='dotted')
legend(5.6, 3.5,lty=c("solid", "dashed", "dotted"), legend=c("fitted", "CI", "PI"))We can also compare the pointwise CI v.s. simultaneous CI.
n = dim(mtcars)[1]
Sxx = var(mtcars$logdisp) * (n-1)
xbar = mean(mtcars$logdisp)
sci.lwr = ci.band[,'fit'] - sqrt(2*qf(0.95,2,fit$df.residual)) * summary(fit)$sigma * sqrt(1/n + (newx - xbar) ** 2 /Sxx)
sci.upr = ci.band[,'fit'] + sqrt(2*qf(0.95,2,fit$df.residual)) * summary(fit)$sigma * sqrt(1/n + (newx - xbar) ** 2 /Sxx)
plot(mtcars$logdisp, mtcars$logmpg)
lines(newx, ci.band[,'fit'], type='l')
lines(newx, ci.band[,'upr'], type='l', lty='dashed')
lines(newx, ci.band[,'lwr'], type='l', lty='dashed')
lines(newx, sci.upr, type='l', lty='dotted')
lines(newx, sci.lwr, type='l', lty='dotted')
legend(5.6, 3.5,lty=c("solid", "dashed", "dotted"), legend=c("fitted", "Pointwise CI", "Simutaneous CI"))