Analysis for Built-in Dataset mtcars

We load the dataset

data(mtcars)
head(mtcars)

Suppose we consider a linear regression of mpg against disp. We first check the scatter plot.

plot(mtcars$disp, mtcars$mpg)

It shows a nonlinear trend. So we create two new columns by taking the log.

mtcars['logmpg'] = log(mtcars$mpg)
mtcars['logdisp'] = log(mtcars$disp)
plot(mtcars$logdisp, mtcars$logmpg)

Now the trend is linear. We run a linear regression model.

fit = lm(logmpg ~ logdisp, data=mtcars)
print(summary(fit))
## 
## Call:
## lm(formula = logmpg ~ logdisp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22758 -0.08874 -0.00791  0.07970  0.32143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.38097    0.20803   25.87  < 2e-16 ***
## logdisp     -0.45857    0.03913  -11.72 1.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1282 on 30 degrees of freedom
## Multiple R-squared:  0.8207, Adjusted R-squared:  0.8148 
## F-statistic: 137.3 on 1 and 30 DF,  p-value: 1.006e-12

Both intercept and the slope are significant. \(R^2=0.8148\). The fitted model is \[logmpg = 5.38 - 0.459\times logdisp + \epsilon\quad\text{with}\ \epsilon\sim N(0, 0.1282^2).\] The residual plot is checked.

plot(fit$fitted.values, fit$residuals)

The residual plot is not too bad, though some unequal variances.

Then we check the QQ-plot.

qqnorm(fit$residuals)
qqline(fit$residuals)

The QQ-plot indicates the residuals are normal.

Or we can call plot directly.

plot(fit)

Prediction

Suppose there is a new car with disp=300. The prediction is

pred = predict(fit, newdata = data.frame(logdisp=log(300)))
print(pred)
##        1 
## 2.765399

The predicted value for the logmpg is 2.765.

The confidence interval for the mean response is

ci = predict(fit, newdata = data.frame(logdisp=log(300)), interval="confidence")
print(ci)
##        fit      lwr      upr
## 1 2.765399 2.708287 2.822511

The CI for the mean response of logmpg is (2.708, 2.823).

The prediction interval is

pi = predict(fit, newdata = data.frame(logdisp=log(300)), interval="prediction")
print(pi)
##        fit      lwr      upr
## 1 2.765399 2.497496 3.033301

The PI for logmpg is (2.497, 3.033).

We can print the CI band and PI band.

newx = seq(4, 6.5, 0.1)
ci.band = predict(fit, newdata = data.frame(logdisp=newx), interval="confidence")
pi.band = predict(fit, newdata = data.frame(logdisp=newx), interval="prediction")
plot(mtcars$logdisp, mtcars$logmpg)
lines(newx, ci.band[,'fit'], type='l')
lines(newx, ci.band[,'upr'], type='l', lty='dashed')
lines(newx, ci.band[,'lwr'], type='l', lty='dashed')
lines(newx, pi.band[,'upr'], type='l', lty='dotted')
lines(newx, pi.band[,'lwr'], type='l', lty='dotted')
legend(5.6, 3.5,lty=c("solid", "dashed", "dotted"), legend=c("fitted", "CI", "PI"))

We can also compare the pointwise CI v.s. simultaneous CI.

n = dim(mtcars)[1]
Sxx = var(mtcars$logdisp) * (n-1)
xbar = mean(mtcars$logdisp)
sci.lwr = ci.band[,'fit'] - sqrt(2*qf(0.95,2,fit$df.residual)) * summary(fit)$sigma * sqrt(1/n + (newx - xbar) ** 2 /Sxx)
sci.upr = ci.band[,'fit'] + sqrt(2*qf(0.95,2,fit$df.residual)) * summary(fit)$sigma * sqrt(1/n + (newx - xbar) ** 2 /Sxx)

plot(mtcars$logdisp, mtcars$logmpg)
lines(newx, ci.band[,'fit'], type='l')
lines(newx, ci.band[,'upr'], type='l', lty='dashed')
lines(newx, ci.band[,'lwr'], type='l', lty='dashed')
lines(newx, sci.upr, type='l', lty='dotted')
lines(newx, sci.lwr, type='l', lty='dotted')
legend(5.6, 3.5,lty=c("solid", "dashed", "dotted"), legend=c("fitted", "Pointwise CI", "Simutaneous CI"))