Open the "Disease" dataset. The variables Y= progress of a particular disease, X_1= measurement of one

Problem 1.

Open the "Disease" dataset. The variables $Y=$ progress of a particular disease, $X_{1}=$ measurement of one physiological characteristic, and $X_{2}$ = measurement of another physiological characteristic were originally obtained from 50 individuals with the goal of investigating the dependence of $Y$ on $X_{1}$ and $X_{2}$. Data from a further 50 individuals were obtained at a later date (variables $Y_{\text {new }}, X_{\text {Inew }}$, and $X_{2 \text { new }}$ contain the original 50 observations plus these 50 new ones).

Construct a Matrix Plot of $Y, X_{1}$, and $X_{2}$. Briefly describe what this plot tells us about potential problems if we fit an MLR model with both $X_{1}$ and $X_{2}$ as predictors.
Fit an SLR model of $Y$ on $X_{I}$. Obtain a Residuals vs Fits plot and comment on the "Linearity" and "Equal variance" conditions.
Obtain a Normal Probability Plot of the residuals for the SLR model of $Y$ on $X_{1}$. Comment on the "Normality" condition.
You should have found some evidence that the linearity and equal variance assumptions are questionable for the model in part (b). In an attempt to improve the model, we'll next try adding the $X_{2}$ variable to the model. Fit an MLR model of $Y$ on $X_{1}$ and $X_{2}$. Obtain a Residuals vs Fits plot and a Normal Probability Plot of the residuals. Briefly comment on your findings.
What are the Variance Inflation Factors for $X_{1}$ and $X_{2}$ in the model from part (d)? What regression pitfall does this suggest and what can we do to mitigate this type of problem?
Construct a Matrix Plot of $Y_{\text {new }} X_{\text {lnew, }}$ and $X_{2 \text { new. }}$. Briefly compare this plot with the Matrix Plot in part (a).
Fit an MLR model of $Y_{\text {new }}$ on $X_{\text {Inew }}$ and $X_{2 n e w}$. Obtain a Residuals vs Fits plot and a Normal Probability Plot of the residuals. Briefly comment on your findings.
What are the Variance Inflation Factors for $X_{\text {lnew }}$ and $X_{2 \text { new }}$ in the model from part (g)? Has the regression pitfall from part (e) been mitigated?
You should have found some evidence that the linearity and equal variance assumptions remains questionable for the model in part (g). In an attempt to improve the model, we'll next try adding a $X_{\text {lnew }}{ }^{2}$ variable to the model. Fit an MLR model of $Y_{\text {new }}$ on $X_{\text {lnew, }} X_{2 \text { new }}$, and $X_{\text {lnew }}{ }^{2}$. [An easy way to do this in Minitab v17 is to click the Model button in the Regression Dialog, highlight $X_{\text {lnew }}$ in the top-left Predictors box, then click "Add" to the right of "Terms through order: 2 " so that $X_{\text {lnew }}{ }^{*} X_{\text {lnew }}$ appears in the "Terms in the model" list.] Obtain a Residuals vs Fits plot and a Normal Probability Plot of the residuals. Briefly comment on your findings.
What are the Variance Inflation Factors for $X_{\text {lnew }}$ and $X_{\text {new }}{ }^{2}$ in the model from part (i)? What regression pitfall does this suggest and what can we do to mitigate this type of problem?

Problem 2.

Use the "Disease" dataset used in the previous problem. Fit the following six models:

- $E(Y)=\beta_{0}+\beta_{1} X_{1}$

- $E(Y)=\beta_{0}+\beta_{2} X_{2}$

- $E(Y)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}$

- $E\left(Y_{\text {new }}\right)=\beta_{0}+\beta_{1} X_{1 n e w}$

- $E\left(Y_{n e w}\right)=\beta_{0}+\beta_{2} X_{2 n e w}$

- $E\left(Y_{\text {new }}\right)=\beta_{0}+\beta_{1} X_{1 n e w}+\beta_{2} X_{2 n e w}$

Use the model results to complete the following table (the first row has been completed for you):
The first three models use data in which the two predictors are sufficiently highly correlated to create data-based multicollinearity (as explored in the previous problem). Briefly describe how the results in the first three rows of the table illustrate how:

The estimated regression coefficient of any one variable depends on which other predictor variables are included in the model;
The precision of the estimated regression coefficients decreases as more predictor variables are added to the model.

c) The last three models use additional data such that the correlation between the two predictors has been reduced. The hope is that the data-based multicollinearity has been mitigated. Do the results in the last three rows of the table support the following assertions?

The estimated regression coefficient of any one variable no longer depends on which other predictor variables are included in the model;
The precision of the estimated regression coefficients remains approximately the same as more predictor variables are added to the model.

[Note: You should find that while the results support assertion (ii) to some extent, they don't particularly support assertion (i) at all. The take-home message here is that in most observational studies there will be varying degrees of correlation between predictors that we can't do anything about. This means that it is almost always the case that the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model. The only time this is not the case is in datasets where the correlation between predictors is close to zero (which generally only occurs in designed experiments). The rest of the time we need to be careful to include all relevant predictor variables and interpret regression coefficients correctly (that is with reference to all the other predictor variables in the model).]

Problem 3.

Open the "Health" dataset. The variables Health = index of overall health of a patient, $B P=$ systolic blood pressure, and $B M I=$ body mass index were obtained from 100 individuals with the goal of investigating the dependence of Health on BP and BMI.

Fit an MLR model for $\mathrm{Y}=$ Health and $\mathrm{X}_{1}=B P, \mathrm{X}_{2}=B M I$. Report the fitted regression equation.
Create a residuals vs BMI graph for the model in part (a). Describe why the pattern in this plot suggests that adding $BMI^{2}$ to the model might be worthwhile.
Fit an MLR model for $\mathrm{Y}=$ Health and $\mathrm{X}_{1}=B P, \mathrm{X}_{2}=B M I$, and $\mathrm{X}_{3}=B M I^{2}$. [An easy way to do this in Minitab v17 is to click the Model button in the Regression Dialog, highlight BMI in the top-left Predictors box, then click "Add" to the right of "Terms through order: 2 " so that BMI{*}BMI appears in the "Terms in the model" list.] Report the fitted regression equation.
Conduct a hypothesis test for $\beta_{3}$ in the model in part (c) using a significance level of 0.05. What is your conclusion for this test with respect to the predictor $BMI^{2}$ ?
What are the Variance Inflation Factors for BMI and $BMI^{2}$ in the model from part (c)? What regression pitfall does this suggest and what can we do to mitigate this type of problem?
Use the Minitab calculator to create a centered BMI variable stored in a variable called BMIc and defined as "BMI - mean(BMI)." Then fit an MLR model for Y= Health and $\mathrm{X}_{1}=BP$, $\mathrm{X}_{2}=BMIc$, and $\mathrm{X}_{3}=BMIc^{2}$. Report the fitted regression equation.
What are the Variance Inflation Factors for BMIc and BMIc in the model from part (f)? Has the regression pitfall identified in part (e) been mitigated?
Use the model from part (c) to predict Health for an individual with $BP=120$ and $BMI= 20$ (a point prediction is sufficient, no need for an interval).
Use the model from part (f) to predict Health for this same individual, i.e., with $BP=120$ and $BMIc=-2.554$ because mean $(BMI)=22.554$. Compare your result to part $(\mathrm{h})$.

Problem 4.

Use the "WorkerSupervisor" dataset. A study of $n=28$ industrial establishments of varying sizes attempted to draw a linear relationship between the number of supervised workers, $X$, and the number of supervisors, $Y$, recorded at the establishments. The dataset also contains the square of the number of workers (WorkSq)

Graph $y=$ Supervisor versus $x=$ Worker. Comment on the important features of the relationship.
Fit a multiple linear regression model with $\mathbf{y}=$ Supervisor and $x$ -variables Worker and WorkSq. Store the Fits (predicted values) and Residuals (use the Storage button in the Regression Dialog). Plot the Residuals versus Fits (use the Graphs button). Describe the difficulty that is indicated by this residual plot.
Refer to the multiple regression results from part (b). Fill in the values for the coefficients and standard errors in the table below.
Follow these steps to determine appropriate weights for a weighted least squares model:
- Use Minitab's calculator to create a new variable named absress defined by the expression abs (RES1), where RES1 represents the residuals that you stored in part (b).
- Fit a simple linear regression model with $\mathbf{y}=$ absres and $\mathbf{x}$ -variable FITS1, where EITS1 represents the fitted values that you stored in part (b). Store the Fits (use the Storage button in the Regression Dialog).
- Use Minitab's calculator to create a new variable named weights defined by the expression $1 / \operatorname{FITS} 2^{\wedge} 2$, where FITS2 represents the fitted values that you just stored. These will be possible weights for a weighted regression. The idea for this is that the correct weights are $1 / \mathrm{sd}^{2}$, where we've estimated the standard deviation function in the second step.
Now, refit the multiple linear regression model from part (b) using weighted least squares (use the Options button in the Regression Dialog and put the new weights variable in the Weights box.). For the weighted regression, fill in the values for the coefficients and standard errors in the table below.
Briefly compare the results in parts (c) and (d).
For the weighted regression that you did in part (d), graph the studentized residuals versus fits. (Remember that Minitab calls studentized residuals "standardized" residuals.) Briefly discuss whether the plot looks about as it should (a horizontal random band with constant variance). [In Minitab, use the Graphs button of the regression dialog and at the top of the next dialog box select Standardized residuals.]