Multiple linear regression: Select an outcome (dependent) variable and at least 3 predictor (independent)

Multiple linear regression:

Select an outcome (dependent) variable and at least 3 predictor (independent) variables appropriate for a multiple linear regression. Ensure that at least one of your predictor/independent variables is a variable that you would include in the model as dummy (indicator) variable(s).

Run descriptive statistics to explore the variables, and recode your dummy variable(s). Interpret your results. ( 3 points)
Run a multiple linear regression including appropriate diagnostic tests to examine for assumptions. Interpret your results. (2 points analysis; 5 points discussion of results and diagnostics)

2. Nonparametric tests:

Pick 2 of the following tests, run on appropriate data, and interpret your results ( 2 points each):

Spearman's rho,
Mann-Whitney U,
Kruskal-Wallis analysis of variance
Pearson's chi-square.

Hand in your output with annotations (copied to a Word document preferred) and include the syntax. Generally, more description of your logic/work and results is better than less.

Solution:

For this model, we’ll use Educ , Size of Place , and Marital Status to predict the variable Respondent’s Income . We’ll use Marital Status as the dummy variable by recoding it in the following way:
maritcat = 1 if the respondent is married, 0 if not.
Hence, we are going to estimate the following regression equation:
$Income={{\beta }_{0}}+{{\beta }_{1}}Educ+{{\beta }_{2}}Size+{{\beta }_{3}}maritcat+\varepsilon$
Using SPSS we get the following results:

The descriptive statistics above show a picture of the main characteristics of the data. Since the three variables we are analyzing are measured at the ratio interval, we use the mean as a measure of central tendency, and the standard deviation as a measure of dispersion.
For the first variable, Educ , the mean is 13.78 years, and the standard deviation is 2.889.
For the second variable, Income , the mean is 13.10, and the standard deviation is 5.754.
Finally, for the third variable, Size of Place , the mean is 416,050, and the standard deviation is 1,309,339.
Now we show below the distribution of the categorical variable used.

Now, we perform a regression analysis:

The amount of variance explained is approximately 13.1%. This is a bit low, but yet the regression is significant overall, with F = 25.998 and p = 0.000.
We have the following table with regression coefficients:

The model is therefore:
$Income=2.965+0.652Educ+0.000139Size+1.829maritcat$
Notice that all the predictors are significant, except for Size , which is not significant (p = 0.457).
We have the following histogram of residuals:

Now we test for normality:

The p-value for Shapiro-Wilk test is p = 0.000, which means that we have enough evidence to reject the null hypothesis of normality.
We have the plot of residuals by predicted;

There is some kind of a pattern, which indicates that the heteroskedasticity assumption may be violated.
We are interested in testing whether or not Father’s Highest degree is independent from Race . We obtain the following Crosstabulation:

The Chi-Square statistics is 21.797, and the corresponding p-value is p =0.005, which means that we reject the null hypothesis of independence, and hence, we have enough evidence to claim that they are related.

We are going to test whether or not the median income of male respondents is different from the median income of female respondents. For that purpose we’ll a Mann-Whitney test. Using SPSS we get

The p-value for the test is 0.000, which means that we reject the null hypothesis.

Solution: