PROJECT TWO Self-reported health (SRH) is considered as an important indicator measuring individual’s

PROJECT TWO

Self-reported health (SRH) is considered as an important indicator measuring individual’s health status. In this project, you will use the General Social Survey 2002 to analyze the variation in SRH status and various risk factors that affect SRH. The description of the dataset is given below:

Year: year of General Social Survey

Id: respondent ID number

Age: age of respondent

Sex: gender of the respondent (1=male, 2=female)

Martial3: respondent’s marital status (1=married, 2=divorced/separated/widowed, 3=never married)

Born: Was the respondent born in this country? (1=yes, 2=no)

Educ: Highest year of school completed

Race4: respondent’s race (1=White, 2=Black, 3=Asian, 4=Other)

Income3: respondent’s family income (1=<$25000, 2=$25000--$50000, 3=>$50000)

Class: respondent self-perceived class (1=lower class, 2=working class, 3=middle class, 4=upper class)

Sz_place: size of place where respondent resides (1=city, 2=suburb, 3=rural)

Hrsrelax number of hours per day respondent has to relax

Attend2: how often does respondent attend church (1=frequently (weekly or more often), 0=infrequently or never)

SRH: self-reported health (1=excellent, 2=good, 3=fair, 4=poor)

Conduct a short data analysis following the questions below. The project should be no longer than 5 pages (including Table 3). It must be done independently. You can email the instructor and TAs if you wish to clarify any questions about the project, but to what extent an answer can be given will depend on the instructor’s discretion.

Part 0: Data description

Missing data : This dataset is a randomly drawn subset from the actual General Social Survey. As any other survey data, when filling out the questionnaires, the respondents have options for "don’t know" and "refuse to answer". Therefore there are many missing values in this dataset. Before you proceed to the analysis, it is helpful to examine whether the missing data are appropriately coded (as missing) and whether the patterns of missing reflect certain selection bias. Follow the instructions below to conduct some basic missing data analysis.

Step 1: Explore the data using " Analyze Explore " command in SPSS. To save time, you can check all the variables (except "year" and "id") into the variable list to have SPSS run descriptive statistics for these variables at once. Then click on " options ", check " exclude cases pairwise " in the missing value box. Based on "case processing summary" table in SPSS output, which three variables have the most missing data and what percentages of them are missing respectively?

Step 2: Since self-reported health is the outcome variable of our interest, let’s examine the pattern of missingness in self-reported health. In particular, we like to know whether a subset of population is more likely to refuse reporting SRH hence bias the results of our analysis. To check the pattern of missingness, first create a new variable "SRHmissing". It takes value 0 if SRH is not missing, i.e. SRH =1,2,3,or 4; and it takes value 1 otherwise. Then run a frequency table for "SRHmissing" to check if the number of 1s is the same as the number of missing reported in step 1.

Step 3: Now we can test whether the distribution of "SRHmissing" is different across subpopulations that are defined by the independent variables. There are many independent variables in this dataset, to keep it simple in this project we only look at "sex" and "age".

First, run a crosstabulation of "SRHmissing" and "sex". What percentage of males have SRH missing? And what percentage of females have SRH missing? Is one gender more likely to have missing SRH than the other one? Test this statement using a Chi-square test.

Second, conduct an independent sample t test to test whether the mean age is the same between respondents who report SRH and who don’t. Report the results of the test and conclude that whether age is related to SRH missingness.

Part II: Dichotomized SRH and its risk factors

In this part, you will conduct a series of logistic regression to study what factors affect self-reported health.

First, fill out the table below. What is the median value for self-reported health in this dataset?

Table 1. Distribution of self-reported health

SRH	N	Valid percentage*	Cumulative percentage*
Excellent
Good
Fair
Poor

*valid percentage and cumulative percentage are calculated based on number of valid (nonmissing) cases.

Now let’s look at what factors affect individuals’ self-reported health status. Since we only learned to deal with either interval valued dependent variable or binary valued dependent variable, we need to dichotomize SRH (i.e. to record it into a binary variable.) Create a new variable "SRH2", it takes value 1 if SRH is good or better, 0 if SRH is fair or worse, then code all other values as system missing. (Before you proceed, verify that in this new variable, the number of 1s is 1005, the number of 0s is 289 and the number of missing is 206.)

In literatures that examine health disparity, demographic variables (such as age, gender, marital status), social economic status (such as education, income and race) are often found to be associated with health outcomes. Recently, studies have linked stress with health disparity. On one hand, self-perceived social status is reported to be significantly related to difference in health outcomes (for example, results based on the famous Whitehall study.) On the other hand, behaviors that release stress (for example, relaxation and mediation such as praying) are found to improve health status. In this project, you will analyze the effects of these factors on one specific health outcome—the dichotomized SRH—using logistic regression.

You are asked to run three logistic regressions adding the following blocks of variables sequentially and report the results in Table 2 (provided in a separate document due to its size). Notice that in Table 2, categorical variables are present in their dummy variable representations. This means you will need to recode the categorical independent variables into dummy variables and identify the reference group that is used for each categorical variable.

Now run a logistic regression for each of the three models using appropriate dummy variables whenever applicable. Please use the "block" option to e nter the following blocks of variables sequentially .

Model 1: include the first block (demographic variables): age, sex, marital status.

Model 2: further include the second block (variables that indicate social economic status): race, income groups, years of education and whether the respondent was born outside the US.

Model 3: further include the third block (variables that are linked to stress): church attendance, number of hours spent relaxing and self-perceived social class.

Note you will also need to invoke certain options for logistic regression diagnostics in order to complete Table 2.

Based on Table 2, answer the following questions. (Note 1: Interpret the results using Odds Ratio language. Note 2: please make the distinctions between the effects of dummy variables and the effects of interval valued independent variables. Note 3: when discussing the significance of the variables, make a distinction among highly significant (at level 1%), significant (at level %) or marginally significant (at level 10%.)

In model 1, are any of the demographic variables highly significant, significant or marginally significant? Does their significance change in model 2 and model 3?
Are any of the social economic status variables significant in model 2? How do their significance and the size of the effect change in model 3 (after controlling for stress related variables)?
Are any of the variables related to stress highly significant, significant or marginally significant?
It is suggested to exclude the variables that are insignificant to improve the efficiency of the model. Perform a fourth logistic regression excluding all demographic variables and retaining the rest variables and answer the following questions. (Note some of the independent variables that are retained in the model may still be insignificant because the purpose of this exercise is not to look for the best smallest model.)
Compared to model 3, are there any changes in the effects and significance of socio-economic variables and stress-related variables in model 4?
Based on model 4, interpret the effects of the variables that have at least marginally significant effects on self-reported health. Please use the language of odds ratios and be as accurate as you can.
In terms of the SRH, is there a significant difference between the lowest class and the highest class after controlling for other variables in model 4? Which self-perceived social class has the best self-reported health?
Overall, are there big differences in self-reported health status across different racial groups after controlling for other variables in model 4?
For each model, what percentages of individuals with excellent/good health are correctly predicted, in models 1—4, respectively? And what percentages of individuals with fair/poor health are correctly predicted, in models 1—4, respectively? In general, does any of the models do a good job predicting fair/poor self-reported health conditions?
Based on Hosmer and Lemeshow test, which models (models 1-4) are considered to fit the data well?