Z is a random variable which is distributed standard normal, i.e. Z ~ N(0,1). This notation means its
Problem 1
\(\mathrm{Z}\) is a random variable which is distributed standard normal, i.e. \(Z \sim N(0,1)\). This notation means its mean is \(\mu=0\) and its standard deviation is \(\sigma=1\).
- Using canned normal CDFs, calculate the following probabilities in \(\mathrm{R}\) or Stata: \(\operatorname{Pr}(Z \leq 0)\); \(\operatorname{Pr}(-1.96 \leq\) \(Z \leq 1.96) ;\) and \(\operatorname{Pr}(-\sigma \leq Z \leq \sigma)\)
- Now use simulation to calculate the same probabilities in \(\mathrm{R}\) or Stata: \(\operatorname{Pr}(Z \leq 0) ; \operatorname{Pr}(-1.96 \leq Z \leq 1.96)\); and \(\operatorname{Pr}(-\sigma \leq Z \leq \sigma)\). You should do this by generating a large number of realizations from the appropriate distribution and then calculating the approximate probabilities from those realizations.
Problem 2
\(\mathrm{Z}\) is a random variable which is distributed poisson with rate parameter \(\lambda=1\), i.e. \(Z \sim\) Pois(1) (take a look at Wikipedia to see what this distribution looks like if you're not familiar with it). Use the rpois function to take one sample of 100,000 observations from Z. Plot a histogram of these observations. This is an approximation of the \(\mathrm{PDF}\) of \(\mathrm{Z}\).
- Recall that the Central Limit Theorem states that as the sample size increases, the sampling distribution of the sample mean approximates a Normal distribution with \(\mu=\mu_{Z}\) and \(\sigma=\sqrt{\sigma_{Z}^{2} / n}\). Given that a poisson random variable with rate parameter \(\lambda=1\) has a mean of 1 , and a variance of 1 , what distribution does the sample mean, \(\bar{Z}\), approximate for a sample size of \(n\) (where \(n\) is large)?
- Again using rpois, take 10,000 samples of size 2 from \(Z\), record the mean of each sample, and plot the sampling distribution of these sample means (use a density plot, not a histogram). On the same plot, show the pdf of the distribution from your answer to Part A.
- Repeat step i for 10,000 samples of size 5,10,20 and 50 . Each time, plot the sampling distribution of the sample means and the corresponding distribution from Part A (given by the Central Limit Theorem).
- In 1 to 2 sentences, comment on how the sampling distributions of the sample means change as the sample size increases (while holding the number of samples constant). Does this conform to what we expect given the Central Limit Theorem?
B) Instead of changing the sample size, we might increase the number of samples we take.
- Again using rpois, take 5 samples of size 1,000 from \(\mathrm{Z}\), record the mean of each sample, and plot the sampling distribution of the sample means. On the same plot, show the corresponding distribution from Part A (given by the Central Limit Theorem).
- Repeat step i for 10,20,50 and 100 samples (all of size 1,000). Each time, plot the sampling distribution of the sample means, and on the same plot, show the corresponding distribution from Part A (given by the Central Limit Theorem).
- In 1 to 2 sentences, comment on how the sampling distributions of the sample means change as the number of samples increases (while holding the size of sample constant).
Problem 3
You are a policy researcher trying to unpack what happened in the recent U.S. foreclosure crisis. You're particularly interested in why certain people got subprime loans, which are directly linked to a higher risk of foreclosure. You have narrowed your research to the Cape Coral-Fort Myers (Florida) area, the area of the United States most devastated by the foreclosure crisis.
To begin this problem first download subprime. csv and load it into \(\mathrm{R}\) or Stata. These are data collected by the U.S. government on all home lending transactions in Cape Coral and Fort Myers. They contain information on each loan applicant and give information on whether that applicant received a subprime loan (high.rate) as well as on the amount of the loan (loan.amount). They also contain basic demographic information such as race, gender, and income. For the remainder of the problem, we will treat this dataset as the entire, true population.
Problem 3 for Gov 2000/E-2000:
- Compute the population (i.e., true) mean and variance of high.rate (note that mean in this case is the proportion of people awarded subprime loans in Cape Coral-Fort Myers.)
- Now suppose that you - a researcher new to this area - are unaware that the federal government has collected this data for the entire population of interest. Instead, you go out and randomly survey 250 people in Cape Coral-Fort Myers who receive mortgage loans. To mimic this in \(\mathrm{R}\), set the seed to 02138. Now, take a random sample of 250 observations from the subprime data and find the sample mean and variance of high. rate (the subprime lending rate).
- Your preliminary results have generated some interest in your research, bringing in enough grant money for you to expand your research. Flush with cash, you send 10,000 research assistants to survey 250 loan recipients each and calculate their own means and variances. Simulate this in \(\mathrm{R}\) by setting the seed to 02138 . Repeat Part B 10,000 times and save the sample mean and variance of high. rate in each iteration. Provide a density plot of the sampling distribution of the average subprime lending rate.
- One of your colleagues gives you even more money so you send 10,000 research assistants to survey 1,000 respondents each. Re-do Part C with the following changes: use sample sizes of 1,000 (rather than 250 ) and do the simulation 10,000 times. As before, calculate the mean and variance of each sample and create a plot showing the sampling distribution of $\bar{X}$.
- Comment on the differences between Parts C and D. What effect does sample size have on the sampling distribution of \(\bar{X}\) ?
- Recall that you collected the sample variance of subprime loan rates in each sample of Part C (the variance of high.rate). Create a plot showing the sampling distribution of the sample variances. Then calculate the variance of the sampling distribution of the sample means from Part C. These are both types of variances - explain what they represent and how they differ from each other.
Problem 3 for Gov 2000e/1000:
- Compute the population (i.e., true) mean and variance of high.rate (note that mean in this case is the proportion of people awarded subprime loans in Cape Coral-Fort Myers.)
- Now suppose that you - a researcher new to this area - are unaware that the federal government has collected this data for the entire population of interest. Instead, you go out and randomly survey 250 people in Cape Coral-Fort Myers who receive mortgage loans. To mimic this in Stata, set the seed to 02138. Now, take a random sample of 250 observations from the subprime data and find the sample mean and variance of high. rate (the subprime lending rate).
- Get the intuition of this situation by repeating Part B 20 times (making sure to draw different samples of 250 each time). Each time, record the sample mean and variance of high. rate in each iteration. Enter
Problem 4
For this question, consider the variable loan. amount in the subprime data. You would like to calculate the standard deviation of the loan amounts because you are interested in exploring inequality in the sizes of loans that are given to various minority groups. The usual estimator for the population standard deviation is \(S=\left[\frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}\right]^{\frac{1}{2}}\). However one of your co-authors, based on a cursory reading of a few pages in Wikipedia, proposes an alternative estimator: \(S_{a l t}=\left[\frac{1}{n} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}\right]^{\frac{1}{2}}\). The co-author argues that \(S_{a l t}\) is in some ways better performing than \(S\) although he is vague about the details.
In order to test his contention, you decide to run a simulation study to compare the performance of the two proposed estimators on the subprime data. As in the first problem, the full subprime dataset is your complete population of interest.
- Write a function in R which implements your co-author's proposed estimator, \(S_{a l t}\). Name your function sd.alt, and then check that sd.alt \((1: 100)\) returns the answer 28.86607. Note that the \(S\) estimator is already implemented in \(\mathrm{R}\) as the sd() function.
- Set the seed to 111 , and then draw 5000 random samples of size $n=15$ from the population. For each sample, calculate and store each of the two estimators, $S$ and $S_{a l t}$. Then, for the two simulated sampling distributions, report the following three quantities in a table with two columns, one for each estimator:
- the average estimate of the population standard deviation
- the difference between the average estimate of the population standard deviation and the true population standard deviation
- the variance of your estimates
Deliverable: Word Document
