Preliminaries To begin this homework, let's make sure we have the aliens data set loaded in our session,


Preliminaries

To begin this homework, let's make sure we have the aliens data set loaded in our session, and that we have the sampling function loaded. Run the code shown below. Like for HW1, make sure that it's been copied exactly, with correct line breaks.

Let's also draw a sample of 100 , once again making sure to use your student ID number in place of $00000001 .$

my_sample <- suppressWarnings (make.my.sample(00000001, 100, aliens))

Question 1.

The first thing you'll do is to make a simple table showing the distribution of a categorical variable, using the table command. To make a table showing the distribution of the college variable in your sample of aliens, do this:

college.table <- table(my_sample$college)

college.table

Recall that whenever you make reference to a variable that's within a data frame, you need to include the name of the data frame, then the dollar sign, then the variable name.

Summarize the output (which will be different for your personal sample) in words. Now make similar tables for two other categorical variables in your sample. Again, summarize the output in words.

Question 2.

You can make a bar graph of the results by making your table into an argument of the barplot function, like this:

Do this, and also do it for the two other variables that you used in Question 1.

Question 3.

You can also make a contingency table showing the joint distribution of two categorical variables, with the same table function that you used in Question 1. You simply have to give it the two variables as separate arguments, separating them with a comma.

Make two separate contingency tables, for two distinct pairs of variables.

Question 4.

You can also make a bar graph that shows the joint distribution of two variables by using a contingency table (like the one you made in Question 3) as the argument to the barplot function.

Do this for the two contingency tables you made in Question 3. Does this look weird? Try including an extra argument in the barplot function: beside . You might also want to include yet another argument: legend = T. Explain what these two arguments do. Do you like these graphs better with, or without, these arguments?

Question 5.

Based on the graphs you made in Question 4, do you conclude anything about how the categorical variables that you're looking at might be related to each other? If so, what do you conclude?

General note: You might find that these bar graphs do not look quite as good as you might have liked. For example, they didn't have -and y-axis labels, or a main title, or the legend was in the wrong place. Your book has lots of information about how to make your graphs look better with some extra arguments to graphing functions such as barplot. You can also look these things up in the help menu, or online. I encourage you to explore. You might find that making nice-looking graphs helps you to understand your data better!

Okay, now you know how to look at the distribution of a categorical variable, or the joint distribution of two categorical variables. Let's move on to quantitative variables.

Here are some basic functions for describing the distribution of a quantitative variable. In each case you just provide the variable name as an argument to the function (i.e., the thing inside the parentheses). Remember that you always have to use the dataframe$variable format when you're referring to a variable that's part of a data frame.

Here are the basic functions to make a histogram and a boxplot.

Again, you just provide the name of your variable as the one argument to the function.

hist ()

boxplot()

Question 6.

Use all of these functions to give both numerical and graphical summaries of the anxiety, income, and intelligence variables. Compare the median and the mean, and explain the patterns you find.

Question 7.

Are there outliers in the distributions you've looked at, based on the 1.5xIQR rule? If so, do you think these points should be excluded from the data set for the purpose of descriptive statistics, or should they be kept? Justify your conclusion.

Question 8. Sometimes the histogram that $R$ gives you doesn't have suitable bin sizes, and you need to tell the hist function how many bins to make. One way to do this is to use the breaks argument (for example, breaks =50)

Try different numbers of bins for the histogram of the income variable. What do you think is the best value for the breaks argument in this case? Why?

Question 9 .

The boxplot function also allows you to make separate boxplots, based on the values of a categorical variable, like this:

boxplot(my_sample$anxiety my_sample$ i s l a n d, ylim =c(30,70) )

Note that the little squiggly line can be read as 'depends on' - in this case, you're making a boxplot of anxiety, depending on island.

Run this command. Why does it include the ylim argument? What happens if you leave this out? What do these boxplots tell you about the distribution of anxiety for aliens from each of the three islands?

Question 10.

Make two more side-by-side boxplots, in each case exploring the distribution of one of the quantitative variables, depending on the values of one of the categorical variables. Explain what you learn from these plots.

Question 11.

You probably realize, at this point, that the answers you get for these problems depend, to some extent, on the specific sample of 100 aliens that you happened to draw at the beginning. To explore this a bit, please draw a different sample of 100 aliens by adding 1 to your student ID number, and re-running the line of code where you drew your sample. To keep things straight, call your new sample my_sample_2, or something like that.

Redo exactly what you did for Question 10, using the same quantitative and categorical variables, but this time using your new sample. Describe in words how the patterns differ from Question 10 , if they do differ.

In this last part of the homework, you're going to try to make distributions that have certain properties. You can make your own quantitative variable like this:

newvar <- c (17,20,31)

This is saying to take the values 17,20 , and 31 , and make them the values of a variable called newvar. Note that this variable is not part of a data frame. These next problems require you to make a variable in this way. They'll make you think a little bit.

Question 12.

Make a new variable, with at least 40 values, that has (a) a mean of about 100 , and (b) a fairly uniform distribution. Make a histogram and a boxplot.

Question 13.

Make another new variable, again with at least 40 values and a mean of about 100 , that has a fairly symmetrical, unimodal distribution. Again, make a histogram and boxplot.

Question 14.

Now make another one, again with at least 40 values and a mean of about 100 , that has a bimodal distribution. Again, make a histogram and boxplot.

Question 15.

Based on 12-14, do you think histograms or boxplots are better for showing the detailed shape of distributions? Why?

Question 16.

Add a single extremely high value to one of your variables from problems 12-14. What happens to the mean? What happens to the median?

Price: $43.86
Solution: The downloadable solution consists of 23 pages, 2086 words and 45 charts.
Deliverable: Word Document


log in to your account

Don't have a membership account?
REGISTER

reset password

Back to
log in

sign up

Back to
log in