Exercise List 4 - Interactive
Sampling distributions and confidence intervals
Hey you :)
How this list works:
- one small task per code cell
- one final output per code cell
- hints + solutions available for each task
- native R errors are shown when code fails
Quick guide: which formula do I need?
Sampling distribution of sample mean (x̄)
- Mean of
x̄:μ - If population
σis known:SE = σ / sqrt(n)- probability:
pnorm(cutoff, mean = μ, sd = σ / sqrt(n))
Sampling distribution of sample proportion (p̂)
E(p̂) = pSE(p̂) = sqrt(p * (1 - p) / n)- probability:
pnorm(cutoff, mean = p, sd = sqrt(p * (1 - p) / n))
Confidence interval for mean
σknown:x̄ +/- z* * σ / sqrt(n)
σunknown:x̄ +/- t* * s / sqrt(n)(or uset.test(...))
Confidence interval for proportion
p̂ +/- z* * sqrt(p̂ * (1 - p̂) / n)
Required sample size
- Mean (
σknown):n = (z* * σ / E)^2 - Proportion (no prior estimate):
n = z*^2 * 0.25 / E^2 - Always round up
7.2 The Sampling Distribution of the Sample Mean
Exercise 10
According to a survey, high school girls average 100 text messages daily. Assume that the population standard deviation is 20 text messages. Suppose a random sample of 50 high school girls is taken.
Exercise 10a
What is the probability that the sample mean is more than 105?
This is about the sampling distribution of x̄. First compute the standard error with σ / sqrt(n), then use the upper tail because the question says “more than 105”.
The sample mean x̄ is approximately normal with center 100 and standard error 20 / sqrt(50). Because the question asks for “more than 105”, you need the upper tail to the right of 105.
1 - pnorm(105, mean = 100, sd = 20 / sqrt(50))Exercise 10b
What is the probability that the sample mean is less than 95?
Again use the sampling distribution of x̄, so the spread is σ / sqrt(n). This time the question says “less than 95”, so use the lower tail.
This is again the sampling distribution of x̄, so the mean stays 100 but the spread becomes 20 / sqrt(50). Because the wording says “less than 95”, you use the lower tail at 95.
pnorm(95, mean = 100, sd = 20 / sqrt(50))Exercise 10c
What is the probability that the sample mean is between 95 and 105?
Find the probability below 105 and subtract the probability below 95. That gives the area between the two cutoffs.
“Between 95 and 105” means the area between two cutoffs. So first find the probability below 105, then subtract the probability below 95.
pnorm(105, mean = 100, sd = 20 / sqrt(50)) -
pnorm(95, mean = 100, sd = 20 / sqrt(50))Exercise 20
Suppose that IQ scores are normally distributed with a mean of 100 and a standard deviation of 16.
Exercise 20a
What is the probability that a randomly selected person will have an IQ score of less than 90?
This is about one randomly selected person, not a sample mean. So use the original normal distribution with μ = 100 and σ = 16, then take the lower tail at 90.
This question is about one randomly selected person, not about an average. So you stay with the original normal distribution with mean 100 and standard deviation 16, and take the lower tail at 90.
pnorm(90, mean = 100, sd = 16)Exercise 20b
What is the probability that the average IQ score of four randomly selected people is less than 90?
Now the question is about the average of 4 people, so switch to the sampling distribution of x̄. Keep the same mean, but replace the spread with σ / sqrt(n).
Now the question is about the average of 4 people, so the center stays 100 but the spread becomes smaller: 16 / sqrt(4). Then you again take the lower tail at 90.
pnorm(90, mean = 100, sd = 16 / sqrt(4))Exercise 20c
If four people are randomly selected, what is the probability that all of them have an IQ score of less than 90?
First find the probability that one person is below 90. Since the 4 people are independent, combine that single-person probability four times.
First find the probability that one person scores below 90. The 4 people are independent, so you multiply that single-person probability by itself 4 times.
p <- pnorm(90, mean = 100, sd = 16)
p^47.3 The Sampling Distribution of the Sample Proportion
Exercise 25
A recent survey found that 82% of college graduates believe that their degree was a good investment (cnbc.com, February 27, 2020). Suppose a random sample of 100 college graduates is taken.
Exercise 25a-1
What is the expected value for the sampling distribution of the sample proportion?
For the sampling distribution of p̂, the center is the population proportion p. So you only need the given proportion, written on proportion scale.
For the sampling distribution of p̂, the expected value is simply the population proportion p. So here the center is 0.82, not 82.
0.82Exercise 25a-2
What is the standard error for the sampling distribution of the sample proportion?
Use the standard error formula for p̂: combine the given population proportion with the sample size n = 100.
The standard error of p̂ uses the proportion formula sqrt(p * (1 - p) / n). Here you plug in p = 0.82 and n = 100.
sqrt(0.82 * (1 - 0.82) / 100)Exercise 25b
What is the probability that the sample proportion is less than 0.80?
Treat p̂ like an approximately normal distribution. Use mean p = 0.82 and the standard error from part 25a-2, then take the lower tail at 0.80.
You treat p̂ as approximately normal with mean 0.82 and standard error from 25a-2. Since the question asks for “less than 0.80”, you use the lower tail.
pnorm(0.80, mean = 0.82, sd = sqrt(0.82 * 0.18 / 100))Exercise 25c
What is the probability that the sample proportion is within +/- 0.02 of the population proportion?
“Within +/- 0.02 of the population proportion” means from 0.82 - 0.02 to 0.82 + 0.02. Then find the probability between those two bounds.
“Within +/- 0.02” means from 0.80 to 0.84. So the probability you want is the area between those two bounds: probability below 0.84 minus probability below 0.80.
pnorm(0.84, mean = 0.82, sd = sqrt(0.82 * 0.18 / 100)) -
pnorm(0.80, mean = 0.82, sd = sqrt(0.82 * 0.18 / 100))Exercise 28
At an exhibit in the Museum of Science, people are asked to choose between 50 and 100 random draws from a machine. The machine is known to have 60 green balls and 40 red balls. After each draw, the color of the ball is noted, and the ball is put back for the next draw. You win a prize if more than 70% of the draws result in a green ball. Would you choose 50 or 100 draws for the game. Explain.
Choose the better option.
Translate “more than 70%” into a minimum number of green balls for each choice. Then compare the two binomial win probabilities and pick the choice with the larger win chance.
Correct choice: 50 draws.
The cutoff is more than 70% green. That means more than 35 green balls out of 50, or more than 70 green balls out of 100. When you compare the two binomial win probabilities, the 50-draw game gives the larger chance of winning. So the better choice is 50, not 100.
8.1 Confidence interval for the population mean when σ is known
Exercise 15
(Highway_Speeds) A safety office is concerned about speeds on a certain section of the New Jersey Turnpike. The accompanying file contains the speeds of 40 cars on a Saturday afternoon. Assume that the population standard deviation is 5 mph. Construct the 95% confidence interval for the mean speed of all cars on that section of the turnpike. Are the safety officer’s concern valid if the speed limit is 55 mph? Explain.
Quick dataset note: in the code cells below, the file Highway_Speeds.xlsx is loaded into df. It contains one column called Highway Speeds, which stores the observed car speeds.
Exercise 15a
Return the lower bound of the 95% confidence interval.
Start with the sample mean from the dataset. Then compute the 95% margin of error with known σ = 5 and subtract it for the lower bound.
Because σ is known, this is a z-based confidence interval for the mean. First compute the sample mean, then subtract the margin of error z* × 5 / sqrt(n) to get the lower endpoint.
mean(df[[1]]) - qnorm(0.975) * 5 / sqrt(nrow(df))Exercise 15b
Return the upper bound of the 95% confidence interval.
Use the same sample mean and the same 95% margin of error as in 15a, but now add the margin of error for the upper bound.
This is the same confidence interval as in 15a. The only difference is that for the upper endpoint you add the margin of error instead of subtracting it.
mean(df[[1]]) + qnorm(0.975) * 5 / sqrt(nrow(df))Exercise 15c
Are the safety officer’s concerns valid if the speed limit is 55 mph?
Choose one answer.
Check whether 55 lies inside the 95% confidence interval. If 55 is outside the interval, the concern is supported.
Correct choice: Yes, the concern is valid.
You first build the 95% confidence interval for the mean speed. Then you check whether 55 lies inside that interval. Here, 55 is outside the interval, which means a mean of 55 mph is not plausible given the sample. That supports the safety officer’s concern.
8.2 Confidence interval for the population mean when σ is unknown
Exercise 36
(Economics) An associate dean of a university wishes to compare the means on the standardized final exams in microeconomics and macroeconomics. He has access to a random sample of 40 scores from each of these two courses. A portion of the data is shown in the accompanying table.
Quick dataset note: in the code cells below, the file Economics.xlsx is loaded into df. It has two score columns: Micro for microeconomics and Macro for macroeconomics.
Exercise 36a
Construct the 95% confidence interval lower bound for the mean score in microeconomics.
Because σ is unknown, use a one-sample t interval for the Micro scores. Then return the first element of the confidence interval.
Here σ is unknown, so you use a one-sample t interval. The function t.test(df$Micro) returns the full interval in conf.int, and the first element is the lower endpoint.
t.test(df$Micro)$conf.int[1]Exercise 36b
Construct the 95% confidence interval upper bound for the mean score in microeconomics.
This is the same one-sample t interval for Micro as in 36a, but now return the second element of the confidence interval.
This is the same one-sample t interval for the Micro scores. This time you want the second value in conf.int, which is the upper endpoint.
t.test(df$Micro)$conf.int[2]Exercise 36c
Construct the 95% confidence interval lower bound for the mean score in macroeconomics.
Use a one-sample t interval for the Macro scores and return the lower endpoint.
The setup is the same as before, but now for the Macro scores. The first element of conf.int is the lower endpoint of the interval.
t.test(df$Macro)$conf.int[1]Exercise 36d
Construct the 95% confidence interval upper bound for the mean score in macroeconomics.
Use the same one-sample t interval for Macro as in 36c, but return the upper endpoint.
Again use the one-sample t interval for Macro. The second element of conf.int is the upper endpoint.
t.test(df$Macro)$conf.int[2]Exercise 36e
Explain why the widths of the two intervals are different.
Choose the statement that fits best.
Both intervals use the same confidence level and the same sample size, so the main remaining difference is the sample variability.
Correct choice: Agree.
The confidence level is the same for both intervals, and the sample sizes are the same as well. So the main thing that can change the width is the sample standard deviation. The group with the larger sample standard deviation has a larger standard error, and that leads to a wider confidence interval.
8.3 Confidence interval for the population proportion
Exercise 54
One in five 18-year-old Americans has not graduated from high school. A mayor of a Northeastern city comments that its residents do not have the same graduation rate as the rest of the country. An analyst from the Department of Education decides to test the mayor’s claim. In particular, she draws a random sample of 80 18-year-old in the city and finds that 20 of them have not graduated from high school.
Exercise 54a
Compute the point estimate for the proportion of 18-year-olds who have not graduated from high school in this city.
The point estimate for a population proportion is sample successes divided by sample size. Here, use the numbers given in the task.
The point estimate of a population proportion is always successes divided by sample size. Here that is 20 / 80.
20 / 80Exercise 54b
Use this point estimate to derive the 95% confidence interval lower bound for the population proportion.
First compute p̂ from 54a. Then use the 95% confidence interval formula for a proportion and subtract the margin of error.
Start with p̂ = 20 / 80. Then use the confidence-interval formula for a proportion and subtract the margin of error to get the lower endpoint.
phat <- 20 / 80
phat - qnorm(0.975) * sqrt(phat * (1 - phat) / 80)Exercise 54c
Use this point estimate to derive the 95% confidence interval upper bound for the population proportion.
Use the same p̂ and the same 95% margin of error as in 54b, but now add the margin of error.
This is the same interval as in 54b. For the upper endpoint, add the margin of error instead of subtracting it.
phat <- 20 / 80
phat + qnorm(0.975) * sqrt(phat * (1 - phat) / 80)Exercise 54d
Can the mayor’s comment be justified at 95% confidence?
Choose one answer.
Check whether 0.20 lies inside the confidence interval. If it does, the claim that the city is different from the country is not justified at this confidence level.
Correct choice: No, the comment is not justified.
The mayor is claiming that the city is different from the national proportion 0.20. To support that claim, 0.20 would need to fall outside the 95% confidence interval. Here it lies inside the interval, so the data do not give enough evidence that the city is different at the 95% level.
8.4 Selecting the required sample size
Exercise 64
An analyst would like to construct 95% confidence intervals for the mean stock returns in two industries. Industry A is a high-risk industry with a known population standard deviation of 20.6%, whereas Industry B is a low-risk industry with a known population standard deviation of 12.8%
Exercise 64a
What is the minimum sample size required by the analyst if she wants to restrict the margin of error of 4% for Industry A?
Use the required sample size formula for a mean with known σ: n = (z* × σ / E)^2. Here the margin of error is 4, and you round up at the end.
Use the sample-size formula for a mean with known σ: n = (z* × σ / E)^2. Then round up, because sample size must be a whole number and you need at least that many observations.
ceiling((qnorm(0.975) * 20.6 / 4)^2)Exercise 64b
What is the minimum sample size required by the analyst if she wants to restrict the margin of error to 4% for Industry B?
This is the same sample size formula as in 64a, but now use the σ for Industry B instead of Industry A. Round up at the end again.
This is the same formula as in 64a, but now you plug in the standard deviation for Industry B instead of Industry A. You still round up at the end.
ceiling((qnorm(0.975) * 12.8 / 4)^2)Exercise 64c
Why do the results differ if they use the same margin of error?
Choose the statement that fits best.
Look at the sample-size formula from 64a and 64b. The only thing that changes is σ, so ask what a larger σ does to the required n.
Correct choice: Agree.
In the sample-size formula for a confidence interval with known σ, the required n increases when σ increases. Intuitively, more spread in the population means you need more observations to get the same precision. That is why the two industries need different sample sizes even though the margin of error is the same.
Exercise 71
A business student is interested in estimating the 99% confidence interval for the proportion of students who bring laptops to campus. He wants a precise estimate and is willing to draw a large sample that will keep the sample proportion within five percentage points of the population proportion. What is the minimum sample size required by this student, given that no prior estimate of the population proportion is available?
For sample size planning with no prior proportion, use the conservative choice p = 0.5. Then plug that into the proportion sample size formula with 99% confidence and margin of error 0.05, and round up.
Because there is no prior estimate of the population proportion, you use the conservative choice p = 0.5. That gives the largest required sample size, so it is the safe planning choice.
ceiling((qnorm(0.995)^2 * 0.5 * 0.5) / 0.05^2)