Exercise List 5 - Interactive

Hey you :)

This list is longer, so take it step by step:

one small task per code cell
one final output per cell
read the task wording carefully (tail direction matters)
use hints if stuck, then retry

Quick guide: hypothesis testing flow

Step 1: write hypotheses correctly

Null includes equality (=, <=, >=)
Alternative is strict (<, >, !=)
Use population parameter symbols (μ, p), not sample statistics

Step 2: pick the correct tail

Ha: μ < value -> left tail
Ha: μ > value -> right tail
Ha: μ != value -> two-sided

Step 3: test + p-value

Mean, σ known: z test
Mean, σ unknown: t test
Proportion: z test
Difference in means: z or t depending on assumptions

Step 4: decision

If p-value < alpha: reject H0
If p-value >= alpha: do not reject H0

9.1 Introduction to Hypothesis Testing

Exercise 1

Explain why the following hypotheses are not constructed correctly

Exercise 1a

H0: μ <= 10; Ha: μ >= 10

Choose the best explanation.

Exercise 1b

H0: μ != 500; Ha: μ = 500

Choose the best explanation.

Exercise 1c

H0: p <= 0.40; Ha: p > 0.42

Choose the best explanation.

Exercise 1d

H0: X <= 128; Ha: X > 128

Choose the best explanation.

Exercise 2

Which of the following statements are valid null and alternative hypotheses? If they are invalid hypotheses, explain why.

For each item below, choose whether the hypotheses are valid or invalid.

Exercise 2a

H0: X <= 210; Ha: X > 210

Choose one answer.

Exercise 2b

H0: μ = 120; Ha: μ != 120

Choose one answer.

Exercise 2c

H0: p <= 0.24; Ha: p > 0.24

Choose one answer.

Exercise 2d

H0: μ < 252; Ha: μ > 252

Choose one answer.

Exercise 7

Construct the null and alternative hypotheses for the following claims:

Exercise 7a

“I am going to get the majority of the votes to win this election”

Choose the best hypothesis pair.

Exercise 7b

“I suspect that your 10-inch pizzas are, on average, less than 10 inches in size”

Choose the best hypothesis pair.

Exercise 7c

“I will have to fine the company since its tablets do not contain an average of 250 mg of ibuprofen as advertised”

Choose the best hypothesis pair.

Exercise 11

The screening process for detecting a rare disease is not perfect. Researchers have developed a blood test that is considered fairly reliable. It gives a positive reaction in 98% of the people who have that disease. However, it erroneously gives a positive reaction in 3% of the people who do not have the disease. Consider the null hypothesis “the individual does not have the disease” to answer the following questions.

Exercise 11a

What is the probability of a Type I error?

A Type I error means rejecting H0 even though H0 is true. Here H0 says the person does not have the disease, so a Type I error is a false positive. The test gives a positive result to healthy people 3% of the time.

0.03

Exercise 11b

What is the probability of a Type II error?

A Type II error means failing to reject H0 even though it is false. Here that means the person really has the disease, but the test misses it. Since the true-positive rate is 0.98, the false-negative rate is 1 - 0.98 = 0.02.

0.02

Exercise 11c

Choose whether this summary is correct: “Type I: healthy person tests positive. Type II: diseased person tests negative.”

Exercise 11d

What is wrong with the nurse’s analysis, “The blood test result has proved that the individual is free of disease”?

Choose the best explanation.

9.2 Hypothesis test for the population mean when `σ` is known

Exercise 29

(Hourly_Wage) The data accompanying this exercise shows hourly wages (in $) for 50 employees. An economist wants to test if the average hourly wage is less than $22. Assume that the population standard deviation is $6.

Quick dataset note: in the code cells below, the file Hourly_Wage.xlsx is loaded into df. It contains the columns Wage, EDUC, EXPER, AGE, and Male. For this exercise, you only need the Wage column.

Exercise 29a

State the null and alternative hypotheses.

Choose the best hypothesis pair.

Exercise 29b

Find the value of the test statistic.

For a one-sample z test with known σ, the test statistic is

sample mean minus hypothesized mean
divided by σ / sqrt(n)

So here you compare the sample mean wage with 22 and scale that difference by the known standard error.

(mean(df$Wage) - 22) / (6 / sqrt(nrow(df)))

Exercise 29c

Find the p-value.

The alternative is “less than 22”, so this is a left-tailed test. Once you have the z statistic, the p-value is the probability of getting a value that small or smaller under the null.

pnorm(z, lower.tail = TRUE)

Exercise 29d

At alpha = 0.05, what is the conclusion? Is the average hourly wage less than $22?

Choose one answer.

9.3 Hypothesis test for the population mean when `σ` is unknown

Exercise 50

(MPG) The data accompanying this exercise shows miles per gallon (MPG) for 25 “supergreen” cars.

Quick dataset note: in the code cells below, the file MPG.xlsx is loaded into df. It has one column called MPG, which stores the miles per gallon values.

Exercise 50a

State the null and the alternative hypotheses in order to test whether the average MPG differs from 95.

Choose the best hypothesis pair.

Exercise 50b

Run the full one-sample t test for whether the average MPG differs from 95.

Return the full t.test(...) output.

Because the question asks whether the mean MPG differs from 95, this is a one-sample two-sided t test. You give the sample data, the hypothesized mean, and let t.test(...) return the full output.

t.test(df$MPG, mu = 95)

Exercise 50c

Based on the output from 50b, which statement about the p-value is correct?

Exercise 50d

At alpha = 0.05, can you conclude that the average MPG differs from 95?

Choose one answer.

9.4 Hypothesis test for the population proportion

Exercise 64

An economist is concerned that more than 20% of American households have raided their retirement accounts to endure financial hardships such as unemployment and medical emergencies. He randomly surveys 190 households with retirement accounts and finds that 50 are borrowing against them.

Exercise 64a

Set up the null and alternative hypotheses to test the economist’s concern.

Choose the best hypothesis pair.

Exercise 64b

Calculate the value of the test statistic.

A one-sample proportion z statistic compares the sample proportion p̂ with the null value p0. The denominator uses the null standard error sqrt(p0(1-p0)/n), not the sample standard deviation.

phat <- 50 / 190
(phat - 0.20) / sqrt(0.20 * 0.80 / 190)

Exercise 64c

Calculate the p-value.

The economist’s claim is right-tailed, so once you have the z statistic you take the upper-tail probability. That gives the chance of seeing a result at least this large if the true proportion were really 0.20.

phat <- 50 / 190
z <- (phat - 0.20) / sqrt(0.20 * 0.80 / 190)
pnorm(z, lower.tail = FALSE)

Exercise 64d

Determine if the economist’s concern is justifiable at alpha = 0.05.

Choose one answer.

10.1 Inference concerning the difference between two means

Exercise 17

(Longevity) A consumer advocate researches the length of life between two brands of refrigerators, Brand A and Brand B. He collects data (measured in years) on the longevity of 40 refrigerators for Brand A and repeats the sampling for Brand B. A portion of the data is shown in the accompanying table.

Quick dataset note: in the code cells below, the file Longevity.xlsx is loaded into df. It has two columns: Brand A and Brand B, each containing the observed lifetimes in years.

Exercise 17a

Specify the competing hypotheses to test whether the average length of life differs between the two brands.

Choose the best hypothesis pair.

Exercise 17b

Calculate the value of the test statistic. Assume that σ²A = 4.4 and σ²B = 5.2.

Here you are standardizing the observed difference in sample means.

The top of the formula is:

sample mean of Brand A
minus sample mean of Brand B

The bottom of the formula is the standard error for the difference in two means when the population variances are known:

sqrt(σ²A / nA + σ²B / nB)

That is why the numbers 4.4 and 5.2 appear inside the square root, each divided by its sample size. One possible R answer is:

(mean(df$`Brand A`) - mean(df$`Brand B`)) / sqrt(4.4 / nrow(df) + 5.2 / nrow(df))

Exercise 17c

Calculate the p-value.

Exercise 17d

At the 5% significance level, what is the conclusion?

Choose one answer.

Exercise 20

(Tractor_Times) The production department at Greenside Corporation, a manufacturer of lawn equipment, has devised a new manual assembly method for its lawn tractors. Now it wishes to determine if it is reasonable to conclude that the mean assembly time of the new method is less than the old method. Accordingly, they have randomly sampled assembly time (in minutes) from the 40 tractors using the old method and 32 tractors using the new method. A portion of the data is shown in the accompanying table.

Quick dataset note: in the code cells below, the file Tractor_Times.xlsx is loaded into df. It has two columns: Old for the old assembly method and New for the new assembly method.

Exercise 20a

Set up the hypotheses.

Choose the best hypothesis pair.

Exercise 20b

Run the full unequal-variance t test for whether the new method has a lower mean assembly time than the old method.

Return the full t.test(...) output.

Because the claim is that the new method is faster, the new-method times should be the first group and the old-method times the second group. The test is one-sided ("less") and uses unequal variances, so this is the Welch version of the two-sample t test.

t.test(df$New, df$Old, alternative = "less", var.equal = FALSE)

Exercise 20c

Based on the output from 20b, which statement about the p-value is correct?

Exercise 20d

At the 5% significance level, what is the conclusion?

Choose one answer.

Exercise 20e

What if the significance level is 10%?

Choose one answer.

Exercise 21

(Nicknames) Baseball has always been a favorite pastime in America and is rife with statistics and theories. One study found that major league players who have nicknames live an average of 2 1/2 years longer than those without them. You do not believe in this result and decide to collect data on the lifespan of 30 baseball players along with a nickname variable that equals 1 if the player had a nickname and 0 otherwise. A portion of the data is shown in the accompanying table.

Quick dataset note: in the code cells below, the file Nicknames.xlsx is loaded into df. It contains Years for lifespan and Nickname, where 1 means the player had a nickname and 0 means the player did not.

Exercise 21a

Create two subsamples and return the average longevity for players with nicknames (Nickname == 1).

Exercise 21b

Return the average longevity for players without nicknames (Nickname == 0).

Exercise 21c

Specify hypotheses to contradict the original claim.

Choose the best hypothesis pair.

Exercise 21d

Run the full equal-variance two-sample t test for the claim about a 2.5-year difference.

Return the full t.test(...) output.

The original claim is about a difference of 2.5 years, so that value stays in mu. Because the exercise says to assume equal variances, you include var.equal = TRUE and return the full t.test(...) output.

t.test(with_nick, without_nick, mu = 2.5, var.equal = TRUE)

Exercise 21e

Based on the output from 21d, which statement about the p-value is correct?

Exercise 21f

What is the conclusion of the test using 5% level of significance?

Choose one answer.

Hey you :)

Quick guide: hypothesis testing flow

Step 1: write hypotheses correctly

Step 2: pick the correct tail

Step 3: test + p-value

Step 4: decision

9.1 Introduction to Hypothesis Testing

Exercise 1

Exercise 1a

Exercise 1b

Exercise 1c

Exercise 1d

Exercise 2

Exercise 2a

Exercise 2b

Exercise 2c

Exercise 2d

Exercise 7

Exercise 7a

Exercise 7b

Exercise 7c

Exercise 11

Exercise 11a

Exercise 11b

Exercise 11c

Exercise 11d

9.2 Hypothesis test for the population mean when σ is known

Exercise 29

Exercise 29a

Exercise 29b

Exercise 29c

Exercise 29d

9.3 Hypothesis test for the population mean when σ is unknown

Exercise 50

Exercise 50a

Exercise 50b

Exercise 50c

Exercise 50d

9.4 Hypothesis test for the population proportion

Exercise 64

Exercise 64a

Exercise 64b

Exercise 64c

Exercise 64d

10.1 Inference concerning the difference between two means

Exercise 17

Exercise 17a

Exercise 17b

Exercise 17c

Exercise 17d

Exercise 20

Exercise 20a

Exercise 20b

Exercise 20c

Exercise 20d

Exercise 20e

Exercise 21

Exercise 21a

Exercise 21b

Exercise 21c

Exercise 21d

Exercise 21e

Exercise 21f

9.2 Hypothesis test for the population mean when `σ` is known

9.3 Hypothesis test for the population mean when `σ` is unknown