Exercise List 6 - Interactive

Paired differences, two proportions, and variance tests

<- Back to main page

Hey you :)

This list covers paired differences, two proportions, one variance, and two variances. Take it one small step at a time:

  • one code cell or one choice at a time
  • one final result per task
  • read the tail direction carefully
  • if you see a full test output, use it to answer the question below

Packages used on this page: readxl and EnvStats.

Quick guide: which method do I need?

Paired before/after data

  • Same people measured twice -> use a paired t test
  • In R: t.test(df$After, df$Before, paired = TRUE)
  • The confidence interval is about the mean of the differences

Difference between two proportions

  • Start with p̂1 and p̂2
  • If the null difference is 0, use the pooled proportion in the standard error
  • If the null difference is a nonzero value such as 0.10, keep that value in the numerator

One population variance

  • Test statistic: χ² = (n - 1)s² / σ0²
  • Use pchisq(...) for the p-value
  • Use qchisq(...) for a confidence interval for the variance

Ratio of two variances

  • Test statistic: F = s1² / s2²
  • In R, var.test(...) gives the full test output
  • Use qf(...) for a confidence interval for σ1² / σ2²

10.2 Inference concerning mean differences

Exercise 39 (Smoking)

It is fairly common for people to put on weight when they quit smoking. While a small weight gain is normal, excessive weight gain can create new health concerns that erode the benefits of not smoking. The accompanying table shows a portion of the weight data for 50 women before quitting and six months after quitting.

Woman Before After
1 140 155
2 144 142
3 138 153
4 145 146
5 118 129
6 150 149

Quick dataset note: in the code cells below, the file Smoking.xlsx is loaded into df. It has the columns Woman, Before, and After.

Exercise 39a

Construct the 95% confidence interval for the mean gain in weight.

Return the full paired t.test(...) output.

This is paired data because the same women are measured before and after. Use After first, Before second, and set paired = TRUE.

Because the same women appear in both columns, this is a paired t test. That output gives you the estimated mean difference and the 95% confidence interval for that mean difference.

t.test(df$After, df$Before, paired = TRUE)

Exercise 39b

Which interval matches the 95% confidence interval for the mean gain in weight?

Read the two confidence interval numbers directly from the t.test(...) output in 39a.

Correct choice: the first option.

The confidence interval from the paired test is approximately 4.87 to 9.13. Because the whole interval is positive, it suggests an average weight gain after quitting.

Exercise 39c

Use the confidence interval to decide whether the mean gain in weight differs from 5 pounds.

Check whether 5 lies inside the confidence interval from 39a.

Correct choice: the second option.

Because 5 lies inside the 95% confidence interval, the data do not show that the mean gain is different from 5 pounds at the 5% level.

10.3 Inference concerning the difference between two proportions

Exercise 57

A report suggests that business majors spend the least amount of time on course work than all other college students. A provost of a university decides to count a survey where students are asked if they study hard, defined as spending at least 20hrs per week on course work. Of 120 business majors included in the survey, 20 said they had studied hard, as compared to 48 out of 150 nonbusiness majors who said that they studied hard. At the 5% significance level, can we conclude that the proportion of business majors who study hard is less than that of nonmajors? Provide the details.

Exercise 57a

Choose the correct hypotheses.

Let p_business be the proportion of business majors who study hard and p_nonbusiness the proportion of nonbusiness majors who study hard. The claim says business majors have the smaller proportion.

Correct choice: the first option.

The claim is that the business-major proportion is lower. That means the alternative should be p_business - p_nonbusiness < 0, and the null keeps the equality case.

Exercise 57b

Calculate the value of the z test statistic.

First compute p̂1 = 20 / 120 and p̂2 = 48 / 150. Because the null difference is 0, use the pooled proportion in the standard error.

A difference-in-proportions z test compares p̂1 - p̂2 with the null value 0. Because the null difference is zero, the standard error uses the pooled proportion from the two samples.

p1_hat <- 20 / 120
p2_hat <- 48 / 150
p_pool <- (20 + 48) / (120 + 150)
(p1_hat - p2_hat) / sqrt(p_pool * (1 - p_pool) * (1 / 120 + 1 / 150))

Exercise 57c

Find the p-value.

This is a left-tailed test, so after you get z, use the left tail in pnorm(...).

Because the alternative is left-tailed, the p-value is the left-tail probability for the z statistic. That means you use pnorm(z).

z <- -2.884195
pnorm(z)

Exercise 57d

At the 5% significance level, what is the correct conclusion?

Compare the p-value from 57c with 0.05.

Correct choice: the first option.

The p-value is much smaller than 0.05, so you reject the null hypothesis. The data support the claim that the proportion of business majors who study hard is lower.

Exercise 58

Many believe that it is not feasible for men and women to be just friends, while others argue that this belief may not be true anymore since gone are the days when men worked, and women stayed at home and the only way they could get together was for romance. In a recent survey, 200 heterosexual college students were asked if it was feasible for male and female students to be just friends. Thirty-two percent of females and 57% of males reported that it was not feasible for men and women to be just friends. Suppose the study consisted of 100 female and 100 male students. At the 5% significance level, can we conclude that there is a greater than 10 percentage point difference between the proportion of male and female students with this view? Provide the details.

Exercise 58a

Choose the correct hypotheses.

Let p_male be the male proportion and p_female the female proportion. The claim is about a difference greater than 0.10.

Correct choice: the first option.

The claim is that the male-minus-female difference is greater than 0.10, so the alternative must be p_male - p_female > 0.10. The null keeps the equality case and everything below it.

Exercise 58b

Calculate the value of the z test statistic.

Use p̂_male = 0.57, p̂_female = 0.32, and subtract the claimed difference 0.10 in the numerator.

Because the null value for the difference is 0.10, the numerator is (p̂_male - p̂_female) - 0.10. The standard error is built from the two sample proportions and the two sample sizes.

((0.57 - 0.32) - 0.10) / sqrt((0.57 * (1 - 0.57) / 100) + (0.32 * (1 - 0.32) / 100))

Exercise 58c

Find the p-value.

This is a right-tailed test, so use the upper tail after you calculate the z statistic.

Because the alternative is right-tailed, the p-value is the upper-tail probability for the z statistic.

z <- 2.205167
pnorm(z, lower.tail = FALSE)

Exercise 58d

At the 5% significance level, what is the correct conclusion?

Compare the p-value from 58c with 0.05.

Correct choice: the first option.

The p-value is below 0.05, so you reject the null hypothesis. The data support a male-female difference greater than 10 percentage points.


11.1 Inference concerning the population variance

Exercise 17 (MPG)

The data accompanying this exercise show miles per gallon (mpg) for 25 cars.

Quick dataset note: in the code cells below, the file MPG.xlsx is loaded into df. It has one column called MPG.

Exercise 17a

State the null and the alternative hypotheses in order to test whether the variance differs from 62 mpg².

The phrase “differs from” means a two-sided test, and this question is about the population variance.

Correct choice: the first option.

Because the question asks whether the variance differs from 62, the null uses equality and the alternative is two-sided.

Exercise 17b

Assuming that MPG is normally distributed, calculate the value of the test statistic.

You can do this in two ways. The solution-list function is varTest(...) from EnvStats, and the test statistic is inside that output. You can also compute the same chi-square statistic manually with (n - 1)s² / 62.

For one population variance, the solution list uses varTest(...) from EnvStats. The test statistic is the chi-square value inside that output. The manual formula gives the same result.

# Method 1: use the EnvStats test function
varTest(df$MPG, alternative = "two.sided", sigma.squared = 62, conf.level = 0.99)$statistic

# Method 2: calculate the same statistic directly
x <- df$MPG
(length(x) - 1) * var(x) / 62

Exercise 17c

Find the p-value.

You can do this in two ways. The solution-list function is varTest(...) from EnvStats, and the p-value is inside that output. You can also compute the same p-value manually from the chi-square distribution.

The one-sample variance test output already contains the p-value, and the manual chi-square route gives the same value.

# Method 1: use the EnvStats test function
varTest(df$MPG, alternative = "two.sided", sigma.squared = 62, conf.level = 0.99)$p.value

# Method 2: compute the same p-value manually
x <- df$MPG
stat <- (length(x) - 1) * var(x) / 62
2 * min(pchisq(stat, df = length(x) - 1), 1 - pchisq(stat, df = length(x) - 1))

Exercise 17d

Make a conclusion at α = 0.01.

Compare the p-value from 17c with 0.01.

Correct choice: the second option.

The p-value is about 0.0141, which is larger than 0.01. So you do not reject the null hypothesis at the 1% level.

Exercise 17e1

Calculate the lower bound of the 95% confidence interval for the population variance.

For the lower bound, divide (n - 1)s² by the upper chi-square critical value qchisq(0.975, df = n - 1).

The confidence interval for a variance comes from the chi-square distribution. The lower bound uses the larger chi-square cutoff in the denominator, which makes the lower endpoint smaller.

x <- df$MPG
(length(x) - 1) * var(x) / qchisq(0.975, df = length(x) - 1)

Exercise 17e2

Calculate the upper bound of the 95% confidence interval for the population variance.

For the upper bound, divide (n - 1)s² by the lower chi-square critical value qchisq(0.025, df = n - 1).

The upper endpoint uses the smaller chi-square cutoff in the denominator. That makes the fraction larger and gives the upper bound of the interval.

x <- df$MPG
(length(x) - 1) * var(x) / qchisq(0.025, df = length(x) - 1)

11.2 Inference concerning the ratio of two population variances

Exercise 26

Consider the following measures based on independently drawn samples from normally distributed populations:

Sample 1: s1² = 220 and n1 = 20

Sample 2: s2² = 196 and n2 = 15

Exercise 26a1

Construct the 95% interval estimate for the ratio of the population variances.

Return the lower bound.

Start with the sample ratio 220 / 196. For the lower bound, divide that ratio by qf(0.975, 19, 14).

The interval is built around the sample ratio s1² / s2². For the lower bound, you divide that ratio by the F critical value with df1 = 19 and df2 = 14.

(220 / 196) / qf(0.975, 19, 14)

Exercise 26a2

Construct the 95% interval estimate for the ratio of the population variances.

Return the upper bound.

Use the same sample ratio 220 / 196. For the upper bound, multiply by qf(0.975, 14, 19).

For the upper endpoint, the same sample ratio is multiplied by the F critical value with the degrees of freedom reversed.

(220 / 196) * qf(0.975, 14, 19)

Exercise 26b

Using the confidence interval from part (a), test if the ratio of the population variances differs from 1 at the 5% significance level.

Check whether 1 lies inside the confidence interval from 26a.

Correct choice: the second option.

The interval runs from about 0.3924 to 2.9710, so it includes 1. That means the data do not show a significant difference from 1 at the 5% level.

Exercise 38 (Rentals)

The data accompanying this exercise include monthly rents for a two-bedroom apartment in two campus towns. At the 5% significance level, test if the variance of rent in campus town 1 is less than the variance of rent in campus town 2. State your assumptions clearly.

Quick dataset note: in the code cells below, the file Rentals.xlsx is loaded into df. It has two columns called Town1 and Town2.

Exercise 38a

Which statement gives the right setup and assumptions?

The claim is that the variance in town 1 is smaller. You also need the usual F-test assumptions.

Correct choice: the first option.

You can test this as σ²_town2 / σ²_town1 > 1, which matches the claim that town 1 has the smaller variance. The usual assumptions are independent samples and normal populations.

Exercise 38b

Run the full variance-ratio test in R.

Return the full var.test(...) output.

One simple way is to put Town2 first and Town1 second, then use alternative = "greater".

The claim is that town 1 has the smaller variance, so an equivalent way to code the test is to place Town2 first and Town1 second and test whether the ratio is greater than 1. Returning the full var.test(...) output lets you read both the F statistic and the p-value directly.

var.test(df$Town2, df$Town1, alternative = "greater")

Exercise 38c

Based on the output from 38b, which statement about the p-value is correct?

Read the p-value from the var.test(...) output and compare it with both 0.05 and 0.10.

Correct choice: the third option.

The p-value is about 0.3644, so it is well above both 0.05 and 0.10.

Exercise 38d

At the 5% significance level, what is the correct conclusion?

Compare the p-value from 38b with 0.05.

Correct choice: the second option.

Because the p-value is much larger than 0.05, you do not reject the null hypothesis. The data do not give enough evidence that town 1 has the smaller variance.