Inference Practice Set

One exam-style dataset for comparing two group means

<- Back to main page

Readme

Hey :)

This page is an inference practice set built around one simple business-style dataset. The goal is to practise the full flow calmly: import the data, compare two groups, choose the right test direction, run the test, and interpret the output.

  • In Exercise 1, you import the Excel file yourself with readxl.
  • From Exercise 2 onward, the dataset is already loaded as df for you.
  • In every later exercise, df is freshly reloaded behind the scenes, so one mistake will not break the next task.
  • In the later part of the page, you will sometimes run the full output first and then answer a multiple-choice question based on that output.

Dataset description

The workbook contains data from a campus cafe. Each row is one customer order.

Variable Meaning Values / categories
order_id Order ID 1, 2, ..., 50
items Number of items in the order whole numbers
express_line Whether the order used the express line "yes" or "no"
mobile_order Whether the order was placed on mobile first "yes" or "no"
wait_time Total wait time in minutes numeric values

Functions you may need

You do not need all of these in every exercise, but these are the main functions and operators used across the whole page.

  • library(readxl)
  • read_excel()
  • head()
  • mean()
  • sum()
  • boxplot()
  • length()
  • qt()
  • t.test()

Useful operators you will probably use:

  • $
  • ==
  • -
  • [ ]
  • <-

Quick guide

  • If the claim is that the express-line group has a lower mean wait_time, use a one-sided test with alternative = "less".
  • If the claim is only that the two groups are different, use alternative = "two.sided".
  • If equal variances are assumed, the degrees of freedom are n1 + n2 - 2.
  • A p-value smaller than 0.05 means you reject H0 at the 5% level.
  • A p-value between 0.05 and 0.10 is not significant at 5%, but it is significant at 10%.

Exercise 1 - Import the workbook and show the data

Import data.xlsx into an object called df, then display the dataset.

Typing either df or head(df) is completely fine here.

Load the Excel package first, import data.xlsx into an object called df, and then print df or head(df) so you can see that the import worked.

In the exam, the very first step is often just getting the data in correctly. Once df exists, printing it lets you check that the file was imported the way you expected.

library(readxl)
df <- read_excel("data.xlsx")
head(df)

Exercise 2 - Count orders in the express line

df is already loaded for you in this exercise.

How many orders used the express line?

Use the express_line column, check which rows are equal to "yes", and then count how many TRUE values you get.

df$express_line == "yes" creates a TRUE/FALSE check for every row. sum() then counts the TRUE values, so it gives the number of orders that used the express line.

sum(df$express_line == "yes")

Exercise 3 - Mean wait time for one group

Find the mean wait time for the orders that used the express line.

First keep only the orders with express_line == "yes". Then take the mean of their wait_time values.

The mean should only be calculated for the express-line group. So first keep the wait_time values where express_line is "yes", and then average those values.

mean(df$wait_time[df$express_line == "yes"])

Exercise 4 - Mean wait time for the other group

Find the mean wait time for the orders that did not use the express line.

This is the same idea as before, but now keep the orders with express_line == "no" and then take the mean of wait_time.

The comparison needs one mean for each group. Here you keep only the orders with express_line == "no" and average their wait times.

mean(df$wait_time[df$express_line == "no"])

Exercise 5 - Difference between the two means

Use the two means from Exercises 3 and 4 and return their difference.

Take the mean from the express-line orders and subtract the mean from the orders that did not use the express line.

The difference here is mean_yes - mean_no, where mean_yes is the mean wait time for orders with express_line == "yes" and mean_no is the mean wait time for orders with express_line == "no". Writing the difference in this order makes the sign meaningful: a negative value means the express-line orders were faster in this dataset.

mean(df$wait_time[df$express_line == "yes"]) - mean(df$wait_time[df$express_line == "no"])

Exercise 6 - Boxplot of wait time by line

Create a boxplot of wait_time by express_line.

Use wait_time ~ express_line inside boxplot(...) so R knows which variable should be compared across the two line groups.

wait_time ~ express_line tells R to draw one box for each line group and place the wait time distribution inside each box. That is the quickest visual comparison of the two groups.

boxplot(wait_time ~ express_line, data = df)

Exercise 7 - Choose the right alternative

We want to test whether orders in the express line have a lower mean wait_time than the other orders.

Choose the correct alternative setting for t.test(...).


Exercise 8 - Degrees of freedom

Assume population variances are unknown but equal. What are the degrees of freedom for this two-sample t test?

With equal variances, the degrees of freedom are n1 + n2 - 2.

For the equal-variance two-sample t test, you add the two group sizes and subtract 2 because two sample means are being estimated.

sum(df$express_line == "yes") + sum(df$express_line == "no") - 2

Exercise 9 - Critical t-value

At alpha = 0.05, what is the critical t value for the left-tailed one-sided test?

Use qt(...) with the one-sided 5th percentile and the degrees of freedom from the previous exercise.

A left-tailed one-sided test with alpha = 0.05 uses the cutoff that leaves 5% to the left and 95% to the right.

qt(0.05, df = sum(df$express_line == "yes") + sum(df$express_line == "no") - 2)

Exercise 10 - Run the one-sided hypothesis test

Run the equal-variance two-sample t test to check whether orders in the express line have a lower mean wait_time than orders not in the express line.

Return the full t.test(...) output.

Use the express-line wait times as the first input, the regular-line wait times as the second input, set alternative = "less", and include var.equal = TRUE.

The first vector should be the express-line group because the claim is that this group has the lower mean wait time. var.equal = TRUE matches the equal-variance assumption, and alternative = "less" matches the direction of the claim.

t.test(
  df$wait_time[df$express_line == "yes"],
  df$wait_time[df$express_line == "no"],
  alternative = "less",
  var.equal = TRUE
)

Exercise 11 - Read the one-sided output

Based on the output from Exercise 10, which statement is correct at the 5% level?


Exercise 12 - Change the significance level

If you keep the same one-sided test output from Exercise 10 but use alpha = 0.10, what changes?


Exercise 13 - Run the two-sided version

Now run the equal-variance two-sample t test for the question whether the two mean wait times are different.

Return the full t.test(...) output.

Use the same two wait_time vectors as before and keep var.equal = TRUE, but change the alternative to "two.sided".

The groups stay the same, and the equal-variance assumption stays the same. Only the research question changes: now you test whether the mean wait times differ in either direction.

t.test(
  df$wait_time[df$express_line == "yes"],
  df$wait_time[df$express_line == "no"],
  alternative = "two.sided",
  var.equal = TRUE
)

Exercise 14 - Read the two-sided output

Based on the output from Exercise 13, which statement is correct at the 5% level?