Inference Practice Set
One exam-style dataset for comparing two group means
Readme
Hey :)
This page is an inference practice set built around one simple business-style dataset. The goal is to practise the full flow calmly: import the data, compare two groups, choose the right test direction, run the test, and interpret the output.
- In Exercise 1, you import the Excel file yourself with
readxl. - From Exercise 2 onward, the dataset is already loaded as
dffor you. - In every later exercise,
dfis freshly reloaded behind the scenes, so one mistake will not break the next task. - In the later part of the page, you will sometimes run the full output first and then answer a multiple-choice question based on that output.
Dataset description
The workbook contains data from a campus cafe. Each row is one customer order.
| Variable | Meaning | Values / categories |
|---|---|---|
order_id |
Order ID | 1, 2, ..., 50 |
items |
Number of items in the order | whole numbers |
express_line |
Whether the order used the express line | "yes" or "no" |
mobile_order |
Whether the order was placed on mobile first | "yes" or "no" |
wait_time |
Total wait time in minutes | numeric values |
Functions you may need
You do not need all of these in every exercise, but these are the main functions and operators used across the whole page.
library(readxl)read_excel()head()mean()sum()boxplot()length()qt()t.test()
Useful operators you will probably use:
$==-[]<-
Quick guide
- If the claim is that the express-line group has a lower mean
wait_time, use a one-sided test withalternative = "less". - If the claim is only that the two groups are different, use
alternative = "two.sided". - If equal variances are assumed, the degrees of freedom are
n1 + n2 - 2. - A p-value smaller than
0.05means you rejectH0at the 5% level. - A p-value between
0.05and0.10is not significant at 5%, but it is significant at 10%.
Exercise 1 - Import the workbook and show the data
Import data.xlsx into an object called df, then display the dataset.
Typing either df or head(df) is completely fine here.
Load the Excel package first, import data.xlsx into an object called df, and then print df or head(df) so you can see that the import worked.
In the exam, the very first step is often just getting the data in correctly. Once df exists, printing it lets you check that the file was imported the way you expected.
library(readxl)
df <- read_excel("data.xlsx")
head(df)Exercise 2 - Count orders in the express line
df is already loaded for you in this exercise.
How many orders used the express line?
Use the express_line column, check which rows are equal to "yes", and then count how many TRUE values you get.
df$express_line == "yes" creates a TRUE/FALSE check for every row. sum() then counts the TRUE values, so it gives the number of orders that used the express line.
sum(df$express_line == "yes")Exercise 3 - Mean wait time for one group
Find the mean wait time for the orders that used the express line.
First keep only the orders with express_line == "yes". Then take the mean of their wait_time values.
The mean should only be calculated for the express-line group. So first keep the wait_time values where express_line is "yes", and then average those values.
mean(df$wait_time[df$express_line == "yes"])Exercise 4 - Mean wait time for the other group
Find the mean wait time for the orders that did not use the express line.
This is the same idea as before, but now keep the orders with express_line == "no" and then take the mean of wait_time.
The comparison needs one mean for each group. Here you keep only the orders with express_line == "no" and average their wait times.
mean(df$wait_time[df$express_line == "no"])Exercise 5 - Difference between the two means
Use the two means from Exercises 3 and 4 and return their difference.
Take the mean from the express-line orders and subtract the mean from the orders that did not use the express line.
The difference here is mean_yes - mean_no, where mean_yes is the mean wait time for orders with express_line == "yes" and mean_no is the mean wait time for orders with express_line == "no". Writing the difference in this order makes the sign meaningful: a negative value means the express-line orders were faster in this dataset.
mean(df$wait_time[df$express_line == "yes"]) - mean(df$wait_time[df$express_line == "no"])Exercise 6 - Boxplot of wait time by line
Create a boxplot of wait_time by express_line.
Use wait_time ~ express_line inside boxplot(...) so R knows which variable should be compared across the two line groups.
wait_time ~ express_line tells R to draw one box for each line group and place the wait time distribution inside each box. That is the quickest visual comparison of the two groups.
boxplot(wait_time ~ express_line, data = df)Exercise 7 - Choose the right alternative
We want to test whether orders in the express line have a lower mean wait_time than the other orders.
Choose the correct alternative setting for t.test(...).
Exercise 8 - Degrees of freedom
Assume population variances are unknown but equal. What are the degrees of freedom for this two-sample t test?
With equal variances, the degrees of freedom are n1 + n2 - 2.
For the equal-variance two-sample t test, you add the two group sizes and subtract 2 because two sample means are being estimated.
sum(df$express_line == "yes") + sum(df$express_line == "no") - 2Exercise 9 - Critical t-value
At alpha = 0.05, what is the critical t value for the left-tailed one-sided test?
Use qt(...) with the one-sided 5th percentile and the degrees of freedom from the previous exercise.
A left-tailed one-sided test with alpha = 0.05 uses the cutoff that leaves 5% to the left and 95% to the right.
qt(0.05, df = sum(df$express_line == "yes") + sum(df$express_line == "no") - 2)Exercise 10 - Run the one-sided hypothesis test
Run the equal-variance two-sample t test to check whether orders in the express line have a lower mean wait_time than orders not in the express line.
Return the full t.test(...) output.
Use the express-line wait times as the first input, the regular-line wait times as the second input, set alternative = "less", and include var.equal = TRUE.
The first vector should be the express-line group because the claim is that this group has the lower mean wait time. var.equal = TRUE matches the equal-variance assumption, and alternative = "less" matches the direction of the claim.
t.test(
df$wait_time[df$express_line == "yes"],
df$wait_time[df$express_line == "no"],
alternative = "less",
var.equal = TRUE
)Exercise 11 - Read the one-sided output
Based on the output from Exercise 10, which statement is correct at the 5% level?
Exercise 12 - Change the significance level
If you keep the same one-sided test output from Exercise 10 but use alpha = 0.10, what changes?
Exercise 13 - Run the two-sided version
Now run the equal-variance two-sample t test for the question whether the two mean wait times are different.
Return the full t.test(...) output.
Use the same two wait_time vectors as before and keep var.equal = TRUE, but change the alternative to "two.sided".
The groups stay the same, and the equal-variance assumption stays the same. Only the research question changes: now you test whether the mean wait times differ in either direction.
t.test(
df$wait_time[df$express_line == "yes"],
df$wait_time[df$express_line == "no"],
alternative = "two.sided",
var.equal = TRUE
)Exercise 14 - Read the two-sided output
Based on the output from Exercise 13, which statement is correct at the 5% level?