Exam R Cheatsheet
Pick the problem type, then use the matching function
The combined script uses one flat data folder with one workbook per file name. For the List 2 Prime exercise, the script uses the first 100 rows of Prime.xlsx.
Use this page when you are stuck on one simple question: “Which R function fits this task?”
Before You Start
To be on the safe side, you can load all of these packages at the start of a fresh Quarto file or a fresh R session:
library(readxl)
library(EnvStats)
library(PerformanceAnalytics)
library(tseries)If you use the course Quarto template, you do not need these library(...) lines because the template already loads them for you.
If a package is not loaded, you can still call a function with package::function(). Example:
tseries::jarque.bera.test(df$grade)How To Read The Examples
df$colmeans “use the column you need from your dataset”.- In
boxplot(wait_time ~ express_line, data = df),wait_timeis the numeric variable andexpress_lineis the group variable. - Names such as
salary_modelorwage_modelare just names for a fitted regression model. You can choose another clear name, but then keep using that same name insummary(),predict(),confint(), andresid(). - Symbols such as
x1,n1,x2,n2are counts:xmeans number of successes andnmeans sample size.
Start Every Data Question This Way
If the exam gives you the file data.xlsx, a good safe choice is to load that file and store it in df.
library(readxl)
df <- read_excel("data.xlsx")
head(df)
names(df)
summary(df)If the question uses another file name, load that file instead.
This cheatsheet uses df in the examples because it is short and easy to read, but you can choose another object name if you prefer. If you do, stay consistent and keep using that same name in the rest of your code.
Quick Tail Guide
Use this part only for hypothesis tests.
First ask: what kind of result would support the claim?
- If smaller values support the claim, use a left-tailed test.
- If larger values support the claim, use a right-tailed test.
- If both smaller and larger values would support the claim, use a two-sided test.
In R:
- Left-tailed test:
alternative = "less" - Right-tailed test:
alternative = "greater" - Two-sided test:
alternative = "two.sided"
Common wording:
- “less than”, “lower than”, “smaller than” -> left-tailed
- “greater than”, “higher than”, “larger than” -> right-tailed
- “different from”, “differs from”, “not equal to” -> two-sided
Examples:
- “Do students in the express line have lower wait times?” ->
alternative = "less" - “Do students with tutoring get higher grades?” ->
alternative = "greater" - “Does the mean MPG differ from 95?” ->
alternative = "two.sided"
If you calculate the p-value by hand:
- Left tail:
pnorm(z) - Right tail:
pnorm(z, lower.tail = FALSE) - Two-sided:
2 * pnorm(abs(z), lower.tail = FALSE)
Descriptive Statistics And Basic Plots
Use these when the question asks you to describe one variable, count cases, or make a simple plot.
- Mean or average:
mean(df$col) - Median:
median(df$col) - Interquartile range:
IQR(df$col) - Range:
range(df$col) - Minimum:
min(df$col) - Maximum:
max(df$col) - Standard deviation:
sd(df$col) - Count how many rows meet a condition:
sum(df$weekend == "yes") - Proportion that meet a condition:
mean(df$weekend == "yes") - Percentage that meet a condition:
100 * mean(df$weekend == "yes") - Frequency table:
table(df$breakfast) - Proportion table:
prop.table(table(df$breakfast)) - Histogram:
hist(df$col) - One-variable boxplot:
boxplot(df$col, horizontal = TRUE) - Boxplot for one numeric variable split into groups:
boxplot(wait_time ~ express_line, data = df)
If you want a quick full summary of one numeric column, use:
summary(df$col)In the grouped boxplot example above:
- the left side of
~is the numeric variable you want to compare - the right side of
~is the variable that splits the data into groups
If you need a pie chart for category shares, use:
pct <- round(100 * prop.table(table(df$breakfast)), 1)
pie(pct, labels = paste0(pct, "%"))Distribution Questions
Binomial
Use binomial functions when you have:
a fixed number of trials
only two outcomes
the same success probability each time
Exactly
x:dbinom(x, n, p)At most
x:pbinom(x, n, p)At least
x:1 - pbinom(x - 1, n, p)Cutoff value:
qbinom(prob, n, p)
dbinom(10, 20, 0.4)
pbinom(10, 20, 0.4)
1 - pbinom(14, 20, 0.4)
qbinom(0.85, 20, 0.4)Uniform
Use punif() when every value between min and max is equally likely.
- Less than
x:punif(x, min = a, max = b) - Greater than
x:punif(x, min = a, max = b, lower.tail = FALSE)
punif(15.5, min = 12, max = 20)
punif(14, min = 12, max = 20, lower.tail = FALSE)Normal
- Less than
x:pnorm(x, mean = mu, sd = sigma) - Greater than
x:pnorm(x, mean = mu, sd = sigma, lower.tail = FALSE) - Between
aandb:pnorm(b, mean = mu, sd = sigma) - pnorm(a, mean = mu, sd = sigma) - Cutoff or percentile:
qnorm(prob, mean = mu, sd = sigma)
pnorm(90, mean = 100, sd = 16)
pnorm(80, mean = 60, sd = 20) - pnorm(50, mean = 60, sd = 20)
qnorm(0.85, mean = 60, sd = 20)Sampling Distributions, Confidence Intervals, And Sample Size
Sampling Distribution Of The Sample Mean
Use this when the question is about the average of a sample, not one single person.
pnorm(cutoff, mean = mu, sd = sigma / sqrt(n))Examples:
1 - pnorm(105, mean = 100, sd = 20 / sqrt(50))
pnorm(95, mean = 100, sd = 20 / sqrt(50))Sampling Distribution Of The Sample Proportion
Use this when the question is about a sample proportion p-hat.
se <- sqrt(p * (1 - p) / n)
pnorm(cutoff, mean = p, sd = se)Example:
se <- sqrt(0.82 * 0.18 / 100)
pnorm(0.80, mean = 0.82, sd = se)Confidence Interval For A Mean When Sigma Is Known
Use qnorm() for the critical value.
xbar <- mean(df$col)
me <- qnorm(0.975) * sigma / sqrt(nrow(df))
c(xbar - me, xbar + me)Confidence Interval For A Mean When Sigma Is Unknown
The easiest method is usually t.test().
t.test(df$col)$conf.intIf you only need one endpoint:
t.test(df$col)$conf.int[1]
t.test(df$col)$conf.int[2]Confidence Interval For A Proportion
phat <- x / n
me <- qnorm(0.975) * sqrt(phat * (1 - phat) / n)
c(phat - me, phat + me)Example:
phat <- 20 / 80
me <- qnorm(0.975) * sqrt(phat * (1 - phat) / 80)
c(phat - me, phat + me)Required Sample Size
Mean with known sigma:
ceiling((qnorm(0.975) * sigma / E)^2)Proportion with no prior estimate:
ceiling((qnorm(0.995)^2 * 0.25) / E^2)Always round up.
Hypothesis Tests For Means And Proportions
If the question naturally gives a full test output, return the full output first. That is usually the safest starting point.
One Mean, Sigma Known
This is a z test. You usually compute the z statistic by hand.
z <- (mean(df$col) - mu0) / (sigma / sqrt(nrow(df)))Then turn z into a p-value:
pnorm(z)
pnorm(z, lower.tail = FALSE)
2 * pnorm(abs(z), lower.tail = FALSE)One Mean, Sigma Unknown
Use a one-sample t test.
t.test(df$col, mu = mu0, alternative = "two.sided")Change the alternative setting if the question is one-sided.
Two Independent Means
Start by splitting the numeric variable into two groups.
with_tutoring <- df$grade[df$tut == 1]
without_tutoring <- df$grade[df$tut == 0]If the question says equal variances:
t.test(with_tutoring, without_tutoring, var.equal = TRUE, alternative = "greater")If the question does not assume equal variances:
t.test(with_tutoring, without_tutoring, var.equal = FALSE, alternative = "less")If the null difference is not 0, put that value in mu.
t.test(with_tutoring, without_tutoring, mu = 2.5, var.equal = TRUE)If the question asks for the critical t value, use:
qt(0.05, df = degrees_of_freedom, lower.tail = FALSE)Paired Before/After Data
Use a paired t test when the same people, cars, or units are measured twice.
t.test(df$After, df$Before, paired = TRUE)One Proportion
This is usually a z test done by hand.
phat <- mean(df$tut == 1)
z <- (phat - p0) / sqrt(p0 * (1 - p0) / nrow(df))Then use pnorm() for the p-value.
Difference Between Two Proportions
If the null difference is 0, use the pooled proportion.
phat1 <- x1 / n1
phat2 <- x2 / n2
ppool <- (x1 + x2) / (n1 + n2)
z <- (phat1 - phat2) / sqrt(ppool * (1 - ppool) * (1 / n1 + 1 / n2))If the null difference is a nonzero value such as 0.10, keep that value in the numerator.
phat1 <- x1 / n1
phat2 <- x2 / n2
z <- ((phat1 - phat2) - 0.10) / sqrt(phat1 * (1 - phat1) / n1 + phat2 * (1 - phat2) / n2)Then use pnorm() for the p-value.
Variance Questions
One Population Variance
Use var() to get the sample variance.
x <- df$col
stat <- (length(x) - 1) * var(x) / sigma0_sqFor a two-sided p-value:
2 * min(
pchisq(stat, df = length(x) - 1),
1 - pchisq(stat, df = length(x) - 1)
)For a confidence interval for the variance:
lower <- (length(x) - 1) * var(x) / qchisq(0.975, df = length(x) - 1)
upper <- (length(x) - 1) * var(x) / qchisq(0.025, df = length(x) - 1)
c(lower, upper)Ratio Of Two Variances
If the question asks for the full test output, use:
var.test(df$Town2, df$Town1, alternative = "greater")If the question asks for a confidence interval for the variance ratio:
ratio <- s1_sq / s2_sq
lower <- ratio / qf(0.975, df1, df2)
upper <- ratio * qf(0.975, df2, df1)
c(lower, upper)Chi-Square Tests And Normality
Goodness-Of-Fit
Use this when one sample is compared with a claimed distribution.
observed <- c(153, 163, 139, 20, 20, 5)
expected_prop <- c(0.357, 0.314, 0.229, 0.043, 0.043, 0.014)
chisq.test(observed, p = expected_prop)Independence In A Contingency Table
Build the table first, then test it.
tab <- table(df$passed_exam, df$book)
chisq.test(tab)Jarque-Bera Normality Check
This is for checking normality.
If you loaded library(tseries) at the top, you can write jarque.bera.test(...). If not, use tseries::jarque.bera.test(...).
For one variable:
tseries::jarque.bera.test(df$col)For regression residuals:
grade_model <- lm(grade ~ prep + tut + book, data = df)
tseries::jarque.bera.test(resid(grade_model))Correlation And Regression
Correlation
Use cor() when you only need the sample correlation coefficient.
cor(df$Age, df$Happiness)Use cor.test() when the question asks whether the correlation is statistically significant.
cor.test(df$Age, df$Happiness, conf.level = 0.99)If you want a quick correlation matrix for several numeric variables, use:
cor(df[, c("grade", "tut", "book", "prep")])If you want a quick correlation plot matrix and you loaded PerformanceAnalytics, you can also use:
PerformanceAnalytics::chart.Correlation(df[, c("grade", "tut", "book", "prep")])Simple Regression
Use lm() to fit the model. Wrap it in summary() when the question asks for coefficients, p-values, the F-test, or R-squared. Pick one clear name for the fitted model and keep reusing it.
salary_model <- lm(Salary ~ Education, data = df)
summary(salary_model)For a prediction:
predict(salary_model, data.frame(Education = 7))For coefficient confidence intervals:
confint(salary_model)Multiple Regression
score_model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
summary(score_model)Use this setup when several explanatory variables are used at the same time.
Dummy Variables
A dummy variable is a 0/1 group variable.
wage_model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
summary(wage_model)
predict(wage_model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 1))Interactions
Use * when the effect of one variable may change across groups.
consumption_model <- lm(Consumption ~ Income * Urban, data = df)
summary(consumption_model)
predict(consumption_model, data.frame(Income = 75000, Urban = 1))Residual Checks
If the question asks about heteroskedasticity or changing variability, use residual plots.
healthy_model <- lm(Healthy ~ FV + Exercise + Smoke, data = df)
r <- resid(healthy_model)
plot(r ~ df$FV, xlab = "FV", ylab = "Residuals")
abline(h = 0)Then repeat the same idea for the other explanatory variables in the model.
Short Memory Rule
- One variable to describe ->
mean(),median(),sd(),IQR(),range(),hist(),boxplot() - Probability question ->
d...,p..., orq... - Mean CI or mean test -> usually
t.test() - Proportion test -> build
z, then usepnorm() - Goodness-of-fit or independence ->
chisq.test() - Correlation ->
cor()orcor.test() - Regression ->
summary(lm(...)) - Regression prediction ->
predict(...) - Regression interval ->
confint(...) - Residual normality ->
tseries::jarque.bera.test(resid(my_model))