Exam R Cheatsheet

Pick the problem type, then use the matching function

The combined script uses one flat data folder with one workbook per file name. For the List 2 Prime exercise, the script uses the first 100 rows of Prime.xlsx.

Use this page when you are stuck on one simple question: “Which R function fits this task?”

Before You Start

To be on the safe side, you can load all of these packages at the start of a fresh Quarto file or a fresh R session:

library(readxl)
library(EnvStats)
library(PerformanceAnalytics)
library(tseries)

If you use the course Quarto template, you do not need these library(...) lines because the template already loads them for you.

If a package is not loaded, you can still call a function with package::function(). Example:

tseries::jarque.bera.test(df$grade)

How To Read The Examples

df$col means “use the column you need from your dataset”.
In boxplot(wait_time ~ express_line, data = df), wait_time is the numeric variable and express_line is the group variable.
Names such as salary_model or wage_model are just names for a fitted regression model. You can choose another clear name, but then keep using that same name in summary(), predict(), confint(), and resid().
Symbols such as x1, n1, x2, n2 are counts: x means number of successes and n means sample size.

Start Every Data Question This Way

If the exam gives you the file data.xlsx, a good safe choice is to load that file and store it in df.

library(readxl)
df <- read_excel("data.xlsx")
head(df)
names(df)
summary(df)

If the question uses another file name, load that file instead.

This cheatsheet uses df in the examples because it is short and easy to read, but you can choose another object name if you prefer. If you do, stay consistent and keep using that same name in the rest of your code.

Quick Tail Guide

Use this part only for hypothesis tests.

First ask: what kind of result would support the claim?

If smaller values support the claim, use a left-tailed test.
If larger values support the claim, use a right-tailed test.
If both smaller and larger values would support the claim, use a two-sided test.

In R:

Left-tailed test: alternative = "less"
Right-tailed test: alternative = "greater"
Two-sided test: alternative = "two.sided"

Common wording:

“less than”, “lower than”, “smaller than” -> left-tailed
“greater than”, “higher than”, “larger than” -> right-tailed
“different from”, “differs from”, “not equal to” -> two-sided

Examples:

“Do students in the express line have lower wait times?” -> alternative = "less"
“Do students with tutoring get higher grades?” -> alternative = "greater"
“Does the mean MPG differ from 95?” -> alternative = "two.sided"

If you calculate the p-value by hand:

Left tail: pnorm(z)
Right tail: pnorm(z, lower.tail = FALSE)
Two-sided: 2 * pnorm(abs(z), lower.tail = FALSE)

Descriptive Statistics And Basic Plots

Use these when the question asks you to describe one variable, count cases, or make a simple plot.

Mean or average: mean(df$col)
Median: median(df$col)
Interquartile range: IQR(df$col)
Range: range(df$col)
Minimum: min(df$col)
Maximum: max(df$col)
Standard deviation: sd(df$col)
Count how many rows meet a condition: sum(df$weekend == "yes")
Proportion that meet a condition: mean(df$weekend == "yes")
Percentage that meet a condition: 100 * mean(df$weekend == "yes")
Frequency table: table(df$breakfast)
Proportion table: prop.table(table(df$breakfast))
Histogram: hist(df$col)
One-variable boxplot: boxplot(df$col, horizontal = TRUE)
Boxplot for one numeric variable split into groups: boxplot(wait_time ~ express_line, data = df)

If you want a quick full summary of one numeric column, use:

summary(df$col)

In the grouped boxplot example above:

the left side of ~ is the numeric variable you want to compare
the right side of ~ is the variable that splits the data into groups

If you need a pie chart for category shares, use:

pct <- round(100 * prop.table(table(df$breakfast)), 1)
pie(pct, labels = paste0(pct, "%"))

Distribution Questions

Binomial

Use binomial functions when you have:

a fixed number of trials
only two outcomes
the same success probability each time
Exactly x: dbinom(x, n, p)
At most x: pbinom(x, n, p)
At least x: 1 - pbinom(x - 1, n, p)
Cutoff value: qbinom(prob, n, p)

dbinom(10, 20, 0.4)
pbinom(10, 20, 0.4)
1 - pbinom(14, 20, 0.4)
qbinom(0.85, 20, 0.4)

Uniform

Use punif() when every value between min and max is equally likely.

Less than x: punif(x, min = a, max = b)
Greater than x: punif(x, min = a, max = b, lower.tail = FALSE)

punif(15.5, min = 12, max = 20)
punif(14, min = 12, max = 20, lower.tail = FALSE)

Normal

Less than x: pnorm(x, mean = mu, sd = sigma)
Greater than x: pnorm(x, mean = mu, sd = sigma, lower.tail = FALSE)
Between a and b: pnorm(b, mean = mu, sd = sigma) - pnorm(a, mean = mu, sd = sigma)
Cutoff or percentile: qnorm(prob, mean = mu, sd = sigma)

pnorm(90, mean = 100, sd = 16)
pnorm(80, mean = 60, sd = 20) - pnorm(50, mean = 60, sd = 20)
qnorm(0.85, mean = 60, sd = 20)

Sampling Distributions, Confidence Intervals, And Sample Size

Sampling Distribution Of The Sample Mean

Use this when the question is about the average of a sample, not one single person.

pnorm(cutoff, mean = mu, sd = sigma / sqrt(n))

Examples:

1 - pnorm(105, mean = 100, sd = 20 / sqrt(50))
pnorm(95, mean = 100, sd = 20 / sqrt(50))

Sampling Distribution Of The Sample Proportion

Use this when the question is about a sample proportion p-hat.

se <- sqrt(p * (1 - p) / n)
pnorm(cutoff, mean = p, sd = se)

Example:

se <- sqrt(0.82 * 0.18 / 100)
pnorm(0.80, mean = 0.82, sd = se)

Confidence Interval For A Mean When Sigma Is Known

Use qnorm() for the critical value.

xbar <- mean(df$col)
me <- qnorm(0.975) * sigma / sqrt(nrow(df))
c(xbar - me, xbar + me)

Confidence Interval For A Mean When Sigma Is Unknown

The easiest method is usually t.test().

t.test(df$col)$conf.int

If you only need one endpoint:

t.test(df$col)$conf.int[1]
t.test(df$col)$conf.int[2]

Confidence Interval For A Proportion

phat <- x / n
me <- qnorm(0.975) * sqrt(phat * (1 - phat) / n)
c(phat - me, phat + me)

Example:

phat <- 20 / 80
me <- qnorm(0.975) * sqrt(phat * (1 - phat) / 80)
c(phat - me, phat + me)

Required Sample Size

Mean with known sigma:

ceiling((qnorm(0.975) * sigma / E)^2)

Proportion with no prior estimate:

ceiling((qnorm(0.995)^2 * 0.25) / E^2)

Always round up.

Hypothesis Tests For Means And Proportions

If the question naturally gives a full test output, return the full output first. That is usually the safest starting point.

One Mean, Sigma Known

This is a z test. You usually compute the z statistic by hand.

z <- (mean(df$col) - mu0) / (sigma / sqrt(nrow(df)))

Then turn z into a p-value:

pnorm(z)
pnorm(z, lower.tail = FALSE)
2 * pnorm(abs(z), lower.tail = FALSE)

One Mean, Sigma Unknown

Use a one-sample t test.

t.test(df$col, mu = mu0, alternative = "two.sided")

Change the alternative setting if the question is one-sided.

Two Independent Means

Start by splitting the numeric variable into two groups.

with_tutoring <- df$grade[df$tut == 1]
without_tutoring <- df$grade[df$tut == 0]

If the question says equal variances:

t.test(with_tutoring, without_tutoring, var.equal = TRUE, alternative = "greater")

If the question does not assume equal variances:

t.test(with_tutoring, without_tutoring, var.equal = FALSE, alternative = "less")

If the null difference is not 0, put that value in mu.

t.test(with_tutoring, without_tutoring, mu = 2.5, var.equal = TRUE)

If the question asks for the critical t value, use:

qt(0.05, df = degrees_of_freedom, lower.tail = FALSE)

Paired Before/After Data

Use a paired t test when the same people, cars, or units are measured twice.

t.test(df$After, df$Before, paired = TRUE)

One Proportion

This is usually a z test done by hand.

phat <- mean(df$tut == 1)
z <- (phat - p0) / sqrt(p0 * (1 - p0) / nrow(df))

Then use pnorm() for the p-value.

Difference Between Two Proportions

If the null difference is 0, use the pooled proportion.

phat1 <- x1 / n1
phat2 <- x2 / n2
ppool <- (x1 + x2) / (n1 + n2)
z <- (phat1 - phat2) / sqrt(ppool * (1 - ppool) * (1 / n1 + 1 / n2))

If the null difference is a nonzero value such as 0.10, keep that value in the numerator.

phat1 <- x1 / n1
phat2 <- x2 / n2
z <- ((phat1 - phat2) - 0.10) / sqrt(phat1 * (1 - phat1) / n1 + phat2 * (1 - phat2) / n2)

Then use pnorm() for the p-value.

Variance Questions

One Population Variance

Use var() to get the sample variance.

x <- df$col
stat <- (length(x) - 1) * var(x) / sigma0_sq

For a two-sided p-value:

2 * min(
  pchisq(stat, df = length(x) - 1),
  1 - pchisq(stat, df = length(x) - 1)
)

For a confidence interval for the variance:

lower <- (length(x) - 1) * var(x) / qchisq(0.975, df = length(x) - 1)
upper <- (length(x) - 1) * var(x) / qchisq(0.025, df = length(x) - 1)
c(lower, upper)

Ratio Of Two Variances

If the question asks for the full test output, use:

var.test(df$Town2, df$Town1, alternative = "greater")

If the question asks for a confidence interval for the variance ratio:

ratio <- s1_sq / s2_sq
lower <- ratio / qf(0.975, df1, df2)
upper <- ratio * qf(0.975, df2, df1)
c(lower, upper)

Chi-Square Tests And Normality

Goodness-Of-Fit

Use this when one sample is compared with a claimed distribution.

observed <- c(153, 163, 139, 20, 20, 5)
expected_prop <- c(0.357, 0.314, 0.229, 0.043, 0.043, 0.014)
chisq.test(observed, p = expected_prop)

Independence In A Contingency Table

Build the table first, then test it.

tab <- table(df$passed_exam, df$book)
chisq.test(tab)

Jarque-Bera Normality Check

This is for checking normality.

If you loaded library(tseries) at the top, you can write jarque.bera.test(...). If not, use tseries::jarque.bera.test(...).

For one variable:

tseries::jarque.bera.test(df$col)

For regression residuals:

grade_model <- lm(grade ~ prep + tut + book, data = df)
tseries::jarque.bera.test(resid(grade_model))

Correlation And Regression

Correlation

Use cor() when you only need the sample correlation coefficient.

cor(df$Age, df$Happiness)

Use cor.test() when the question asks whether the correlation is statistically significant.

cor.test(df$Age, df$Happiness, conf.level = 0.99)

If you want a quick correlation matrix for several numeric variables, use:

cor(df[, c("grade", "tut", "book", "prep")])

If you want a quick correlation plot matrix and you loaded PerformanceAnalytics, you can also use:

PerformanceAnalytics::chart.Correlation(df[, c("grade", "tut", "book", "prep")])

Simple Regression

Use lm() to fit the model. Wrap it in summary() when the question asks for coefficients, p-values, the F-test, or R-squared. Pick one clear name for the fitted model and keep reusing it.

salary_model <- lm(Salary ~ Education, data = df)
summary(salary_model)

For a prediction:

predict(salary_model, data.frame(Education = 7))

For coefficient confidence intervals:

confint(salary_model)

Multiple Regression

score_model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
summary(score_model)

Use this setup when several explanatory variables are used at the same time.

Dummy Variables

A dummy variable is a 0/1 group variable.

wage_model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
summary(wage_model)
predict(wage_model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 1))

Interactions

Use * when the effect of one variable may change across groups.

consumption_model <- lm(Consumption ~ Income * Urban, data = df)
summary(consumption_model)
predict(consumption_model, data.frame(Income = 75000, Urban = 1))

Residual Checks

If the question asks about heteroskedasticity or changing variability, use residual plots.

healthy_model <- lm(Healthy ~ FV + Exercise + Smoke, data = df)
r <- resid(healthy_model)
plot(r ~ df$FV, xlab = "FV", ylab = "Residuals")
abline(h = 0)

Then repeat the same idea for the other explanatory variables in the model.

Short Memory Rule

One variable to describe -> mean(), median(), sd(), IQR(), range(), hist(), boxplot()
Probability question -> d..., p..., or q...
Mean CI or mean test -> usually t.test()
Proportion test -> build z, then use pnorm()
Goodness-of-fit or independence -> chisq.test()
Correlation -> cor() or cor.test()
Regression -> summary(lm(...))
Regression prediction -> predict(...)
Regression interval -> confint(...)
Residual normality -> tseries::jarque.bera.test(resid(my_model))

<- Back to main page