Exercise List 8 - Interactive

Correlation and regression

<- Back to main page

Hey you :)

This list covers correlation and regression. Take it one step at a time:

  • start with the full model output when the task naturally gives more than one useful result
  • read carefully whether the task asks for an equation, an interpretation, or a prediction
  • keep the variable names exactly as they appear in the dataset

Packages used on this page: readxl.

Quick guide: which method do I need?

Correlation

  • Use cor(x, y) for the sample correlation coefficient
  • Use cor.test(x, y) when the task asks about statistical significance
  • A scatterplot helps you check whether the relationship looks roughly linear

Linear regression

  • Use lm(y ~ x, data = df) for simple regression
  • Use lm(y ~ x1 + x2 + ..., data = df) for multiple regression
  • Use summary(...) to see the coefficients, standard errors, p-values, and goodness-of-fit measures
  • Use predict(...) for fitted values at specific inputs

14.1 Hypothesis Test for the Correlation Coefficient

Exercise 10 (Happiness_Age)

Many attempts have been made to relate happiness with various factors. One such study relates happiness with age and finds that, holding everything else constant, people are least happy when they are in their mid-40s. The accompanying table shows a portion of data on a respondent’s age and his/her perception of well-being on a scale from 0 to 100.

  • a. Calculate and interpret the sample correlation coefficient between age and happiness.
  • b. Is the correlation coefficient statistically significant at the 1% level?
  • c. Construct a scatterplot to point out a flaw with this correlation analysis.

Quick dataset note: in the code cells below, the file Happiness_Age.xlsx is loaded into df. It has the columns Respondent, Happiness, and Age.

Exercise 10a

Calculate the sample correlation coefficient between age and happiness.

Use cor(...) on the Age and Happiness columns.

The sample correlation coefficient measures the strength and direction of the linear relationship.

cor(df$Age, df$Happiness)

Exercise 10b

Test whether the correlation is statistically significant at the 1% level and return the full output.

Use cor.test(...). Set conf.level = 0.99 so the output matches the 1% significance level context.

A correlation significance question naturally returns more than one useful result, so return the full cor.test(...) output first.

cor.test(df$Age, df$Happiness, conf.level = 0.99)

Exercise 10c

What is the correct conclusion at the 1% level?

Compare the p-value from 10b with 0.01.

Correct choice: the first option.

The p-value is about 0.0035, which is below 0.01. So the correlation is statistically significant at the 1% level.

Exercise 10d

Construct the scatterplot.

Use plot(...) first, then add a smooth curve with lines(lowess(...)).

A scatterplot helps you check whether the relationship actually looks linear.

plot(df$Age, df$Happiness,
     xlab = "Age", ylab = "Happiness",
     pch = 20, xlim = c(0, 100), ylim = c(0, 100))
lines(lowess(df$Age, df$Happiness), col = "blue")

Exercise 10e

What flaw does the scatterplot suggest?

Compare the straight-line idea from the correlation coefficient with the shape suggested by the smooth curve.

Correct choice: the second option.

The scatterplot suggests the relationship is not well described by one straight line. That is the main flaw in using only the simple correlation here.

14.2 The Linear Regression Model

Exercise 27 (Education)

A social scientist would like to analyze the relationship between educational attainment (in years of higher education) and annual salary (in $1,000s). He collects data on 20 individuals. A portion of the data is as follows.

  • a. Find the sample regression equation for the model: Salary = β0 + β1 Education + ε
  • b. Interpret the coefficient for Education
  • c. What is the predicted salary for an individual who completed 7 years of higher education?

Quick dataset note: in the code cells below, the file Education.xlsx is loaded into df. It has the columns Salary and Education.

Exercise 27a

Estimate the model and return the full regression output.

Use lm(Salary ~ Education, data = df) and wrap it in summary(...).

A simple linear regression output gives you the coefficients you need for the sample equation, plus the other regression details.

summary(lm(Salary ~ Education, data = df))

Exercise 27b

Which sample regression equation matches the output?

Read the intercept and the coefficient on Education from the regression output in 27a.

Correct choice: the first option.

The intercept is about 21.9508 and the slope on Education is about 10.8516.

Exercise 27c

Choose the best interpretation of the coefficient for Education.

The slope tells you the predicted change in salary for one extra unit of Education, holding the rest of the model fixed.

Correct choice: the first option.

A one-year increase in higher education is associated with an increase of about 10.85 thousand dollars in predicted annual salary.

Exercise 27d

What is the predicted salary for an individual who completed 7 years of higher education?

Fit the same model as in 27a, then use predict(...) with a small data frame where Education = 7.

After fitting the simple regression, use predict(...) for the new Education value.

model <- lm(Salary ~ Education, data = df)
predict(model, data.frame(Education = 7))

Exercise 37 (MCAS)

Education reform is one of the most hotly debated subjects on both state and national policy makers’ list of socioeconomic topics. Consider a linear regression model that relates school expenditures and family background to student performance in Massachusetts using 224 school districts. The response variable is the mean score on the MCAS exam given to 10th graders. Four explanatory variables are used: (1) STR is the student-to-teacher ratio, (2) TSAL is the average teacher’s salary in $1,000s, (3) INC is the median household income in $1,000s, and (4) SGL is the percentage of single-parent households.

  • a. For each explanatory variable, discuss whether it is likely to have a positive or negative influence on Score.
  • b. Find the sample equation. Are the signs of the slope coefficients as expected?
  • c. What is the predicted score if STR = 18, TSAL = 50, INC = 60, and SGL = 5?
  • d. What is the predicted score if everything else is the same as in part (c) except INC = 80?

Quick dataset note: in the code cells below, the file MCAS.xlsx is loaded into df. It has the columns SCORE, STR, TSAL, INC, and SGL.

Exercise 37a

Which sign pattern matches the economic intuition before fitting the model?

Think through each variable one by one: larger class size, higher teacher salary, higher household income, and more single-parent households.

Correct choice: the first option.

A larger student-to-teacher ratio and more single-parent households are expected to lower scores. Higher teacher salary and higher household income are expected to raise scores.

Exercise 37b

Estimate the model and return the full regression output.

Use lm(SCORE ~ STR + TSAL + INC + SGL, data = df) and wrap it in summary(...).

This multiple regression output gives the sample equation and lets you compare the signs with your expectations.

summary(lm(SCORE ~ STR + TSAL + INC + SGL, data = df))

Exercise 37c

Which fitted coefficient sign is not as expected?

Compare the signs you expected in 37a with the signs in the regression output from 37b.

Correct choice: the second option.

The TSAL coefficient is negative in the fitted model, even though the economic intuition suggested a positive sign.

Exercise 37d

What is the predicted score if STR = 18, TSAL = 50, INC = 60, and SGL = 5?

Fit the model from 37b, then use predict(...) with the four specified input values.

Use the fitted regression model and then plug the new values into predict(...).

model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
predict(model, data.frame(STR = 18, TSAL = 50, INC = 60, SGL = 5))

Exercise 37e

What is the predicted score if everything else is the same as in 37d except INC = 80?

Use the same model and the same values as in 37d, but change only INC from 60 to 80.

Only the INC value changes, so you can reuse the same fitted model from 37d.

model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
predict(model, data.frame(STR = 18, TSAL = 50, INC = 80, SGL = 5))

14.3 Goodness-of-Fit Measures

Exercise 54 (Car_Prices)

The accompanying data file shows the selling price of a used sedan, its age, and its mileage.

Estimate two models:

  • Model 1: Price = β0 + β1 Age + ε
  • Model 2: Price = β0 + β1 Age + β2 Mileage + ε

Which model provides a better fit for y? Justify your response with two goodness-of-fit measures.

Quick dataset note: in the code cells below, the file Car_Prices.xlsx is loaded into df. It has the columns Price, Age, and Mileage.

Exercise 54a

Estimate Model 1 and return the full regression output.

Model 1 uses Age only.

Return the full regression output first so you can compare the fit measures afterward.

summary(lm(Price ~ Age, data = df))

Exercise 54b

Estimate Model 2 and return the full regression output.

Model 2 uses both Age and Mileage.

Model 2 adds Mileage, so return its full summary too before comparing fit.

summary(lm(Price ~ Age + Mileage, data = df))

Exercise 54c

Which model provides the better fit?

Compare the residual standard error and the adjusted R^2 from the two summaries.

Correct choice: the second option.

Model 2 has the smaller residual standard error and the larger adjusted R^2, so it fits better.

Exercise 54d

Which two measures best justify that choice?

Look for one measure where lower is better and one where higher is better.

Correct choice: the first option.

For this comparison, Model 2 has the smaller residual standard error and the larger adjusted R^2.