Exercise List 8 - Interactive

Hey you :)

This list covers correlation and regression. Take it one step at a time:

start with the full model output when the task naturally gives more than one useful result
read carefully whether the task asks for an equation, an interpretation, or a prediction
keep the variable names exactly as they appear in the dataset

Packages used on this page: readxl.

Quick guide: which method do I need?

Correlation

Use cor(x, y) for the sample correlation coefficient
Use cor.test(x, y) when the task asks about statistical significance
A scatterplot helps you check whether the relationship looks roughly linear

Linear regression

Use lm(y ~ x, data = df) for simple regression
Use lm(y ~ x1 + x2 + ..., data = df) for multiple regression
Use summary(...) to see the coefficients, standard errors, p-values, and goodness-of-fit measures
Use predict(...) for fitted values at specific inputs

14.1 Hypothesis Test for the Correlation Coefficient

Exercise 10 (Happiness_Age)

Many attempts have been made to relate happiness with various factors. One such study relates happiness with age and finds that, holding everything else constant, people are least happy when they are in their mid-40s. The accompanying table shows a portion of data on a respondent’s age and his/her perception of well-being on a scale from 0 to 100.

a. Calculate and interpret the sample correlation coefficient between age and happiness.
b. Is the correlation coefficient statistically significant at the 1% level?
c. Construct a scatterplot to point out a flaw with this correlation analysis.

Quick dataset note: in the code cells below, the file Happiness_Age.xlsx is loaded into df. It has the columns Respondent, Happiness, and Age.

Exercise 10a

Calculate the sample correlation coefficient between age and happiness.

Exercise 10b

Test whether the correlation is statistically significant at the 1% level and return the full output.

Exercise 10c

What is the correct conclusion at the 1% level?

Exercise 10d

Construct the scatterplot.

A scatterplot helps you check whether the relationship actually looks linear.

plot(df$Age, df$Happiness,
     xlab = "Age", ylab = "Happiness",
     pch = 20, xlim = c(0, 100), ylim = c(0, 100))
lines(lowess(df$Age, df$Happiness), col = "blue")

Exercise 10e

What flaw does the scatterplot suggest?

14.2 The Linear Regression Model

Exercise 27 (Education)

A social scientist would like to analyze the relationship between educational attainment (in years of higher education) and annual salary (in $1,000s). He collects data on 20 individuals. A portion of the data is as follows.

a. Find the sample regression equation for the model: Salary = β0 + β1 Education + ε
b. Interpret the coefficient for Education
c. What is the predicted salary for an individual who completed 7 years of higher education?

Quick dataset note: in the code cells below, the file Education.xlsx is loaded into df. It has the columns Salary and Education.

Exercise 27a

Estimate the model and return the full regression output.

Exercise 27b

Which sample regression equation matches the output?

Exercise 27c

Choose the best interpretation of the coefficient for Education.

Exercise 27d

What is the predicted salary for an individual who completed 7 years of higher education?

Exercise 37 (MCAS)

Education reform is one of the most hotly debated subjects on both state and national policy makers’ list of socioeconomic topics. Consider a linear regression model that relates school expenditures and family background to student performance in Massachusetts using 224 school districts. The response variable is the mean score on the MCAS exam given to 10th graders. Four explanatory variables are used: (1) STR is the student-to-teacher ratio, (2) TSAL is the average teacher’s salary in $1,000s, (3) INC is the median household income in $1,000s, and (4) SGL is the percentage of single-parent households.

a. For each explanatory variable, discuss whether it is likely to have a positive or negative influence on Score.
b. Find the sample equation. Are the signs of the slope coefficients as expected?
c. What is the predicted score if STR = 18, TSAL = 50, INC = 60, and SGL = 5?
d. What is the predicted score if everything else is the same as in part (c) except INC = 80?

Quick dataset note: in the code cells below, the file MCAS.xlsx is loaded into df. It has the columns SCORE, STR, TSAL, INC, and SGL.

Exercise 37a

Which sign pattern matches the economic intuition before fitting the model?

Exercise 37b

Estimate the model and return the full regression output.

Exercise 37c

Which fitted coefficient sign is not as expected?

Exercise 37d

What is the predicted score if STR = 18, TSAL = 50, INC = 60, and SGL = 5?

Use the fitted regression model and then plug the new values into predict(...).

model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
predict(model, data.frame(STR = 18, TSAL = 50, INC = 60, SGL = 5))

Exercise 37e

What is the predicted score if everything else is the same as in 37d except INC = 80?

Only the INC value changes, so you can reuse the same fitted model from 37d.

model <- lm(SCORE ~ STR + TSAL + INC + SGL, data = df)
predict(model, data.frame(STR = 18, TSAL = 50, INC = 80, SGL = 5))

14.3 Goodness-of-Fit Measures

Exercise 54 (Car_Prices)

The accompanying data file shows the selling price of a used sedan, its age, and its mileage.

Estimate two models:

Model 1: Price = β0 + β1 Age + ε
Model 2: Price = β0 + β1 Age + β2 Mileage + ε

Which model provides a better fit for y? Justify your response with two goodness-of-fit measures.

Quick dataset note: in the code cells below, the file Car_Prices.xlsx is loaded into df. It has the columns Price, Age, and Mileage.

Exercise 54a

Estimate Model 1 and return the full regression output.

Exercise 54b

Estimate Model 2 and return the full regression output.

Exercise 54c

Which model provides the better fit?

Exercise 54d

Which two measures best justify that choice?