Exercise List 9 - Interactive

Hey you :)

This list covers regression significance, model assumptions, dummy variables, and interactions. Take it one step at a time:

return the full regression output when that is the most useful starting point
separate joint significance, individual significance, and interpretation
for prediction tasks, plug in exactly the values from the question
for dummy variables and interactions, keep track of what changes between groups

Packages used on this page: readxl.

Quick guide: which method do I need?

Regression significance

Use summary(lm(...))
The overall F-test answers whether the explanatory variables are jointly significant
The coefficient p-values answer whether each variable is individually significant

Residual checks

Residual plots help you look for changing variability
A fanning-out pattern suggests heteroskedasticity

Dummy variables and interactions

A dummy variable shifts the intercept between groups
An interaction can change the slope between groups too

15.1 Tests of Significance

Exercise 20 (Electricity_Cost)

The facility manager at a pharmaceutical company wants to build a regression model to forecast monthly electricity cost. Three main variables are thought to dictate electricity cost: (1) average outdoor temperature (Temp in °F), (2) working days per month (Days), and (3) tons of product produced (Tons).

a. Estimate the regression model.
b. At the 10% significance level, are the explanatory variables jointly significant? Show the relevant steps of the test.
c. Are the explanatory variables individually significant at the 10% significance level? Show the relevant steps of the test.

Quick dataset note: in the code cells below, the file Electricity_Cost.xlsx is loaded into df. It has the columns Cost, Temp, Days, and Tons.

Exercise 20a

Estimate the model and return the full regression output.

Exercise 20b

At the 10% level, are the explanatory variables jointly significant?

Exercise 20c

Which explanatory variable is individually significant at the 10% level?

Exercise 22 (Houses)

A realtor examines the factors that influence the price of a house in a suburb outside of Boston, Massachusetts. He collects data on 36 recent house sales (Price) and notes each house’s square footage (Sqft) as well as its number of bedrooms (Beds) and number of bathrooms (Baths).

a. Estimate: Price = β0 + β1 Sqft + β2 Beds + β3 Baths + ε. Show the regression results in a well-formatted table.
b. At the 5% significance level, are the explanatory variables jointly significant in explaining Price?
c. At the 5% significance level, are all explanatory variables individually significant in explaining Price?
d. Estimate the 95% confidence interval for coefficients of Sqft, Beds, and Baths.

Quick dataset note: in the code cells below, the file Houses.xlsx is loaded into df. It has the columns Price, Sqft, Beds, Baths, and an extra column Col that is not used in the model.

Exercise 22a

Estimate the model and return the full regression output.

Exercise 22b

At the 5% level, are the explanatory variables jointly significant in explaining Price?

Exercise 22c

Which statement about individual significance is correct at the 5% level?

Exercise 22d

Return the full 95% confidence interval output.

Exercise 22e

Which coefficient has a 95% confidence interval that includes 0?

15.4 Model Assumptions and Common Violations

Exercise 51 (Rental)

Consider the monthly rent (Rent in $) of an apartment as a function of the number of bedrooms (Bed), the number of bathrooms (Bath), and square footage (Sqft).

a. Estimate: Rent = β0 + β1 Bed + β2 Bath + β3 Sqft + ε.
b. Which of the explanatory variables might cause changing variability? Explain.
c. Use residual plots to verify your economic intuition.

Quick dataset note: in the code cells below, the file Rental.xlsx is loaded into df. It has the columns Rent, Bed, Bath, and Sqft.

Exercise 51a

Estimate the model and return the full regression output.

Exercise 51b

Which explanatory variable is the main candidate for changing variability?

Exercise 51c

Use residual plots to check that idea.

A residual plot is the standard visual check here. The spread fans out most clearly against Sqft.

model <- lm(Rent ~ Bed + Bath + Sqft, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$Bed, xlab = "Bed", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Bath, xlab = "Bath", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Sqft, xlab = "Sqft", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))

Exercise 51d

Which residual plot shows the clearest fanning-out pattern?

Exercise 53 (Healthy_Living)

Healthy living has always been an important goal for any society. Consider a regression model that conjectures that fruits and vegetables and regular exercising have a positive effect on health and smoking has a negative effect on health. The sample consists of the percentage of these variables observed in various states in the United States.

a. Estimate the model Healthy = β0 + β1 FV + β2 Exercise + β3 Smoke + ε.
b. Analyze the data to determine if multicollinearity and changing variability are present.

Quick dataset note: in the code cells below, the file Healthy_Living.xlsx is loaded into df. It has the columns State, Healthy, FV, Exercise, and Smoke.

Exercise 53a

Estimate the model and return the full regression output.

Exercise 53b

Return the correlation matrix for Healthy, FV, Exercise, and Smoke.

Exercise 53c

Which statement about multicollinearity fits the data best?

Exercise 53d

Use residual plots to look for changing variability.

Residual plots are the visual tool for checking changing variability. In this dataset, no one plot gives a very strong fanning-out pattern.

model <- lm(Healthy ~ FV + Exercise + Smoke, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$FV, xlab = "FV", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Exercise, xlab = "Exercise", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Smoke, xlab = "Smoke", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))

Exercise 53e

Which statement about changing variability fits the residual plots best?

17.1 Dummy Variables

Exercise 9 (Wage)

A researcher wonders whether males get paid more, on average, than females at a large firm. She interviews 50 employees and collects data on each employee’s hourly wage (Wage in $), years of higher education (EDUC), years of experience (EXPER), age (Age), and a Male dummy variable that equals 1 if male, 0 otherwise.

a. Estimate: Wage = β0 + β1 EDUC + β2 EXPER + β3 Age + β4 Male + ε.
b. Predict the hourly wage of a 40-year-old male employee with 10 years of higher education and 5 years experience. Predict the hourly wage of a 40-year-old female employee with the same qualifications.
c. Interpret the estimated coefficient for Male. Is the Male variable significant at the 5% level? Do the data suggest that sex discrimination exists at this firm?
d. Calculate the 95% confidence interval for the coefficients of EDUC, EXPER, Age, and Male and interpret the results.

Quick dataset note: in the code cells below, the file Wage.xlsx is loaded into df. It has the columns Wage, EDUC, EXPER, Age, and Male.

Exercise 9a

Estimate the model and return the full regression output.

Exercise 9b1

Predict the hourly wage of the male employee described in the question.

Exercise 9b2

Predict the hourly wage of the female employee with the same qualifications.

Exercise 9c

Which statement about the Male coefficient fits the results best?

Exercise 9d

Return the full 95% confidence interval output.

Exercise 9e

Which coefficient 95% confidence interval includes 0?

17.2 Interactions with Dummy Variables

Exercise 22 (Urban)

The accompanying data file shows consumption expenditures of families in the United States (Consumption in $), family income (Income in $), and whether or not the family lives in an urban or rural community (Urban = 1 if urban, 0 otherwise).

a. Estimate: Consumption = β0 + β1 Income + ε. Compute the predicted consumption expenditures of a family with income of $75,000.
b. Include the dummy variable Urban to predict consumption for a family with income of $75,000 in urban and rural communities.
c. Include the dummy variable Urban and an interaction variable (Income × Urban) to predict consumption for a family with income of $75,000 in urban and rural communities.
d. Which of the preceding models is most suitable for the data? Explain.
e. Calculate the 95% confidence interval for the coefficients of Income and Urban and interpret the results.

Quick dataset note: in the code cells below, the file Urban.xlsx is loaded into df. It has the columns Consumption, Income, and Urban.

Exercise 22a1

Estimate Model 1 and return the full regression output.

Exercise 22a2

Predict consumption for a family with income of $75,000 using Model 1.

Exercise 22b1

Estimate Model 2 and return the full regression output.

Exercise 22b2

Using Model 2, predict consumption for a rural family with income of $75,000.

Exercise 22b3

Using Model 2, predict consumption for an urban family with income of $75,000.

Exercise 22c1

Estimate Model 3 with the interaction and return the full regression output.

Exercise 22c2

Using Model 3, predict consumption for a rural family with income of $75,000.

Exercise 22c3

Using Model 3, predict consumption for an urban family with income of $75,000.

Exercise 22d

Which model is the most suitable for the data?

Exercise 22e

Return the full 95% confidence interval output for Model 3.