Exercise List 9 - Interactive

Regression significance, assumptions, and dummy variables

<- Back to main page

Hey you :)

This list covers regression significance, model assumptions, dummy variables, and interactions. Take it one step at a time:

  • return the full regression output when that is the most useful starting point
  • separate joint significance, individual significance, and interpretation
  • for prediction tasks, plug in exactly the values from the question
  • for dummy variables and interactions, keep track of what changes between groups

Packages used on this page: readxl.

Quick guide: which method do I need?

Regression significance

  • Use summary(lm(...))
  • The overall F-test answers whether the explanatory variables are jointly significant
  • The coefficient p-values answer whether each variable is individually significant

Residual checks

  • Residual plots help you look for changing variability
  • A fanning-out pattern suggests heteroskedasticity

Dummy variables and interactions

  • A dummy variable shifts the intercept between groups
  • An interaction can change the slope between groups too

15.1 Tests of Significance

Exercise 20 (Electricity_Cost)

The facility manager at a pharmaceutical company wants to build a regression model to forecast monthly electricity cost. Three main variables are thought to dictate electricity cost: (1) average outdoor temperature (Temp in °F), (2) working days per month (Days), and (3) tons of product produced (Tons).

  • a. Estimate the regression model.
  • b. At the 10% significance level, are the explanatory variables jointly significant? Show the relevant steps of the test.
  • c. Are the explanatory variables individually significant at the 10% significance level? Show the relevant steps of the test.

Quick dataset note: in the code cells below, the file Electricity_Cost.xlsx is loaded into df. It has the columns Cost, Temp, Days, and Tons.

Exercise 20a

Estimate the model and return the full regression output.

Use lm(Cost ~ Temp + Days + Tons, data = df) and wrap it in summary(...).

Return the full regression output first, because the later questions use both the overall F-test and the individual coefficient table.

summary(lm(Cost ~ Temp + Days + Tons, data = df))

Exercise 20b

At the 10% level, are the explanatory variables jointly significant?

Use the overall F-test p-value from the regression output in 20a.

Correct choice: the first option.

The overall p-value is about 0.0262, which is below 0.10, so the variables are jointly significant.

Exercise 20c

Which explanatory variable is individually significant at the 10% level?

Look at the coefficient p-values in the regression output. Compare each one with 0.10.

Correct choice: the first option.

Only Temp has a p-value below 0.10. Days and Tons do not.

Exercise 22 (Houses)

A realtor examines the factors that influence the price of a house in a suburb outside of Boston, Massachusetts. He collects data on 36 recent house sales (Price) and notes each house’s square footage (Sqft) as well as its number of bedrooms (Beds) and number of bathrooms (Baths).

  • a. Estimate: Price = β0 + β1 Sqft + β2 Beds + β3 Baths + ε. Show the regression results in a well-formatted table.
  • b. At the 5% significance level, are the explanatory variables jointly significant in explaining Price?
  • c. At the 5% significance level, are all explanatory variables individually significant in explaining Price?
  • d. Estimate the 95% confidence interval for coefficients of Sqft, Beds, and Baths.

Quick dataset note: in the code cells below, the file Houses.xlsx is loaded into df. It has the columns Price, Sqft, Beds, Baths, and an extra column Col that is not used in the model.

Exercise 22a

Estimate the model and return the full regression output.

Use lm(Price ~ Sqft + Beds + Baths, data = df) and wrap it in summary(...).

Return the full regression output first so you can use it for the joint and individual significance questions.

summary(lm(Price ~ Sqft + Beds + Baths, data = df))

Exercise 22b

At the 5% level, are the explanatory variables jointly significant in explaining Price?

Use the overall F-test p-value from the regression output in 22a.

Correct choice: the first option.

The overall p-value is extremely small, so the variables are jointly significant.

Exercise 22c

Which statement about individual significance is correct at the 5% level?

Check the coefficient p-values for Sqft, Beds, and Baths in the output from 22a.

Correct choice: the second option.

Sqft and Baths are significant at the 5% level, but Beds is not.

Exercise 22d

Return the full 95% confidence interval output.

Fit the same model as in 22a, then use confint(...).

The confidence interval question naturally returns several values, so return the full confint(...) output.

model <- lm(Price ~ Sqft + Beds + Baths, data = df)
confint(model)

Exercise 22e

Which coefficient has a 95% confidence interval that includes 0?

Look at the interval output from 22d and see which interval crosses zero.

Correct choice: the second option.

Only the interval for Beds includes 0.

15.4 Model Assumptions and Common Violations

Exercise 51 (Rental)

Consider the monthly rent (Rent in $) of an apartment as a function of the number of bedrooms (Bed), the number of bathrooms (Bath), and square footage (Sqft).

  • a. Estimate: Rent = β0 + β1 Bed + β2 Bath + β3 Sqft + ε.
  • b. Which of the explanatory variables might cause changing variability? Explain.
  • c. Use residual plots to verify your economic intuition.

Quick dataset note: in the code cells below, the file Rental.xlsx is loaded into df. It has the columns Rent, Bed, Bath, and Sqft.

Exercise 51a

Estimate the model and return the full regression output.

Use lm(Rent ~ Bed + Bath + Sqft, data = df) and wrap it in summary(...).

Return the full regression output first so you can use it for the later residual-plot question.

summary(lm(Rent ~ Bed + Bath + Sqft, data = df))

Exercise 51b

Which explanatory variable is the main candidate for changing variability?

Think about which variable is most naturally linked to the spread of rent values as it gets larger.

Correct choice: the third option.

Larger apartments can vary much more in rent, so Sqft is the most natural candidate for heteroskedasticity.

Exercise 51c

Use residual plots to check that idea.

Make one residual plot against each explanatory variable and then compare the spread.

A residual plot is the standard visual check here. The spread fans out most clearly against Sqft.

model <- lm(Rent ~ Bed + Bath + Sqft, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$Bed, xlab = "Bed", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Bath, xlab = "Bath", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Sqft, xlab = "Sqft", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))

Exercise 51d

Which residual plot shows the clearest fanning-out pattern?

Compare the spread of the residuals as each x-variable increases.

Correct choice: the third option.

The residuals fan out most clearly against Sqft.

Exercise 53 (Healthy_Living)

Healthy living has always been an important goal for any society. Consider a regression model that conjectures that fruits and vegetables and regular exercising have a positive effect on health and smoking has a negative effect on health. The sample consists of the percentage of these variables observed in various states in the United States.

  • a. Estimate the model Healthy = β0 + β1 FV + β2 Exercise + β3 Smoke + ε.
  • b. Analyze the data to determine if multicollinearity and changing variability are present.

Quick dataset note: in the code cells below, the file Healthy_Living.xlsx is loaded into df. It has the columns State, Healthy, FV, Exercise, and Smoke.

Exercise 53a

Estimate the model and return the full regression output.

Use lm(Healthy ~ FV + Exercise + Smoke, data = df) and wrap it in summary(...).

Return the full regression output first. You will use it together with the correlation matrix and the residual plots.

summary(lm(Healthy ~ FV + Exercise + Smoke, data = df))

Exercise 53b

Return the correlation matrix for Healthy, FV, Exercise, and Smoke.

Select the four numeric variables and pass them to cor(...).

The correlation matrix helps you judge whether the explanatory variables are so strongly related that multicollinearity looks serious.

cor(df[, c("Healthy", "FV", "Exercise", "Smoke")])

Exercise 53c

Which statement about multicollinearity fits the data best?

Look at the explanatory-variable correlations. Ask whether they are high enough to make multicollinearity look severe.

Correct choice: the second option.

Some correlations are moderate, but none are so extreme that multicollinearity looks severe from this matrix alone.

Exercise 53d

Use residual plots to look for changing variability.

Plot the residuals against each explanatory variable and look for a clear fanning-out pattern.

Residual plots are the visual tool for checking changing variability. In this dataset, no one plot gives a very strong fanning-out pattern.

model <- lm(Healthy ~ FV + Exercise + Smoke, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$FV, xlab = "FV", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Exercise, xlab = "Exercise", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Smoke, xlab = "Smoke", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))

Exercise 53e

Which statement about changing variability fits the residual plots best?

Look for a strong increase or decrease in residual spread as the x-values change.

Correct choice: the second option.

The residual plots do not show a strong, clean fanning-out pattern, so changing variability is not clearly established here.

17.1 Dummy Variables

Exercise 9 (Wage)

A researcher wonders whether males get paid more, on average, than females at a large firm. She interviews 50 employees and collects data on each employee’s hourly wage (Wage in $), years of higher education (EDUC), years of experience (EXPER), age (Age), and a Male dummy variable that equals 1 if male, 0 otherwise.

  • a. Estimate: Wage = β0 + β1 EDUC + β2 EXPER + β3 Age + β4 Male + ε.
  • b. Predict the hourly wage of a 40-year-old male employee with 10 years of higher education and 5 years experience. Predict the hourly wage of a 40-year-old female employee with the same qualifications.
  • c. Interpret the estimated coefficient for Male. Is the Male variable significant at the 5% level? Do the data suggest that sex discrimination exists at this firm?
  • d. Calculate the 95% confidence interval for the coefficients of EDUC, EXPER, Age, and Male and interpret the results.

Quick dataset note: in the code cells below, the file Wage.xlsx is loaded into df. It has the columns Wage, EDUC, EXPER, Age, and Male.

Exercise 9a

Estimate the model and return the full regression output.

Use lm(Wage ~ EDUC + EXPER + Age + Male, data = df) and wrap it in summary(...).

Return the full regression output first, because later parts need the predictions, the Male coefficient, and the confidence intervals.

summary(lm(Wage ~ EDUC + EXPER + Age + Male, data = df))

Exercise 9b1

Predict the hourly wage of the male employee described in the question.

Use predict(...) with EDUC = 10, EXPER = 5, Age = 40, and Male = 1.

Fit the model first, then predict for the male employee profile.

model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
predict(model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 1))

Exercise 9b2

Predict the hourly wage of the female employee with the same qualifications.

Use the same values as in 9b1, but set Male = 0.

Use the same fitted model and change only the dummy variable.

model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
predict(model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 0))

Exercise 9c

Which statement about the Male coefficient fits the results best?

Use the coefficient estimate and its p-value from the regression output in 9a.

Correct choice: the second option.

The fitted Male coefficient is positive, but its p-value is above 0.05, so it is not statistically significant at the 5% level.

Exercise 9d

Return the full 95% confidence interval output.

Fit the same model as in 9a, then use confint(...).

The confidence interval output helps you see which coefficients are clearly different from zero.

model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
confint(model)

Exercise 9e

Which coefficient 95% confidence interval includes 0?

Check the interval output from 9d and look for intervals that cross zero.

Correct choice: the third option.

The interval for Male includes zero, and the interval for Age does too.

17.2 Interactions with Dummy Variables

Exercise 22 (Urban)

The accompanying data file shows consumption expenditures of families in the United States (Consumption in $), family income (Income in $), and whether or not the family lives in an urban or rural community (Urban = 1 if urban, 0 otherwise).

  • a. Estimate: Consumption = β0 + β1 Income + ε. Compute the predicted consumption expenditures of a family with income of $75,000.
  • b. Include the dummy variable Urban to predict consumption for a family with income of $75,000 in urban and rural communities.
  • c. Include the dummy variable Urban and an interaction variable (Income × Urban) to predict consumption for a family with income of $75,000 in urban and rural communities.
  • d. Which of the preceding models is most suitable for the data? Explain.
  • e. Calculate the 95% confidence interval for the coefficients of Income and Urban and interpret the results.

Quick dataset note: in the code cells below, the file Urban.xlsx is loaded into df. It has the columns Consumption, Income, and Urban.

Exercise 22a1

Estimate Model 1 and return the full regression output.

Model 1 uses Income only.

Return the full regression output first, then use it for the prediction in the next step.

summary(lm(Consumption ~ Income, data = df))

Exercise 22a2

Predict consumption for a family with income of $75,000 using Model 1.

Fit Model 1 first, then use predict(...) with Income = 75000.

Use the fitted simple regression for the new income value.

model1 <- lm(Consumption ~ Income, data = df)
predict(model1, data.frame(Income = 75000))

Exercise 22b1

Estimate Model 2 and return the full regression output.

Model 2 adds the dummy variable Urban but no interaction.

Return the full output first, then use the model for the two predictions in the next steps.

summary(lm(Consumption ~ Income + Urban, data = df))

Exercise 22b2

Using Model 2, predict consumption for a rural family with income of $75,000.

For a rural family, set Urban = 0.

Use the fitted Model 2 and set the dummy variable to 0 for rural.

model2 <- lm(Consumption ~ Income + Urban, data = df)
predict(model2, data.frame(Income = 75000, Urban = 0))

Exercise 22b3

Using Model 2, predict consumption for an urban family with income of $75,000.

For an urban family, set Urban = 1.

Use the same fitted Model 2 and change only the dummy variable.

model2 <- lm(Consumption ~ Income + Urban, data = df)
predict(model2, data.frame(Income = 75000, Urban = 1))

Exercise 22c1

Estimate Model 3 with the interaction and return the full regression output.

The easiest way is lm(Consumption ~ Income * Urban, data = df).

The interaction model lets both the intercept and the slope differ between urban and rural families.

summary(lm(Consumption ~ Income * Urban, data = df))

Exercise 22c2

Using Model 3, predict consumption for a rural family with income of $75,000.

For a rural family, set Urban = 0. Then the interaction term is handled automatically by predict(...).

Use the fitted interaction model and set Urban = 0.

model3 <- lm(Consumption ~ Income * Urban, data = df)
predict(model3, data.frame(Income = 75000, Urban = 0))

Exercise 22c3

Using Model 3, predict consumption for an urban family with income of $75,000.

For an urban family, set Urban = 1.

Use the same fitted interaction model and change the dummy variable to 1.

model3 <- lm(Consumption ~ Income * Urban, data = df)
predict(model3, data.frame(Income = 75000, Urban = 1))

Exercise 22d

Which model is the most suitable for the data?

Compare the adjusted R^2 values and think about whether the interaction gives the model extra flexibility.

Correct choice: the third option.

Model 3 has the highest adjusted R^2, and it allows the slope to differ between urban and rural families.

Exercise 22e

Return the full 95% confidence interval output for Model 3.

Fit Model 3 first, then use confint(...).

The full interval output lets you inspect the coefficients of Income, Urban, and the interaction together.

model3 <- lm(Consumption ~ Income * Urban, data = df)
confint(model3)