Exercise List 9 - Interactive
Regression significance, assumptions, and dummy variables
Hey you :)
This list covers regression significance, model assumptions, dummy variables, and interactions. Take it one step at a time:
- return the full regression output when that is the most useful starting point
- separate joint significance, individual significance, and interpretation
- for prediction tasks, plug in exactly the values from the question
- for dummy variables and interactions, keep track of what changes between groups
Packages used on this page: readxl.
Quick guide: which method do I need?
Regression significance
- Use
summary(lm(...)) - The overall F-test answers whether the explanatory variables are jointly significant
- The coefficient p-values answer whether each variable is individually significant
Residual checks
- Residual plots help you look for changing variability
- A fanning-out pattern suggests heteroskedasticity
Dummy variables and interactions
- A dummy variable shifts the intercept between groups
- An interaction can change the slope between groups too
15.1 Tests of Significance
Exercise 20 (Electricity_Cost)
The facility manager at a pharmaceutical company wants to build a regression model to forecast monthly electricity cost. Three main variables are thought to dictate electricity cost: (1) average outdoor temperature (Temp in °F), (2) working days per month (Days), and (3) tons of product produced (Tons).
a.Estimate the regression model.b.At the10%significance level, are the explanatory variables jointly significant? Show the relevant steps of the test.c.Are the explanatory variables individually significant at the10%significance level? Show the relevant steps of the test.
Quick dataset note: in the code cells below, the file Electricity_Cost.xlsx is loaded into df. It has the columns Cost, Temp, Days, and Tons.
Exercise 20a
Estimate the model and return the full regression output.
Use lm(Cost ~ Temp + Days + Tons, data = df) and wrap it in summary(...).
Return the full regression output first, because the later questions use both the overall F-test and the individual coefficient table.
summary(lm(Cost ~ Temp + Days + Tons, data = df))Exercise 20b
At the 10% level, are the explanatory variables jointly significant?
Use the overall F-test p-value from the regression output in 20a.
Correct choice: the first option.
The overall p-value is about 0.0262, which is below 0.10, so the variables are jointly significant.
Exercise 20c
Which explanatory variable is individually significant at the 10% level?
Look at the coefficient p-values in the regression output. Compare each one with 0.10.
Correct choice: the first option.
Only Temp has a p-value below 0.10. Days and Tons do not.
Exercise 22 (Houses)
A realtor examines the factors that influence the price of a house in a suburb outside of Boston, Massachusetts. He collects data on 36 recent house sales (Price) and notes each house’s square footage (Sqft) as well as its number of bedrooms (Beds) and number of bathrooms (Baths).
a.Estimate:Price = β0 + β1 Sqft + β2 Beds + β3 Baths + ε. Show the regression results in a well-formatted table.b.At the5%significance level, are the explanatory variables jointly significant in explainingPrice?c.At the5%significance level, are all explanatory variables individually significant in explainingPrice?d.Estimate the95%confidence interval for coefficients ofSqft,Beds, andBaths.
Quick dataset note: in the code cells below, the file Houses.xlsx is loaded into df. It has the columns Price, Sqft, Beds, Baths, and an extra column Col that is not used in the model.
Exercise 22a
Estimate the model and return the full regression output.
Use lm(Price ~ Sqft + Beds + Baths, data = df) and wrap it in summary(...).
Return the full regression output first so you can use it for the joint and individual significance questions.
summary(lm(Price ~ Sqft + Beds + Baths, data = df))Exercise 22b
At the 5% level, are the explanatory variables jointly significant in explaining Price?
Use the overall F-test p-value from the regression output in 22a.
Correct choice: the first option.
The overall p-value is extremely small, so the variables are jointly significant.
Exercise 22c
Which statement about individual significance is correct at the 5% level?
Check the coefficient p-values for Sqft, Beds, and Baths in the output from 22a.
Correct choice: the second option.
Sqft and Baths are significant at the 5% level, but Beds is not.
Exercise 22d
Return the full 95% confidence interval output.
Fit the same model as in 22a, then use confint(...).
The confidence interval question naturally returns several values, so return the full confint(...) output.
model <- lm(Price ~ Sqft + Beds + Baths, data = df)
confint(model)Exercise 22e
Which coefficient has a 95% confidence interval that includes 0?
Look at the interval output from 22d and see which interval crosses zero.
Correct choice: the second option.
Only the interval for Beds includes 0.
15.4 Model Assumptions and Common Violations
Exercise 51 (Rental)
Consider the monthly rent (Rent in $) of an apartment as a function of the number of bedrooms (Bed), the number of bathrooms (Bath), and square footage (Sqft).
a.Estimate:Rent = β0 + β1 Bed + β2 Bath + β3 Sqft + ε.b.Which of the explanatory variables might cause changing variability? Explain.c.Use residual plots to verify your economic intuition.
Quick dataset note: in the code cells below, the file Rental.xlsx is loaded into df. It has the columns Rent, Bed, Bath, and Sqft.
Exercise 51a
Estimate the model and return the full regression output.
Use lm(Rent ~ Bed + Bath + Sqft, data = df) and wrap it in summary(...).
Return the full regression output first so you can use it for the later residual-plot question.
summary(lm(Rent ~ Bed + Bath + Sqft, data = df))Exercise 51b
Which explanatory variable is the main candidate for changing variability?
Think about which variable is most naturally linked to the spread of rent values as it gets larger.
Correct choice: the third option.
Larger apartments can vary much more in rent, so Sqft is the most natural candidate for heteroskedasticity.
Exercise 51c
Use residual plots to check that idea.
Make one residual plot against each explanatory variable and then compare the spread.
A residual plot is the standard visual check here. The spread fans out most clearly against Sqft.
model <- lm(Rent ~ Bed + Bath + Sqft, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$Bed, xlab = "Bed", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Bath, xlab = "Bath", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Sqft, xlab = "Sqft", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))Exercise 51d
Which residual plot shows the clearest fanning-out pattern?
Compare the spread of the residuals as each x-variable increases.
Correct choice: the third option.
The residuals fan out most clearly against Sqft.
Exercise 53 (Healthy_Living)
Healthy living has always been an important goal for any society. Consider a regression model that conjectures that fruits and vegetables and regular exercising have a positive effect on health and smoking has a negative effect on health. The sample consists of the percentage of these variables observed in various states in the United States.
a.Estimate the modelHealthy = β0 + β1 FV + β2 Exercise + β3 Smoke + ε.b.Analyze the data to determine if multicollinearity and changing variability are present.
Quick dataset note: in the code cells below, the file Healthy_Living.xlsx is loaded into df. It has the columns State, Healthy, FV, Exercise, and Smoke.
Exercise 53a
Estimate the model and return the full regression output.
Use lm(Healthy ~ FV + Exercise + Smoke, data = df) and wrap it in summary(...).
Return the full regression output first. You will use it together with the correlation matrix and the residual plots.
summary(lm(Healthy ~ FV + Exercise + Smoke, data = df))Exercise 53b
Return the correlation matrix for Healthy, FV, Exercise, and Smoke.
Select the four numeric variables and pass them to cor(...).
The correlation matrix helps you judge whether the explanatory variables are so strongly related that multicollinearity looks serious.
cor(df[, c("Healthy", "FV", "Exercise", "Smoke")])Exercise 53c
Which statement about multicollinearity fits the data best?
Look at the explanatory-variable correlations. Ask whether they are high enough to make multicollinearity look severe.
Correct choice: the second option.
Some correlations are moderate, but none are so extreme that multicollinearity looks severe from this matrix alone.
Exercise 53d
Use residual plots to look for changing variability.
Plot the residuals against each explanatory variable and look for a clear fanning-out pattern.
Residual plots are the visual tool for checking changing variability. In this dataset, no one plot gives a very strong fanning-out pattern.
model <- lm(Healthy ~ FV + Exercise + Smoke, data = df)
residuals_model <- resid(model)
par(mfrow = c(1, 3))
plot(residuals_model ~ df$FV, xlab = "FV", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Exercise, xlab = "Exercise", ylab = "Residuals")
abline(h = 0)
plot(residuals_model ~ df$Smoke, xlab = "Smoke", ylab = "Residuals")
abline(h = 0)
par(mfrow = c(1, 1))Exercise 53e
Which statement about changing variability fits the residual plots best?
Look for a strong increase or decrease in residual spread as the x-values change.
Correct choice: the second option.
The residual plots do not show a strong, clean fanning-out pattern, so changing variability is not clearly established here.
17.1 Dummy Variables
Exercise 9 (Wage)
A researcher wonders whether males get paid more, on average, than females at a large firm. She interviews 50 employees and collects data on each employee’s hourly wage (Wage in $), years of higher education (EDUC), years of experience (EXPER), age (Age), and a Male dummy variable that equals 1 if male, 0 otherwise.
a.Estimate:Wage = β0 + β1 EDUC + β2 EXPER + β3 Age + β4 Male + ε.b.Predict the hourly wage of a40-year-old male employee with10years of higher education and5years experience. Predict the hourly wage of a40-year-old female employee with the same qualifications.c.Interpret the estimated coefficient forMale. Is theMalevariable significant at the5%level? Do the data suggest that sex discrimination exists at this firm?d.Calculate the95%confidence interval for the coefficients ofEDUC,EXPER,Age, andMaleand interpret the results.
Quick dataset note: in the code cells below, the file Wage.xlsx is loaded into df. It has the columns Wage, EDUC, EXPER, Age, and Male.
Exercise 9a
Estimate the model and return the full regression output.
Use lm(Wage ~ EDUC + EXPER + Age + Male, data = df) and wrap it in summary(...).
Return the full regression output first, because later parts need the predictions, the Male coefficient, and the confidence intervals.
summary(lm(Wage ~ EDUC + EXPER + Age + Male, data = df))Exercise 9b1
Predict the hourly wage of the male employee described in the question.
Use predict(...) with EDUC = 10, EXPER = 5, Age = 40, and Male = 1.
Fit the model first, then predict for the male employee profile.
model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
predict(model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 1))Exercise 9b2
Predict the hourly wage of the female employee with the same qualifications.
Use the same values as in 9b1, but set Male = 0.
Use the same fitted model and change only the dummy variable.
model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
predict(model, data.frame(EDUC = 10, EXPER = 5, Age = 40, Male = 0))Exercise 9c
Which statement about the Male coefficient fits the results best?
Use the coefficient estimate and its p-value from the regression output in 9a.
Correct choice: the second option.
The fitted Male coefficient is positive, but its p-value is above 0.05, so it is not statistically significant at the 5% level.
Exercise 9d
Return the full 95% confidence interval output.
Fit the same model as in 9a, then use confint(...).
The confidence interval output helps you see which coefficients are clearly different from zero.
model <- lm(Wage ~ EDUC + EXPER + Age + Male, data = df)
confint(model)Exercise 9e
Which coefficient 95% confidence interval includes 0?
Check the interval output from 9d and look for intervals that cross zero.
Correct choice: the third option.
The interval for Male includes zero, and the interval for Age does too.
17.2 Interactions with Dummy Variables
Exercise 22 (Urban)
The accompanying data file shows consumption expenditures of families in the United States (Consumption in $), family income (Income in $), and whether or not the family lives in an urban or rural community (Urban = 1 if urban, 0 otherwise).
a.Estimate:Consumption = β0 + β1 Income + ε. Compute the predicted consumption expenditures of a family with income of$75,000.b.Include the dummy variableUrbanto predict consumption for a family with income of$75,000in urban and rural communities.c.Include the dummy variableUrbanand an interaction variable (Income × Urban) to predict consumption for a family with income of$75,000in urban and rural communities.d.Which of the preceding models is most suitable for the data? Explain.e.Calculate the95%confidence interval for the coefficients ofIncomeandUrbanand interpret the results.
Quick dataset note: in the code cells below, the file Urban.xlsx is loaded into df. It has the columns Consumption, Income, and Urban.
Exercise 22a1
Estimate Model 1 and return the full regression output.
Model 1 uses Income only.
Return the full regression output first, then use it for the prediction in the next step.
summary(lm(Consumption ~ Income, data = df))Exercise 22a2
Predict consumption for a family with income of $75,000 using Model 1.
Fit Model 1 first, then use predict(...) with Income = 75000.
Use the fitted simple regression for the new income value.
model1 <- lm(Consumption ~ Income, data = df)
predict(model1, data.frame(Income = 75000))Exercise 22b1
Estimate Model 2 and return the full regression output.
Model 2 adds the dummy variable Urban but no interaction.
Return the full output first, then use the model for the two predictions in the next steps.
summary(lm(Consumption ~ Income + Urban, data = df))Exercise 22b2
Using Model 2, predict consumption for a rural family with income of $75,000.
For a rural family, set Urban = 0.
Use the fitted Model 2 and set the dummy variable to 0 for rural.
model2 <- lm(Consumption ~ Income + Urban, data = df)
predict(model2, data.frame(Income = 75000, Urban = 0))Exercise 22b3
Using Model 2, predict consumption for an urban family with income of $75,000.
For an urban family, set Urban = 1.
Use the same fitted Model 2 and change only the dummy variable.
model2 <- lm(Consumption ~ Income + Urban, data = df)
predict(model2, data.frame(Income = 75000, Urban = 1))Exercise 22c1
Estimate Model 3 with the interaction and return the full regression output.
The easiest way is lm(Consumption ~ Income * Urban, data = df).
The interaction model lets both the intercept and the slope differ between urban and rural families.
summary(lm(Consumption ~ Income * Urban, data = df))Exercise 22c2
Using Model 3, predict consumption for a rural family with income of $75,000.
For a rural family, set Urban = 0. Then the interaction term is handled automatically by predict(...).
Use the fitted interaction model and set Urban = 0.
model3 <- lm(Consumption ~ Income * Urban, data = df)
predict(model3, data.frame(Income = 75000, Urban = 0))Exercise 22c3
Using Model 3, predict consumption for an urban family with income of $75,000.
For an urban family, set Urban = 1.
Use the same fitted interaction model and change the dummy variable to 1.
model3 <- lm(Consumption ~ Income * Urban, data = df)
predict(model3, data.frame(Income = 75000, Urban = 1))Exercise 22d
Which model is the most suitable for the data?
Compare the adjusted R^2 values and think about whether the interaction gives the model extra flexibility.
Correct choice: the third option.
Model 3 has the highest adjusted R^2, and it allows the slope to differ between urban and rural families.
Exercise 22e
Return the full 95% confidence interval output for Model 3.
Fit Model 3 first, then use confint(...).
The full interval output lets you inspect the coefficients of Income, Urban, and the interaction together.
model3 <- lm(Consumption ~ Income * Urban, data = df)
confint(model3)