Exercise List 7 - Interactive
Chi-square tests and Jarque-Bera normality
Hey you :)
This list covers chi-square tests and one normality test. Take it one step at a time:
- use the full test output when the task asks for it
- read the tail direction carefully
- for chi-square questions, keep track of the null distribution
- for normality, focus on what the p-value says about the data
Packages used on this page: readxl.
Quick guide: which method do I need?
Goodness-of-fit for one multinomial distribution
- Use
chisq.test(observed, p = expected_proportions) - The null says the category proportions match the claimed distribution
- The degrees of freedom are
k - 1
Chi-square test for independence
- Build a contingency table first
- Then use
chisq.test(table(...)) - The null says the two variables are independent
Jarque-Bera test for normality
- The null says the variable is normally distributed
- The test statistic uses skewness and kurtosis
- A large p-value means the data do not contradict normality
12.1 Goodness-of-Fit test for a Multinomial Experiment
Exercise 9
In 2003, the distribution of the world’s people worth $1 million or more was as follows:
- Europe:
35.7% - North America:
31.4% - Asia Pacific:
22.9% - Latin America:
4.3% - Middle East:
4.3% - Africa:
1.4%
A recent sample of 500 global millionaires produces the following results:
Europe:
153North America:
163Asia Pacific:
139Latin America:
20Middle East:
20Africa:
5a.Test whether the distribution of millionaires today is different from the distribution in 2003 atα = 0.05.b.Would the conclusion change if we tested it atα = 0.10?
Exercise 9a
Choose the correct hypotheses.
This is a goodness-of-fit test. The null says the current category proportions still match the 2003 distribution.
Correct choice: the first option.
In a goodness-of-fit test, the null states the full claimed distribution. The alternative says at least one category proportion is different.
Exercise 9b
Run the chi-square goodness-of-fit test and return the full output.
Put the six observed counts into one vector and the six 2003 proportions into another vector. Then use chisq.test(...) with the p = argument.
This is a chi-square goodness-of-fit test because you compare one observed categorical sample with a claimed distribution.
observed <- c(153, 163, 139, 20, 20, 5)
proportion <- c(0.357, 0.314, 0.229, 0.043, 0.043, 0.014)
chisq.test(observed, p = proportion)Exercise 9c
What is the correct conclusion at α = 0.05?
Compare the p-value from 9b with 0.05.
Correct choice: the second option.
The p-value is about 0.0783, which is larger than 0.05. So at the 5% level you do not reject the null hypothesis.
Exercise 9d
Would the conclusion change at α = 0.10?
Use the same p-value from 9b, but now compare it with 0.10 instead.
Correct choice: the first option.
Now the p-value 0.0783 is below 0.10, so at the 10% level you reject the null hypothesis.
12.2 Chi-Square test for independence
Exercise 24 (Happiness)
There have been numerous attempts that relate happiness with income. In a recent survey, 290 individuals were asked to evaluate their state of happiness (Happy or Not Happy) and income (Low, Medium, or High). The accompanying table shows a portion of the data.
a.Use the data to construct a contingency table.b.Specify the competing hypotheses to determine whether happiness is related to income.c.Conduct the test at the5%significance level and make a conclusion.
Quick dataset note: in the code cells below, the file Happiness.xlsx is loaded into df. It has the columns Individual, Income, and Happy?.
Exercise 24a
Construct the contingency table.
Use table(...) with happiness status in one margin and income in the other.
A contingency table counts how many observations fall in each combination of the two categorical variables.
table(df$`Happy?`, df$Income)Exercise 24b
Choose the correct hypotheses.
A chi-square test for independence asks whether the two categorical variables are independent or related.
Correct choice: the first option.
The null says income and happiness are independent. The alternative says they are dependent.
Exercise 24c
Run the chi-square test for independence and return the full output.
First build the contingency table. Then pass that table into chisq.test(...).
This is a chi-square test for independence because you want to know whether two categorical variables are related.
chisq.test(table(df$`Happy?`, df$Income))Exercise 24d
What is the correct conclusion at the 5% level?
Compare the p-value from 24c with 0.05.
Correct choice: the second option.
The p-value is about 0.1915, which is above 0.05. So you do not reject the null hypothesis.
12.3 Chi-Square tests for normality
Exercise 35 (MPG)
The accompanying data file shows miles per gallon (MPG) for a sample of 25 cars.
a.Using the Jarque-Bera test, state the competing hypotheses in order to determine whether or notMPGfollows the normal distribution.b.Calculate the value of the Jarque-Bera test statistic and the p-value.c.Atα = 0.05, can you conclude thatMPGis not normally distributed?
Quick dataset note: in the code cells below, the file MPG.xlsx is loaded into df. It has one column called MPG.
Exercise 35a
Choose the correct hypotheses.
For a normality test, the null says the data come from a normal distribution.
Correct choice: the first option.
The null says MPG is normally distributed. The alternative says it is not normally distributed.
Exercise 35b
Calculate the Jarque-Bera test statistic.
Use the formulas for skewness S and excess kurtosis K, then plug them into (n / 6) * (S^2 + K^2 / 4).
Jarque-Bera uses the sample skewness and excess kurtosis. Once you compute those, plug them into the formula.
x <- df$MPG
n <- length(x)
m <- mean(x)
m2 <- mean((x - m)^2)
S <- mean((x - m)^3) / (m2^(3/2))
K <- mean((x - m)^4) / (m2^2) - 3
(n / 6) * (S^2 + K^2 / 4)Exercise 35c
Calculate the p-value.
You can get the same p-value in two ways: either run jarque.bera.test(df$MPG) directly on this page, or use the Jarque-Bera statistic with the chi-square distribution with 2 degrees of freedom.
Both methods give the same p-value.
# Method 1: directly from the Jarque-Bera test
jarque.bera.test(df$MPG)$p.value
# Method 2: from the Jarque-Bera statistic
x <- df$MPG
n <- length(x)
m <- mean(x)
m2 <- mean((x - m)^2)
S <- mean((x - m)^3) / (m2^(3/2))
K <- mean((x - m)^4) / (m2^2) - 3
jb <- (n / 6) * (S^2 + K^2 / 4)
pchisq(jb, df = 2, lower.tail = FALSE)Exercise 35d
At α = 0.05, what is the correct conclusion?
Compare the p-value from 35c with 0.05.
Correct choice: the second option.
The p-value is about 0.7755, which is much larger than 0.05. So you do not reject the null hypothesis.