+ - 0:00:00
Notes for current slide
Notes for next slide

EDUC 847 Winter 25

Week 3 - Data for Linear Modeling

Eric Brewe
Professor of Physics at Drexel University

22 January 2025, last update: 2025-01-21

1 / 29

Let's start where we ended last class...Scatter Plots

Start by getting the data into R

#This loads the csv and saves it as a dataframe titled week_1_data
week_1_data <- read_csv(here("static/slides/EDUC_847/data",
"EDUC_847_Survey.csv"))
2 / 29

Let's start cleaning up.

week_1_data %>%
select(ID,Q2:Q6) %>%
filter(row_number() >3 ) %>%
mutate(across(c(Q2:Q3,Q5:Q6), as.numeric)) -> w1df
3 / 29

Let's remember what we did last week

Is there a relationship between length of book and estimates on weight of cow?

Making a scatter plot to explore

w1df %>%
select(Q5:Q6) %>%
ggplot(aes(x = Q5, y = Q6)) +
geom_point() +
theme_minimal() +
labs(x = "Pages",
y = "Weight of Cow",
title = "Kick Ass Scatterplot!")

A residual is the distance between the Y value for each measurement, and the predicted Y value.

4 / 29

Let's make a linear regression model

We want the estimated weight of the cow as predicted by the length of the book closest to the survey respondent.

Model1 <- lm(Q6 ~ Q5, data = w1df)
Model1
##
## Call:
## lm(formula = Q6 ~ Q5, data = w1df)
##
## Coefficients:
## (Intercept) Q5
## 1668.7235 0.3001
5 / 29

Let's dig into our model

summary(Model1)
##
## Call:
## lm(formula = Q6 ~ Q5, data = w1df)
##
## Residuals:
## 1 2 3 4 5 6 7
## 2638.4 -478.3 -933.6 -274.3 -229.5 -285.5 -437.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1668.7235 723.6088 2.306 0.0692 .
## Q5 0.3001 1.8163 0.165 0.8753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1301 on 5 degrees of freedom
## Multiple R-squared: 0.005429, Adjusted R-squared: -0.1935
## F-statistic: 0.02729 on 1 and 5 DF, p-value: 0.8753
6 / 29

Let's look at hypothesis testing for this model

Statistical significance tells us whether or not we can reject the null hypothesis

  1. What is the null model?
  2. What is our tolerance for type 1 error? (p-value)
  3. Is this model different from the null model?
7 / 29

Let's think about this model...

R2 = 0.005429

We saw this last week

F-Statistic = 0.02729

This is a ratio of the Model Sum of Squares vs. the Residual Sum of Squares

p = 0.8753

This is the percent of time you would have to be willing to have a type one error (False Positive)

This model doesn't fit very well.

8 / 29

Let's think about the Intercept

Intercept = 1668

This tells us the average predicted weight of the cow.

Standard Error = 723.6

SE=σN

t value = 2.3006

t=InterceptStandardError=1668.7723.6=2.03

p = 0.0692

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".

9 / 29

Let's think about the Intercept

Intercept = 1668

This tells us the average predicted weight of the cow.

Standard Error = 723.6

SE=σN

t value = 2.3006

t=InterceptStandardError=1668.7723.6=2.03

p = 0.0692

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.

9 / 29

Let's think about the Q5 coeff.

Q5 = 0.3001

The average additional weight added for each additional page in the closet book.

Standard Error = 1.8163

SE=σN

t value = 0.165

t=InterceptStandardError=.30011.8163=0.165

p = 0.8753

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".

10 / 29

Let's think about the Q5 coeff.

Q5 = 0.3001

The average additional weight added for each additional page in the closet book.

Standard Error = 1.8163

SE=σN

t value = 0.165

t=InterceptStandardError=.30011.8163=0.165

p = 0.8753

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.

10 / 29

Let's think about why this model is a failure

Maybe it is because we violated the assumptions?

And what are those assumptions? from https://learningstatisticswithr.com/book/regression.html#regressionassumptions

11 / 29

Let's think about why this model is a failure

Normality of Residuals

Residuals are distributed normally.

Linearity

Predictor and Fitted are linearly related

Homogeneity of varicance

The standard deviation of the residual is the same at all values of the Outcome

Uncorrelated predictors

The correlations between predictor variables is small (We'll come back to this)

No bad outliers

There aren't any individual data points that are really impacting our model

12 / 29

Let's check out those residuals...

augment(Model1) %>%
select(Q5:.resid)
## # A tibble: 7 × 3
## Q5 .fitted .resid
## <dbl> <dbl> <dbl>
## 1 273 1751. 2638.
## 2 32 1678. -478.
## 3 383 1784. -934.
## 4 352 1774. -274.
## 5 869 1929. -229.
## 6 56 1686. -286.
## 7 81 1693. -437.
13 / 29

Let's check the normality residuals...

We can do a histogram of the residuals

mod_df <- augment(Model1)
mod_df %>%
ggplot(aes(y = .resid)) +
geom_histogram() +
theme_minimal() +
coord_flip()

Doesn't look normal! Could confirm with a Shapiro-Wilk test...

14 / 29

Let's check the normality residuals...

We can do a Q-Q plot

plot(x = Model1, which = 2 )

Should be a straight line

Should have a slope = 1

Not so much

15 / 29

Let's check the linearity assumption

We can do a scatter plot of the Outcome (Q6) vs. the Fitted (the result of the eqn. )

plot(x = Model1, which = 1 )

Should be a straight line

Not so much

16 / 29

Let's check the homogeneity of variance assumption

We can do a scatter plot of the standardized residuals vs. the Fitted values (the result of the eqn. )

plot(x = Model1, which = 3 )

Should be a straight horizontal line

Not so much

17 / 29

Let's check the no bad outliers assumption

We can do a boxplot to get a sense

w1df %>%
ggplot(aes(y = Q6)) +
geom_boxplot() +
theme_minimal()

Note the one way up above might be an outlier

We should confirm.

18 / 29

Let's check the no bad outliers assumption

We can check the leverage of the model

plot(Model1, which = 5)

Looks like observations 5 might have much larger leverage?

We should confirm.

19 / 29

Let's revisit our model

Maybe, just maybe this model is junk because...

There is no good reason to think that the number of pages in the book closest to you when you answered this survey has any bearing on how much you would estimate a cow to weigh!

20 / 29

Let's try this with a different model

w3df <- read_csv(here("static/slides/EDUC_847/data",
"science_scores.csv"))
sci_model <- lm(sci ~ math_sc, data = w3df)
21 / 29

Let's try this with a different model

summary(sci_model)
##
## Call:
## lm(formula = sci ~ math_sc, data = w3df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.3 -362.3 -187.9 182.5 4431.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.631 164.487 0.265 0.791
## math_sc 11.622 2.475 4.696 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 653.1 on 498 degrees of freedom
## Multiple R-squared: 0.04241, Adjusted R-squared: 0.04049
## F-statistic: 22.06 on 1 and 498 DF, p-value: 3.428e-06
22 / 29

Let's look at hypothesis testing for the model

Statistical significance tells us whether or not we can reject the null hypothesis

  1. What is the null model?
  2. What is our tolerance for type 1 error? (p-value)
  3. Is this model different from the null model?
23 / 29

Let's look at hypothesis testing for the model

Statistical significance tells us whether or not we can reject the null hypothesis

  1. What is the null model? The intercept only model.
  2. What is our tolerance for type 1 error? (p-value) Typically 95%
  3. Is this model different from the null model? Yes. We know because the F-statistic is very large, and the p- value is less than 0.05 (5%).
24 / 29

Let's look at hypothesis testing for the coef

Statistical significance tells us whether or not we can reject the null hypothesis

  1. What is the null model?
  2. What is our tolerance for type 1 error? (p-value)
  3. Is this model different from the null model?
25 / 29

Let's look at hypothesis testing for the coef

Statistical significance tells us whether or not we can reject the null hypothesis

  1. What is the null model? the coefficient = 0
  2. What is our tolerance for type 1 error? (p-value) typically 95%
  3. Is this model different from the null model? Yes! The t-value for the math_sc is 49.27, p << 0.05
26 / 29

Let's talk about what we can say...

  1. Can we reject the null model?

  2. Does a p << 0.05 mean that the model is meaningful?

27 / 29

Let's talk about what we can say...

  1. Can we reject the null model?

Yes!

  1. Does a p << 0.05 mean that the model is meaningful?

No!

28 / 29

Let's look at confidence intervals

Confidence intervals give us a sense about the range of likely true positives.

For a coefficient, we first establish a tolerance for false positives (95%), then...

C.I.=b±tcritS.E.(b)

confint(sci_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -279.543039 366.80601
## math_sc 6.760086 16.48408
29 / 29

Let's start where we ended last class...Scatter Plots

Start by getting the data into R

#This loads the csv and saves it as a dataframe titled week_1_data
week_1_data <- read_csv(here("static/slides/EDUC_847/data",
"EDUC_847_Survey.csv"))
2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow