EDUC 847 Winter 25
Week 3 - Data for Linear Modeling
Eric Brewe 
 Professor of Physics at Drexel University 

22 January 2025, last update: 2025-01-21
1 / 29

Let's start where we ended last class...Scatter Plots

Start by getting the data into R

#This loads the csv and saves it as a dataframe titled week_1_data
week_1_data <- read_csv(here("static/slides/EDUC_847/data",
                             "EDUC_847_Survey.csv"))

2 / 29

Let's start cleaning up.

week_1_data %>%
  select(ID,Q2:Q6) %>%
  filter(row_number() >3 ) %>%
  mutate(across(c(Q2:Q3,Q5:Q6), as.numeric)) -> w1df

3 / 29

Let's remember what we did last week

Is there a relationship between length of book and estimates on weight of cow?

Making a scatter plot to explore

w1df %>%
  select(Q5:Q6) %>%
  ggplot(aes(x = Q5, y = Q6)) +
  geom_point() + 
  theme_minimal() + 
  labs(x = "Pages", 
       y = "Weight of Cow",
       title = "Kick Ass Scatterplot!")

A residual is the distance between the Y value for each measurement, and the predicted Y value.

4 / 29

Let's make a linear regression model

We want the estimated weight of the cow as predicted by the length of the book closest to the survey respondent.

Model1 <- lm(Q6 ~ Q5, data = w1df)
Model1

## 
## Call:
## lm(formula = Q6 ~ Q5, data = w1df)
## 
## Coefficients:
## (Intercept)           Q5  
##   1668.7235       0.3001

5 / 29

Let's dig into our model

summary(Model1)

## 
## Call:
## lm(formula = Q6 ~ Q5, data = w1df)
## 
## Residuals:
##      1      2      3      4      5      6      7 
## 2638.4 -478.3 -933.6 -274.3 -229.5 -285.5 -437.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1668.7235   723.6088   2.306   0.0692 .
## Q5             0.3001     1.8163   0.165   0.8753  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1301 on 5 degrees of freedom
## Multiple R-squared:  0.005429,    Adjusted R-squared:  -0.1935 
## F-statistic: 0.02729 on 1 and 5 DF,  p-value: 0.8753

6 / 29

Let's look at hypothesis testing for this modelStatistical significance tells us whether or not we can reject the null hypothesisWhat is the null model?  
What is our tolerance for type 1 error? (p-value) 
Is this model different from the null model? 
7 / 29

Let's think about this model...

R² = 0.005429

We saw this last week

F-Statistic = 0.02729

This is a ratio of the Model Sum of Squares vs. the Residual Sum of Squares

p = 0.8753

This is the percent of time you would have to be willing to have a type one error (False Positive)

This model doesn't fit very well.

8 / 29

Let's think about the Intercept

Intercept = 1668

This tells us the average predicted weight of the cow.

Standard Error = 723.6

$S E = \frac{σ}{\sqrt{N}}$

t value = 2.3006

$t = \frac{I n t e r c e p t}{S t a n d a r d E r r o r} = \frac{1668.7}{723.6} = 2.03$

p = 0.0692

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".

9 / 29

Let's think about the Intercept

Intercept = 1668

This tells us the average predicted weight of the cow.

Standard Error = 723.6

$S E = \frac{σ}{\sqrt{N}}$

t value = 2.3006

$t = \frac{I n t e r c e p t}{S t a n d a r d E r r o r} = \frac{1668.7}{723.6} = 2.03$

p = 0.0692

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.

9 / 29

Let's think about the Q5 coeff.

Q5 = 0.3001

The average additional weight added for each additional page in the closet book.

Standard Error = 1.8163

$S E = \frac{σ}{\sqrt{N}}$

t value = 0.165

$t = \frac{I n t e r c e p t}{S t a n d a r d E r r o r} = \frac{.3001}{1.8163} = 0.165$

p = 0.8753

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".

10 / 29

Let's think about the Q5 coeff.

Q5 = 0.3001

The average additional weight added for each additional page in the closet book.

Standard Error = 1.8163

$S E = \frac{σ}{\sqrt{N}}$

t value = 0.165

$t = \frac{I n t e r c e p t}{S t a n d a r d E r r o r} = \frac{.3001}{1.8163} = 0.165$

p = 0.8753

Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.

10 / 29

Let's think about why this model is a failure

Maybe it is because we violated the assumptions?

And what are those assumptions? from https://learningstatisticswithr.com/book/regression.html#regressionassumptions

11 / 29

Let's think about why this model is a failure

Normality of Residuals

Residuals are distributed normally.

Linearity

Predictor and Fitted are linearly related

Homogeneity of varicance

The standard deviation of the residual is the same at all values of the Outcome

Uncorrelated predictors

The correlations between predictor variables is small (We'll come back to this)

No bad outliers

There aren't any individual data points that are really impacting our model

12 / 29

Let's check out those residuals...

augment(Model1) %>%
  select(Q5:.resid)

## # A tibble: 7 × 3
##      Q5 .fitted .resid
##   <dbl>   <dbl>  <dbl>
## 1   273   1751.  2638.
## 2    32   1678.  -478.
## 3   383   1784.  -934.
## 4   352   1774.  -274.
## 5   869   1929.  -229.
## 6    56   1686.  -286.
## 7    81   1693.  -437.

13 / 29

Let's check the normality residuals...

We can do a histogram of the residuals

mod_df <- augment(Model1)
mod_df %>%
  ggplot(aes(y = .resid)) +
  geom_histogram() + 
  theme_minimal() + 
  coord_flip()

Doesn't look normal! Could confirm with a Shapiro-Wilk test...

14 / 29

Let's check the normality residuals...

We can do a Q-Q plot

plot(x = Model1, which = 2 )

Should be a straight line

Should have a slope = 1

Not so much

15 / 29

Let's check the linearity assumption

We can do a scatter plot of the Outcome (Q6) vs. the Fitted (the result of the eqn. )

plot(x = Model1, which = 1 )

Should be a straight line

Not so much

16 / 29

Let's check the homogeneity of variance assumption

We can do a scatter plot of the standardized residuals vs. the Fitted values (the result of the eqn. )

plot(x = Model1, which = 3 )

Should be a straight horizontal line

Not so much

17 / 29

Let's check the no bad outliers assumption

We can do a boxplot to get a sense

w1df %>%
  ggplot(aes(y = Q6)) +
  geom_boxplot() +
  theme_minimal()

Note the one way up above might be an outlier

We should confirm.

18 / 29

Let's check the no bad outliers assumption

We can check the leverage of the model

plot(Model1, which = 5)

Looks like observations 5 might have much larger leverage?

We should confirm.

19 / 29

Let's revisit our model

Maybe, just maybe this model is junk because...

There is no good reason to think that the number of pages in the book closest to you when you answered this survey has any bearing on how much you would estimate a cow to weigh!

20 / 29

Let's try this with a different model

w3df <- read_csv(here("static/slides/EDUC_847/data",
                             "science_scores.csv"))
sci_model <- lm(sci ~ math_sc, data = w3df)

21 / 29

Let's try this with a different model

summary(sci_model)

## 
## Call:
## lm(formula = sci ~ math_sc, data = w3df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -927.3 -362.3 -187.9  182.5 4431.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   43.631    164.487   0.265    0.791    
## math_sc       11.622      2.475   4.696 3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 653.1 on 498 degrees of freedom
## Multiple R-squared:  0.04241,    Adjusted R-squared:  0.04049 
## F-statistic: 22.06 on 1 and 498 DF,  p-value: 3.428e-06

22 / 29

Let's look at hypothesis testing for the modelStatistical significance tells us whether or not we can reject the null hypothesisWhat is the null model?
What is our tolerance for type 1 error? (p-value)
Is this model different from the null model? 
23 / 29

Let's look at hypothesis testing for the modelStatistical significance tells us whether or not we can reject the null hypothesisWhat is the null model?  The intercept only model. 
What is our tolerance for type 1 error? (p-value) Typically 95%
Is this model different from the null model? Yes.  We know because the F-statistic is very large, and the p- value is less than 0.05 (5%).
24 / 29

Let's look at hypothesis testing for the coefStatistical significance tells us whether or not we can reject the null hypothesisWhat is the null model?  
What is our tolerance for type 1 error? (p-value) 
Is this model different from the null model? 
25 / 29

Let's look at hypothesis testing for the coefStatistical significance tells us whether or not we can reject the null hypothesisWhat is the null model?  the coefficient = 0
What is our tolerance for type 1 error? (p-value) typically 95%
Is this model different from the null model? Yes!  The t-value for the math_sc is 49.27, p << 0.05
26 / 29

Let's talk about what we can say...

Can we reject the null model?
Does a p << 0.05 mean that the model is meaningful?

27 / 29

Let's talk about what we can say...

Can we reject the null model?

Yes!

Does a p << 0.05 mean that the model is meaningful?

No!

28 / 29

Let's look at confidence intervals

Confidence intervals give us a sense about the range of likely true positives.

For a coefficient, we first establish a tolerance for false positives (95%), then...

$C . I . = b \pm t_{c r i t} * S . E . (b)$

confint(sci_model, level = 0.95)

##                   2.5 %    97.5 %
## (Intercept) -279.543039 366.80601
## math_sc        6.760086  16.48408

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

EDUC 847 Winter 25

Week 3 - Data for Linear Modeling

Eric Brewe Professor of Physics at Drexel University

22 January 2025, last update: 2025-01-21

Let's start where we ended last class...Scatter Plots

Let's start cleaning up.

Let's remember what we did last week

Is there a relationship between length of book and estimates on weight of cow?

Making a scatter plot to explore

Let's make a linear regression model

We want the estimated weight of the cow as predicted by the length of the book closest to the survey respondent.

Let's dig into our model

Let's look at hypothesis testing for this model

Statistical significance tells us whether or not we can reject the null hypothesis

Let's think about this model...

R2 = 0.005429

F-Statistic = 0.02729

p = 0.8753

This model doesn't fit very well.

Let's think about the Intercept

Intercept = 1668

Standard Error = 723.6

t value = 2.3006

p = 0.0692

Let's think about the Intercept

Intercept = 1668

Standard Error = 723.6

t value = 2.3006

p = 0.0692

Let's think about the Q5 coeff.

Q5 = 0.3001

Standard Error = 1.8163

t value = 0.165

p = 0.8753

Let's think about the Q5 coeff.

Q5 = 0.3001

Standard Error = 1.8163

t value = 0.165

p = 0.8753

Let's think about why this model is a failure

Maybe it is because we violated the assumptions?

Let's think about why this model is a failure

Normality of Residuals

Linearity

Homogeneity of varicance

Uncorrelated predictors

No bad outliers

Let's check out those residuals...

Let's check the normality residuals...

Let's check the normality residuals...

Should be a straight line

Should have a slope = 1

Let's check the linearity assumption

Should be a straight line

Let's check the homogeneity of variance assumption

Should be a straight horizontal line

Let's check the no bad outliers assumption

Note the one way up above might be an outlier

Let's check the no bad outliers assumption

Looks like observations 5 might have much larger leverage?

Let's revisit our model

Maybe, just maybe this model is junk because...

Let's try this with a different model

Let's try this with a different model

Let's look at hypothesis testing for the model

Statistical significance tells us whether or not we can reject the null hypothesis

Let's look at hypothesis testing for the model

Statistical significance tells us whether or not we can reject the null hypothesis

Let's look at hypothesis testing for the coef

Statistical significance tells us whether or not we can reject the null hypothesis

Let's look at hypothesis testing for the coef

Statistical significance tells us whether or not we can reject the null hypothesis

Let's talk about what we can say...

Let's talk about what we can say...

Let's look at confidence intervals

Let's start where we ended last class...Scatter Plots

Help

Eric Brewe
Professor of Physics at Drexel University

R² = 0.005429