+ - 0:00:00
Notes for current slide
Notes for next slide

EDUC 847 Winter 25

Week 4 - Data for Linear Modeling

Eric Brewe
Professor of Physics at Drexel University

29 January 2025, last update: 2025-02-05

1 / 24

Let's set the stage

  • Last week we talked about the assumptions that go into linear modeling.
  • We didn't talk about what happens when these assumptions are violated.

This week we'll discuss

  • Data transformations
  • Missing data

Both of these are real issues that arise all the time in doing regression analyses.

2 / 24

Let's start where we ended last week

Start by getting the data into R

#This loads the csv and saves it as a dataframe titled week_1_data
sci_df <- read_csv(here("data",
"science_scores2.csv"))
3 / 24

Let's remember last week

We want to predict a science score from a math score.

Mod1 <- lm(sci ~ math_sc, data = sci_df)
summary(Mod1)
##
## Call:
## lm(formula = sci ~ math_sc, data = sci_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1046.2 -362.0 -140.4 144.3 4657.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -279.474 167.378 -1.670 0.0957 .
## math_sc 15.967 2.478 6.445 3.15e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 612.7 on 424 degrees of freedom
## (74 observations deleted due to missingness)
## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707
## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10

Note - we have a new dataset!

4 / 24

Let's check some other variables

What about pre score?

Mod2 <- lm(sci ~ pre_sc, data = sci_df)
summary(Mod2)
##
## Call:
## lm(formula = sci ~ pre_sc, data = sci_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -649.40 -50.95 8.69 50.72 1061.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.935 9.786 18.29 <2e-16 ***
## pre_sc 429.168 4.872 88.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.2 on 450 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.9452, Adjusted R-squared: 0.9451
## F-statistic: 7759 on 1 and 450 DF, p-value: < 2.2e-16
5 / 24

Let's check on the assumptions

Normality of Residuals

Residuals are distributed normally.

Linearity

Predictor and Fitted are linearly related

Homogeneity of varicance

The standard deviation of the residual is the same at all values of the Outcome

Uncorrelated predictors

The correlations between predictor variables is small (We'll come back to this)

No bad outliers

There aren't any individual data points that are really impacting our model

6 / 24

Let's start with a scatter plot

That doesn't look good, doesn't look like the residuals are normally distributed

7 / 24

Let's check the normality

We can do a histogram of residuals

Mod2_df %>%
ggplot(aes(y = .resid)) +
geom_histogram() +
theme_minimal() +
coord_flip()

definitely not normal distribution...

8 / 24

Let's check the normality

Also doesn't hurt to check the pre_score itself

Mod2_df %>%
ggplot(aes(y = pre_sc)) +
geom_histogram() +
theme_minimal() +
coord_flip()

definitely not normal distribution... Now What????

9 / 24

Let's transform our data

When your data are not normally distributed, you can apply a transformation.

The most common transformation is a log transformation

sci_df %>%
ggplot(aes(y = log(pre_sc))) +
geom_histogram() +
theme_minimal() +
coord_flip()
10 / 24

Let's use the log transform in a model

Mod3 <- lm(sci ~ log(pre_sc), data = sci_df)
summary(Mod3)
##
## Call:
## lm(formula = sci ~ log(pre_sc), data = sci_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -571.75 -212.78 -102.83 95.64 2887.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 826.35 17.44 47.40 <2e-16 ***
## log(pre_sc) 533.34 18.28 29.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.6 on 450 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534
## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16
11 / 24

Let's interpret this new model

summary(Mod3)
##
## Call:
## lm(formula = sci ~ log(pre_sc), data = sci_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -571.75 -212.78 -102.83 95.64 2887.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 826.35 17.44 47.40 <2e-16 ***
## log(pre_sc) 533.34 18.28 29.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.6 on 450 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534
## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16
12 / 24

Let's check the normality

We can do a histogram of residuals for transformed data

log_Mod3_df %>%
ggplot(aes(y = .resid)) +
geom_histogram() +
theme_minimal() +
coord_flip()
13 / 24

Let's evaluate this new model

Is this model with the transformed data better?

  • The R2 is lower in the new model
  • The residuals aren't normally distributed.

So maybe this data transformation was not the right way to go. There are other approaches to data transformation.

14 / 24

Let's consider another transform

Another common transformation is to use Z-scores

This makes it easy to compare coefficients across models and between different variables.

Mod4 <- lm(scale(sci) ~ scale(math_sc), data = sci_df)
summary(Mod4)
##
## Call:
## lm(formula = scale(sci) ~ scale(math_sc), data = sci_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5219 -0.5266 -0.2043 0.2099 6.7761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03709 0.04319 -0.859 0.391
## scale(math_sc) 0.27701 0.04298 6.445 3.15e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8913 on 424 degrees of freedom
## (74 observations deleted due to missingness)
## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707
## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10

Hey, notice that the t-value and the p value both stay the same as the first model!

15 / 24

Let's move on to missing data.

16 / 24

Let's inspect our data

summary(sci_df)
## id gr motiv pre_sc
## Min. :10789 Min. :1.000 Min. : 37.39 Min. : 0.05074
## 1st Qu.:34052 1st Qu.:2.000 1st Qu.: 65.01 1st Qu.: 0.50900
## Median :54614 Median :3.000 Median : 73.25 Median : 0.98523
## Mean :54561 Mean :2.988 Mean : 73.20 Mean : 1.44099
## 3rd Qu.:75198 3rd Qu.:4.000 3rd Qu.: 81.33 3rd Qu.: 1.75119
## Max. :99772 Max. :5.000 Max. :106.14 Max. :10.89618
## NA's :115 NA's :14
## math_sc sci
## Min. : 30.00 Min. : 184.7
## 1st Qu.: 58.79 1st Qu.: 402.8
## Median : 66.61 Median : 595.4
## Mean : 66.74 Mean : 811.7
## 3rd Qu.: 74.59 3rd Qu.: 972.6
## Max. :101.33 Max. :5755.3
## NA's :40 NA's :34

What do you notice?

17 / 24

Let's deal with these missing data!

Here are some options

  • Hot Deck Replacement
  • Mean/Median Replacement
  • Regression Replacement
  • Multiple Imputation
18 / 24

Let's look at Multiple Imputation

  • Impute missing data (do this several/a bunch of times)
  • Analyze all the new data sets
  • Pool the results

Sounds difficult, no?

19 / 24

Let's try multiple imputation

First, it works best with data that are MCAR or MAR

md.pattern(sci_df)

## id gr pre_sc sci math_sc motiv
## 297 1 1 1 1 1 1 0
## 115 1 1 1 1 1 0 1
## 40 1 1 1 1 0 1 1
## 34 1 1 1 0 1 1 1
## 14 1 1 0 1 1 1 1
## 0 0 14 34 40 115 203
20 / 24

Let's go ahead and impute

First, we will set up a dataset with 5 iterations of the imputed data.

sci_df_imp <- mice(sci_df, maxit = 5, m = 5, seed = 327)
##
## iter imp variable
## 1 1 motiv pre_sc math_sc sci
## 1 2 motiv pre_sc math_sc sci
## 1 3 motiv pre_sc math_sc sci
## 1 4 motiv pre_sc math_sc sci
## 1 5 motiv pre_sc math_sc sci
## 2 1 motiv pre_sc math_sc sci
## 2 2 motiv pre_sc math_sc sci
## 2 3 motiv pre_sc math_sc sci
## 2 4 motiv pre_sc math_sc sci
## 2 5 motiv pre_sc math_sc sci
## 3 1 motiv pre_sc math_sc sci
## 3 2 motiv pre_sc math_sc sci
## 3 3 motiv pre_sc math_sc sci
## 3 4 motiv pre_sc math_sc sci
## 3 5 motiv pre_sc math_sc sci
## 4 1 motiv pre_sc math_sc sci
## 4 2 motiv pre_sc math_sc sci
## 4 3 motiv pre_sc math_sc sci
## 4 4 motiv pre_sc math_sc sci
## 4 5 motiv pre_sc math_sc sci
## 5 1 motiv pre_sc math_sc sci
## 5 2 motiv pre_sc math_sc sci
## 5 3 motiv pre_sc math_sc sci
## 5 4 motiv pre_sc math_sc sci
## 5 5 motiv pre_sc math_sc sci
21 / 24

Let's plot our imputed data

We should look at how the new data are distributed using a stripplot.

stripplot(sci_df_imp, math_sc, xlab = "imputation number")

22 / 24

Let's use this in a linear model

sci_mod_imp <- with(sci_df_imp, lm(sci~ math_sc))
summary(pool(sci_mod_imp))
## term estimate std.error statistic df p.value
## 1 (Intercept) -178.89599 182.166993 -0.9820439 132.2844 3.278709e-01
## 2 math_sc 14.98221 2.706412 5.5358212 121.6803 1.813591e-07
23 / 24

Let's review...

We spent more time dealing with the underlying data for linear modeling

We can...

  • transform data when it is not normally distributed
    • Log-Transform
    • Z-scores
  • visualize missingness in our dataset
  • use multiple imputation with chained equations.
24 / 24

Let's set the stage

  • Last week we talked about the assumptions that go into linear modeling.
  • We didn't talk about what happens when these assumptions are violated.

This week we'll discuss

  • Data transformations
  • Missing data

Both of these are real issues that arise all the time in doing regression analyses.

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow