EDUC 847 Winter 25
Week 4 - Data for Linear Modeling
Eric Brewe 
 Professor of Physics at Drexel University 

29 January 2025, last update: 2025-02-05
1 / 24

Let's set the stage

Last week we talked about the assumptions that go into linear modeling.
We didn't talk about what happens when these assumptions are violated.

This week we'll discuss

Data transformations
Missing data

Both of these are real issues that arise all the time in doing regression analyses.

2 / 24

Let's start where we ended last week

Start by getting the data into R

#This loads the csv and saves it as a dataframe titled week_1_data
sci_df <- read_csv(here("data",
                        "science_scores2.csv"))

3 / 24

Let's remember last week

We want to predict a science score from a math score.

Mod1 <- lm(sci ~ math_sc, data = sci_df)
summary(Mod1)

## 
## Call:
## lm(formula = sci ~ math_sc, data = sci_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1046.2  -362.0  -140.4   144.3  4657.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -279.474    167.378  -1.670   0.0957 .  
## math_sc       15.967      2.478   6.445 3.15e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 612.7 on 424 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.08922,    Adjusted R-squared:  0.08707 
## F-statistic: 41.53 on 1 and 424 DF,  p-value: 3.154e-10

Note - we have a new dataset!

4 / 24

Let's check some other variables

What about pre score?

Mod2 <- lm(sci ~ pre_sc, data = sci_df)
summary(Mod2)

## 
## Call:
## lm(formula = sci ~ pre_sc, data = sci_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -649.40  -50.95    8.69   50.72 1061.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  178.935      9.786   18.29   <2e-16 ***
## pre_sc       429.168      4.872   88.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.2 on 450 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.9452,    Adjusted R-squared:  0.9451 
## F-statistic:  7759 on 1 and 450 DF,  p-value: < 2.2e-16

5 / 24

Let's check on the assumptions

Normality of Residuals

Residuals are distributed normally.

Linearity

Predictor and Fitted are linearly related

Homogeneity of varicance

The standard deviation of the residual is the same at all values of the Outcome

Uncorrelated predictors

The correlations between predictor variables is small (We'll come back to this)

No bad outliers

There aren't any individual data points that are really impacting our model

6 / 24

Let's start with a scatter plot

That doesn't look good, doesn't look like the residuals are normally distributed

7 / 24

Let's check the normality

We can do a histogram of residuals

Mod2_df %>%
  ggplot(aes(y = .resid)) +
  geom_histogram() + 
  theme_minimal() + 
  coord_flip()

definitely not normal distribution...

8 / 24

Let's check the normality

Also doesn't hurt to check the pre_score itself

Mod2_df %>%
  ggplot(aes(y = pre_sc)) +
  geom_histogram() + 
  theme_minimal() + 
  coord_flip()

definitely not normal distribution... Now What????

9 / 24

Let's transform our data

When your data are not normally distributed, you can apply a transformation.

The most common transformation is a log transformation

sci_df %>%
  ggplot(aes(y = log(pre_sc))) +
  geom_histogram() + 
  theme_minimal() + 
  coord_flip()

10 / 24

Let's use the log transform in a model

Mod3 <- lm(sci ~ log(pre_sc), data = sci_df)
summary(Mod3)

## 
## Call:
## lm(formula = sci ~ log(pre_sc), data = sci_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -571.75 -212.78 -102.83   95.64 2887.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   826.35      17.44   47.40   <2e-16 ***
## log(pre_sc)   533.34      18.28   29.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.6 on 450 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.6542,    Adjusted R-squared:  0.6534 
## F-statistic: 851.2 on 1 and 450 DF,  p-value: < 2.2e-16

11 / 24

Let's interpret this new model

summary(Mod3)

## 
## Call:
## lm(formula = sci ~ log(pre_sc), data = sci_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -571.75 -212.78 -102.83   95.64 2887.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   826.35      17.44   47.40   <2e-16 ***
## log(pre_sc)   533.34      18.28   29.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.6 on 450 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.6542,    Adjusted R-squared:  0.6534 
## F-statistic: 851.2 on 1 and 450 DF,  p-value: < 2.2e-16

12 / 24

Let's check the normality

We can do a histogram of residuals for transformed data

log_Mod3_df %>%
  ggplot(aes(y = .resid)) +
  geom_histogram() + 
  theme_minimal() + 
  coord_flip()

13 / 24

Let's evaluate this new model

Is this model with the transformed data better?

The $R^{2}$ is lower in the new model
The residuals aren't normally distributed.

So maybe this data transformation was not the right way to go. There are other approaches to data transformation.

14 / 24

Let's consider another transform

Another common transformation is to use Z-scores

This makes it easy to compare coefficients across models and between different variables.

Mod4 <- lm(scale(sci) ~ scale(math_sc), data = sci_df)
summary(Mod4)

## 
## Call:
## lm(formula = scale(sci) ~ scale(math_sc), data = sci_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5219 -0.5266 -0.2043  0.2099  6.7761 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.03709    0.04319  -0.859    0.391    
## scale(math_sc)  0.27701    0.04298   6.445 3.15e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8913 on 424 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.08922,    Adjusted R-squared:  0.08707 
## F-statistic: 41.53 on 1 and 424 DF,  p-value: 3.154e-10

Hey, notice that the t-value and the p value both stay the same as the first model!

15 / 24

Let's move on to missing data.16 / 24

Let's inspect our data

summary(sci_df)

##        id              gr            motiv            pre_sc        
##  Min.   :10789   Min.   :1.000   Min.   : 37.39   Min.   : 0.05074  
##  1st Qu.:34052   1st Qu.:2.000   1st Qu.: 65.01   1st Qu.: 0.50900  
##  Median :54614   Median :3.000   Median : 73.25   Median : 0.98523  
##  Mean   :54561   Mean   :2.988   Mean   : 73.20   Mean   : 1.44099  
##  3rd Qu.:75198   3rd Qu.:4.000   3rd Qu.: 81.33   3rd Qu.: 1.75119  
##  Max.   :99772   Max.   :5.000   Max.   :106.14   Max.   :10.89618  
##                                  NA's   :115      NA's   :14        
##     math_sc            sci        
##  Min.   : 30.00   Min.   : 184.7  
##  1st Qu.: 58.79   1st Qu.: 402.8  
##  Median : 66.61   Median : 595.4  
##  Mean   : 66.74   Mean   : 811.7  
##  3rd Qu.: 74.59   3rd Qu.: 972.6  
##  Max.   :101.33   Max.   :5755.3  
##  NA's   :40       NA's   :34

What do you notice?

17 / 24

Let's deal with these missing data!Here are some optionsHot Deck Replacement
Mean/Median Replacement
Regression Replacement
Multiple Imputation
18 / 24

Let's look at Multiple Imputation

Impute missing data (do this several/a bunch of times)
Analyze all the new data sets
Pool the results

Sounds difficult, no?

19 / 24

Let's try multiple imputation

First, it works best with data that are MCAR or MAR

md.pattern(sci_df)

##     id gr pre_sc sci math_sc motiv    
## 297  1  1      1   1       1     1   0
## 115  1  1      1   1       1     0   1
## 40   1  1      1   1       0     1   1
## 34   1  1      1   0       1     1   1
## 14   1  1      0   1       1     1   1
##      0  0     14  34      40   115 203

20 / 24

Let's go ahead and impute

First, we will set up a dataset with 5 iterations of the imputed data.

sci_df_imp <- mice(sci_df, maxit = 5, m = 5, seed = 327)

## 
##  iter imp variable
##   1   1  motiv  pre_sc  math_sc  sci
##   1   2  motiv  pre_sc  math_sc  sci
##   1   3  motiv  pre_sc  math_sc  sci
##   1   4  motiv  pre_sc  math_sc  sci
##   1   5  motiv  pre_sc  math_sc  sci
##   2   1  motiv  pre_sc  math_sc  sci
##   2   2  motiv  pre_sc  math_sc  sci
##   2   3  motiv  pre_sc  math_sc  sci
##   2   4  motiv  pre_sc  math_sc  sci
##   2   5  motiv  pre_sc  math_sc  sci
##   3   1  motiv  pre_sc  math_sc  sci
##   3   2  motiv  pre_sc  math_sc  sci
##   3   3  motiv  pre_sc  math_sc  sci
##   3   4  motiv  pre_sc  math_sc  sci
##   3   5  motiv  pre_sc  math_sc  sci
##   4   1  motiv  pre_sc  math_sc  sci
##   4   2  motiv  pre_sc  math_sc  sci
##   4   3  motiv  pre_sc  math_sc  sci
##   4   4  motiv  pre_sc  math_sc  sci
##   4   5  motiv  pre_sc  math_sc  sci
##   5   1  motiv  pre_sc  math_sc  sci
##   5   2  motiv  pre_sc  math_sc  sci
##   5   3  motiv  pre_sc  math_sc  sci
##   5   4  motiv  pre_sc  math_sc  sci
##   5   5  motiv  pre_sc  math_sc  sci

21 / 24

Let's plot our imputed data

We should look at how the new data are distributed using a stripplot.

stripplot(sci_df_imp, math_sc, xlab = "imputation number")

22 / 24

Let's use this in a linear model

sci_mod_imp <- with(sci_df_imp, lm(sci~ math_sc))
summary(pool(sci_mod_imp))

##          term   estimate  std.error  statistic       df      p.value
## 1 (Intercept) -178.89599 182.166993 -0.9820439 132.2844 3.278709e-01
## 2     math_sc   14.98221   2.706412  5.5358212 121.6803 1.813591e-07

23 / 24

Let's review...

We spent more time dealing with the underlying data for linear modeling

We can...

transform data when it is not normally distributed
- Log-Transform
- Z-scores
visualize missingness in our dataset
use multiple imputation with chained equations.

24 / 24

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

EDUC 847 Winter 25

Week 4 - Data for Linear Modeling

Eric Brewe Professor of Physics at Drexel University

29 January 2025, last update: 2025-02-05

Let's set the stage

Let's start where we ended last week

Let's remember last week

We want to predict a science score from a math score.

Let's check some other variables

What about pre score?

Let's check on the assumptions

Normality of Residuals

Linearity

Homogeneity of varicance

Uncorrelated predictors

No bad outliers

Let's start with a scatter plot

Let's check the normality

Let's check the normality

Let's transform our data

When your data are not normally distributed, you can apply a transformation.

Let's use the log transform in a model

Let's interpret this new model

Let's check the normality

Let's evaluate this new model

Let's consider another transform

Another common transformation is to use Z-scores

Let's move on to missing data.

Let's inspect our data

Let's deal with these missing data!

Here are some options

Let's look at Multiple Imputation

Let's try multiple imputation

First, it works best with data that are MCAR or MAR

Let's go ahead and impute

Let's plot our imputed data

Let's use this in a linear model

Let's review...

Let's set the stage

Help

Eric Brewe
Professor of Physics at Drexel University