This week we'll discuss
Both of these are real issues that arise all the time in doing regression analyses.
Start by getting the data into R
#This loads the csv and saves it as a dataframe titled week_1_datasci_df <- read_csv(here("data", "science_scores2.csv"))
Mod1 <- lm(sci ~ math_sc, data = sci_df)summary(Mod1)
## ## Call:## lm(formula = sci ~ math_sc, data = sci_df)## ## Residuals:## Min 1Q Median 3Q Max ## -1046.2 -362.0 -140.4 144.3 4657.9 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -279.474 167.378 -1.670 0.0957 . ## math_sc 15.967 2.478 6.445 3.15e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 612.7 on 424 degrees of freedom## (74 observations deleted due to missingness)## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707 ## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10
Note - we have a new dataset!
Mod2 <- lm(sci ~ pre_sc, data = sci_df)summary(Mod2)
## ## Call:## lm(formula = sci ~ pre_sc, data = sci_df)## ## Residuals:## Min 1Q Median 3Q Max ## -649.40 -50.95 8.69 50.72 1061.37 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 178.935 9.786 18.29 <2e-16 ***## pre_sc 429.168 4.872 88.08 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 147.2 on 450 degrees of freedom## (48 observations deleted due to missingness)## Multiple R-squared: 0.9452, Adjusted R-squared: 0.9451 ## F-statistic: 7759 on 1 and 450 DF, p-value: < 2.2e-16
Residuals are distributed normally.
Predictor and Fitted are linearly related
The standard deviation of the residual is the same at all values of the Outcome
The correlations between predictor variables is small (We'll come back to this)
There aren't any individual data points that are really impacting our model
That doesn't look good, doesn't look like the residuals are normally distributed
We can do a histogram of residuals
Mod2_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip()
definitely not normal distribution...
Also doesn't hurt to check the pre_score itself
Mod2_df %>% ggplot(aes(y = pre_sc)) + geom_histogram() + theme_minimal() + coord_flip()
definitely not normal distribution... Now What????
The most common transformation is a log transformation
sci_df %>% ggplot(aes(y = log(pre_sc))) + geom_histogram() + theme_minimal() + coord_flip()
Mod3 <- lm(sci ~ log(pre_sc), data = sci_df)summary(Mod3)
## ## Call:## lm(formula = sci ~ log(pre_sc), data = sci_df)## ## Residuals:## Min 1Q Median 3Q Max ## -571.75 -212.78 -102.83 95.64 2887.15 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 826.35 17.44 47.40 <2e-16 ***## log(pre_sc) 533.34 18.28 29.18 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 369.6 on 450 degrees of freedom## (48 observations deleted due to missingness)## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534 ## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16
summary(Mod3)
## ## Call:## lm(formula = sci ~ log(pre_sc), data = sci_df)## ## Residuals:## Min 1Q Median 3Q Max ## -571.75 -212.78 -102.83 95.64 2887.15 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 826.35 17.44 47.40 <2e-16 ***## log(pre_sc) 533.34 18.28 29.18 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 369.6 on 450 degrees of freedom## (48 observations deleted due to missingness)## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534 ## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16
We can do a histogram of residuals for transformed data
log_Mod3_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip()
Is this model with the transformed data better?
So maybe this data transformation was not the right way to go. There are other approaches to data transformation.
This makes it easy to compare coefficients across models and between different variables.
Mod4 <- lm(scale(sci) ~ scale(math_sc), data = sci_df)summary(Mod4)
## ## Call:## lm(formula = scale(sci) ~ scale(math_sc), data = sci_df)## ## Residuals:## Min 1Q Median 3Q Max ## -1.5219 -0.5266 -0.2043 0.2099 6.7761 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.03709 0.04319 -0.859 0.391 ## scale(math_sc) 0.27701 0.04298 6.445 3.15e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.8913 on 424 degrees of freedom## (74 observations deleted due to missingness)## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707 ## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10
Hey, notice that the t-value and the p value both stay the same as the first model!
summary(sci_df)
## id gr motiv pre_sc ## Min. :10789 Min. :1.000 Min. : 37.39 Min. : 0.05074 ## 1st Qu.:34052 1st Qu.:2.000 1st Qu.: 65.01 1st Qu.: 0.50900 ## Median :54614 Median :3.000 Median : 73.25 Median : 0.98523 ## Mean :54561 Mean :2.988 Mean : 73.20 Mean : 1.44099 ## 3rd Qu.:75198 3rd Qu.:4.000 3rd Qu.: 81.33 3rd Qu.: 1.75119 ## Max. :99772 Max. :5.000 Max. :106.14 Max. :10.89618 ## NA's :115 NA's :14 ## math_sc sci ## Min. : 30.00 Min. : 184.7 ## 1st Qu.: 58.79 1st Qu.: 402.8 ## Median : 66.61 Median : 595.4 ## Mean : 66.74 Mean : 811.7 ## 3rd Qu.: 74.59 3rd Qu.: 972.6 ## Max. :101.33 Max. :5755.3 ## NA's :40 NA's :34
What do you notice?
Sounds difficult, no?
md.pattern(sci_df)
## id gr pre_sc sci math_sc motiv ## 297 1 1 1 1 1 1 0## 115 1 1 1 1 1 0 1## 40 1 1 1 1 0 1 1## 34 1 1 1 0 1 1 1## 14 1 1 0 1 1 1 1## 0 0 14 34 40 115 203
First, we will set up a dataset with 5 iterations of the imputed data.
sci_df_imp <- mice(sci_df, maxit = 5, m = 5, seed = 327)
## ## iter imp variable## 1 1 motiv pre_sc math_sc sci## 1 2 motiv pre_sc math_sc sci## 1 3 motiv pre_sc math_sc sci## 1 4 motiv pre_sc math_sc sci## 1 5 motiv pre_sc math_sc sci## 2 1 motiv pre_sc math_sc sci## 2 2 motiv pre_sc math_sc sci## 2 3 motiv pre_sc math_sc sci## 2 4 motiv pre_sc math_sc sci## 2 5 motiv pre_sc math_sc sci## 3 1 motiv pre_sc math_sc sci## 3 2 motiv pre_sc math_sc sci## 3 3 motiv pre_sc math_sc sci## 3 4 motiv pre_sc math_sc sci## 3 5 motiv pre_sc math_sc sci## 4 1 motiv pre_sc math_sc sci## 4 2 motiv pre_sc math_sc sci## 4 3 motiv pre_sc math_sc sci## 4 4 motiv pre_sc math_sc sci## 4 5 motiv pre_sc math_sc sci## 5 1 motiv pre_sc math_sc sci## 5 2 motiv pre_sc math_sc sci## 5 3 motiv pre_sc math_sc sci## 5 4 motiv pre_sc math_sc sci## 5 5 motiv pre_sc math_sc sci
We should look at how the new data are distributed using a stripplot.
stripplot(sci_df_imp, math_sc, xlab = "imputation number")
sci_mod_imp <- with(sci_df_imp, lm(sci~ math_sc))summary(pool(sci_mod_imp))
## term estimate std.error statistic df p.value## 1 (Intercept) -178.89599 182.166993 -0.9820439 132.2844 3.278709e-01## 2 math_sc 14.98221 2.706412 5.5358212 121.6803 1.813591e-07
We spent more time dealing with the underlying data for linear modeling
We can...
This week we'll discuss
Both of these are real issues that arise all the time in doing regression analyses.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |