class: center, middle, inverse, title-slide .title[ # EDUC 847 Winter 24 ] .subtitle[ ## Week 4 - Data for Linear Modeling ] .author[ ### Eric Brewe
Professor of Physics at Drexel University
] .date[ ### 5 February 2024, last update: 2024-02-26 ] --- class: center, middle # Let's start wit Start by getting the data into R ```r #This loads the csv and saves it as a dataframe titled week_1_data sci_df <- read_csv(here("data", "science_scores2.csv")) ``` --- # Let's remember last week ### We want to predict a science score from a math score. ```r *sci_mod <- lm(sci ~ math_sc, data = sci_df) summary(sci_mod) ``` ``` ## ## Call: ## lm(formula = sci ~ math_sc, data = sci_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1046.2 -362.0 -140.4 144.3 4657.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -279.474 167.378 -1.670 0.0957 . ## math_sc 15.967 2.478 6.445 3.15e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 612.7 on 424 degrees of freedom ## (74 observations deleted due to missingness) ## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707 ## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10 ``` Note - we have a new dataset! --- # Let's check some other variables ## What about pre score? ```r Mod2 <- lm(sci ~ pre_sc, data = sci_df) summary(Mod2) ``` ``` ## ## Call: ## lm(formula = sci ~ pre_sc, data = sci_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -649.40 -50.95 8.69 50.72 1061.37 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 178.935 9.786 18.29 <2e-16 *** ## pre_sc 429.168 4.872 88.08 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 147.2 on 450 degrees of freedom ## (48 observations deleted due to missingness) ## Multiple R-squared: 0.9452, Adjusted R-squared: 0.9451 ## F-statistic: 7759 on 1 and 450 DF, p-value: < 2.2e-16 ``` --- # Let's check on the assumptions ### Normality of Residuals Residuals are distributed normally. ### Linearity Predictor and Fitted are linearly related ### Homogeneity of varicance The standard deviation of the residual is the same at all values of the Outcome ### Uncorrelated predictors The correlations between predictor variables is small (We'll come back to this) ### No bad outliers There aren't any individual data points that are really impacting our model --- # Let's start with a scatter plot ![](EDUC_847_Week_4_files/figure-html/ScatterPlotWithResiduals-1.png)<!-- --> That doesn't look good, doesn't look normal --- # Let's check the normality We can do a histogram of residuals .pull-left[ ![](EDUC_847_Week_4_files/figure-html/ResidualHist-1.png)<!-- --> ] .pull-right[ ```r sci_mod_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip() ``` ] definitely not normal distribution... --- # Let's check the normality Also doesn't hurt to check the pre_score itself .pull-left[ ![](EDUC_847_Week_4_files/figure-html/pre_sc_Hist-1.png)<!-- --> ] .pull-right[ ```r sci_mod_df %>% ggplot(aes(y = pre_sc)) + geom_histogram() + theme_minimal() + coord_flip() ``` ] definitely not normal distribution... Now What???? --- # Let's transform our data ## When your data are not normally distributed, you can apply a transformation. The most common transformation is a log transformation .pull-left[ ![](EDUC_847_Week_4_files/figure-html/LogTrans-1.png)<!-- --> ] .pull-right[ ```r sci_mod_df %>% ggplot(aes(y = log(pre_sc))) + geom_histogram() + theme_minimal() + coord_flip() ``` ] --- # Let's use the log transform in a model ```r Mod3 <- lm(sci ~ log(pre_sc), data = sci_df) summary(Mod3) ``` ``` ## ## Call: ## lm(formula = sci ~ log(pre_sc), data = sci_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -571.75 -212.78 -102.83 95.64 2887.15 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 826.35 17.44 47.40 <2e-16 *** ## log(pre_sc) 533.34 18.28 29.18 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 369.6 on 450 degrees of freedom ## (48 observations deleted due to missingness) ## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534 ## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16 ``` --- # Let's interpret this new model ```r summary(Mod3) ``` ``` ## ## Call: ## lm(formula = sci ~ log(pre_sc), data = sci_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -571.75 -212.78 -102.83 95.64 2887.15 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 826.35 17.44 47.40 <2e-16 *** ## log(pre_sc) 533.34 18.28 29.18 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 369.6 on 450 degrees of freedom ## (48 observations deleted due to missingness) ## Multiple R-squared: 0.6542, Adjusted R-squared: 0.6534 ## F-statistic: 851.2 on 1 and 450 DF, p-value: < 2.2e-16 ``` --- # Let's check the normality We can do a histogram of residuals for transformed data .pull-left[ ![](EDUC_847_Week_4_files/figure-html/LogResidualHist-1.png)<!-- --> ] .pull-right[ ```r log_sci_mod_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip() ``` ] --- # Let's consider some other transforms ## Another common transformation is to use Z-scores This makes it easy to compare coefficients ```r Mod4 <- lm(scale(sci) ~ scale(math_sc), data = sci_df) summary(Mod4) ``` ``` ## ## Call: ## lm(formula = scale(sci) ~ scale(math_sc), data = sci_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5219 -0.5266 -0.2043 0.2099 6.7761 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.03709 0.04319 -0.859 0.391 ## scale(math_sc) 0.27701 0.04298 6.445 3.15e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8913 on 424 degrees of freedom ## (74 observations deleted due to missingness) ## Multiple R-squared: 0.08922, Adjusted R-squared: 0.08707 ## F-statistic: 41.53 on 1 and 424 DF, p-value: 3.154e-10 ``` Hey, notice that the t-value and the p value both stay the same as the first model! --- # Let's deal with these missing data! ```r summary(sci_df) ``` ``` ## id gr motiv pre_sc ## Min. :10789 Min. :1.000 Min. : 37.39 Min. : 0.05074 ## 1st Qu.:34052 1st Qu.:2.000 1st Qu.: 65.01 1st Qu.: 0.50900 ## Median :54614 Median :3.000 Median : 73.25 Median : 0.98523 ## Mean :54561 Mean :2.988 Mean : 73.20 Mean : 1.44099 ## 3rd Qu.:75198 3rd Qu.:4.000 3rd Qu.: 81.33 3rd Qu.: 1.75119 ## Max. :99772 Max. :5.000 Max. :106.14 Max. :10.89618 ## NA's :115 NA's :14 ## math_sc sci ## Min. : 30.00 Min. : 184.7 ## 1st Qu.: 58.79 1st Qu.: 402.8 ## Median : 66.61 Median : 595.4 ## Mean : 66.74 Mean : 811.7 ## 3rd Qu.: 74.59 3rd Qu.: 972.6 ## Max. :101.33 Max. :5755.3 ## NA's :40 NA's :34 ``` Any thoughts? --- # Let's deal with these missing data! ## Here are some options - Hot Deck Replacement - Mean/Median Replacement - Regression Replacement - Multiple Imputation --- # Let's look at the basics of Multiple Imputation - Impute missing data (do this several/a bunch of times) - Analyze all the new data sets - Pool the results Sounds difficult, no? --- # Let's try multiple imputation ## First, it works best with data that are MCAR or MAR ```r md.pattern(sci_df) ``` ![](EDUC_847_Week_4_files/figure-html/visualizeMD-1.png)<!-- --> ``` ## id gr pre_sc sci math_sc motiv ## 297 1 1 1 1 1 1 0 ## 115 1 1 1 1 1 0 1 ## 40 1 1 1 1 0 1 1 ## 34 1 1 1 0 1 1 1 ## 14 1 1 0 1 1 1 1 ## 0 0 14 34 40 115 203 ``` --- # Let's go ahead and impute First, we will set up a dataset with 5 iterations of the imputed data. ```r sci_df_imp <- mice(sci_df, maxit = 5, m = 5, seed = 327) ``` ``` ## ## iter imp variable ## 1 1 motiv pre_sc math_sc sci ## 1 2 motiv pre_sc math_sc sci ## 1 3 motiv pre_sc math_sc sci ## 1 4 motiv pre_sc math_sc sci ## 1 5 motiv pre_sc math_sc sci ## 2 1 motiv pre_sc math_sc sci ## 2 2 motiv pre_sc math_sc sci ## 2 3 motiv pre_sc math_sc sci ## 2 4 motiv pre_sc math_sc sci ## 2 5 motiv pre_sc math_sc sci ## 3 1 motiv pre_sc math_sc sci ## 3 2 motiv pre_sc math_sc sci ## 3 3 motiv pre_sc math_sc sci ## 3 4 motiv pre_sc math_sc sci ## 3 5 motiv pre_sc math_sc sci ## 4 1 motiv pre_sc math_sc sci ## 4 2 motiv pre_sc math_sc sci ## 4 3 motiv pre_sc math_sc sci ## 4 4 motiv pre_sc math_sc sci ## 4 5 motiv pre_sc math_sc sci ## 5 1 motiv pre_sc math_sc sci ## 5 2 motiv pre_sc math_sc sci ## 5 3 motiv pre_sc math_sc sci ## 5 4 motiv pre_sc math_sc sci ## 5 5 motiv pre_sc math_sc sci ``` --- # Let's plot our imputed data We should look at how the new data are distributed using a stripplot. ```r stripplot(sci_df_imp, math_sc, xlab = "imputation number") ``` ![](EDUC_847_Week_4_files/figure-html/plotImputedMathScore-1.png)<!-- --> --- # Let's use this in a linear model ```r sci_mod_imp <- with(sci_df_imp, lm(sci~ math_sc)) summary(pool(sci_mod_imp)) ``` ``` ## term estimate std.error statistic df p.value ## 1 (Intercept) -178.89599 182.166993 -0.9820439 132.2844 3.278709e-01 ## 2 math_sc 14.98221 2.706412 5.5358212 121.6803 1.813591e-07 ``` --- # Let's review... We spent more time dealing with the underlying data for linear modeling We can... - transform data when it is not normally distributed - Log-Transform - Z-scores - visualize missingness in our dataset - use multiple imputation with chained equations.