class: center, middle, inverse, title-slide .title[ # EDUC 847 Winter 25 ] .subtitle[ ## Week 3 - Data for Linear Modeling ] .author[ ### Eric Brewe
Professor of Physics at Drexel University
] .date[ ### 22 January 2025, last update: 2025-01-21 ] --- class: center, middle # Let's start where we ended last class...Scatter Plots Start by getting the data into R ``` r #This loads the csv and saves it as a dataframe titled week_1_data week_1_data <- read_csv(here("static/slides/EDUC_847/data", "EDUC_847_Survey.csv")) ``` --- # Let's start cleaning up. ``` r week_1_data %>% select(ID,Q2:Q6) %>% filter(row_number() >3 ) %>% * mutate(across(c(Q2:Q3,Q5:Q6), as.numeric)) -> w1df ``` --- # Let's remember what we did last week ### Is there a relationship between length of book and estimates on weight of cow? .pull-left[ ### Making a scatter plot to explore ``` r w1df %>% select(Q5:Q6) %>% ggplot(aes(x = Q5, y = Q6)) + geom_point() + theme_minimal() + labs(x = "Pages", y = "Weight of Cow", title = "Kick Ass Scatterplot!") ``` ] .pull-right[ ![](EDUC_847_Week_3_files/figure-html/unnamed-chunk-1-1.png)<!-- --> ] A **residual** is the distance between the Y value for each measurement, and the predicted Y value. --- # Let's make a linear regression model ### We want the estimated weight of the cow as predicted by the length of the book closest to the survey respondent. ``` r *Model1 <- lm(Q6 ~ Q5, data = w1df) Model1 ``` ``` ## ## Call: ## lm(formula = Q6 ~ Q5, data = w1df) ## ## Coefficients: ## (Intercept) Q5 ## 1668.7235 0.3001 ``` --- # Let's dig into our model ``` r summary(Model1) ``` ``` ## ## Call: ## lm(formula = Q6 ~ Q5, data = w1df) ## ## Residuals: ## 1 2 3 4 5 6 7 ## 2638.4 -478.3 -933.6 -274.3 -229.5 -285.5 -437.0 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1668.7235 723.6088 2.306 0.0692 . ## Q5 0.3001 1.8163 0.165 0.8753 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1301 on 5 degrees of freedom ## Multiple R-squared: 0.005429, Adjusted R-squared: -0.1935 ## F-statistic: 0.02729 on 1 and 5 DF, p-value: 0.8753 ``` --- # Let's look at hypothesis testing for this model ## Statistical significance tells us whether or not we can reject the null hypothesis 1. What is the null model? 2. What is our tolerance for type 1 error? (p-value) 3. Is this model different from the null model? --- # Let's think about this model... ## R<sup>2</sup> = 0.005429 We saw this last week ## F-Statistic = 0.02729 This is a ratio of the Model Sum of Squares vs. the Residual Sum of Squares ## p = 0.8753 This is the percent of time you would have to be willing to have a type one error (False Positive) ## This model doesn't fit very well. --- # Let's think about the Intercept ## Intercept = 1668 This tells us the average predicted weight of the cow. ## Standard Error = 723.6 `\(SE = \frac{\sigma}{\sqrt{N}}\)` ## t value = 2.3006 `\(t = \frac{Intercept}{Standard Error} = \frac{1668.7}{723.6} = 2.03\)` ## p = 0.0692 Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". -- It is not. --- # Let's think about the Q5 coeff. ## Q5 = 0.3001 The average additional weight added for each additional page in the closet book. ## Standard Error = 1.8163 `\(SE = \frac{\sigma}{\sqrt{N}}\)` ## t value = 0.165 `\(t = \frac{Intercept}{Standard Error} = \frac{.3001}{1.8163} = 0.165\)` ## p = 0.8753 Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". -- It is not. --- # Let's think about why this model is a failure ## Maybe it is because we violated the assumptions? And what are those assumptions? from https://learningstatisticswithr.com/book/regression.html#regressionassumptions --- # Let's think about why this model is a failure ### Normality of Residuals Residuals are distributed normally. ### Linearity Predictor and Fitted are linearly related ### Homogeneity of varicance The standard deviation of the residual is the same at all values of the Outcome ### Uncorrelated predictors The correlations between predictor variables is small (We'll come back to this) ### No bad outliers There aren't any individual data points that are really impacting our model --- # Let's check out those residuals... .pull-left[ ![](EDUC_847_Week_3_files/figure-html/ScatterPlotWithResiduals-1.png)<!-- --> ] .pull-right[ ``` r augment(Model1) %>% select(Q5:.resid) ``` ``` ## # A tibble: 7 × 3 ## Q5 .fitted .resid ## <dbl> <dbl> <dbl> ## 1 273 1751. 2638. ## 2 32 1678. -478. ## 3 383 1784. -934. ## 4 352 1774. -274. ## 5 869 1929. -229. ## 6 56 1686. -286. ## 7 81 1693. -437. ``` ] --- # Let's check the normality residuals... We can do a histogram of the residuals .pull-left[ ![](EDUC_847_Week_3_files/figure-html/ResidualHist-1.png)<!-- --> ] .pull-right[ ``` r mod_df <- augment(Model1) mod_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip() ``` ] Doesn't look normal! Could confirm with a Shapiro-Wilk test... --- # Let's check the normality residuals... We can do a Q-Q plot .pull-left[ ![](EDUC_847_Week_3_files/figure-html/QQPlot-1.png)<!-- --> ] .pull-right[ ``` r plot(x = Model1, which = 2 ) ``` ## Should be a straight line ## Should have a slope = 1 Not so much ] --- # Let's check the linearity assumption We can do a scatter plot of the Outcome (Q6) vs. the Fitted (the result of the eqn. ) .pull-left[ ![](EDUC_847_Week_3_files/figure-html/OutcomeVsFitted-1.png)<!-- --> ] .pull-right[ ``` r plot(x = Model1, which = 1 ) ``` ## Should be a straight line Not so much ] --- # Let's check the homogeneity of variance assumption We can do a scatter plot of the standardized residuals vs. the Fitted values (the result of the eqn. ) .pull-left[ ![](EDUC_847_Week_3_files/figure-html/StdResidualsVsFitted-1.png)<!-- --> ] .pull-right[ ``` r plot(x = Model1, which = 3 ) ``` ## Should be a straight horizontal line Not so much ] --- # Let's check the no bad outliers assumption We can do a boxplot to get a sense .pull-left[ ![](EDUC_847_Week_3_files/figure-html/BoxPlot-1.png)<!-- --> ] .pull-right[ ``` r w1df %>% ggplot(aes(y = Q6)) + geom_boxplot() + theme_minimal() ``` ## Note the one way up above might be an outlier We should confirm. ] --- # Let's check the no bad outliers assumption We can check the leverage of the model .pull-left[ ![](EDUC_847_Week_3_files/figure-html/Leverage-1.png)<!-- --> ] .pull-right[ ``` r plot(Model1, which = 5) ``` ## Looks like observations 5 might have much larger leverage? We should confirm. ] --- # Let's revisit our model ## Maybe, just maybe this model is junk because... There is no good reason to think that the number of pages in the book closest to you when you answered this survey has any bearing on how much you would estimate a cow to weigh! --- # Let's try this with a different model ``` r w3df <- read_csv(here("static/slides/EDUC_847/data", "science_scores.csv")) sci_model <- lm(sci ~ math_sc, data = w3df) ``` --- # Let's try this with a different model ``` r summary(sci_model) ``` ``` ## ## Call: ## lm(formula = sci ~ math_sc, data = w3df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -927.3 -362.3 -187.9 182.5 4431.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 43.631 164.487 0.265 0.791 ## math_sc 11.622 2.475 4.696 3.43e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 653.1 on 498 degrees of freedom ## Multiple R-squared: 0.04241, Adjusted R-squared: 0.04049 ## F-statistic: 22.06 on 1 and 498 DF, p-value: 3.428e-06 ``` --- # Let's look at hypothesis testing for the model ## Statistical significance tells us whether or not we can reject the null hypothesis 1. What is the null model? 2. What is our tolerance for type 1 error? (p-value) 3. Is this model different from the null model? --- # Let's look at hypothesis testing for the model ## Statistical significance tells us whether or not we can reject the null hypothesis 1. What is the null model? The intercept only model. 2. What is our tolerance for type 1 error? (p-value) Typically 95% 3. Is this model different from the null model? Yes. We know because the F-statistic is very large, and the p- value is less than 0.05 (5%). --- # Let's look at hypothesis testing for the coef ## Statistical significance tells us whether or not we can reject the null hypothesis 1. What is the null model? 2. What is our tolerance for type 1 error? (p-value) 3. Is this model different from the null model? --- # Let's look at hypothesis testing for the coef ## Statistical significance tells us whether or not we can reject the null hypothesis 1. What is the null model? the coefficient = 0 2. What is our tolerance for type 1 error? (p-value) typically 95% 3. Is this model different from the null model? Yes! The t-value for the math_sc is 49.27, p << 0.05 --- # Let's talk about what we can say... 1. Can we reject the null model? 2. Does a p << 0.05 mean that the model is meaningful? --- # Let's talk about what we can say... 1. Can we reject the null model? Yes! 2. Does a p << 0.05 mean that the model is meaningful? No! --- # Let's look at confidence intervals Confidence intervals give us a sense about the range of likely true positives. For a coefficient, we first establish a tolerance for false positives (95%), then... `\(C.I. = b \pm t_{crit} * S.E.(b)\)` ``` r confint(sci_model, level = 0.95) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -279.543039 366.80601 ## math_sc 6.760086 16.48408 ```