Start by getting the data into R
#This loads the csv and saves it as a dataframe titled week_1_dataweek_1_data <- read_csv(here("static/slides/EDUC_847/data", "EDUC_847_Survey.csv"))
week_1_data %>% select(ID,Q2:Q6) %>% filter(row_number() >3 ) %>% mutate(across(c(Q2:Q3,Q5:Q6), as.numeric)) -> w1df
w1df %>% select(Q5:Q6) %>% ggplot(aes(x = Q5, y = Q6)) + geom_point() + theme_minimal() + labs(x = "Pages", y = "Weight of Cow", title = "Kick Ass Scatterplot!")
A residual is the distance between the Y value for each measurement, and the predicted Y value.
Model1 <- lm(Q6 ~ Q5, data = w1df)Model1
## ## Call:## lm(formula = Q6 ~ Q5, data = w1df)## ## Coefficients:## (Intercept) Q5 ## 1668.7235 0.3001
summary(Model1)
## ## Call:## lm(formula = Q6 ~ Q5, data = w1df)## ## Residuals:## 1 2 3 4 5 6 7 ## 2638.4 -478.3 -933.6 -274.3 -229.5 -285.5 -437.0 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1668.7235 723.6088 2.306 0.0692 .## Q5 0.3001 1.8163 0.165 0.8753 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1301 on 5 degrees of freedom## Multiple R-squared: 0.005429, Adjusted R-squared: -0.1935 ## F-statistic: 0.02729 on 1 and 5 DF, p-value: 0.8753
We saw this last week
This is a ratio of the Model Sum of Squares vs. the Residual Sum of Squares
This is the percent of time you would have to be willing to have a type one error (False Positive)
This tells us the average predicted weight of the cow.
SE=σ√N
t=InterceptStandardError=1668.7723.6=2.03
Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".
This tells us the average predicted weight of the cow.
SE=σ√N
t=InterceptStandardError=1668.7723.6=2.03
Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.
The average additional weight added for each additional page in the closet book.
SE=σ√N
t=InterceptStandardError=.30011.8163=0.165
Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant".
The average additional weight added for each additional page in the closet book.
SE=σ√N
t=InterceptStandardError=.30011.8163=0.165
Depending on our predetermined willingness to accept false positives, this tells us whether a result is "statistically significant". It is not.
And what are those assumptions? from https://learningstatisticswithr.com/book/regression.html#regressionassumptions
Residuals are distributed normally.
Predictor and Fitted are linearly related
The standard deviation of the residual is the same at all values of the Outcome
The correlations between predictor variables is small (We'll come back to this)
There aren't any individual data points that are really impacting our model
augment(Model1) %>% select(Q5:.resid)
## # A tibble: 7 × 3## Q5 .fitted .resid## <dbl> <dbl> <dbl>## 1 273 1751. 2638.## 2 32 1678. -478.## 3 383 1784. -934.## 4 352 1774. -274.## 5 869 1929. -229.## 6 56 1686. -286.## 7 81 1693. -437.
We can do a histogram of the residuals
mod_df <- augment(Model1)mod_df %>% ggplot(aes(y = .resid)) + geom_histogram() + theme_minimal() + coord_flip()
Doesn't look normal! Could confirm with a Shapiro-Wilk test...
We can do a Q-Q plot
plot(x = Model1, which = 2 )
Not so much
We can do a scatter plot of the Outcome (Q6) vs. the Fitted (the result of the eqn. )
plot(x = Model1, which = 1 )
Not so much
We can do a scatter plot of the standardized residuals vs. the Fitted values (the result of the eqn. )
plot(x = Model1, which = 3 )
Not so much
We can do a boxplot to get a sense
w1df %>% ggplot(aes(y = Q6)) + geom_boxplot() + theme_minimal()
We should confirm.
We can check the leverage of the model
plot(Model1, which = 5)
We should confirm.
There is no good reason to think that the number of pages in the book closest to you when you answered this survey has any bearing on how much you would estimate a cow to weigh!
w3df <- read_csv(here("static/slides/EDUC_847/data", "science_scores.csv"))sci_model <- lm(sci ~ math_sc, data = w3df)
summary(sci_model)
## ## Call:## lm(formula = sci ~ math_sc, data = w3df)## ## Residuals:## Min 1Q Median 3Q Max ## -927.3 -362.3 -187.9 182.5 4431.6 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 43.631 164.487 0.265 0.791 ## math_sc 11.622 2.475 4.696 3.43e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 653.1 on 498 degrees of freedom## Multiple R-squared: 0.04241, Adjusted R-squared: 0.04049 ## F-statistic: 22.06 on 1 and 498 DF, p-value: 3.428e-06
Can we reject the null model?
Does a p << 0.05 mean that the model is meaningful?
Yes!
No!
Confidence intervals give us a sense about the range of likely true positives.
For a coefficient, we first establish a tolerance for false positives (95%), then...
C.I.=b±tcrit∗S.E.(b)
confint(sci_model, level = 0.95)
## 2.5 % 97.5 %## (Intercept) -279.543039 366.80601## math_sc 6.760086 16.48408
Start by getting the data into R
#This loads the csv and saves it as a dataframe titled week_1_dataweek_1_data <- read_csv(here("static/slides/EDUC_847/data", "EDUC_847_Survey.csv"))
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |