Today we will use a subset of a dataset that has been peer reviewed.
It is about Portugese student performance, and can be found here:
https://archive.ics.uci.edu/dataset/320/student+performance
#This loads the csv and saves it as a dataframe titled sci_dfpsd <- read_csv(here("data", "PortugeseStudentData.csv"))
glimpse(psd)
## Rows: 649## Columns: 33## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…## $ failures <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…## $ absences <dbl> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…## $ G1 <dbl> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…## $ G2 <dbl> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…## $ G3 <dbl> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…
school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
sex - student's sex (binary: "F" - female or "M" - male)
age - student's age (numeric: from 15 to 22)
address - student's home address type (binary: "U" - urban or "R" - rural)
famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
guardian - student's guardian (nominal: "mother", "father" or "other")
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)
psd %>% select(-G1) %>% mutate_if(is_character, as.factor) %>% mutate(passing = if_else(G3>14, 1,0 )) %>% mutate(passing = as.factor(passing) )-> psd
To do this, we will use the tidymodels package
It has a number of the things we'll need to do built into the package.
Training data = a portion of the data used to first build the model Test data = the remainder of the data that is used to evaluate the model.
A standard is to do a 75/25 or 80/20 split.
set.seed(304)splits <- initial_split(psd, prop = 0.8)psd_train <- training(splits)psd_test <- testing(splits)
glimpse(psd_train)
## Rows: 519## Columns: 33## $ school <fct> GP, GP, GP, GP, GP, MS, MS, MS, MS, GP, GP, GP, GP, MS, MS,…## $ sex <fct> M, M, F, M, M, M, F, F, F, M, F, M, M, F, F, F, F, F, F, F,…## $ age <dbl> 16, 16, 18, 15, 16, 16, 18, 16, 16, 16, 17, 17, 18, 18, 17,…## $ address <fct> U, U, R, U, U, U, U, U, U, U, U, R, R, R, R, R, U, U, U, R,…## $ famsize <fct> GT3, LE3, GT3, GT3, GT3, GT3, GT3, GT3, GT3, GT3, LE3, LE3,…## $ Pstatus <fct> T, T, T, A, T, A, A, T, T, T, T, T, T, T, A, T, T, T, T, T,…## $ Medu <dbl> 2, 2, 2, 3, 4, 1, 2, 3, 1, 2, 2, 1, 3, 2, 2, 1, 4, 2, 2, 3,…## $ Fedu <dbl> 2, 1, 2, 4, 4, 2, 4, 1, 2, 3, 4, 1, 3, 2, 1, 1, 4, 3, 2, 1,…## $ Mjob <fct> other, other, at_home, services, health, other, other, othe…## $ Fjob <fct> other, other, other, other, other, other, services, other, …## $ reason <fct> course, course, course, course, course, other, reputation, …## $ guardian <fct> father, mother, mother, mother, mother, mother, father, mot…## $ traveltime <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 4, 1, 2, 2, 2, 1, 1, 1, 2,…## $ studytime <dbl> 2, 2, 4, 2, 1, 3, 2, 1, 3, 1, 2, 2, 2, 1, 2, 1, 2, 4, 2, 2,…## $ failures <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3,…## $ schoolsup <fct> no, no, no, no, no, yes, no, no, no, no, no, no, no, no, no…## $ famsup <fct> no, no, no, yes, yes, no, yes, no, yes, no, no, no, yes, no…## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…## $ activities <fct> no, yes, yes, yes, yes, no, no, yes, no, no, yes, yes, no, …## $ nursery <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…## $ higher <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, y…## $ internet <fct> yes, yes, no, yes, yes, yes, yes, yes, no, yes, yes, no, ye…## $ romantic <fct> no, yes, no, no, no, no, no, no, no, no, yes, yes, yes, yes…## $ famrel <dbl> 4, 4, 4, 5, 3, 4, 2, 3, 1, 5, 4, 5, 4, 5, 5, 3, 4, 4, 5, 5,…## $ freetime <dbl> 3, 2, 4, 4, 4, 4, 3, 1, 3, 3, 3, 3, 3, 5, 3, 5, 4, 5, 4, 4,…## $ goout <dbl> 5, 3, 4, 4, 4, 3, 2, 3, 2, 3, 2, 5, 3, 5, 3, 5, 4, 5, 5, 4,…## $ Dalc <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1,…## $ Walc <dbl> 4, 2, 1, 1, 4, 1, 3, 3, 2, 1, 1, 5, 3, 1, 2, 2, 3, 3, 2, 1,…## $ health <dbl> 4, 5, 4, 1, 5, 5, 1, 1, 4, 3, 5, 5, 5, 3, 2, 4, 5, 2, 5, 5,…## $ absences <dbl> 0, 0, 6, 0, 4, 0, 8, 0, 0, 0, 8, 0, 8, 0, 5, 3, 0, 10, 12, …## $ G2 <dbl> 10, 14, 13, 16, 13, 11, 5, 6, 8, 12, 15, 8, 9, 6, 11, 11, 1…## $ G3 <dbl> 11, 16, 14, 16, 13, 11, 8, 8, 8, 12, 16, 8, 10, 0, 12, 10, …## $ passing <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
glimpse(psd_test)
## Rows: 130## Columns: 33## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…## $ sex <fct> F, F, M, F, M, M, F, M, M, F, M, M, F, F, M, F, M, M, F, M,…## $ age <dbl> 15, 17, 15, 16, 16, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15,…## $ address <fct> U, U, U, U, U, U, R, U, U, R, U, U, U, U, U, R, U, U, U, U,…## $ famsize <fct> GT3, GT3, GT3, GT3, LE3, LE3, GT3, GT3, LE3, GT3, LE3, GT3,…## $ Pstatus <fct> T, A, T, T, T, T, T, T, T, T, T, T, A, T, A, T, T, T, T, T,…## $ Medu <dbl> 4, 4, 4, 3, 4, 4, 2, 2, 3, 2, 4, 2, 4, 1, 4, 2, 4, 4, 2, 2,…## $ Fedu <dbl> 2, 4, 3, 3, 3, 2, 4, 2, 3, 2, 4, 2, 3, 1, 4, 2, 2, 3, 2, 3,…## $ Mjob <fct> health, other, teacher, other, health, teacher, services, o…## $ Fjob <fct> services, teacher, other, other, other, other, health, othe…## $ reason <fct> home, home, course, reputation, home, course, course, home,…## $ guardian <fct> mother, mother, mother, mother, father, mother, mother, mot…## $ traveltime <dbl> 1, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1, 1, 1,…## $ studytime <dbl> 3, 2, 2, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 1, 4, 2, 4, 2, 4, 3,…## $ failures <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ schoolsup <fct> no, yes, no, yes, no, no, yes, no, no, yes, no, yes, no, ye…## $ famsup <fct> yes, yes, yes, yes, no, no, yes, yes, no, yes, yes, yes, ye…## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…## $ activities <fct> yes, no, no, yes, yes, yes, yes, no, yes, yes, no, no, yes,…## $ nursery <fct> yes, yes, yes, yes, yes, yes, yes, yes, no, yes, no, yes, y…## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…## $ internet <fct> yes, no, yes, no, yes, yes, yes, yes, yes, no, yes, yes, ye…## $ romantic <fct> yes, no, no, no, no, no, no, no, no, no, yes, no, no, yes, …## $ famrel <dbl> 3, 4, 5, 5, 3, 4, 4, 4, 5, 4, 5, 5, 4, 5, 1, 4, 3, 4, 5, 5,…## $ freetime <dbl> 2, 1, 4, 3, 1, 5, 3, 2, 3, 3, 4, 4, 3, 5, 3, 1, 3, 3, 2, 3,…## $ goout <dbl> 2, 4, 3, 2, 3, 1, 2, 2, 2, 1, 3, 1, 2, 5, 3, 3, 3, 3, 3, 2,…## $ Dalc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 5, 5, 1, 1, 2, 1, 1,…## $ Walc <dbl> 1, 1, 2, 1, 3, 3, 1, 2, 1, 1, 4, 1, 1, 5, 5, 3, 1, 3, 3, 2,…## $ health <dbl> 5, 1, 3, 4, 5, 5, 5, 5, 2, 2, 5, 1, 1, 5, 3, 4, 3, 5, 3, 5,…## $ absences <dbl> 0, 2, 0, 2, 6, 0, 2, 8, 0, 8, 8, 0, 0, 0, 0, 0, 0, 0, 1, 2,…## $ G2 <dbl> 14, 13, 12, 14, 12, 13, 11, 12, 12, 13, 11, 10, 14, 10, 12,…## $ G3 <dbl> 14, 13, 13, 14, 12, 14, 10, 12, 12, 12, 11, 10, 15, 16, 12,…## $ passing <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
We want to tell it to do logistic regression
logistic_reg()
## Logistic Regression Model Specification (classification)## ## Computational engine: glm
We want to tell it to do logistic regression
lr_mod <- logistic_reg()
lr_fit <- lr_mod %>% fit(passing ~., data = psd_train)lr_fit
## parsnip model object## ## ## Call: stats::glm(formula = passing ~ ., family = stats::binomial, data = data)## ## Coefficients:## (Intercept) schoolMS sexM age ## -569.48467 -1.98699 1.50399 -0.35734 ## addressU famsizeLE3 PstatusT Medu ## -0.31187 0.85947 0.84345 0.28050 ## Fedu Mjobhealth Mjobother Mjobservices ## -0.11303 1.94199 1.50404 2.29007 ## Mjobteacher Fjobhealth Fjobother Fjobservices ## -2.44609 0.42817 0.80783 0.47909 ## Fjobteacher reasonhome reasonother reasonreputation ## -1.49010 -2.40686 -38.77144 -2.54370 ## guardianmother guardianother traveltime studytime ## 0.74722 -1.39980 0.16337 0.02436 ## failures schoolsupyes famsupyes paidyes ## 42.52078 0.47125 -0.48106 35.26516 ## activitiesyes nurseryyes higheryes internetyes ## -0.63605 0.43905 -39.61192 -0.90956 ## romanticyes famrel freetime goout ## 2.28660 -0.06424 0.34633 -0.04004 ## Dalc Walc health absences ## -0.80768 0.12025 -0.30210 -0.01185 ## G2 G3 ## 1.08331 41.31342 ## ## Degrees of Freedom: 518 Total (i.e. Null); 477 Residual## Null Deviance: 528.2 ## Residual Deviance: 6.635e-08 AIC: 84
lr_res <- predict(lr_fit, psd_test) %>% bind_cols(predict(lr_fit, psd_test, type = "prob")) %>% bind_cols(psd_test %>% select(passing))lr_res
## # A tibble: 130 × 4## .pred_class .pred_0 .pred_1 passing## <fct> <dbl> <dbl> <fct> ## 1 0 1.00 3.21e- 9 0 ## 2 0 1 2.22e-16 0 ## 3 0 1 2.22e-16 0 ## 4 0 1.00 1.07e- 9 0 ## 5 0 1 2.22e-16 0 ## 6 0 1.00 8.72e-10 0 ## 7 0 1 2.22e-16 0 ## 8 0 1 2.22e-16 0 ## 9 0 1 2.22e-16 0 ## 10 0 1 2.22e-16 0 ## # ℹ 120 more rows
lr_res %>% accuracy(truth = passing, .pred_class)
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.969
confusionMatrix(lr_res$.pred_class, lr_res$passing, dnn = c("Predicted", "Actual"), positive = "1")
## Confusion Matrix and Statistics## ## Actual## Predicted 0 1## 0 104 2## 1 2 22## ## Accuracy : 0.9692 ## 95% CI : (0.9231, 0.9916)## No Information Rate : 0.8154 ## P-Value [Acc > NIR] : 1.034e-07 ## ## Kappa : 0.8978 ## ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9167 ## Specificity : 0.9811 ## Pos Pred Value : 0.9167 ## Neg Pred Value : 0.9811 ## Prevalence : 0.1846 ## Detection Rate : 0.1692 ## Detection Prevalence : 0.1846 ## Balanced Accuracy : 0.9489 ## ## 'Positive' Class : 1 ##
We spent today looking at how to use machine learning to do logistic regression
We can...
Today we will use a subset of a dataset that has been peer reviewed.
It is about Portugese student performance, and can be found here:
https://archive.ics.uci.edu/dataset/320/student+performance
#This loads the csv and saves it as a dataframe titled sci_dfpsd <- read_csv(here("data", "PortugeseStudentData.csv"))
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |