+ - 0:00:00
Notes for current slide
Notes for next slide

EDUC 847

Week 7 - Machine Learning

Eric Brewe
Professor of Physics at Drexel University

26 February 2025, last update: 2025-02-25

1 / 22

Let's start with a complete dataset

Today we will use a subset of a dataset that has been peer reviewed.

It is about Portugese student performance, and can be found here:

https://archive.ics.uci.edu/dataset/320/student+performance

#This loads the csv and saves it as a dataframe titled sci_df
psd <- read_csv(here("data",
"PortugeseStudentData.csv"))
2 / 22

Let's have a look at the data

glimpse(psd)
## Rows: 649
## Columns: 33
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences <dbl> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…
## $ G1 <dbl> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…
## $ G2 <dbl> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…
## $ G3 <dbl> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…
3 / 22

Let's figure out some of these variables.

school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)

sex - student's sex (binary: "F" - female or "M" - male)

age - student's age (numeric: from 15 to 22)

address - student's home address type (binary: "U" - urban or "R" - rural)

famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)

Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)

Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

4 / 22

Let's figure out some of these variables.

Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")

Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")

reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")

guardian - student's guardian (nominal: "mother", "father" or "other")

traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

failures - number of past class failures (numeric: n if 1<=n<3, else 4)

5 / 22

Let's figure out some of these variables.

schoolsup - extra educational support (binary: yes or no)

famsup - family educational support (binary: yes or no)

paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

activities - extra-curricular activities (binary: yes or no)

nursery - attended nursery school (binary: yes or no)

higher - wants to take higher education (binary: yes or no)

internet - Internet access at home (binary: yes or no)

romantic - with a romantic relationship (binary: yes or no)

6 / 22

Let's figure out some of these variables.

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

freetime - free time after school (numeric: from 1 - very low to 5 - very high)

goout - going out with friends (numeric: from 1 - very low to 5 - very high)

Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

health - current health status (numeric: from 1 - very bad to 5 - very good)

absences - number of school absences (numeric: from 0 to 93)

7 / 22

these grades are related with the course subject, Math or Portuguese:

G1 - first period grade (numeric: from 0 to 20)

G2 - second period grade (numeric: from 0 to 20)

G3 - final grade (numeric: from 0 to 20, output target)

8 / 22

Let's make characters factors

Machine Learning works best with factors, so we should convert character variables to factors

psd %>%
select(-G1) %>%
mutate_if(is_character, as.factor) %>%
mutate(passing = if_else(G3>14, 1,0 )) %>%
mutate(passing = as.factor(passing) )-> psd
9 / 22

Let's talk about Machine Learning

In many ways it accomplishes the same things as typical regression

  • Prediction of whether a student passes (G3) using a mix of numeric, categorical, and logistic variables.
  • Regression typically uses all of the data to generate the model, and then uses the same data to evaluate the model.
    • This is not ideal.

Additional things ML does:

  • Separating data into test and train data.
  • Variable selection
  • Model evaluation
  • lots of additional things.
10 / 22

Let's define our task

We would like to predict final grade (G3)

To do this, we will use the tidymodels package

https://www.tidymodels.org/

It has a number of the things we'll need to do built into the package.

11 / 22

Let's deal with our dataset first

Since we don't want to use all the data to train our model, we need to split the data

Training data = a portion of the data used to first build the model Test data = the remainder of the data that is used to evaluate the model.

A standard is to do a 75/25 or 80/20 split.

set.seed(304)
splits <- initial_split(psd, prop = 0.8)
psd_train <- training(splits)
psd_test <- testing(splits)
12 / 22

Let's look at our splits

Training Data

glimpse(psd_train)
## Rows: 519
## Columns: 33
## $ school <fct> GP, GP, GP, GP, GP, MS, MS, MS, MS, GP, GP, GP, GP, MS, MS,…
## $ sex <fct> M, M, F, M, M, M, F, F, F, M, F, M, M, F, F, F, F, F, F, F,…
## $ age <dbl> 16, 16, 18, 15, 16, 16, 18, 16, 16, 16, 17, 17, 18, 18, 17,…
## $ address <fct> U, U, R, U, U, U, U, U, U, U, U, R, R, R, R, R, U, U, U, R,…
## $ famsize <fct> GT3, LE3, GT3, GT3, GT3, GT3, GT3, GT3, GT3, GT3, LE3, LE3,…
## $ Pstatus <fct> T, T, T, A, T, A, A, T, T, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu <dbl> 2, 2, 2, 3, 4, 1, 2, 3, 1, 2, 2, 1, 3, 2, 2, 1, 4, 2, 2, 3,…
## $ Fedu <dbl> 2, 1, 2, 4, 4, 2, 4, 1, 2, 3, 4, 1, 3, 2, 1, 1, 4, 3, 2, 1,…
## $ Mjob <fct> other, other, at_home, services, health, other, other, othe…
## $ Fjob <fct> other, other, other, other, other, other, services, other, …
## $ reason <fct> course, course, course, course, course, other, reputation, …
## $ guardian <fct> father, mother, mother, mother, mother, mother, father, mot…
## $ traveltime <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 4, 1, 2, 2, 2, 1, 1, 1, 2,…
## $ studytime <dbl> 2, 2, 4, 2, 1, 3, 2, 1, 3, 1, 2, 2, 2, 1, 2, 1, 2, 4, 2, 2,…
## $ failures <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3,…
## $ schoolsup <fct> no, no, no, no, no, yes, no, no, no, no, no, no, no, no, no…
## $ famsup <fct> no, no, no, yes, yes, no, yes, no, yes, no, no, no, yes, no…
## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ activities <fct> no, yes, yes, yes, yes, no, no, yes, no, no, yes, yes, no, …
## $ nursery <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ higher <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, y…
## $ internet <fct> yes, yes, no, yes, yes, yes, yes, yes, no, yes, yes, no, ye…
## $ romantic <fct> no, yes, no, no, no, no, no, no, no, no, yes, yes, yes, yes…
## $ famrel <dbl> 4, 4, 4, 5, 3, 4, 2, 3, 1, 5, 4, 5, 4, 5, 5, 3, 4, 4, 5, 5,…
## $ freetime <dbl> 3, 2, 4, 4, 4, 4, 3, 1, 3, 3, 3, 3, 3, 5, 3, 5, 4, 5, 4, 4,…
## $ goout <dbl> 5, 3, 4, 4, 4, 3, 2, 3, 2, 3, 2, 5, 3, 5, 3, 5, 4, 5, 5, 4,…
## $ Dalc <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1,…
## $ Walc <dbl> 4, 2, 1, 1, 4, 1, 3, 3, 2, 1, 1, 5, 3, 1, 2, 2, 3, 3, 2, 1,…
## $ health <dbl> 4, 5, 4, 1, 5, 5, 1, 1, 4, 3, 5, 5, 5, 3, 2, 4, 5, 2, 5, 5,…
## $ absences <dbl> 0, 0, 6, 0, 4, 0, 8, 0, 0, 0, 8, 0, 8, 0, 5, 3, 0, 10, 12, …
## $ G2 <dbl> 10, 14, 13, 16, 13, 11, 5, 6, 8, 12, 15, 8, 9, 6, 11, 11, 1…
## $ G3 <dbl> 11, 16, 14, 16, 13, 11, 8, 8, 8, 12, 16, 8, 10, 0, 12, 10, …
## $ passing <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
13 / 22

Let's look at our splits

Testing Data

glimpse(psd_test)
## Rows: 130
## Columns: 33
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex <fct> F, F, M, F, M, M, F, M, M, F, M, M, F, F, M, F, M, M, F, M,…
## $ age <dbl> 15, 17, 15, 16, 16, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15,…
## $ address <fct> U, U, U, U, U, U, R, U, U, R, U, U, U, U, U, R, U, U, U, U,…
## $ famsize <fct> GT3, GT3, GT3, GT3, LE3, LE3, GT3, GT3, LE3, GT3, LE3, GT3,…
## $ Pstatus <fct> T, A, T, T, T, T, T, T, T, T, T, T, A, T, A, T, T, T, T, T,…
## $ Medu <dbl> 4, 4, 4, 3, 4, 4, 2, 2, 3, 2, 4, 2, 4, 1, 4, 2, 4, 4, 2, 2,…
## $ Fedu <dbl> 2, 4, 3, 3, 3, 2, 4, 2, 3, 2, 4, 2, 3, 1, 4, 2, 2, 3, 2, 3,…
## $ Mjob <fct> health, other, teacher, other, health, teacher, services, o…
## $ Fjob <fct> services, teacher, other, other, other, other, health, othe…
## $ reason <fct> home, home, course, reputation, home, course, course, home,…
## $ guardian <fct> mother, mother, mother, mother, father, mother, mother, mot…
## $ traveltime <dbl> 1, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1, 1, 1,…
## $ studytime <dbl> 3, 2, 2, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 1, 4, 2, 4, 2, 4, 3,…
## $ failures <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> no, yes, no, yes, no, no, yes, no, no, yes, no, yes, no, ye…
## $ famsup <fct> yes, yes, yes, yes, no, no, yes, yes, no, yes, yes, yes, ye…
## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ activities <fct> yes, no, no, yes, yes, yes, yes, no, yes, yes, no, no, yes,…
## $ nursery <fct> yes, yes, yes, yes, yes, yes, yes, yes, no, yes, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> yes, no, yes, no, yes, yes, yes, yes, yes, no, yes, yes, ye…
## $ romantic <fct> yes, no, no, no, no, no, no, no, no, no, yes, no, no, yes, …
## $ famrel <dbl> 3, 4, 5, 5, 3, 4, 4, 4, 5, 4, 5, 5, 4, 5, 1, 4, 3, 4, 5, 5,…
## $ freetime <dbl> 2, 1, 4, 3, 1, 5, 3, 2, 3, 3, 4, 4, 3, 5, 3, 1, 3, 3, 2, 3,…
## $ goout <dbl> 2, 4, 3, 2, 3, 1, 2, 2, 2, 1, 3, 1, 2, 5, 3, 3, 3, 3, 3, 2,…
## $ Dalc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 5, 5, 1, 1, 2, 1, 1,…
## $ Walc <dbl> 1, 1, 2, 1, 3, 3, 1, 2, 1, 1, 4, 1, 1, 5, 5, 3, 1, 3, 3, 2,…
## $ health <dbl> 5, 1, 3, 4, 5, 5, 5, 5, 2, 2, 5, 1, 1, 5, 3, 4, 3, 5, 3, 5,…
## $ absences <dbl> 0, 2, 0, 2, 6, 0, 2, 8, 0, 8, 8, 0, 0, 0, 0, 0, 0, 0, 1, 2,…
## $ G2 <dbl> 14, 13, 12, 14, 12, 13, 11, 12, 12, 13, 11, 10, 14, 10, 12,…
## $ G3 <dbl> 14, 13, 13, 14, 12, 14, 10, 12, 12, 12, 11, 10, 15, 16, 12,…
## $ passing <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
14 / 22

Let's talk about options at this point.

Tidymodels splits the ML process into three main chunks, each chunk has options

  • Model Specification
    • Regression, Random Forest, Neural Network...
    • What engine to use (STAN, Keras, OLS)
  • Processing
    • Data splitting
    • Cleaning
  • Modeling
    • Creating a model
    • Tuning model parameters
    • Prediction
    • Evaluation
15 / 22

Let's Specify a Model.

Tidymodels has a distinct approach to specifying a model.

We want to tell it to do logistic regression

logistic_reg()
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
16 / 22

Let's tell it we want to use logistic regression

Tidymodels allows you to choose different approaches

We want to tell it to do logistic regression

lr_mod <-
logistic_reg()
17 / 22

Let's run this...

lr_fit <- lr_mod %>% fit(passing ~., data = psd_train)
lr_fit
## parsnip model object
##
##
## Call: stats::glm(formula = passing ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) schoolMS sexM age
## -569.48467 -1.98699 1.50399 -0.35734
## addressU famsizeLE3 PstatusT Medu
## -0.31187 0.85947 0.84345 0.28050
## Fedu Mjobhealth Mjobother Mjobservices
## -0.11303 1.94199 1.50404 2.29007
## Mjobteacher Fjobhealth Fjobother Fjobservices
## -2.44609 0.42817 0.80783 0.47909
## Fjobteacher reasonhome reasonother reasonreputation
## -1.49010 -2.40686 -38.77144 -2.54370
## guardianmother guardianother traveltime studytime
## 0.74722 -1.39980 0.16337 0.02436
## failures schoolsupyes famsupyes paidyes
## 42.52078 0.47125 -0.48106 35.26516
## activitiesyes nurseryyes higheryes internetyes
## -0.63605 0.43905 -39.61192 -0.90956
## romanticyes famrel freetime goout
## 2.28660 -0.06424 0.34633 -0.04004
## Dalc Walc health absences
## -0.80768 0.12025 -0.30210 -0.01185
## G2 G3
## 1.08331 41.31342
##
## Degrees of Freedom: 518 Total (i.e. Null); 477 Residual
## Null Deviance: 528.2
## Residual Deviance: 6.635e-08 AIC: 84
18 / 22

Let's see how it did.

lr_res <- predict(lr_fit, psd_test) %>%
bind_cols(predict(lr_fit, psd_test, type = "prob")) %>%
bind_cols(psd_test %>% select(passing))
lr_res
## # A tibble: 130 × 4
## .pred_class .pred_0 .pred_1 passing
## <fct> <dbl> <dbl> <fct>
## 1 0 1.00 3.21e- 9 0
## 2 0 1 2.22e-16 0
## 3 0 1 2.22e-16 0
## 4 0 1.00 1.07e- 9 0
## 5 0 1 2.22e-16 0
## 6 0 1.00 8.72e-10 0
## 7 0 1 2.22e-16 0
## 8 0 1 2.22e-16 0
## 9 0 1 2.22e-16 0
## 10 0 1 2.22e-16 0
## # ℹ 120 more rows
19 / 22

Let's check accuracuy

lr_res %>%
accuracy(truth = passing, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.969
20 / 22

Let's check the Confusion

confusionMatrix(lr_res$.pred_class,
lr_res$passing,
dnn = c("Predicted", "Actual"),
positive = "1")
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 104 2
## 1 2 22
##
## Accuracy : 0.9692
## 95% CI : (0.9231, 0.9916)
## No Information Rate : 0.8154
## P-Value [Acc > NIR] : 1.034e-07
##
## Kappa : 0.8978
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9167
## Specificity : 0.9811
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9811
## Prevalence : 0.1846
## Detection Rate : 0.1692
## Detection Prevalence : 0.1846
## Balanced Accuracy : 0.9489
##
## 'Positive' Class : 1
##
21 / 22

Let's review

We spent today looking at how to use machine learning to do logistic regression

We can...

  • create a test/train split
  • use a confusion matrix to evaluate the predictive power of the model
22 / 22

Let's start with a complete dataset

Today we will use a subset of a dataset that has been peer reviewed.

It is about Portugese student performance, and can be found here:

https://archive.ics.uci.edu/dataset/320/student+performance

#This loads the csv and saves it as a dataframe titled sci_df
psd <- read_csv(here("data",
"PortugeseStudentData.csv"))
2 / 22
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow