+ - 0:00:00
Notes for current slide
Notes for next slide

EDUC 847 Winter 25

Day 1 - Theory, Data, and Plotting

Eric Brewe
Professor of Physics at Drexel University

8 January 2025, last update: 2025-01-17

1 / 28
2 / 28

Introduction to this course, R and plotting recipes

3 / 28

Relationships

The basic idea of this course is to understand relationships.

Specifically, relationships between two (or more) variables.

For Example:

  • Height and weight
  • Temperature and ice cream consumption
  • Average rainfall and happiness
4 / 28

Human relationships

  • Humans are weird and complicated!
  • A relationship that we can find is not absolute.
    • Deterministic relationships: If A then B (ALWAYS)
    • Probabalistic relationships: If A then a tendency to B
    • Rainfall and happiness?

5 / 28

Prediction vs Causation

Two ways to think about relationships...

  1. Knowing about A allows you to know more about B (predictive)
  2. More of A caused more of B (causal)
6 / 28

Why are we learning coding?

Research Reproducibility

  • One idea of science is falsifiability

7 / 28

What you need to know about R?

  • R is the programming language.
  • RStudio is the Integrated Development Environment (IDE)
  • Posit.cloud is an online version of the IDE
  • Packages: groups of functions that are developed as open source
  • Base R
    • The group of packages preloaded into R
  • Tidyverse
    • Family of packages that are designed with the theory that programming should be readable by humans.

Lets do this!

8 / 28

How do we start learning R?

  • We need to know something about the different types of data and the ways in which they are stored.

Data types (at least some of them)

  • Logical (T/F, 1/0)
  • Integers (whole numbers)
  • Numeric (numbers with decimal places)
  • Complex (I never use these)
  • Text (exactly what it sounds like)

Data storage (at least some of them)

  • Vectors = long columns of data (can be any type, but only one type of data)
  • Dataframes = like Excel pages (columns can hold different types of data)
  • Matrices = like dataframes, but they have named rows/columns
    • Adjacency Matrices are of type matrix.
  • Lists = the junkdrawer, can hold any type of data (including dataframes or matrices)
9 / 28

Lets Take a Tour of RStudio IDE

10 / 28

Let's Install Some Packages

You'll only need to do this once.

In Console

To install tidyverse package...

install.packages("tidyverse")

Repeat this with the following packages:

glmnet janitor haven

11 / 28

Let's Load Some Packages

You'll need to do this every time you restart R.

In Console

To load tidyverse package...

library(tidyverse) #tools for cleaning and visualizing data
library(janitor) #package for cleaning data
library(glmnet) #tools for doing regression analyses
library(here) #tools for project-based workflow
library(haven) #tools for reading in data from spss
12 / 28

Let's Load Some Packages

You'll need to do this every time you restart R.

In Console

To load tidyverse package...

library(tidyverse) #tools for cleaning and visualizing data
library(janitor) #package for cleaning data
library(glmnet) #tools for doing regression analyses
library(here) #tools for project-based workflow
library(haven) #tools for reading in data from spss

Once you have done this, you will want to put include a code chunk with all of your libraries into your markdown document so that you don't have to type this every time.

12 / 28

Let's get data into R.

In this workshop, you have the data from the survey to make things as easy as possible. However, this is not typical. You will have to find ways to get your data into R but that is sort of different depending on where you are running R (posit.cloud or a local machine)

If you have loaded the package "here" this should just work. If you have not loaded the "here" package you will need to set the working directory.

Again, you will want to include this as a code chunk in your RMD file.

#This loads the csv and saves it as a dataframe titled week_1_data
week_1_data <- read_csv(here("static/slides/EDUC_847/data",
"EDUC_847_Survey.csv"))
13 / 28

Let's have a look at the data

glimpse(week_1_data)
## Rows: 10
## Columns: 24
## $ StartDate <chr> "Start Date", "{\"ImportId\":\"startDate\",\"t…
## $ EndDate <chr> "End Date", "{\"ImportId\":\"endDate\",\"timeZ…
## $ Status <chr> "Response Type", "{\"ImportId\":\"status\"}", …
## $ IPAddress <chr> "IP Address", "{\"ImportId\":\"ipAddress\"}", …
## $ Progress <chr> "Progress", "{\"ImportId\":\"progress\"}", "10…
## $ `Duration (in seconds)` <chr> "Duration (in seconds)", "{\"ImportId\":\"dura…
## $ Finished <chr> "Finished", "{\"ImportId\":\"finished\"}", "1"…
## $ RecordedDate <chr> "Recorded Date", "{\"ImportId\":\"recordedDate…
## $ ResponseId <chr> "Response ID", "{\"ImportId\":\"_recordId\"}",…
## $ RecipientLastName <chr> "Recipient Last Name", "{\"ImportId\":\"recipi…
## $ RecipientFirstName <chr> "Recipient First Name", "{\"ImportId\":\"recip…
## $ RecipientEmail <chr> "Recipient Email", "{\"ImportId\":\"recipientE…
## $ ExternalReference <chr> "External Data Reference", "{\"ImportId\":\"ex…
## $ LocationLatitude <chr> "Location Latitude", "{\"ImportId\":\"location…
## $ LocationLongitude <chr> "Location Longitude", "{\"ImportId\":\"locatio…
## $ DistributionChannel <chr> "Distribution Channel", "{\"ImportId\":\"distr…
## $ UserLanguage <chr> "User Language", "{\"ImportId\":\"userLanguage…
## $ Q2 <chr> "I am more of a", "{\"ImportId\":\"QID4\"}", N…
## $ Q3 <chr> "My favorite dessert is", "{\"ImportId\":\"QID…
## $ Q4 <chr> "Which of the following beverages have you had…
## $ Q5 <chr> "How many pages were in the book that is physi…
## $ Q6 <chr> "Estimate the weight of this cow. Please enter…
## $ Q10 <chr> "I have experience with the following statisti…
## $ ID <dbl> 523, 876, 490, 166, 318, 626, 655, 182, 219, 3…
14 / 28

Let's start cleaning up.

First, we don't need most of that data

There is a ton of data there that doesn't make sense for us to keep around.

We will use the '%>%' (pipe) operator and the verb select

week_1_data %>%
select(ID,Q2:Q10) -> w1df
glimpse(w1df)
## Rows: 10
## Columns: 7
## $ ID <dbl> 523, 876, 490, 166, 318, 626, 655, 182, 219, 305
## $ Q2 <chr> "I am more of a", "{\"ImportId\":\"QID4\"}", NA, "1", "2", "1", "1…
## $ Q3 <chr> "My favorite dessert is", "{\"ImportId\":\"QID2\"}", NA, "1", "1",…
## $ Q4 <chr> "Which of the following beverages have you had this week (select a…
## $ Q5 <chr> "How many pages were in the book that is physically closest to you…
## $ Q6 <chr> "Estimate the weight of this cow. Please enter a number.", "{\"Imp…
## $ Q10 <chr> "I have experience with the following statistical tools (choose al…
15 / 28

Note the data are not numbers.

Let's keep cleaning up.

Notice there are two rows of data that are just the questions and something from Qualtrics. Let's get rid of those.

w1df %>%
filter(row_number() >3 ) -> w1df
16 / 28

Let's start visualizing the data

For categorical data, you might want to get some counts.

Here is code to do this for the question about morning or night person.

w1df %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <chr> <int>
## 1 1 4
## 2 2 3
17 / 28

Let's start visualizing the data

You might want to see the distribution of our data.

Histograms are very important for regression!

Here is code to do this for the data from above.

w1df %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <chr> <int>
## 1 1 4
## 2 2 3
w1df %>%
hist(Q2)
18 / 28

Let's start visualizing the data

You might want to see the distribution of our data.

Histograms are very important for regression!

Here is code to do this for the data from above.

w1df %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <chr> <int>
## 1 1 4
## 2 2 3
w1df %>%
mutate(Q2 = as.numeric(Q2)) -> w1df
hist(w1df$Q2)

Thats kind of ugly! Lets make it look better.

19 / 28

Let's make the histogram look better

w1df %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <dbl> <int>
## 1 1 4
## 2 2 3
w1df %>%
select(Q2) %>%
ggplot(aes(y = Q2)) +
geom_bar()

That's a little better?

20 / 28

Let's make the histogram even better

w1df %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <dbl> <int>
## 1 1 4
## 2 2 3
w1df %>%
select(Q2) %>%
ggplot(aes(y = Q2)) +
geom_bar() +
theme_minimal() +
ggtitle("Histogram of morning/night")

Good enough for now...you are always welcome to continue adding layers to your plot to make it look even better.
See here for more: https://ggplot2.tidyverse.org/

21 / 28

Let's look at some quantitative data

First, let's summarize the book data.

w1df %>%
select(Q5) %>%
mutate(Q5 = as.numeric(Q5)) %>%
summarize(Ave = mean(Q5, na.rm = TRUE),
SD = sd(Q5, na.rm = TRUE))
## # A tibble: 1 × 2
## Ave SD
## <dbl> <dbl>
## 1 292. 292.
22 / 28

Let's tidy up a bit

I find it annoying to have to convert the data to numeric all the time.

w1df %>%
mutate(Q3 = as.numeric(Q3)) %>%
mutate(Q5 = as.numeric(Q5)) %>%
mutate(Q6 = as.numeric(Q6)) -> w1df
glimpse(w1df)
## Rows: 7
## Columns: 7
## $ ID <dbl> 166, 318, 626, 655, 182, 219, 305
## $ Q2 <dbl> 1, 2, 1, 1, 1, 2, 2
## $ Q3 <dbl> 1, 1, 3, 1, 1, 3, 2
## $ Q4 <chr> "2,4", "2,4,7", "1,2,4", "2,3,4,5", "3,4,5", "2,3,4,5,8", "4"
## $ Q5 <dbl> 273, 32, 383, 352, 869, 56, 81
## $ Q6 <dbl> 4389, 1200, 850, 1500, 1700, 1400, 1256
## $ Q10 <chr> NA, NA, NA, NA, NA, NA, NA
23 / 28

Let's investigate groups

Do morning people or night owls have longer books nearby?

w1df %>%
select(Q2, Q5) %>%
group_by(Q2) %>%
summarize(Ave = mean(Q5),
SD = sd(Q5))
## # A tibble: 2 × 3
## Q2 Ave SD
## <dbl> <dbl> <dbl>
## 1 1 469. 270.
## 2 2 56.3 24.5

We might want to use a boxplot to display these data

w1df %>%
select(Q2, Q5) %>%
mutate(Q2 = as.character(Q2)) %>% #Just this once!
ggplot(., aes(x = Q2, y = Q5)) +
geom_boxplot()

24 / 28

Let's look at books vs. weight of cow

Is there a relationship between length of book and estimates on weight of cow?

Could do a scatter plot

w1df %>%
select(Q5:Q6) %>%
ggplot(aes(x = Q5, y = Q6)) +
geom_point()

25 / 28

Let's make our scatter plot look better

Could do a scatter plot

w1df %>%
select(Q5:Q6) %>%
ggplot(aes(x = Q5, y = Q6)) +
geom_point() +
theme_minimal() +
labs(x = "Pages",
y = "Weight of Cow",
title = "Kick Ass Scatterplot!")

26 / 28

Let's use some colors

What if Morning People have different relationships between the estimates of cow weight and length of the closest book.

Could do a scatter plot

w1df %>%
select(Q2,Q5:Q6) %>%
ggplot(aes(x = Q5, y = Q6, color = Q2)) +
geom_point() +
theme_minimal() +
labs(x = "Pages",
y = "Weight of Cow",
title = "Scatterplot!")

27 / 28

Let's recap

  • We know how to get data into R
  • We know how to modify the data
  • We know how to plot the data
  • We have recipes for plotting
    • histograms
    • boxplots
    • scatterplots

We are ready to go!

28 / 28
2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow