The basic idea of this course is to understand relationships.
Specifically, relationships between two (or more) variables.
For Example:
Two ways to think about relationships...
Lets do this!
You'll only need to do this once.
To install tidyverse package...
install.packages("tidyverse")
Repeat this with the following packages:
glmnet janitor haven
You'll need to do this every time you restart R.
To load tidyverse package...
library(tidyverse) #tools for cleaning and visualizing data library(janitor) #package for cleaning datalibrary(glmnet) #tools for doing regression analyseslibrary(here) #tools for project-based workflowlibrary(haven) #tools for reading in data from spss
You'll need to do this every time you restart R.
To load tidyverse package...
library(tidyverse) #tools for cleaning and visualizing data library(janitor) #package for cleaning datalibrary(glmnet) #tools for doing regression analyseslibrary(here) #tools for project-based workflowlibrary(haven) #tools for reading in data from spss
Once you have done this, you will want to put include a code chunk with all of your libraries into your markdown document so that you don't have to type this every time.
In this workshop, you have the data from the survey to make things as easy as possible. However, this is not typical. You will have to find ways to get your data into R but that is sort of different depending on where you are running R (posit.cloud or a local machine)
If you have loaded the package "here" this should just work. If you have not loaded the "here" package you will need to set the working directory.
Again, you will want to include this as a code chunk in your RMD file.
#This loads the csv and saves it as a dataframe titled week_1_dataweek_1_data <- read_csv(here("static/slides/EDUC_847/data", "EDUC_847_Survey.csv"))
glimpse(week_1_data)
## Rows: 10## Columns: 24## $ StartDate <chr> "Start Date", "{\"ImportId\":\"startDate\",\"t…## $ EndDate <chr> "End Date", "{\"ImportId\":\"endDate\",\"timeZ…## $ Status <chr> "Response Type", "{\"ImportId\":\"status\"}", …## $ IPAddress <chr> "IP Address", "{\"ImportId\":\"ipAddress\"}", …## $ Progress <chr> "Progress", "{\"ImportId\":\"progress\"}", "10…## $ `Duration (in seconds)` <chr> "Duration (in seconds)", "{\"ImportId\":\"dura…## $ Finished <chr> "Finished", "{\"ImportId\":\"finished\"}", "1"…## $ RecordedDate <chr> "Recorded Date", "{\"ImportId\":\"recordedDate…## $ ResponseId <chr> "Response ID", "{\"ImportId\":\"_recordId\"}",…## $ RecipientLastName <chr> "Recipient Last Name", "{\"ImportId\":\"recipi…## $ RecipientFirstName <chr> "Recipient First Name", "{\"ImportId\":\"recip…## $ RecipientEmail <chr> "Recipient Email", "{\"ImportId\":\"recipientE…## $ ExternalReference <chr> "External Data Reference", "{\"ImportId\":\"ex…## $ LocationLatitude <chr> "Location Latitude", "{\"ImportId\":\"location…## $ LocationLongitude <chr> "Location Longitude", "{\"ImportId\":\"locatio…## $ DistributionChannel <chr> "Distribution Channel", "{\"ImportId\":\"distr…## $ UserLanguage <chr> "User Language", "{\"ImportId\":\"userLanguage…## $ Q2 <chr> "I am more of a", "{\"ImportId\":\"QID4\"}", N…## $ Q3 <chr> "My favorite dessert is", "{\"ImportId\":\"QID…## $ Q4 <chr> "Which of the following beverages have you had…## $ Q5 <chr> "How many pages were in the book that is physi…## $ Q6 <chr> "Estimate the weight of this cow. Please enter…## $ Q10 <chr> "I have experience with the following statisti…## $ ID <dbl> 523, 876, 490, 166, 318, 626, 655, 182, 219, 3…
There is a ton of data there that doesn't make sense for us to keep around.
We will use the '%>%' (pipe) operator and the verb select
week_1_data %>% select(ID,Q2:Q10) -> w1dfglimpse(w1df)
## Rows: 10## Columns: 7## $ ID <dbl> 523, 876, 490, 166, 318, 626, 655, 182, 219, 305## $ Q2 <chr> "I am more of a", "{\"ImportId\":\"QID4\"}", NA, "1", "2", "1", "1…## $ Q3 <chr> "My favorite dessert is", "{\"ImportId\":\"QID2\"}", NA, "1", "1",…## $ Q4 <chr> "Which of the following beverages have you had this week (select a…## $ Q5 <chr> "How many pages were in the book that is physically closest to you…## $ Q6 <chr> "Estimate the weight of this cow. Please enter a number.", "{\"Imp…## $ Q10 <chr> "I have experience with the following statistical tools (choose al…
Note the data are not numbers.
Notice there are two rows of data that are just the questions and something from Qualtrics. Let's get rid of those.
w1df %>% filter(row_number() >3 ) -> w1df
Here is code to do this for the question about morning or night person.
w1df %>% select(Q2) %>% group_by(Q2) %>% tally()
## # A tibble: 2 × 2## Q2 n## <chr> <int>## 1 1 4## 2 2 3
Histograms are very important for regression!
Here is code to do this for the data from above.
w1df %>% select(Q2) %>% group_by(Q2) %>% tally()
## # A tibble: 2 × 2## Q2 n## <chr> <int>## 1 1 4## 2 2 3
w1df %>% hist(Q2)
Histograms are very important for regression!
Here is code to do this for the data from above.
w1df %>% select(Q2) %>% group_by(Q2) %>% tally()
## # A tibble: 2 × 2## Q2 n## <chr> <int>## 1 1 4## 2 2 3
w1df %>% mutate(Q2 = as.numeric(Q2)) -> w1dfhist(w1df$Q2)
Thats kind of ugly! Lets make it look better.
w1df %>% select(Q2) %>% group_by(Q2) %>% tally()
## # A tibble: 2 × 2## Q2 n## <dbl> <int>## 1 1 4## 2 2 3
w1df %>% select(Q2) %>% ggplot(aes(y = Q2)) + geom_bar()
That's a little better?
w1df %>% select(Q2) %>% group_by(Q2) %>% tally()
## # A tibble: 2 × 2## Q2 n## <dbl> <int>## 1 1 4## 2 2 3
w1df %>% select(Q2) %>% ggplot(aes(y = Q2)) + geom_bar() + theme_minimal() + ggtitle("Histogram of morning/night")
Good enough for now...you are always welcome to continue adding layers to your plot to make it look even better.
See here for more: https://ggplot2.tidyverse.org/
w1df %>% select(Q5) %>% mutate(Q5 = as.numeric(Q5)) %>% summarize(Ave = mean(Q5, na.rm = TRUE), SD = sd(Q5, na.rm = TRUE))
## # A tibble: 1 × 2## Ave SD## <dbl> <dbl>## 1 292. 292.
I find it annoying to have to convert the data to numeric all the time.
w1df %>% mutate(Q3 = as.numeric(Q3)) %>% mutate(Q5 = as.numeric(Q5)) %>% mutate(Q6 = as.numeric(Q6)) -> w1dfglimpse(w1df)
## Rows: 7## Columns: 7## $ ID <dbl> 166, 318, 626, 655, 182, 219, 305## $ Q2 <dbl> 1, 2, 1, 1, 1, 2, 2## $ Q3 <dbl> 1, 1, 3, 1, 1, 3, 2## $ Q4 <chr> "2,4", "2,4,7", "1,2,4", "2,3,4,5", "3,4,5", "2,3,4,5,8", "4"## $ Q5 <dbl> 273, 32, 383, 352, 869, 56, 81## $ Q6 <dbl> 4389, 1200, 850, 1500, 1700, 1400, 1256## $ Q10 <chr> NA, NA, NA, NA, NA, NA, NA
w1df %>% select(Q2, Q5) %>% group_by(Q2) %>% summarize(Ave = mean(Q5), SD = sd(Q5))
## # A tibble: 2 × 3## Q2 Ave SD## <dbl> <dbl> <dbl>## 1 1 469. 270. ## 2 2 56.3 24.5
w1df %>% select(Q2, Q5) %>% mutate(Q2 = as.character(Q2)) %>% #Just this once! ggplot(., aes(x = Q2, y = Q5)) + geom_boxplot()
w1df %>% select(Q5:Q6) %>% ggplot(aes(x = Q5, y = Q6)) + geom_point()
w1df %>% select(Q5:Q6) %>% ggplot(aes(x = Q5, y = Q6)) + geom_point() + theme_minimal() + labs(x = "Pages", y = "Weight of Cow", title = "Kick Ass Scatterplot!")
What if Morning People have different relationships between the estimates of cow weight and length of the closest book.
w1df %>% select(Q2,Q5:Q6) %>% ggplot(aes(x = Q5, y = Q6, color = Q2)) + geom_point() + theme_minimal() + labs(x = "Pages", y = "Weight of Cow", title = "Scatterplot!")
We are ready to go!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |