EDUC 847
Welcome to EDUC 847
The Plan
The main idea is to have some time learning about different aspects of regression analysis, and to actually focus on doing the analyses.
Topics | Date |
---|---|
Intro to R and Plotting | 8 January |
Bivariate Linear Regression | 15 January |
Assumption Checking | 22 January |
Multiple Regression | 29 January |
Model Selection and Evaluation | 5 Februry |
Written Assignment #1 Reviews | 12 Februry |
Logistic Regression | 19 Februry |
Bayesian Regression | 26 Februry |
Machine Learning | 5 March |
Review and Workshop | 12 March |
No Class - Final Projects Due | 19 March |
What might you learn?
There are lots of things that I hope you would learn: to use R tools to look at regression in a variety of forms, to read code, to write code, to find that you are not afraid (and actually like) to analyze data with R. By the end of all three sessions here are things I think you should be able to do after participating in each of the workshops:
Week One
- Import Data into R
- Clean data and prepare data for Regression
- Organize data for Regression
- Vizualize data
Week Two
- Bivariate Regression
- Residuals
- Ordinary Least Squares
- R2
- Interpreting Regression Output.
Week Three
- Model Assumption Checking
- Normality of residuals, Linearity, homogeneity of variance, no bad outliers
- Hypothesis Testing for the Linear Model
- Hypothesis Testing for the coefficients
- Effect Sizes for the coefficients
Week Four
- transform data when it is not normally distributed
- Log-Transform
- Z-scores
- visualize missingness in our dataset
- use multiple imputation with chained equations.
Assignment One Assignment One
Week Five
- run a multiple regression
- interpret a multiple regression
- consider categorical variables
- dummy code for categorical variables
- visualize regression coefficients
- handle interaction terms
Week Six
- create a logistic regression model
- interpret logistic regression coefficients
- use a confusion matrix to evaluate the predictive power of the model
Week Seven
- create a logistic regression model using machine learning
- create a train test split.
- use a confusion matrix to evaluate the predictive power of the model
Week Eight
- calculate Bayes factors
- use Bayesian regression to compare models
- interpret Bayes factors
- use Bayes factors to determine what variable to include (and those to drop)
What is beyond the scope?
While I would love to say you’d be able to master doing regression in R in the short span of one quarter, you will not. I’ve spent a decade learning it and am proficient, but am constantly learning.
Similarly, R is a tool. You are the expert in your research, R can help you to uncover elements of your data. I don’t know how to analyze your data - learning R will empower you to analyze your data.
Why R?
I began learning R because I wanted to have something that worked across different platforms. But then I realized that the Open Source movement was something that was particularly interesting to me. But those aren’t the reason. I have come to see R as crucial for reproducibility. If I write a set of code to analyze a dataset, I can re-run this code, and get the same results. I can share the code with my publications, so someone else can check my work. This is critical to doing science. R allows me to do this in a way that Excel doesn’t.
Certainly other languages (Python, Julia, etc…) will also allow for this, but why I have stuck with R is that there is a rich community that are supportive of learning and growing.
Why Tidyverse?
Open source things tend to be a bit tribal - within R one debate is Base R vs. Tidyverse. This is a debate about how to teach R, and I have decided to use the Tidyverse. It might not ever matter to you, but I want you to know that I have thought about this, and my answer is to use Tidyverse for the following reasons:
- Readability - Code should be readable by humans
- Coherence with my practice - I generally try to use the Tidyverse approach in my work, so I feel like I can best help by using these tools
- Coherence with themselves - These tools are organized in a way that helps them to be internally coherent
- Support - There are loads of books and websites that are published open-source, and so you can typically find the help you need.
Resources for Learning R
Learning R is not easy, as I have learned things, I have tried to maintain a list of the different resources that I have drawn on.
The Basics
- Install R and RStudio
- Introduction to R (using Base R) - R Programming for Data Science
- My all time favorite book for learning statistics and R is Learning Statistics with R
- Danielle Navarro also has a course Robust Tools that gives a sense about the breadth of the tools that R provides.
- These two guides, The R Guide and R for Beginners
A bit more advanced
- Hadley Wickham’s book, R For Data Science
- Project Oriented Workflows
- WTF R Jenny Bryan is great at explaining things about R and so frequently, I find myself looking at her books.
- Cheat Sheets I can never remember all the commands.
- Happy Git With R Eventually you’ll need to use Git, and this is essential reading for making that transition.
- RMarkdown Markdown allows you to use reproducible code and integrate R into reporting and writing. This is how I do all my work.
- DataCarpentry runs workshops about data analysis and often are oriented to different research specialties.
- Social Science Data Carpentry has a social science oriented curriculum that looks great.
- The Effect A really great book that approaches data analysis from a causal inference perspective with code in R, Python, and Stata.
- Statistics and Data Visualization Using R: The Art and Practice of Data Analysis. A book with data sets and R scripts embedded.