+ - 0:00:00
Notes for current slide
Notes for next slide

PEER Advanced Field School 2023

Day 1 - Theory, Data, and Plotting

Eric Brewe
Professor of Physics at Drexel University

15 June 2023, last update: 2023-06-19

1 / 48

Foundations of Network Analysis

2 / 48

What is a network?

  • Collection of object-like things that are connected.
    • Nodes/actors = Object-like things (Nouns)
      • Students in a class, Words in a novel, Banks...
      • Nodes can have attributes
        • Gender,
        • Word-type,
        • Market capitialization

NetworkImage

3 / 48

What is a network?

  • Collection of object-like things that are connected.
    • Ties/Links/Edges = Connections between nodes (Verbs)
      • Talked to each other, Are neigbors, Lend money, Sent text message...
        • Directional
        • Multiplex
        • Weighted

NetworkImage

4 / 48

Network Analysis is for the analysis of relational data

There are four basic assumptions:

  1. Nodes and interactions are interdependent*
  2. Edges allow flow between nodes
  3. Network models on indiviuals both constrain and provide opportunity for action
  4. Network models conceptualize structure as representation of lasting patterns of relations between actors

* Violates basic assumption of inferential statistics

Wasserman, S., Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge university press.

5 / 48

What can we do with it?

Ego-Level Analyses

  • What can we know about the network of one person?
    • Ego density
    • Number of neighbors
    • Number of connected neighbors

6 / 48

What can we do with it?

Node-Level Analyses

  • What can we know about the position of people in a network?
    • Degree (In/Out/Total)
    • Geodesic Distance (Kevin Bacon)
    • PageRank
    • Target Entropy

7 / 48

What can we do with it?

Whole Network Analyses

  • What can we say about a whole network?
    • Density, Average path length, Giant component
    • Clustering
    • Homophily
    • Modeling
      • Block models
      • Small worldness

8 / 48

Historical Foundations

Joseph Moreno & Helen Hall Jennings (1932)

  • Established foundations of SNA

Quantitative Sociology/Anthropology

  • Davis Southern Women's Club (1941)
  • Small World Problem (1967)
  • Zachary's Karate Club (1977)

Seminal Articles

  • Milgram, Stanley "The small world problem" Psychology Today 2:1, (1967)
  • Grannovetter, Mark S. "The strength of weak ties" American Journal of Sociology (1973)

9 / 48

Modern Foundations of Network Analysis

Sociophysics (1990s)

  • Graph theory
  • Information theory
  • Computing
  • Used to study
    • Internet
    • Power grid
    • Transportation networks

Seminal Articles

  • Watts & Strogratz "Collective dynamics of small world networks" Nature (1998)
  • Page, Brin, Motwani, & Winograd "The PageRank citation ranking: Bringing order to the web" Stanford InfoLab (1999)
10 / 48

Important Takeaways from History

Two main camps

11 / 48

Important Takeaways from History

Two main camps

Statistical -> hypothesis testing

11 / 48

Important Takeaways from History

Two main camps

Statistical -> hypothesis testing

Graph theoretic -> network models and simulation

11 / 48

Important Takeaways from History

Two main camps

Statistical -> hypothesis testing

Graph theoretic -> network models and simulation

They often don't agree.

There is often distain.

They have different language, journals, conferences

11 / 48

Network Data in R

Sociomatrix/Adjacency Matrix

## 5 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . 1 1 1 1
## [2,] 1 . . . .
## [3,] 1 . . . .
## [4,] 1 . . . 1
## [5,] 1 . . 1 .
12 / 48

Network Data in R

Edgelist

## [[1]]
## + 4/5 edges from ddd0978:
## [1] 1--2 1--3 1--4 1--5
##
## [[2]]
## + 1/5 edge from ddd0978:
## [1] 1--2
##
## [[3]]
## + 1/5 edge from ddd0978:
## [1] 1--3
##
## [[4]]
## + 2/5 edges from ddd0978:
## [1] 1--4 4--5
##
## [[5]]
## + 2/5 edges from ddd0978:
## [1] 1--5 4--5
13 / 48

What does this mean in terms of learning R?

  • We need to know something about the different types of data!

Data types (at least some of them)

  • Logical (T/F, 1/0)
  • Integers (whole numbers)
  • Numeric (numbers with decimal places)
  • Complex (I never use these)

Data storage (at least some of them)

  • Vectors = long columns of data (can be any type, but only one type of data)
  • Dataframes = like Excel pages (columns can hold different types of data)
  • Matrices = like dataframes, but they have named rows/columns
    • Adjacency Matrices are of type matrix.
  • Lists = the junkdrawer, can hold any type of data (including dataframes or matrices)
    • in igraph, networks are stored as lists.
14 / 48

What you need to know about R?

  • R is the programming language.
  • RStudio is the Integrated Development Environment (IDE)
  • Posit.cloud is an online version of the IDE
  • Packages: groups of functions that are developed as open source
  • Base R
    • The group of packages preloaded into R
  • Tidyverse
    • Family of packages that are designed with the theory that programming should be readable by humans.
  • igraph
    • Package that is very useful for doing network analyses

Lets do this!

15 / 48

Lets Take a Tour of RStudio IDE

16 / 48

Let's Install Some Packages

You'll only need to do this once.

In Console

To install tidyverse package...

install.packages("tidyverse")

Repeat this with the following packages:

igraph tidygraph here ggraph

17 / 48

Let's Load Some Packages

You'll need to do this every time you restart R.

In Console

To load tidyverse package...

library(tidyverse) #tools for cleaning data
library(igraph) #package for doing network analysis
library(tidygraph) #tools for doing tidy networks
library(here) #tools for project-based workflow
library(ggraph) #plotting tools for networks
18 / 48

Let's Load Some Packages

You'll need to do this every time you restart R.

In Console

To load tidyverse package...

library(tidyverse) #tools for cleaning data
library(igraph) #package for doing network analysis
library(tidygraph) #tools for doing tidy networks
library(here) #tools for project-based workflow
library(ggraph) #plotting tools for networks

Once you have done this, you will want to put include a code chunk with all of your libraries into your markdown document so that you don't have to type this every time.

18 / 48

Let's get data into R.

In this workshop, you have the data from the survey to make things as easy as possible. However, this is not typical. You will have to find ways to get your data into R but that is sort of different depending on where you are running R (posit.cloud or a local machine)

If you have loaded the package "here" this should just work. If you have not loaded the "here" package you will need to set the working directory.

Again, you will want to include this as a code chunk in your RMD file.

#This loads the csv and saves it as a dataframe titled WorkshopData
WorkshopData <- read_csv(here("data", "AnonSurveyData.csv"))
19 / 48

Let's have a look at the data

glimpse(WorkshopData)
## Rows: 34
## Columns: 18
## $ ID <dbl> 5106, 6633, 7599, 4425, 2495, 6355, 8810, 3877…
## $ StartDate <dttm> 2020-07-30 12:15:20, 2020-07-30 12:18:50, 202…
## $ EndDate <dttm> 2020-07-30 12:18:59, 2020-07-30 12:20:21, 202…
## $ Status <chr> "IP Address", "IP Address", "IP Address", "IP …
## $ Progress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ `Duration (in seconds)` <dbl> 219, 91, 137, 97, 127, 233, 97, 144, 103, 231,…
## $ Finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ RecordedDate <dttm> 2020-07-30 12:18:59, 2020-07-30 12:20:22, 202…
## $ SurveyID <chr> "R_ssTGHZwy5EpQaNX", "R_2fv9VCk0tjdrDOr", "R_1…
## $ ExternalReference <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ DistributionChannel <chr> "email", "email", "email", "email", "email", "…
## $ UserLanguage <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
## $ Q2 <chr> "morning person.", "morning person.", "night o…
## $ Q3 <chr> "Brownies", "I don't care for dessert.", "Ice …
## $ Q4 <chr> "Coffee,Water,The tears of my enemies", "Coffe…
## $ Q5 <dbl> 350, 12, 300, 0, 264, 4, 289, 550, 349, 300, 4…
## $ Q6 <dbl> 350, 56, 5000, 50, 286, 250, 76, 185, 108, 220…
## $ `as.character(Q5)` <dbl> 350, 12, 300, 0, 264, 4, 289, 550, 349, 300, 4…
20 / 48

Let's start cleaning up.

First, we don't need most of that data

There is a ton of data there that doesn't make sense for us to keep around.

We will use the '%>%' (pipe) operator and the verb select

WorkshopData %>%
select(ID,Q2:Q6) -> WorkshopData
glimpse(WorkshopData)
## Rows: 34
## Columns: 6
## $ ID <dbl> 5106, 6633, 7599, 4425, 2495, 6355, 8810, 3877, 1554, 7743, 8353, 2…
## $ Q2 <chr> "morning person.", "morning person.", "night owl.", "morning person…
## $ Q3 <chr> "Brownies", "I don't care for dessert.", "Ice Cream", "I don't care…
## $ Q4 <chr> "Coffee,Water,The tears of my enemies", "Coffee,Tea,Water,Milk", "F…
## $ Q5 <dbl> 350, 12, 300, 0, 264, 4, 289, 550, 349, 300, 424, 426, 231, 290, 10…
## $ Q6 <dbl> 350, 56, 5000, 50, 286, 250, 76, 185, 108, 220, 500, 350, 412, 1135…
21 / 48

Note the data are not numbers.

Let's start cleaning up.

Now we should actually take a look at the data

WorkshopData %>%
head()
## # A tibble: 6 × 6
## ID Q2 Q3 Q4 Q5 Q6
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 5106 morning person. Brownies Coffee,Water,The … 350 350
## 2 6633 morning person. I don't care for dessert. Coffee,Tea,Water,… 12 56
## 3 7599 night owl. Ice Cream Fruit Juice,Tea,W… 300 5000
## 4 4425 morning person. I don't care for dessert. Fruit Juice,Coffe… 0 50
## 5 2495 morning person. Brownies Coffee,Water 264 286
## 6 6355 night owl. Ice Cream Fruit Juice,Tea,W… 4 250
22 / 48

Let's check out the power of R

I am going to blast through these next slides, to show you some of the things that you might want to do with R

23 / 48

Let's start visualizing the data

For categorical data, you might want to get some counts.

Here is code to do this for the question about morning or night person.

WorkshopData %>%
select(Q2) %>%
group_by(Q2) %>%
tally()
## # A tibble: 2 × 2
## Q2 n
## <chr> <int>
## 1 morning person. 19
## 2 night owl. 15
24 / 48

Let's start visualizing the data

For categorical data, you might want to get a histogram.

Here is code to do this for the favorite dessert type.

WorkshopData %>%
select(Q3) %>%
group_by(Q3) %>%
tally()
## # A tibble: 5 × 2
## Q3 n
## <chr> <int>
## 1 Brown Butter Chocolate Chip Cookies 2
## 2 Brownies 9
## 3 Cheese 3
## 4 I don't care for dessert. 3
## 5 Ice Cream 17
WorkshopData %>%
select(Q3) %>%
ggplot(aes(y = Q3)) +
geom_bar()

25 / 48

Let's explore beverage choices

WorkshopData %>%
select(ID, Q4) %>%
head()
## # A tibble: 6 × 2
## ID Q4
## <dbl> <chr>
## 1 5106 Coffee,Water,The tears of my enemies
## 2 6633 Coffee,Tea,Water,Milk
## 3 7599 Fruit Juice,Tea,Water,Fizzy Water
## 4 4425 Fruit Juice,Coffee,Water
## 5 2495 Coffee,Water
## 6 6355 Fruit Juice,Tea,Water

Notice, these are not tidy data, more than one variable per line

26 / 48

Let's dummy code the beverage data

For tidy data we want one value per row

WorkshopData %>%
select(ID, Q4) %>%
separate_rows(Q4, sep = ",") %>%
head(10)
## # A tibble: 10 × 2
## ID Q4
## <dbl> <chr>
## 1 5106 Coffee
## 2 5106 Water
## 3 5106 The tears of my enemies
## 4 6633 Coffee
## 5 6633 Tea
## 6 6633 Water
## 7 6633 Milk
## 8 7599 Fruit Juice
## 9 7599 Tea
## 10 7599 Water

We can dummy code these

WorkshopData %>%
select(ID, Q4) %>%
separate_rows(Q4, sep = ",") %>%
mutate(Checked = 1) %>%
pivot_wider(names_from = Q4,
values_from = Checked,
values_fill = 0)
## # A tibble: 34 × 11
## ID Coffee Water `The tears of my enemies` Tea Milk `Fruit Juice`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5106 1 1 1 0 0 0
## 2 6633 1 1 0 1 1 0
## 3 7599 0 1 0 1 0 1
## 4 4425 1 1 0 0 0 1
## 5 2495 1 1 0 0 0 0
## 6 6355 0 1 0 1 0 1
## 7 8810 1 0 1 0 0 0
## 8 3877 0 1 1 1 1 0
## 9 1554 0 1 1 0 0 1
## 10 7743 0 1 0 1 1 1
## # ℹ 24 more rows
## # ℹ 4 more variables: `Fizzy Water` <dbl>,
## # `A delicious 12 year single malt scotch from the Scottish lowlands with notes of apple` <dbl>,
## # ` cinnamon` <dbl>, ` and dried fruit served with a single ice cube` <dbl>
27 / 48

Let's look at some quantitative data

First, let's summarize the reading data.

WorkshopData %>%
select(Q5) %>%
summarize(Ave = mean(Q5, na.rm = TRUE),
SD = sd(Q5, na.rm = TRUE))
## # A tibble: 1 × 2
## Ave SD
## <dbl> <dbl>
## 1 327. 247.
28 / 48

Let's investigate groups

Are morning people or night owls reading longer books?

WorkshopData %>%
select(Q2, Q5) %>%
group_by(Q2) %>%
summarize(Ave = mean(Q5),
SD = sd(Q5))
## # A tibble: 2 × 3
## Q2 Ave SD
## <chr> <dbl> <dbl>
## 1 morning person. 321 215.
## 2 night owl. 335. 289.

We might want to use a boxplot to display these data

WorkshopData %>%
select(Q2, Q5) %>%
ggplot(., aes(x = Q2, y = Q5)) +
geom_boxplot()

29 / 48

Let's look at readers vs. blueberries

Is there a relationship between length of book and estimates on number of blueberries?

Could do a scatter plot

WorkshopData %>%
select(Q5:Q6) %>%
mutate(Q5 = as.numeric(Q5), Q6 = as.numeric(Q6)) %>%
ggplot(aes(x = Q5, y = Q6)) +
geom_point()

30 / 48

Let's look at readers vs. blueberries

Is there a relationship between length of book and estimates on number of blueberries?

Or you could do a linear model

summary(lm(Q6 ~ Q5, data = WorkshopData))
##
## Call:
## lm(formula = Q6 ~ Q5, data = WorkshopData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -365.6 -313.8 -179.1 -98.8 4563.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 382.9414 249.7136 1.534 0.135
## Q5 0.1785 0.6126 0.291 0.773
##
## Residual standard error: 867.6 on 32 degrees of freedom
## Multiple R-squared: 0.002647, Adjusted R-squared: -0.02852
## F-statistic: 0.08492 on 1 and 32 DF, p-value: 0.7726
31 / 48

Let's prep our data for SNA

We will need to prep two separate files...

  1. An edgelist
  2. A file of attributes of the nodes.
32 / 48

Let's make an edgelist

There is an issue here.

I wanted to make these data anonymous (so we don't know who likes scotch)

But to do that I had to make the edgelist for you.

So I sent you an edgelist as a csv. Sorry.

Check out the edgelist

EL = read_csv(here("data", "AnonEL.csv"))
head(EL)
## # A tibble: 6 × 2
## ID Connections
## <dbl> <dbl>
## 1 5106 5106
## 2 6633 6196
## 3 7599 5462
## 4 4425 7743
## 5 2495 3940
## 6 6355 6355
33 / 48

Let's assemble our node attributes

Before we can convert our Edgelist to a network, we should add in the attributes.

We have several candidate attributes:

  • Morning vs. Night
  • Dessert Type
  • Pages in book

We will develop a separate dataframe for the attributes.

WorkshopData %>%
select(ID, Q2, Q3, Q5) -> AttributeDf
34 / 48

Let's assemble our node attributes

Experience tells me that when you try to add attributes that you often make a mistake where the number attributes don't match up well to the number of nodes...but lets see.

gr <- graph_from_data_frame(EL, directed = TRUE)
plot(gr)

gr = as_tbl_graph(gr)
35 / 48

Let's add some attributes

So actually the easiest way to add the attributes is to add them while you make the graph.

But that isn't as easy as it seems

gr %>%
activate(nodes) %>%
mutate(AMPM = AttributeDf$Q2)

But that isn't as easy as it seems...


Let's sort out these attributes.

The warning was: "Input AMPM must be size 60 or 1, not 34."

What this means is we need to take our attributes dataframe and make sure all the nodes are listed.

To do this we need to:

  • Compile a dataframe of all nodes listed in the graph
  • Use a join to add attributes to this dataframe
36 / 48

Let's sort out these attributes (pt 2)

#This will get a vector of all nodes
gr %>%
activate(nodes) %>%
as_tibble() %>%
transmute(ID = name) %>%
mutate(ID = as.numeric(ID))-> GrNodes
#Now we pull in the attributes using a left_join
NodeAttributes = left_join(GrNodes, AttributeDf, by = "ID")

You should inspect Node Attributes

37 / 48

Let's sort out these attributes (pt 3)

Inspect the head

head(NodeAttributes)
## # A tibble: 6 × 4
## ID Q2 Q3 Q5
## <dbl> <chr> <chr> <dbl>
## 1 5106 morning person. Brownies 350
## 2 6633 morning person. I don't care for dessert. 12
## 3 7599 night owl. Ice Cream 300
## 4 4425 morning person. I don't care for dessert. 0
## 5 2495 morning person. Brownies 264
## 6 6355 night owl. Ice Cream 4
38 / 48

Let's sort out these attributes (pt 4)

Inspect the tail

tail(NodeAttributes)
## # A tibble: 6 × 4
## ID Q2 Q3 Q5
## <dbl> <chr> <chr> <dbl>
## 1 7128 <NA> <NA> NA
## 2 1050 <NA> <NA> NA
## 3 3799 <NA> <NA> NA
## 4 1651 <NA> <NA> NA
## 5 8984 <NA> <NA> NA
## 6 1958 <NA> <NA> NA
39 / 48

Let's finally assemble this graph.

gr %>%
as_tbl_graph() %>%
activate(nodes) %>%
mutate(AMPM = NodeAttributes$Q2) %>%
mutate(Dessert = NodeAttributes$Q3) %>%
mutate(Pages = NodeAttributes$Q5) -> gr
summary(gr)
## IGRAPH 0971b7b DN-- 60 101 --
## + attr: name (v/c), AMPM (v/c), Dessert (v/c), Pages (v/n)
40 / 48

Let's finally assemble this graph.

Take a look at your graph

## # A tbl_graph: 60 nodes and 101 edges
## #
## # A directed multigraph with 8 components
## #
## # A tibble: 60 × 4
## name AMPM Dessert Pages
## <chr> <chr> <chr> <dbl>
## 1 5106 morning person. Brownies 350
## 2 6633 morning person. I don't care for dessert. 12
## 3 7599 night owl. Ice Cream 300
## 4 4425 morning person. I don't care for dessert. 0
## 5 2495 morning person. Brownies 264
## 6 6355 night owl. Ice Cream 4
## # ℹ 54 more rows
## #
## # A tibble: 101 × 2
## from to
## <int> <int>
## 1 1 1
## 2 2 35
## 3 3 14
## # ℹ 98 more rows
41 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Color (good for grouping)
  • Shape (good for grouping)
  • Linewidth (good for numeric)
  • Text (good for labels)
  • Layout

This is the most basic plot

```r
ggraph(gr) +
geom_edge_link() +
geom_node_point()
```

Not super pretty

42 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Layout, there are lots of options -Try circle?
```r
ggraph(gr, layout = 'circle') +
geom_edge_link() +
geom_node_point()
```

43 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Shape, we can make night owls one shape and morning people a differnt shape...
```r
ggraph(gr, layout = 'circle') +
geom_edge_link() +
geom_node_point(aes(shape = AMPM))
```

Not great

44 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Color, lets use the dessert type to define a color
```r
ggraph(gr, layout = 'circle') +
geom_edge_link() +
geom_node_point(aes(color = Dessert))
```

45 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Size, lets make the nodes different sizes based on the number of pages in the last book read.
```r
ggraph(gr, layout = 'circle') +
geom_edge_link() +
geom_node_point(aes(size = Pages))
```

46 / 48

Let's make our plots look a bit better

We can add various elements to our plot

  • Let's jam them all together!
```r
ggraph(gr, layout = 'circle') +
geom_edge_link() +
geom_node_point(aes(shape = AMPM,
color = Dessert,
size = Pages))
```

47 / 48

Let's save our data so that we can access it easily

We want to save as an RDS file

  • This way we avoid having to reconstruct our graph
  • We get our nodes, our edges, and our attributes.

You don't actually need to run this (I've already saved the file as an RDS so we can start with that on Day 2!)

```r
write_rds(gr, here("data", "WorkshopNetwork.rds"))
```
48 / 48

Foundations of Network Analysis

2 / 48
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow