lab4.rmd


title: "DSCI 531 - LAB 4" output: github_document

knitr::opts_chunk$set(echo = TRUE)

Exercise 1 - Encoding network data as a node-link diagram with force-directed placement

1a

#rstats

library(tidyverse)
library(sna)
library(network)

rt_useR2016 <- read_csv("data/rstats_2016-12-06.csv")

# adjust retweets to create an edgelist for network
el_useR2016 <-
  as.data.frame(cbind(
    sender = tolower(rt_useR2016$sender),
    receiver = tolower(rt_useR2016$screenName)
  ))
el_useR2016 <- count(el_useR2016, sender, receiver)

rtnet_useR2016 <-
  network(
    el_useR2016,
    matrix.type = 'edgelist',
    directed = TRUE,
    ignore.eval = FALSE,
    names.eval = 'num'
  )

# Get names of only those who were retweeted to keep labeling reasonable
vlabs_useR2016 <- rtnet_useR2016 %v% 'vertex.names'
vlabs_useR2016[sna::degree(rtnet_useR2016) <= 12] <- NA

# plot network diagram
par(mar = c(0, 0, 3, 0))
plot(
  rtnet_useR2016,
  label = vlabs_useR2016,
  label.pos = 5,
  label.cex = 1,
  vertex.cex = log(sna::degree(rtnet_useR2016)) + .5,
  # scaling size to match degree
  vertex.col = ifelse(sna::degree(rtnet_useR2016) > 12, 'lightpink', 'lightblue'),
  edge.lwd = el_useR2016$n,
  edge.col = ifelse(el_useR2016$n >= 2, 'red', 'gray70'),
  main = '#useR 2016 Retweet Network'
)

#python

set.seed(2)

rt_python <- read_csv("data/python_2016-12-06.csv")
el_python <- as.data.frame(cbind(
  sender = tolower(rt_python$sender),
  receiver = tolower(rt_python$screenName)
))
major_users <- el_python %>%
  group_by(sender) %>%
  count() %>%
  filter(n > 20)

el_python <- el_python %>%
  filter(sender %in% major_users$sender)

el_python <- count(el_python, sender, receiver)

rtnet_python <-
  network(
    el_python,
    matrix.type = 'edgelist',
    directed = TRUE,
    ignore.eval = FALSE,
    names.eval = 'num'
  )

vlabs_python <- data.frame(vnames = rtnet_python %v% 'vertex.names')

vlabs_python$degree_centrality <- sna::degree(rtnet_python)

vlabs_python[vlabs_python$degree_centrality <= 3, ] <- NA

plot.network(
  rtnet_python,
  label = vlabs_python$vnames,
  label.pos = 5,
  label.cex = 0.8,
  label.col = "black",
  boxed.labels = TRUE,
  arrowhead.cex = 0.6,
  vertex.cex = log(sna::degree(rtnet_python)) + 0.2,
  vertex.col = 'lightgreen',
  vertex.border = "black",
  vertext.lwd = 0.2,
  edge.lwd = el_python$n,
  edge.col = hsv(0.95, 1, 1, (0.5 + (el_python$n) / 28)),
  main = '#python 2016-12-06 Retweet Network'
  )

1b

The core individuals in the #rstats retweet network are randal_olson, rbloggers, hadleywickham, thosjleeper, rlangtip, and ipfconline1. The core individuals in the #python retweet network are datacamp, enca, sirajology, kirkdborne, jms_dot_py, fullstackpython, and ipfconline1. It's interesting that the #rstats network is more connected among its major players than the #python network is. Every major player in the #rstats network is connected to at least one other major player, and often more than one. In contrast, the #python network is much more fractured, with people retweeting certain individuals but not others. Another major difference between the two networks is that #rstats has a single person that dwarfs all the others — randal_olson. The vertices are sized on a log scale, so the extent to which randal_olson dominates is not entirely clear, but if the scale were continuous, the rest of the network would become invisible.

Rather than changing the name labels to numbers and adding a legend, I decided to put boxes around the labels and then increase the font size. I think this makes it much easier to identify the major players. I also filtered both networks so that they only included interactions that involved people with a minimum threshold of retweets. This reduces the size of the network, but it makes it much easier to identify the major players, and it also cuts out the many minimal interactions between minor players, which aren't even connected to the main network. I chose what I thought were nice-looking, high-contrast color schemes for each network, which emphasizes the difference between the networks, and the differences between edges and vertices. I redundantly encoded the strength of the retweet relationship between pairs in the line marks (the edges), using both the width channel and the transparency channel.

Exercise 2 - encoding network data as an arc diagram

2a

library(stringr)
library(janeaustenr)
library(tidytext)
library(widyr)
library(arcdiagram)
library(forcats)
library(viridis)

pride_prejudice_characters <-
  read_lines("data/pride_prejudice_characters.txt")
pride_prejudice_corpus <- prideprejudice

pride_prejudice_character_string <-
  paste(pride_prejudice_characters, collapse = "|")
pride_prejudice_character_string <-
  gsub("[.]", "", pride_prejudice_character_string)
pride_prejudice_characters2 <-
  str_split(str_to_lower(pride_prejudice_character_string), "\\|")
(new_pride_prejudice <- pride_prejudice_characters2[[1]])

pride_tibble <- tibble(pride_prejudice_corpus)

pride_chapters <- pride_tibble %>%
  rename(text = pride_prejudice_corpus) %>%
  mutate(chapter = cumsum(str_detect(
    text,
    regex("^chapter [\\divxlc]", ignore_case = TRUE)
  )))

pride1grams <- pride_chapters %>%
  unnest_tokens(characters, text) %>%
  filter(characters %in% new_pride_prejudice)

pride2grams <- pride_chapters %>%
  unnest_tokens(characters, text, token = "ngrams", n = 2) %>%
  filter(characters %in% new_pride_prejudice)

pride3grams <- pride_chapters %>%
  unnest_tokens(characters, text, token = "ngrams", n = 3) %>%
  filter(characters %in% new_pride_prejudice)

pride4grams <- pride_chapters %>%
  unnest_tokens(characters, text, token = "ngrams", n = 4) %>%
  filter(characters %in% new_pride_prejudice)

pride_characters <-
  bind_rows(pride1grams, pride2grams, pride3grams, pride4grams)

unique_pride_characters <- pride_characters %>%
  mutate(characters = replace(characters, characters == "elizabeth bennet", "elizabeth")) %>%
  mutate(characters = replace(characters, characters == "elizabeth darcy", "elizabeth")) %>%
  mutate(characters = replace(characters, characters == "mr fitzwilliam darcy", "mr darcy")) %>%
  mutate(characters = replace(characters, characters == "jane bennet", "jane bingley")) %>%
  mutate(characters = replace(characters, characters == "catherine bennet", "kitty bennet")) %>%
  mutate(characters = replace(characters, characters == "lydia bennet", "lydia wickham")) %>%
  mutate(characters = replace(characters, characters == "william collins", "mr collins")) %>%
  mutate(characters = replace(characters, characters == "mr gardiner", "edward gardiner")) %>%
  unique() %>%
  arrange(chapter)

(
  character_cooccurence <- unique_pride_characters %>%
    pairwise_count(characters, chapter, sort = TRUE) %>%
    mutate(
      item1 = fct_reorder(as.factor(item1), n),
      item2 = fct_reorder(as.factor(item2), n)
    ) %>%
    rename(Character1 = item1, Character2 = item2)
)

arcplot(
  as.matrix(character_cooccurence %>% select(Character1, Character2)),
  lwd.arcs = 0.5 * character_cooccurence$n,
  line = -1,
  col.arcs = viridis(0.5 * character_cooccurence$n, alpha = 0.25),
  show.nodes = FALSE,
  family = "serif",
  col.labels = "Black",
  font = 20,
  cex.lab = 1.15
) 

2b

From this visualization, I believe that the three most important relationships are the triad formed between Mr. Darcy, Elizabeth and Mrs. Bennet. In general, Mr. Darcy and and Elizabeth seems to be the most central, since they connected to the most other characters as well. This dataset seems to follow a power law, where a few characters are responsible for the majority of the connections.

Personally, I don't find this diagram particularly useful. This dataset is rather dense, so there are many overlapping connections, which makes it difficult for me to parse how often characters co-occur. I think this may be better represented in an adjacency matrix, which works especially well for undirected graphs.

Marks: Lines (for arcs), points (for nodes) Derived Data: None Channels: Area (number of co-occurring chapters), position (lowest to highest co-occurring), Color (relative number of co-occurrences (for dominant partner)) Scale: Dozens of categories, ~100 connections Tasks: Examine relationships

Exercise 3 - Encoding network data as a adjacency matrix view

3a

I am using the datascience tweets dataset.



dsci <- read_csv("data/datascience_2016-12-01_2016-12-08.csv")

# adjust retweets to create an edgelist for network

el <- as.data.frame(cbind(
  sender = tolower(dsci$sender),
  receiver = tolower(dsci$screenName)
))

el <- count(el, sender, receiver)


senders <- el %>%
  group_by(sender) %>%
  summarize(n = n()) %>%
  filter (n > 10)

receivers <- el %>%
  group_by(receiver) %>%
  summarize(n = n()) %>%
  filter (n > 10)

el <- el %>%
  filter(sender %in% senders$sender) %>%
  filter(receiver %in% receivers$receiver) %>%
  droplevels()


ggplot(el, aes(
  x = fct_reorder(sender, n, max),
  y = fct_reorder(receiver, n, max),
  fill = n
)) +
  geom_raster() +
  theme_bw() +
  # Because we need the x and y axis to display every node,
  # not just the nodes that have connections to each other,
  # make sure that ggplot does not drop unused factor levels
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  theme(# Rotate the x-axis lables so they are legible
    axis.text.x = element_text(angle = 270, hjust = 0),
    # Force the plot into a square aspect ratio
    aspect.ratio = 1) +
  # Hide the legend (optional)
  ggtitle ("#datasci tweets 8-DEC-16") +
  scale_fill_viridis(guide_colorbar(title = "count")) +
  xlab("sender") +
  ylab("receiver")

3b

The dataset I choose is 6425 retweets from the #datasci twitter hastag on December 8th 2016.

The graph been ordered to show high volume connections to the upper right, with the biggest connection between innova_scape and mannitan which both turn out to be retweeting bots. Thus, it's not surprising that the cells for both these two tend to encompass most of the other active users ---these bots are just retweeting the highly influential people. Investing most of the high-volume snders to the right shows that the majority of them are likely bots with very little original content that just rewtweet everyone else, though there are some people with high volumes of original content (see @kdnuggets).

Most of the graph is showing low connectivity (purple).

Mark:

  • area (cell)

Channels:

  • Spatial position (x,y) - connection volume (re-ordered senders from min to max, and re-ordered receivers from min to max), so highly connected senders/receivers are right and top
  • color - connection volume (viridis scale used here, with monotonically increasing luminance to show order and multiple hues to maximize colorfulness)

This diagram is good for showing connection between people. For a larger network, it works much better than the force-directed layout as you can actually see labels. The layout is more rigid, but you can actually see some patterns in the connection and influence the layout (via the reordering). Color draws the eye for the highly connected folks. This could easily be doubled or tripled in scale. Possibly up to 100x100 matrix if you were willing to drop the labels.

Most of the challenge on this problem came from the poor documentation in twitterR explaining what they are doing. They re-name variables from the official twitter_api which itself notes that it's in constant flux in such a way that it makes it challenging to figure out who is retweeting who, who is the sender, who is the receiver. I'm not 100% certain that we don't have it mixed up (that sender is actually the retweeter, for example). This was compounded by difficulties of getting twitterR working on R 3.3.2, and general time limits for digging through the twitterR source (which I did do --but it wasn't that informative, hard to get at the twitter API calls directly as they are VERY functionally wrapped).

Exercise 4 - Comparing network visualizations

4(a)

rt_datascience <-
  read_csv("data/datascience_2016-12-01_2016-12-08.csv")
rt_datascience <- rt_datascience[1:3000, ]

el_datascience <-
  as.data.frame(cbind(
    sender = tolower(rt_datascience$sender),
    receiver = tolower(rt_datascience$screenName)
  ))

major_senders <- el_datascience %>%
  group_by(sender) %>%
  count() %>%
  filter(n > 5)

major_receivers <- el_datascience %>%
  group_by(receiver) %>%
  count() %>%
  filter(n > 5)

el_datascience <- el_datascience %>%
  filter(sender %in% major_senders$sender)

el_datascience <- el_datascience %>%
  filter(receiver %in% major_receivers$receiver)

el_datascience <- count(el_datascience, sender, receiver)

vnames <- data.frame(names = unique(el_datascience$receiver))
other_vnames <- data.frame(names = unique(el_datascience$sender))

vnames <- rbind(vnames, other_vnames)

vnames <- unique(vnames)

# Create iGraph object
graph <-
  graph.data.frame(el_datascience, directed = TRUE, vertices = vnames)

# Calculate various network properties, adding them as attributes
# to each node/vertex
V(graph)$degree <- degree(graph)
V(graph)$closeness <- centralization.closeness(graph)$res
V(graph)$betweenness <- centralization.betweenness(graph)$res
V(graph)$eigen <- centralization.evcent(graph)$vector

# Re-generate dataframes for both nodes and edges, now containing
# calculated network attributes
node_list <- get.data.frame(graph, what = "vertices")

edge_list <- get.data.frame(graph, what = "edges")

# Create a character vector containing every node name
all_nodes <- sort(node_list$name)

# Adjust the 'to' and 'from' factor levels so they are equal
# to this complete list of node names
name_order <- (node_list %>% arrange(degree))$name

# Reorder edge_list "from" and "to" factor levels based on
# this new name_order
plot_data <- edge_list %>% mutate(
  original_author = factor(from, levels = name_order),
  retweeter = factor(to, levels = name_order)
)

# Create the adjacency matrix plot
ggplot(plot_data, aes(x = original_author, y = retweeter, fill = n)) +
  geom_raster() +
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  scale_color_viridis() +
  theme(axis.text.x = element_text(angle = 270, hjust = 0))
set.seed(1)

rtnet_datascience <-
  network(
    el_datascience,
    matrix.type = 'edgelist',
    directed = TRUE,
    ignore.eval = FALSE,
    names.eval = 'num'
  )

vlabs_datascience <-
  data.frame(vnames = rtnet_datascience %v% 'vertex.names')

vlabs_datascience$degree_centrality <-
  sna::degree(rtnet_datascience)

vlabs_datascience <- NA

plot.network(
  rtnet_useR2016,
  arrowhead.cex = 0.6,
  vertex.cex = log(sna::degree(rtnet_datascience)) + 0.2,
  vertex.col = 'lightblue',
  vertex.border = "black",
  vertex.lwd = 0.2,
  edge.lwd = el_datascience$n,
  edge.col = hsv(0.08, 1, 1, (0.5 + (el_datascience$n) / 88)),
  main = '#rstats 2016-12-06 Retweet Network'
)
arcplot(
  as.matrix(plot_data %>% select(from, to)),
  lwd.arcs = plot_data$n * 0.3,
  show.labels = TRUE,
  cex.labels = 0.4,
  show.nodes = TRUE,
  horizontal = FALSE,
  pch.nodes = 21,
  col.nodes = "black",
  bg.nodes = "black",
  cex.nodes = 0.5,
  line = 0.5,
  col.arcs = hsv(0.1, (0.5 + (plot_data$n) / 98), 1, (0.5 + (plot_data$n) /
                                                        98)),
  axes = FALSE,
  font = 1,
  las = 2,
  adj = 1,
  family = "serif",
  main = "#datascience Twitter network"
)

4(b)

For this network data, I think the network plot was the least informative. One feature that accounts for this is that I filtered the data at the outset so that it was roughly symmetrical — that is, I only included people who were retweeted a minimum number of times and retweeted others a minimum number of times. Also, this particular dataset didn't include any individuals who were extreme outliers (unlike the case of the network from exercise 1), which means that there are no major players, and any attempt to label members of the network just made the visualization unnecessarily cluttered.

The arcplot is the best for identifying relationships between individuals, and it has the added benefit that it can be searched by the labels if the viewer wants to know about specific users. The arcplot also does a better job at identifying the strength of the pairing between users than does the adjacency matrix plot. The reason for this is that the width channel (for the line marks, i.e. the arcs in the arcplot) is better for making comparisons than the color hue channel (for the point marks, i.e. the tiles in the adjacency matrix plot).

However, the adjacency matrix plot still does a good job of displaying the overall symmetry of the network, and it allows for much more precise descriptions of the relationships between the nodes than the arcplot. The reason for this is that the tiles in the adjacency matrix plot are displayed with no interference (even though they're very small), while the arcs in the arcplot are all overlapping.

Clearly, there are advantages to each approach to visualizing the network data. Based on the experience I gleaned from this lab, these might be some reasonable heuristics:

  • Network diagrams are best for identifying outliers, but they don't work as well if there aren't any outliers.
  • Arcplots are best for emphasizing the stength of an edge relationship, but they sacrifice some of the viewer's ability to look up specific relationships.
  • Adjacency matrices are especially good for displaying the symmetry of the network, and they make it easy to look up specific relationships, but they don't do a great job of identifying the strength of those relationships.