lab4.md - Grip

Lab 4 - Visualizing network data and graphs in R

General lab instructions

rubric={mechanics:3}

Ensure your lab meets these basic lab requirements: https://github.ubc.ca/ubc-mds-2016/general/blob/master/general_lab_instructions.md
For this lab, you must submit both the source code (.py, .ipynb, .R etc) AND a final report in .md format that contains the visualizations and your reflection/discussion on them.

Exercise 1 - Encoding network data as a node-link diagram with force-directed placement

1a

rubric={viz:2,code:1}

Choose one of the retweet datasets from the twitter directory in the datasets repo. These datasets were extracted from Twitter by accessing the Twitter API with the twitterR package (you will learn how to do this next block). They contain recent tweets (from 2016-12-06) from one of the following hashtags: #rstats or #python. Of course, if you are keen, you could create both visualizations for spark marks! And hey, the results may be of interest to you!
I suggest you use the network package (see tutorials in Resources section below) to create a force-directed placement network visualization for the dataset you chose. The task for this visualization is to identify who are the major players in this social network when using the respective hashtag (i.e., who are the people with the most retweets, and thus greatest influence?).
Size nodes to their total degree of retweeting and being retweeted, and make linewidth/edges proportional to the number of retweets between that pair. Be creative on how to label the nodes so that the key data is comprehensible.
Below is an example of code to create a force-directed placement network visualization using the network package and plotting in base R. I have also added a .md file with more help about using the network package: network.md

library(tidyverse)
library(sna)
library(network)

rt_useR2016 <- read_csv("data/rt_useR2016.csv")

# adjust retweets to create an edgelist for network
el_useR2016 <- as.data.frame(cbind(sender = tolower(rt_useR2016$sender),
                         receiver = tolower(rt_useR2016$screenName)))
el_useR2016 <- count(el_useR2016, sender, receiver)

rtnet_useR2016 <- network(el_useR2016, matrix.type = 'edgelist', directed = TRUE,
                ignore.eval = FALSE, names.eval = 'num')

# Get names of only those who were retweeted to keep labeling reasonable
vlabs_useR2016 <- rtnet_useR2016 %v% 'vertex.names'
vlabs_useR2016[degree(rtnet_useR2016, cmode = 'outdegree') == 0] <- NA

# plot network diagram
par(mar = c(0, 0, 3, 0))
plot(rtnet_useR2016, label = vlabs_useR2016, label.pos = 5, label.cex = .8,
     vertex.cex = log(degree(rtnet_useR2016)) + .5, vertex.col = 'lightblue',
     edge.lwd = 'num', edge.col = 'gray70', main = '#useR 2016 Retweet Network')

Other resources:

Two great tutorials for network visualizations in R:

1b

rubric={reasoning:2}

Identify the core of individuals engaging in a conversation at the center of each the graph.
Highlight anything else interesting that the visualization revealed about the data.
Reflect and discuss how the data is represented visually and why or why not you think it is effective. Explicitly state and comment on the marks and channels used in your visual encoding, the tasks that are well supported by it, any choices you made to derive additional attributes beyond the input dataset, and the scale of the data in terms of number of observations, and the number of levels of each categorical or ordered attribute (if categorical or ordered attributes are present). Provide a rationale for your use of each mark and channel.

Exercise 2 - Encoding network data as an arc diagram

2a

rubric={viz:2,code:1}

Use the janeaustenr and tidytext libraries to come up with a dataset of co-occuring characters (by chapter) from Jane Austen's novel titled "Pride and Prejudice" (or choose any other book that you can find as an R package, or get the data into R somehow). A tutorial for the tidytext can be found here.
- note, we uncovered a couple issues with tidytext tutorial, specifically:
  - regex("^chapter [\divxlc]" should be regex("^chapter [\\divxlc]"
  - the pair_count function is now deprecated, try using widyr::pairwise_count instead and note that the arguments are different from pair_count (read the docs!).
Use the arcdiagram package (tutorial here) to create an arc diagram as a network representation of character co-occurrence in the chapters of the book.
I have created a list of main characters from the "Pride and Prejudice" novel, which you can access here: pride_prejudice_characters.txt. A few notes:
- I have added some regex (or symbols for multiple titles characters may be referred to as), but this is may not be complete and you may want to edit/alter this.
- You could also hard code this in as a vector in R, but I think that is more work...
For a more direct, but unfinished, example (you should aim to do much better than this!) see our hints for this exercise: exercise2_hints.R

2b

rubric={reasoning:2}

From your visualization, identify what you think are key co-occuring character pairs.
Highlight anything else interesting that the visualization revealed about the data.
Reflect and discuss how the data is represented visually and why or why not you think it is effective. Explicitly state and comment on the marks and channels used in your visual encoding, the tasks that are well supported by it, any choices you made to derive additional attributes beyond the input dataset, and the scale of the data in terms of number of observations, and the number of levels of each categorical or ordered attribute (if categorical or ordered attributes are present). Provide a rationale for your use of each mark and channel.

Exercise 3 - Encoding network data as a adjacency matrix view

3a

rubric={viz:2,code:1}

Find a very large-ish network dataset (i.e., should have over 2000 observations/items). Do not re-use a dataset from Exercises 1 or 2 of this lab, choose something different. Possible sources for data include:
- #datascience twitter dataset for tweets from 2016-12-01 to 2016-12-08
- Stanford Large Network Dataset Collection
- UCI Network Data Repository
- Network Data Repository
- The Koblenz Network Collection
- The Nexus Network Repository
Checkout these tutorials on adjacency matrix visualizations in R:
- Graphs and Networks - there are a couple really great examples in this one!
- Static and dynamic network visualization with R - see the example about 3/4 through the turtorial (search "heatmap of the network matrix")
- [using igraph and ggplot2](http://matthewlincoln.net/2014/12/20/adjacency-matrix-plots-with-r-and-ggplot2.html

3b

rubric={reasoning:2}

Briefly describe the dataset you chose.
From your visualization, identify what you think are key relationships in the data.
Highlight anything else interesting that the visualization revealed about the data.
Reflect and discuss how the data is represented visually and why or why not you think it is effective. Explicitly state and comment on the marks and channels used in your visual encoding, the tasks that are well supported by it, any choices you made to derive additional attributes beyond the input dataset, and the scale of the data in terms of number of observations, and the number of levels of each categorical or ordered attribute (if categorical or ordered attributes are present). Provide a rationale for your use of each mark and channel.

Exercise 4 - Comparing network visualizations

4a (optional)

rubric={viz:1}

Choose one of the datasets from exercises 1-3 from this lab, and create all possible network visualizations that we explored in the exercises.

4b (optional)

rubric={reasoning:1}

In 2-3 paragraphs, compare and contrast the results from the different visualizations. Which was most informative? Why? Which was least informative? Why? Is there an optimal way to represent all network data? If not, how do you choose the best network visualization for your data?