Jason Hartford
Feb 2015
What we'll look at…
Originally developed as an open source clone of S.
… so it's very popular in statistics departments (where I first learnt it)
Now incredibly popular in data science because of abundance of good libraries
(Though Python may be taking over)
It's full of inconsistencies
Can be a pain to learn - especially if you think of it as a programming language (!)
But… (personal opinion) it's still the fastest way to go from raw data to useful information…
R is the language
RStudio is the IDE
Just about everyone uses RStudio… unless you're a VIM ninja or have some other good reason not to, I'd advise you do the same…
Four main datatypes
Vectors
Data Frames
Lists
Matrices and arrays
x <- 10.5
a <- c(1,3,5,7)
b <- c(1,2,4,8)
d <- c(a,b)
d
[1] 1 3 5 7 1 2 4 8
y <- 3L #integers
z <- 1 + 2i # Complex
a <- c(101,115,83,120)
b <- c("Tom","Bob","Jason","Jennifer")
c <- c(TRUE,FALSE,FALSE,TRUE)
dat <- data.frame(iq = a, name = b, google_job = c)
dat
iq name google_job
1 101 Tom TRUE
2 115 Bob FALSE
3 83 Jason FALSE
4 120 Jennifer TRUE
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x) # One of the most useful commands you'll learn
List of 4
$ : int [1:3] 1 2 3
$ : chr "a"
$ : logi [1:3] TRUE FALSE TRUE
$ : num [1:2] 2.3 5.9
a <- matrix(1:6, ncol = 3, nrow = 2)
b <- matrix(c(2,2,4), nrow = 3, ncol = 1)
a %*% b # note * does elementwise multiplication
[,1]
[1,] 28
[2,] 36
t(b) #transpose and all the other usual operations are available
[,1] [,2] [,3]
[1,] 2 2 4
R can read from just about anywhere…
CSV and flat files:
dat <- read.csv('file.csv', sep = ",")
or more generally, read.table()
Databases: "RMySQL", "RODBC", "ROracle", "RPostgreSQL"
, etc, etc
Excel, JSON, URLs, Twitter, APIs, etc., etc.
Presentations
Markdown (Shiny)
Latex
Advanced R by Hadley Wickham
Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Statistical Learning by Trevor Hastie and Robert Tibshirani
Coursera Data Science Specialization - A bunch of R courses