R, Data Science and other buzzwords

Jason Hartford
Feb 2015

A brief introduction to R

What we'll look at…

  • A brief introduction to R and RStudio
  • Loading and cleaning data
  • Machine learning / Stats essentials…
  • Competitive data science - Kaggle
  • Publishing your work (if we get there…)

What's the deal with R?

Originally developed as an open source clone of S.

… so it's very popular in statistics departments (where I first learnt it)

Now incredibly popular in data science because of abundance of good libraries

(Though Python may be taking over)

Downsides

It's full of inconsistencies

Can be a pain to learn - especially if you think of it as a programming language (!)

But… (personal opinion) it's still the fastest way to go from raw data to useful information…

R and RStudio

R is the language

RStudio is the IDE

Just about everyone uses RStudio… unless you're a VIM ninja or have some other good reason not to, I'd advise you do the same…

Data Types

Four main datatypes

  • Vectors

  • Data Frames

  • Lists

  • Matrices and arrays

Vectors

x <- 10.5
a <- c(1,3,5,7)
b <- c(1,2,4,8)
d <- c(a,b)
d
[1] 1 3 5 7 1 2 4 8
y <- 3L #integers
z <- 1 + 2i # Complex

Data Frames

a <- c(101,115,83,120)
b <- c("Tom","Bob","Jason","Jennifer")
c <- c(TRUE,FALSE,FALSE,TRUE)

dat <- data.frame(iq = a, name = b, google_job = c)
dat
   iq     name google_job
1 101      Tom       TRUE
2 115      Bob      FALSE
3  83    Jason      FALSE
4 120 Jennifer       TRUE

Lists

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x) # One of the most useful commands you'll learn
List of 4
 $ : int [1:3] 1 2 3
 $ : chr "a"
 $ : logi [1:3] TRUE FALSE TRUE
 $ : num [1:2] 2.3 5.9

Matrices and arrays

a <- matrix(1:6, ncol = 3, nrow = 2)
b <- matrix(c(2,2,4), nrow = 3, ncol = 1)
a %*% b # note * does elementwise multiplication
     [,1]
[1,]   28
[2,]   36
t(b) #transpose and all the other usual operations are available
     [,1] [,2] [,3]
[1,]    2    2    4

Reading Data

R can read from just about anywhere…

  • CSV and flat files:

    dat <- read.csv('file.csv', sep = ",")

    or more generally, read.table()

  • Databases: "RMySQL", "RODBC", "ROracle", "RPostgreSQL", etc, etc

  • Excel, JSON, URLs, Twitter, APIs, etc., etc.

This is boring... show me a demo instead!

Now that we have amazing results... how do we share them with the world?

Presentations

Markdown (Shiny)

Latex

Learn more

Advanced R by Hadley Wickham

Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Statistical Learning by Trevor Hastie and Robert Tibshirani

Coursera Data Science Specialization - A bunch of R courses