Data at Scale

CPSC 538W // Winter 2016
Andrew Warfield (andy@cs.ubc.ca)
Monday/Wednesday 9:00-10:30am, ICCS 206

One of the most significant changes to take place in computing over the past decade has been the incorporation of large quantities of data within the applications that we use on a daily basis, and the applications that we, as computer sciences, build and maintain in our careers. The move to work with data at scale is often a subtle and implicit aspect of systems that we use without thinking about. For example, the rendering of most web pages on the internet triggers one or more complex and low-latency ad auctions, in which multiple advertisers bid real dollars to place content as part of the rendered page. Similarly, using mobile devices to solicit driving directions through Waze or a ride through Lyft depend critically on complex and near real time geoinformatics systems that must consider geography, traffic, and other factors, and then must project these properties over the travel time of the trip to be taken.

In this course, we will survey applications of large-scale data analysis, and the associated systems-level challenges and opportunities that accompany them. As a systems course, we will take particular focus on the practical and mechanistic properties of real-world systems with a mind to understanding what makes them hard to build, whether they are efficient, and what the challenges are in achieving scale.

Our selection and coverage of material will be based on three broad questions:

Course Organization

The course is organized as 11 core weeks, each with a specific topic. We will read three papers per week, two on Monday and one on Wednesday. Each paper will be presented by one of the students in the class and then discussed as a group. We will spend roughly the second half of Wednesday’s class discussing the three papers as a set, and generalizing out to the week’s topic area.

You will be expected to write a program committee-style review of each paper in advance of the class in which it is discussed.

You will also be expected to complete a self-directed project for the course. This may be done individually or as a group, and will account for 50% or your course grade. The remaining 50% will be divided evenly between the paper presentations that you do (25%) and your reviews and participation in class discussion(25%).

Things to read and watch for background and context. This is a bit of a random list, mostly intended as interesting motivation and background.

Classes:

Wednesday January 4th: Course overview and logistics.

Part 1: frameworks and approaches to working with data.

The popularization of "Big Data"
Background reading: Condor, MPI, searchbenchmark.org
Jan 9th MapReduce: Simplified Data Processing on Large Clusters OSDI 2004
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing NSDI 2012
Jan 11th Dryad: Distributed Data-parallel Programs from Sequential Building Blocks
Stream processing and Dataflow (1)
Background reading: Datalog
Jan 16th Naiad: A timely dataflow system SOSP 2013
Discretized Streams: Fault-Tolerant Streaming Computation at Scale SOSP 2013
Jan 18th Making Sense of Performance in Data Analytics Frameworks NSDI 2015
Scalability! But at what COST?
Stream processing and Dataflow (2)
Background reading: Aurora
Jan 23rd Twitter Heron: Stream Processing at Scale
StreamScope: Continuous Reliable Distributed Processing of Big Data Streams NSDI 2016
Jan 25th TelegraphCQ: Continuous Dataflow Processing SIGMOD 2003
Graph processing
Background reading:
Jan 30 Pregel: a system for large-scale graph processing SIGMOD 2010
One Trillion Edges: Graph Processing at Facebook-Scale VLDB 2015
Feb 1 GraphChi: Large-Scale Graph Computation on Just a PC OSDI 2012
Machine Learning
Background reading:
Feb 6 GraphLab: A New Framework For Parallel Machine Learning UAI 2010 Scaling distributed machine learning with the parameter server OSDI 2014
Feb 8 TensorFlow: A System for Large-Scale Machine Learning OSDI 2016

Part 2: Infrastructure support.

Silicon
Background reading:
Feb 13 BC Family Day -- no class.
Feb 15 Google supercharges machine learning tasks with TPU custom chip (Google blog article.)
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services ISCA 2014
Reading week
Feb 20 No class
Feb 22 No class
Storage
Background reading: FFS, soft updates, Facebook @Scale presentation
Feb 27 The Google File System
f4: Facebook’s Warm BLOB Storage System
Mar 1 Flat Datacenter Storage
Key/Value Stores
Background reading: chord, bigtable, gribble DHT
Mar 6 Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007
HyperDex: A Distributed, Searchable Key-Value Store SIGCOMM 2012
Mar 8 FaRM: Fast Remote Memory NSDI 2014
RAM and RDMA
Background reading:
Mar 13 Using RDMA Efficiently for Key-Value Services SIGCOMM 2014
Balancing CPU and Network in the Cell Distributed B-Tree Store
Mar 15 RDMA over Commodity Ethernet at Scale SIGCOMM 2016
The Network
Background reading:
Mar 20 Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network SIGCOMM 2015
P4: Programming Protocol-Independent Packet Processors ACM SIGCOMM CCR 2014
Mar 22 Designing Distributed Systems Using Approximate Synchrony in Datacenter Networks
Automatic Scalability
Background reading:
Mar 27 Slicer: Auto-Sharding for Datacenter Applications OSDI 2016
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems VLDB 2014
Mar 29 ASC: Automatically Scalable Computation ASPLOS 2014
Demo week
Background reading:
Apr 3 Project Demos
Apr 5 Project Demos