Data at Scale

CPSC 538W // Winter 2016
Andrew Warfield (andy@cs.ubc.ca)
Monday/Wednesday 9:00-10:30am, ICCS 206

One of the most significant changes to take place in computing over the past decade has been the incorporation of large quantities of data within the applications that we use on a daily basis, and the applications that we, as computer sciences, build and maintain in our careers. The move to work with data at scale is often a subtle and implicit aspect of systems that we use without thinking about. For example, the rendering of most web pages on the internet triggers one or more complex and low-latency ad auctions, in which multiple advertisers bid real dollars to place content as part of the rendered page. Similarly, using mobile devices to solicit driving directions through Waze or a ride through Lyft depend critically on complex and near real time geoinformatics systems that must consider geography, traffic, and other factors, and then must project these properties over the travel time of the trip to be taken.

In this course, we will survey applications of large-scale data analysis, and the associated systems-level challenges and opportunities that accompany them. As a systems course, we will take particular focus on the practical and mechanistic properties of real-world systems with a mind to understanding what makes them hard to build, whether they are efficient, and what the challenges are in achieving scale.

Our selection and coverage of material will be based on three broad questions:

What are the applications? What interesting application use cases are driving new technologies for data-intensive applications? What new application areas are being enabled through the emergence of tools and infrastructure that allow developers to work with data in new ways or at large scale?
How are software tools evolving to serve these applications? What are teh state-of-the-art tools being used to work with data at scale today? What are the dominant challenges that are currently being explored by the research community?
How will evolutions to datacenter hardware impact the way these systems are built? What changes do we expect in terms of the relative performance and latencies of memories and networks? Will computational elements become more specialized? Does the density and cost of emerging hardware change the way we approach teh design and deployment of these systems?

Course Organization

The course is organized as 11 core weeks, each with a specific topic. We will read three papers per week, two on Monday and one on Wednesday. Each paper will be presented by one of the students in the class and then discussed as a group. We will spend roughly the second half of Wednesday’s class discussing the three papers as a set, and generalizing out to the week’s topic area.

You will be expected to write a program committee-style review of each paper in advance of the class in which it is discussed.

You will also be expected to complete a self-directed project for the course. This may be done individually or as a group, and will account for 50% or your course grade. The remaining 50% will be divided evenly between the paper presentations that you do (25%) and your reviews and participation in class discussion(25%).

Things to read and watch for background and context. This is a bit of a random list, mostly intended as interesting motivation and background.

Flash Boys by Michael Lewis
The Tail at scale, Jeff Dean (Google)

Classes:

Wednesday January 4th: Course overview and logistics.

Part 1: frameworks and approaches to working with data.

The popularization of "Big Data"
Background reading: Condor, MPI, searchbenchmark.org
Jan 9th	MapReduce: Simplified Data Processing on Large Clusters OSDI 2004 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing NSDI 2012
Jan 11th	Dryad: Distributed Data-parallel Programs from Sequential Building Blocks
Stream processing and Dataflow (1)
Background reading: Datalog
Jan 16th	Naiad: A timely dataflow system SOSP 2013 Discretized Streams: Fault-Tolerant Streaming Computation at Scale SOSP 2013
Jan 18th	Making Sense of Performance in Data Analytics Frameworks NSDI 2015 Scalability! But at what COST?
Stream processing and Dataflow (2)
Background reading: Aurora
Jan 23rd	Twitter Heron: Stream Processing at Scale StreamScope: Continuous Reliable Distributed Processing of Big Data Streams NSDI 2016
Jan 25th	TelegraphCQ: Continuous Dataflow Processing SIGMOD 2003
Graph processing
Background reading:
Jan 30	Pregel: a system for large-scale graph processing SIGMOD 2010 One Trillion Edges: Graph Processing at Facebook-Scale VLDB 2015
Feb 1	GraphChi: Large-Scale Graph Computation on Just a PC OSDI 2012
Machine Learning
Background reading:
Feb 6	GraphLab: A New Framework For Parallel Machine Learning UAI 2010 Scaling distributed machine learning with the parameter server OSDI 2014
Feb 8	TensorFlow: A System for Large-Scale Machine Learning OSDI 2016

Part 2: Infrastructure support.

Silicon
Background reading:
Feb 13	*BC Family Day -- no class.*
Feb 15	Google supercharges machine learning tasks with TPU custom chip (Google blog article.) A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services ISCA 2014
Reading week
Feb 20	*No class*
Feb 22	*No class*
Storage
Background reading: FFS, soft updates, Facebook @Scale presentation
Feb 27	The Google File System f4: Facebook’s Warm BLOB Storage System
Mar 1	Flat Datacenter Storage
Key/Value Stores
Background reading: chord, bigtable, gribble DHT
Mar 6	Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007 HyperDex: A Distributed, Searchable Key-Value Store SIGCOMM 2012
Mar 8	FaRM: Fast Remote Memory NSDI 2014
RAM and RDMA
Background reading:
Mar 13	Using RDMA Efficiently for Key-Value Services SIGCOMM 2014 Balancing CPU and Network in the Cell Distributed B-Tree Store
Mar 15	RDMA over Commodity Ethernet at Scale SIGCOMM 2016
The Network
Background reading:
Mar 20	Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network SIGCOMM 2015 P4: Programming Protocol-Independent Packet Processors ACM SIGCOMM CCR 2014
Mar 22	Designing Distributed Systems Using Approximate Synchrony in Datacenter Networks
Automatic Scalability
Background reading:
Mar 27	Slicer: Auto-Sharding for Datacenter Applications OSDI 2016 E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems VLDB 2014
Mar 29	ASC: Automatically Scalable Computation ASPLOS 2014
Demo week
Background reading:
Apr 3	Project Demos
Apr 5	Project Demos