Topics in Artificial Intelligence (CPSC 532S):
Multimodal Learning with Vision, Language and Sound
Winter Term 2, 2018
Course Information
Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).
In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.
- Instructor:
- Leonid Sigal
(
lsigal@cs.ubc.ca
) - TA:
- Mohit Bajaj
(
mbajaj01@cs.ubc.ca
) - Siddhesh Khandelwal
(
skhandel@cs.ubc.ca
) - Office hours:
- TBD and by appointment (all communication to go through piazza)
- Class meets:
- Tuesday, Thursday 9:30 - 11:00 am,
DMP 101
- Piazza
- piazza.com/ubc.ca/winterterm22018/cpsc532s
Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.
Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We will also provide credits for the use of Microsoft Azure cloud service for all students in the class. Note that the amount of credits will be limited and not replenish-able, which means you have to be judicial about their use and execution times. An optional (but extremely useful) tutorial on using Microsoft Azure will be given during the first 2 weeks of classes outside of the regular course meeting time.
Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome, but will only be accommodated to the extent there is physical room in the class.
Grading
Assignments (four assignments in total) | 30% |
   Assignments #1: Neural Networks Introduction (5%) | |
   Assignments #2: Convolutional Neural Networks (5%) | |
   Assignments #3: Recurrent Neural Network Language Models (10%) | |
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%) | |
Research papers | 20% |
   Readings and reviews: Two papers a week after the break (10%) | |
   Presentations and discussion: One paper per semester (10%) | |
Group project (proposal, final presentation and web report) | 50% |
Assignments (30% of the grade)
Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on our grade as well as overall understanding of the material.
Research papers (20% of the grade)
In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation and debate. In other words, each student will need to present (defend) and argue against (attack) one paper in class (depending on the enrollment, this is likely to be done in small groups). Note that the expectation is that all students need to attend all classes, including those where your peers present. While I will not be taking attendence, if you miss too many classes for unspecified reasons I reserve the right to discressionary deduct up to 10% from your final grade.
Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).
Structure of the reviews: |
---|
  Short summary of the paper (3-4 sentences) |
  Main contributions (2-3 bullet points) |
  Positive and negatives points (2-3 bullet points each) |
  What did you not understand or was unclear about the paper? (2-3 bullet points) |
Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Piazza as private notes to the instructor.
Paper Presentation and Debate: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. The student(s) should read the assigned paper and related work in enough detail to be able to lead a discussion and answer questions. A presentation should be roughly 30 minutes long (although that maybe adjusted based on enrollment). You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. In addition, another student (or a group of students) will be assigned to atack the paper and the shortcomings it may have. The goal of this structure (which is different from past offering of this course) is to generate a healthy scientific debate about the papers.
Structure of the paper presentation: |
---|
  High-level overview of the problem and motivation |
  Clear statement of the problem |
  Overview of the technical details of the method, including necessary background |
  Relationship of the approach and method to others discussed in class |
  Discussion of strengths and weaknesses of the approach |
  Discussion of strengths and weaknesses of the evaluation |
  Discussion of potential extensions (published or potential) |
Deadline: Each student (or student group) is required to have slides ready and meet with the instructor at least 2 day before the presentation, to obtain and incorporate feedback. Students are responsible for scheduling these meetings. Make sure you reach out to the instructor to schedule these meetings in advance.
Project (50% of the grade)
A major component of the course is a student-lead reserch project. Due to the size of the class, these are encouraged to be group projects with approximately 2-3 students (3 highly encouraged); individual projects are possible under certain circumstances with instructor approval. The scope of the project would need to be scaled appropriately based on the group size. The projects will be research oriented and each student in the group needs to contribute significantly to algorithmic components and implementation. Please start thinking about the project as early as possible. The expectation is that you will have a well formed idea by spring break and will present them right after.
The project can be on any interesting topic related to the course that the student comes up with himself/herself or with the help of the instructor. Some project ideas will be suggested in class. Note that re-implementing an existing paper is not sufficient. The project needs to attempt to go beyond and existing publication. The grade will depend on the project definition, how well you present them in the report, how well you position your work in the related literature, how thorough are your experiments and how thoughtful are your conclusions.
When thinking about the project, and for proposal, you should think about: |
---|
  The overall problem you want to solve |
  What dataset you are going to use (see sample list below) |
  What model you will use and/or from what paper you will start |
  What is the related literature you should look at |
  Who on the team will specifically work on what |
  How will you evaluate the performance |
Deadline: In the middle of semester you will need to hand in a project proposal and give a quick (5 minutes or less) presentation of what you intend to do. Prior to this you need to discuss the project idea with the instructor (in person or via e-mail). Final project presentations will be given during the finals week at the scheduled time, where each group will present their findings in, roughly 10-15 min, presentation. The final writeup will take the form of a (password protected) project webpage and should contain links to the github repository with the code. This writeup will be due on the day of the final.
Schedule
Date | Topic | Reading |
W1: Jan 3 | Introduction to the Course (slides) - What is multi-modal learning? - Challenges in multi-modal learning - Course expectations and grading |
(optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser |
W2: Jan 8 | Introduction to Deep Learning [Part 1] (slides) - Multi-layer Perceptron (MLP) - Stochastic Gradient Descent - Computational graphs - NN as Universal Approximators Assignment 1 out (download)
|
Deep Learning in Nature by LeCun et al. Automatic Differentiation in Machine Learning: a Survey by Baydin et al. |
W2: Jan 10 | Introduction to Deep Learning [Part 2] (slides) - Regularization (L1, L2, batch norm, dropout) - Terminology and practical advice on optimization - Simple loss functions - Structure, parameters and hyper-parameters Introduction to Computer Vision (slides) - History - Basic operations and problems - Image filtering and features |
|
W3: Jan 14 | Assignment 1 due |
|
W3: Jan 15 | Convolutional Neural Networks [part I] (slides) - CNN Basics - CNN as a feature representation Assignment 2 out (download, data [~10gb])
|
Chapter 9, 9.1-9.3 of Deep Learning Book |
W3: Jan 17 | Convolutional Neural Networks [part II] (slides) - Regularization, Data Augmentation - Pre-training and transferability - AlexNet, VGG, GoogleLeNet, ResNet |
CNNs for Computer Vision by Srinivas et al |
W4: Jan 22 | Convolutional Neural Networks [part III] (slides) - Fully convolutional networks (transpose convolutions) - CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, YOLO) |
R-CNN by Girshick et al |
W4: Jan 23 | Assignment 2 due |
|
W4: Jan 24 | Visualizing CNNs (slides) - Guided BackProp - Gradient ascent - Adversarial examples Introduction to Natural Language Processing (slides) - Tasks in NLP - Why NLP is difficult - Representing words and text Assignment 3 out (download)
|
Word Representations in Vector Space by Mikolov et al Chapter 10 of Deep Learning Book |
W5: Jan 29 | Recurrent Neural Networks (part I) (slides) - Recurrent Neural Networks - Long Short Term Memory Networks (LSTMs) - Gated Recurrent Units (GRUs) |
|
W5: Jan 31 | Recurrent Neural Networks (part II) (slides) - Encoder-decoder RNNs - Translation models - Attention models |
|
W6: Feb 5 | Recurrent Neural Networks (part III) (slides) - Applications: Image Captioning, Quation Answering, Activity Recognition Unsupervised Representation Learning (part I) (slides) - Autoencoders, Denoising Autoencoders Assignment 3 due Assignment 4 out (download)
|
Unified Visual-Semantic Embeddings by Kiros et al |
W6: Feb 7 | Unsupervised Representation Learning (part II) (slides) - Stacked Autoencoders, Context Encoders - Bottleneck Theory Multimodal Learning (part I) (slides) - Intro to Multimodal Learning - Multimodal Joint Representations - Canonical Correlation Analysis (CCA) |
|
W7: Feb 12 | Snow Day |
|
W7: Feb 14 | Multimodal Learning (part II) (slides) - Joint embedding models - Applications Generative Models (slides) - PixelRNN, VAEs (intro) |
|
W7: Feb 15 | Assignment 4 due |
|
W8: Feb 19 | Spring Break (no class) |
|
W8: Feb 21 | Spring Break (no class) |
|
W9: Feb 26 | Final Project Pitches |
|
W9: Feb 28 | Final Project Pitches |
|
W10: Mar 5 | Final Project Pitches |
|
W10: Mar 7 | Generative Models (slides) - VAEs - GANs - Applications |
|
W11: Mar 12 | Graph Neural Networks (slides) - Convolutions on a Graph - GNNs as Message Passing Networks - Variants of GNNs Deep Reinforcement Learning (slides) - Introduction - Value-based RL, Policy-based RL, REINFORCE - Applications |
Relational inductive biases, deep learning, and graph networks by Battaglia et al Gated Graph Sequence Neural Networks by Li et al |
W11: Mar 14 | Paper Readings 1: Image Captioning and Understanding
|
Graph R-CNN for Scene Graph Generation by Yang et al |
W12: Mar 19 | Paper Readings 2: Visual Question Answering
|
Neural module networks, by Andreas et al |
W12: Mar 21 | Paper Readings 3: Architecture Design Choices
|
FiLM: Visual Reasoning with a General Conditioning Layer by Perez et al |
W13: Mar 26 | Guest Lecture |
|
W13: Mar 28 | Paper Readings 4: Neural Architecture Search / AutoML
|
Learning Transferable Architectures for Scalable Image Recognition by Zoph et al -- OR -- Distilling the Knowledge in a Neural Network by Hinton et al |
W14: April 2 | Paper Readings 5: Having Fun with Modalities
|
Diverse Image-to-Image Translation via Disentangled Representations by Lee et al |
W14: April 4 | Paper Readings 6: Learning with Little Data
|
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks by Finn et al |
TBD | Final Project Presentations Final Project Writeups due
|
Resources
Related Classes
This course was very heavily inspires by courses in other places. Most notably:
- Vision & Language (CS 6501-004) course at Virginia Tech
- Advanced Multimodla Machine Learning (11-777) course CMU
- Visual Recognition with Text (CSC2539) course at University of Toronto
- Deep Learning (CS 7648) course at Georgia Tech
as well as:
- Convolutional Neural Networks for Visual Recognition (CS231n) at Stanford University
- Deep Learning for Natural Language Processing (CS224d) at Stanford University
- Language and Vision (Comp 790-133) course at UNC Chapel Hill
- Topics in Computer Vision: Deep Learning in Computer Vision (CSC2523) course at University of Toronto
Books
- Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press
Libraries
- PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
- Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
- TensorFlow: popular deep learning library from Google
- Theano: another popular deep learning library
- CNTK: Microsoft's deep learning cognitive toolkit library
- scikit: Machine learning in Python
Datasets
- ImageNet: Large-scale image classification dataset
- VQA: Visual Question Answering dataset
- Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
- LSMDC: Large-Scale Movie Description Dataset and challenge
- Madlibs: Visual fil-in-the-blank dataset
- ReferIt: Dataset of visual referring expressions
- VisDial: Visual dialog dataset
- ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
- VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
- VIST: VIsual StroyTelling dataset
- CLEVR: Compositional Language and Elementary Visual Reasoning dataset
- COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
- Toronto COCO-QA: Toronto question answering dataset
- Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
- MovieQA: automatic story comprehension dataset from both video and text.
- Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
- imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
- MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.