Topics in Artificial Intelligence (CPSC 532S):
Multimodal Learning with Vision, Language and Sound
Winter Term 1, 2022
Course Information
Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).
In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.
Content Delivery and Covid Precautions: The lectures will be offered in-person only and no recordings will be made. Unfortunately, for this reason, a hybrid delivery of material will not be availble. We will experiment with hybrid office hours, as we believe this will benefit the students. Students are strongly encouraged and expected (but not required) to wear masks in class. This is largely for the benefit of your fellow students with whom you will sit in close proximity. Instructor will not wear a mask when lecturing (this improves delivery of the material) but will put on the mask in close interaction setting or when requested by students. If at any point a student is diagnosed with COVID or has symptoms, he/she are expeected to follow UBC and provincial guidelines and isolate at home. Please inform instructor of such cases and he will do his best to provide accomodations.
- Instructor:
- Leonid Sigal
(
lsigal@cs.ubc.ca
) - TA:
- Rayat Hossain
(
rayat137@cs.ubc.ca
) - Tanzila Rahman
(
tanzila.himu@gmail.com
) - Office hours:
- TBD and by appointment (all communication to go through piazza)
- Class meets:
- Tuesday, Thursday 11:00 - 12:30 pm,
ICICS 246
- Piazza
- piazza.com/ubc.ca/winterterm12022/cpsc532s/home
Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.
Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We also encourage you to use Google Colab which should be sufficient for your assignment but likely not for the project. You may also register for student accounts with Amazon AWS (free $25 credit) or with Micrasoft Azure (free $100 credit). Note that while TAs will do their best to help with your specific enviornment setup, due to heterogeneity of various setups that may results from these choices, it is really up to the student to ultimately sort out the details.
Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome to contact me and I will asses feasibility of letting you attend the lectures (subject to space in the lecture room).
Grading
Assignments (five assignments in total) | 40% |
   Assignments #0: Introduction to PyTorch (0% -- ungraded) | |
   Assignments #1: Neural Networks Introduction (5%) | |
   Assignments #2: Convolutional Neural Networks (5%) | |
   Assignments #3: Recurrent Neural Network Language Models (10%) | |
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%) | |
   Assignments #5: Advanced Neural Architectures (10%) | |
Research papers | 20% |
   Readings and reviews: Two papers a week after the break (10%) | |
   Presentations and discussion: One paper per semester (10%) | |
Project (proposal, final presentation and web report) | 40% |
Assignments (40% of the grade)
Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on the grade as well as overall understanding of the material.
Research papers (20% of the grade)
In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation. Note that the expectation is that all students need to attend all classes, including those where your peers present.
Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).
Structure of the reviews: |
---|
  Short summary of the paper (3-4 sentences) |
  Main contributions (2-3 bullet points) |
  Positive and negatives points (2-3 bullet points each) |
  What did you not understand or was unclear about the paper? (2-3 bullet points) |
Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Canvas.
Paper Presentation: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. We are likely to resort to pre-reocrded presentations, rather than the live ones. You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. More details will be given as we get closer to paper readings and presentations.
Structure of the paper presentation: |
---|
  High-level overview of the problem and motivation |
  Clear statement of the problem |
  Overview of the technical details of the method, including necessary background |
  Relationship of the approach and method to others discussed in class |
  Discussion of strengths and weaknesses of the approach |
  Discussion of strengths and weaknesses of the evaluation |
  Discussion of potential extensions (published or potential) |
Project (40% of the grade)
Details on projects to follow ....
Schedule
Date | Topic | Reading and Resources |
W1: Sept 6 | Grad Classes Canceled |
|
W1: Sept 8 | Introduction to the Course (slides) - What is multi-modal learning? - Challenges in multi-modal learning - Course expectations and grading |
1. (optional) Reading List for Topics in Multimodal Machine Learning by Liang 2. (optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser 3. Multimodal Machine Learning: A Survey and Taxonomy by Baltrusaitis, Ahuja and Morency 4. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods by Mogadala, Kalimuthu and Klakow |
W1: Sept 9 |
Assignment 0 out (download) Credit: Assignment 0 is adopted from the course assignment given out by Justin Johnson in EECS 498/598 at University of Michigan (see link). The adoptation was done by our own Suhail Mohammed. |
|
W2: Sept 13 | Introduction to Deep Learning [Part 1] (slides) - Multi-layer Perceptron (MLP) - Stochastic Gradient Descent - Computational graphs - NN as Universal Approximators Assignment 1 out (download)
|
Deep Learning in Nature by LeCun et al. Automatic Differentiation in Machine Learning: a Survey by Baydin et al. |
W2: Sept 15 | Introduction to Deep Learning [Part 2] (slides) - More on activation functions - Regularization (L1, L2, batch norm, dropout) - Terminology and practical advice on optimization - Simple loss functions - Structure, parameters and hyper-parameters |
|
W3: Sept 20 |
Introduction to Deep Learning [Part 3] (slides) - Debugging strategies and techniques Introduction to Computer Vision (slides) - History - Basic operations and problems - Image filtering and features Convolutional Neural Networks [Part 1] (slides) - CNN Basics - CNN layer |
Chapter 9, 9.1-9.3 of Deep Learning Book |
W3: Sept 21 |
Assignment 1 due
|
|
W3: Sept 22 |
Convolutional Neural Networks [Part 2] (slides) - CNN, Pooling Layers - Invariance vs. Equivariance - Regularization, Data Augmentation - Pre-training and transferability - VGG Assignment 2 out (download, data [~10gb])
|
CNNs for Computer Vision by Srinivas et al |
W4: Sept 27 |
Convolutional Neural Networks [Part 3] (slides) - CNNs learning positional information - Model ensembling and soups - Static vs. dynamic computational graphs - Image classification - AlexNet, VGG - GoogleLeNet, ResNet |
CNNs for Computer Vision by Srinivas et al |
W4: Sept 29 |
Convolutional Neural Networks [Part 4] (slides) - Vanishing and exploding gradients - ResNet (+ theory) - Segmantation networks - Fully convolutional networks (transpose convolutions) - CNNs for object detection (RCNN) |
Mask R-CNN by He et al |
W5: Oct 3 | Assignment 2 due |
|
W5: Oct 4 |
Convolutional Neural Networks [Part 4] (slides) - CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, Mask RCNN, YOLO) Visualizing CNNs (slides) - Guided BackProp - Gradient ascent - Adversarial examples Introduction to Natural Language Processing (slides) - Tasks in NLP - Why NLP is difficult - Representing words and text Assignment 3 out (download)
|
|
W5: Oct 6 |
Recurrent Neural Networks [Part 1] (slides) - Representing words and text - Intro to language modeling - Recurrent Neural Networks (RNNs) - Encoder-decoder RNNs |
Word Representations in Vector Space by Mikolov et al Chapter 10 of Deep Learning Book |
W6: Oct 11 |
Recurrent Neural Networks [Part 2] (slides) - Encoder-decoder RNNs - Translation models - Long Short Term Memory Networks (LSTMs) - Gated Recurrent Units (GRUs) - Attention models |
|
W6: Oct 13 |
Recurrent Neural Networks [Part 3] (slides) - Attention models - Forms of attention - Transformer - Applications: Language Translation, BERT, Image Captioning |
|
W7: Oct 17 |
Recurrent Neural Networks Applications [Part 2] (slides) - Masked Language Modeling (BERT) - Sequential Language Modeling (GPT3) - Image Captioning - Visual Question Answering, Visual Dialogs |
|
W7: Oct 20 |
Recurrent Neural Networks Applications [Part 3] (slides) - Activity Recognition - Vision Transformers, SWIN Transformers - DETR, Language Grounding Unsupervised Representation Learning (slides) - Autoencoders, Denoising Autoencoders - Stacked Autoencoders, Context Encoders - Bottleneck Theory Project Teams Formed Assignment 4 out (download)
|
|
W7: Oct 23 | Assignment 3 due |
|
W8: Oct 25 |
Unsupervised Representation Learning (slides) - Bottleneck Theory Multimodal Learning [part I] (slides) - Intro to Multimodal Learning - Multimodal Joint Representations - Canonical Correlation Analysis (CCA) |
Unified Visual-Semantic Embeddings by Kiros et al |
W8: Oct 27 |
Multimodal Learning [part 2] (slides) - Joint embedding models - Applications Generative Models [part 1] (slides) - PixelRNN, PixelCNN |
|
W9: Nov 1 | Final Project Pitches (Part 1) |
|
W9: Nov 3 | Final Project Pitches (Part 2) |
|
W10: Nov 8 |
Generative Models [part 2] (slides) - Variational Autoencoders (VAEs) - Vector Quantized Variational Autoencoders (VQ-VAEs) - Applications Assignment 4 due |
|
W10: Nov 10 | No Class | |
W11: Nov 15 |
Generative Models [part 3] (slides) - Vector Quantized Variational Autoencoders (VQ-VAEs) - Generative Adversarial Networks (GANs) - DCGAN, Conditional GAN - Image-to-Image Tranlation: pix2pix, CycleGAN - Laplacyan Pyramid GAN, InfoGAN, Adversarial Autoencoders |
|
W11: Nov 16 |
Paper presentation selection quiz due
|
|
W11: Nov 17 |
Diffusion Models (slides) Guest lecture by Saeid Naderiparizi. Assignment 5 out (download)
|
Reading: Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon or Denoising Diffusion Probabilistic Models by Ho, Jain and Abbeel (Only read ONE. Song has a nice blog explaining this.) |
W12: Nov 22 | (slides) | Reading: Graph Attention Networks by Velickovic et al. |
W12: Nov 24 | (slides) | |
W13: Nov 29 | Deep Reinforcement Learning (slides) - Introduction - Value-based RL, Policy-based RL, Q-Learning, REINFORCE - RL Applications | |
W13: Dec 1 | (slides) | |
W13: Dec 2 |
Paper presentation due
|
|
W14: Dec 6 |
(slides)Assignment 5 due |
Resources
Related Classes
This course was very heavily inspires by courses in other places. Most notably:
- Deep Learning for Vision and Language (COMP 646) course at Rice
- Multi-Modal Machine Learning (11-777) course CMU
- Visual Recognition with Text (CSC2539) course at University of Toronto
- Deep Learning (CS 4803 / 7643) course at Georgia Tech
as well as:
- Convolutional Neural Networks for Visual Recognition (CS401n) at Stanford University
- Deep Learning for Natural Language Processing (CS224d) at Stanford University
- Language and Vision (Comp 790-133) course at UNC Chapel Hill
- Topics in Computer Vision: Deep Learning in Computer Vision (CSC2540) course at University of Toronto
Books
- Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press
Libraries
- PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
- Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
- TensorFlow: popular deep learning library from Google
- Theano: another popular deep learning library
- CNTK: Microsoft's deep learning cognitive toolkit library
- scikit: Machine learning in Python
Datasets
- ImageNet: Large-scale image classification dataset
- VQA: Visual Question Answering dataset
- Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
- LSMDC: Large-Scale Movie Description Dataset and challenge
- Madlibs: Visual fil-in-the-blank dataset
- ReferIt: Dataset of visual referring expressions
- VisDial: Visual dialog dataset
- ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
- VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
- VIST: VIsual StroyTelling dataset
- CLEVR: Compositional Language and Elementary Visual Reasoning dataset
- COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
- Toronto COCO-QA: Toronto question answering dataset
- Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
- MovieQA: automatic story comprehension dataset from both video and text.
- Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
- imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
- MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.