Topics in Artificial Intelligence (CPSC 532S): Multimodal Learning with Vision, Language and Sound

Course Information

Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).

In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.

Content Delivery and Covid Precautions: The lectures will be offered in-person only and no recordings will be made. Unfortunately, for this reason, a hybrid delivery of material will not be availble. We will experiment with hybrid office hours, as we believe this will benefit the students. Students are strongly encouraged and expected (but not required) to wear masks in class. This is largely for the benefit of your fellow students with whom you will sit in close proximity. Instructor will not wear a mask when lecturing (this improves delivery of the material) but will put on the mask in close interaction setting or when requested by students. If at any point a student is diagnosed with COVID or has symptoms, he/she are expeected to follow UBC and provincial guidelines and isolate at home. Please inform instructor of such cases and he will do his best to provide accomodations.

Instructor:: Leonid Sigal (lsigal@cs.ubc.ca)
TA:: Rayat Hossain (rayat137@cs.ubc.ca); Tanzila Rahman (tanzila.himu@gmail.com)
Office hours:: TBD and by appointment (all communication to go through piazza)
Class meets:: Tuesday, Thursday 11:00 - 12:30 pm, ICICS 246
Piazza: piazza.com/ubc.ca/winterterm12022/cpsc532s/home

Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.

Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We also encourage you to use Google Colab which should be sufficient for your assignment but likely not for the project. You may also register for student accounts with Amazon AWS (free $25 credit) or with Micrasoft Azure (free $100 credit). Note that while TAs will do their best to help with your specific enviornment setup, due to heterogeneity of various setups that may results from these choices, it is really up to the student to ultimately sort out the details.

Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome to contact me and I will asses feasibility of letting you attend the lectures (subject to space in the lecture room).

Grading

`Assignments` (five assignments in total)	40%
Assignments #0: Introduction to PyTorch (0% -- ungraded)
Assignments #1: Neural Networks Introduction (5%)
Assignments #2: Convolutional Neural Networks (5%)
Assignments #3: Recurrent Neural Network Language Models (10%)
Assignments #4: Neural Model for Image Captioning / Retrieval (10%)
Assignments #5: Advanced Neural Architectures (10%)
`Research papers`	20%
Readings and reviews: Two papers a week after the break (10%)
Presentations and discussion: One paper per semester (10%)
`Project` (proposal, final presentation and web report)	40%

Assignments `(40% of the grade)`

Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on the grade as well as overall understanding of the material.

Research papers `(20% of the grade)`

In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation. Note that the expectation is that all students need to attend all classes, including those where your peers present.

Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).

Structure of the reviews:
Short summary of the paper (3-4 sentences)
Main contributions (2-3 bullet points)
Positive and negatives points (2-3 bullet points each)
What did you not understand or was unclear about the paper? (2-3 bullet points)

Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Canvas.

Paper Presentation: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. We are likely to resort to pre-reocrded presentations, rather than the live ones. You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. More details will be given as we get closer to paper readings and presentations.

Structure of the paper presentation:
High-level overview of the problem and motivation
Clear statement of the problem
Overview of the technical details of the method, including necessary background
Relationship of the approach and method to others discussed in class
Discussion of strengths and weaknesses of the approach
Discussion of strengths and weaknesses of the evaluation
Discussion of potential extensions (published or potential)

Project `(40% of the grade)`

Details on projects to follow ....

Schedule

Date	Topic	Reading and Resources
W1: Sept 6	`Grad Classes Canceled`
W1: Sept 8	Introduction to the Course (slides) - What is multi-modal learning? - Challenges in multi-modal learning - Course expectations and grading	1. (optional) Reading List for Topics in Multimodal Machine Learning by Liang 2. (optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser 3. Multimodal Machine Learning: A Survey and Taxonomy by Baltrusaitis, Ahuja and Morency 4. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods by Mogadala, Kalimuthu and Klakow
W1: Sept 9	`Assignment 0 out` (download) Credit: Assignment 0 is adopted from the course assignment given out by Justin Johnson in EECS 498/598 at University of Michigan (see link). The adoptation was done by our own Suhail Mohammed.
W2: Sept 13	Introduction to Deep Learning [Part 1] (slides) - Multi-layer Perceptron (MLP) - Stochastic Gradient Descent - Computational graphs - NN as Universal Approximators `Assignment 1 out` (download)	Deep Learning in Nature by LeCun et al. Automatic Differentiation in Machine Learning: a Survey by Baydin et al.
W2: Sept 15	Introduction to Deep Learning [Part 2] (slides) - More on activation functions - Regularization (L1, L2, batch norm, dropout) - Terminology and practical advice on optimization - Simple loss functions - Structure, parameters and hyper-parameters
W3: Sept 20	Introduction to Deep Learning [Part 3] (slides) - Debugging strategies and techniques Introduction to Computer Vision (slides) - History - Basic operations and problems - Image filtering and features Convolutional Neural Networks [Part 1] (slides) - CNN Basics - CNN layer	Chapter 9, 9.1-9.3 of Deep Learning Book
W3: Sept 21	`Assignment 1 due`
W3: Sept 22	Convolutional Neural Networks [Part 2] (slides) - CNN, Pooling Layers - Invariance vs. Equivariance - Regularization, Data Augmentation - Pre-training and transferability - VGG `Assignment 2 out` (download, data [~10gb])	CNNs for Computer Vision by Srinivas et al
W4: Sept 27	Convolutional Neural Networks [Part 3] (slides) - CNNs learning positional information - Model ensembling and soups - Static vs. dynamic computational graphs - Image classification - AlexNet, VGG - GoogleLeNet, ResNet	CNNs for Computer Vision by Srinivas et al
W4: Sept 29	Convolutional Neural Networks [Part 4] (slides) - Vanishing and exploding gradients - ResNet (+ theory) - Segmantation networks - Fully convolutional networks (transpose convolutions) - CNNs for object detection (RCNN)	Mask R-CNN by He et al
W5: Oct 3	`Assignment 2 due`
W5: Oct 4	Convolutional Neural Networks [Part 4] (slides) - CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, Mask RCNN, YOLO) Visualizing CNNs (slides) - Guided BackProp - Gradient ascent - Adversarial examples Introduction to Natural Language Processing (slides) - Tasks in NLP - Why NLP is difficult - Representing words and text `Assignment 3 out` (download)
W5: Oct 6	Recurrent Neural Networks [Part 1] (slides) - Representing words and text - Intro to language modeling - Recurrent Neural Networks (RNNs) - Encoder-decoder RNNs	Word Representations in Vector Space by Mikolov et al Chapter 10 of Deep Learning Book
W6: Oct 11	Recurrent Neural Networks [Part 2] (slides) - Encoder-decoder RNNs - Translation models - Long Short Term Memory Networks (LSTMs) - Gated Recurrent Units (GRUs) - Attention models
W6: Oct 13	Recurrent Neural Networks [Part 3] (slides) - Attention models - Forms of attention - Transformer - Applications: Language Translation, BERT, Image Captioning
W7: Oct 17	Recurrent Neural Networks Applications [Part 2] (slides) - Masked Language Modeling (BERT) - Sequential Language Modeling (GPT3) - Image Captioning - Visual Question Answering, Visual Dialogs
W7: Oct 20	Recurrent Neural Networks Applications [Part 3] (slides) - Activity Recognition - Vision Transformers, SWIN Transformers - DETR, Language Grounding Unsupervised Representation Learning (slides) - Autoencoders, Denoising Autoencoders - Stacked Autoencoders, Context Encoders - Bottleneck Theory `Project Teams Formed` `Assignment 4 out` (download)
W7: Oct 23	`Assignment 3 due`
W8: Oct 25	Unsupervised Representation Learning (slides) - Bottleneck Theory Multimodal Learning [part I] (slides) - Intro to Multimodal Learning - Multimodal Joint Representations - Canonical Correlation Analysis (CCA)	Unified Visual-Semantic Embeddings by Kiros et al
W8: Oct 27	Multimodal Learning [part 2] (slides) - Joint embedding models - Applications Generative Models [part 1] (slides) - PixelRNN, PixelCNN
W9: Nov 1	`Final Project Pitches (Part 1)`
W9: Nov 3	`Final Project Pitches (Part 2)`
W10: Nov 8	Generative Models [part 2] (slides) - Variational Autoencoders (VAEs) - Vector Quantized Variational Autoencoders (VQ-VAEs) - Applications `Assignment 4 due`
W10: Nov 10	No Class
W11: Nov 15	Generative Models [part 3] (slides) - Vector Quantized Variational Autoencoders (VQ-VAEs) - Generative Adversarial Networks (GANs) - DCGAN, Conditional GAN - Image-to-Image Tranlation: pix2pix, CycleGAN - Laplacyan Pyramid GAN, InfoGAN, Adversarial Autoencoders
W11: Nov 16	`Paper presentation selection quiz due`
W11: Nov 17	Diffusion Models (slides) Guest lecture by Saeid Naderiparizi. `Assignment 5 out` (download)	Reading: Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon or Denoising Diffusion Probabilistic Models by Ho, Jain and Abbeel (Only read ONE. Song has a nice blog explaining this.)
W12: Nov 22	(slides)	Reading: Graph Attention Networks by Velickovic et al.
W12: Nov 24	(slides)
W13: Nov 29	Deep Reinforcement Learning (slides) - Introduction - Value-based RL, Policy-based RL, Q-Learning, REINFORCE - RL Applications
W13: Dec 1	(slides)
W13: Dec 2	`Paper presentation due`
W14: Dec 6	(slides) `Assignment 5 due`

Resources

Convolutional Neural Networks for Visual Recognition (CS401n) at Stanford University
Deep Learning for Natural Language Processing (CS224d) at Stanford University
Language and Vision (Comp 790-133) course at UNC Chapel Hill
Topics in Computer Vision: Deep Learning in Computer Vision (CSC2540) course at University of Toronto

Books

Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press

Libraries

PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
TensorFlow: popular deep learning library from Google
Theano: another popular deep learning library
CNTK: Microsoft's deep learning cognitive toolkit library
scikit: Machine learning in Python

Datasets

ImageNet: Large-scale image classification dataset
VQA: Visual Question Answering dataset
Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
LSMDC: Large-Scale Movie Description Dataset and challenge
Madlibs: Visual fil-in-the-blank dataset
ReferIt: Dataset of visual referring expressions
VisDial: Visual dialog dataset
ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
VIST: VIsual StroyTelling dataset
CLEVR: Compositional Language and Elementary Visual Reasoning dataset
COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
Toronto COCO-QA: Toronto question answering dataset
Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
MovieQA: automatic story comprehension dataset from both video and text.
Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.

Leonid Sigal

Associate Professor, University of British Columbia

Menu

Contact

Links

Topics in Artificial Intelligence (CPSC 532S):

Multimodal Learning with Vision, Language and Sound

Winter Term 1, 2022

Course Information

Grading

Assignments `(40% of the grade)`

Research papers `(20% of the grade)`

Project `(40% of the grade)`

Schedule

Resources

Related Classes

Books

Libraries

Datasets

Leonid Sigal

Associate Professor, University of British Columbia

Menu

Contact

Links

Topics in Artificial Intelligence (CPSC 532S):

Multimodal Learning with Vision, Language and Sound

Winter Term 1, 2022

Course Information

Grading

Assignments (40% of the grade)

Research papers (20% of the grade)

Project (40% of the grade)

Schedule

Resources

Related Classes

Books

Libraries

Datasets

Assignments `(40% of the grade)`

Research papers `(20% of the grade)`

Project `(40% of the grade)`