Elements of Deep Learning

Anupam Sobti

Introduction

Deep Learning (3-1-0)

Practicum: 25% assignments and a semester project
- Project needs to be a novel contribution to deep learning
  - existing method applied to a novel setting
  - novel method for existing problem
  - TBD on a public git repo in groups of upto 4. Grades will be distributed according to commits.
Math heavy (at times)
Code heavy (always)
- Pytorch

How we run the labs

Code examples of things we do in class
Try to understand all code
Read documentation
Feel free to generate GPT code but be ready to be quizzed on everything generated

Course Evaluation

Quizzes: Top 3 of 4 (30%) [held every month]
Assignments: 2 (20%)
Project: 40%
- Mid-sem: 10%
- Weekly Reviews (top 4 scores): 5%
- End-Sem: 25%
Attendance: 10%

Matrix:

More than 80% = 10 marks (full)
70 to 80 % = 8 marks
60 to 70 % = 6 marks
Less than 60 % = 0 marks

Teaching Assistant

Plaksha Teaching Assistant
Office hours?

Project Timeline

Mid-sem report
- Problem statement proposal
  - Proposed changes to the state of the art
- Related Work Study
End-sem report
- Planned experiments
- Executed experiments
- Results and Analysis
- Paper Submission

Star Projects from last year

WarpNet
Enhancing Yield Prediction
Airfoil Design
Dynamic Hand Gesture Recognition
Classifying seizures
Poverty Prediction from Satellite Images
Evaluating UI Screens

Course References

Dive Deeper into Deep Learning - d2l.ai
Understanding Deep Learning - https://udlbook.github.io/udlbook/
Practical Deep Learning - course.fast.ai

Surviving in deep learning

Learn to read papers
- If you can’t read papers, your DL will be outdated in 2 weeks time
Learn to read/write math required for DL
Be ready to read and write code
Learn to structure experiments (wandb)

Compute

Laptop
Coming soon..

Teaching

28 lectures
2 paper reading sessions (student presentations)
- [groups of upto 4]
Tentative lecture plan:

Why Deep Learning?

Data and compute availability in large quantities
Hard to define features. Need to be automatically learnt.
- Software 2.0
DL allows a hierarchical feature composition to handle complexity
- Real life data often has the same property. Think images, text, sound, etc.

Hierarchical Feature Composition

Source: link

The process

The data that we can learn from.
A model of how to transform the data.
An objective function that quantifies how well (or badly) the model is doing.
An algorithm to adjust the model’s parameters to optimize the objective function.

Things to wonder!

How do we train models with more parameters than the data samples?
- Alexnet + Imagenet -> 60 million parameters with 1.2 million images
- How do they still generalize?
Theoretically, two layers should be sufficient? Why aren’t they?

Kinds of (Supervised) Machine Learning Problems

Classification
- Binary, Multiclass, Hierarchical
Tagging
- Multilabel classification, e.g., topics in a forum or categories of clothes
Regression
Search
- How to rank pages as per the relevance for the query (original PageRank from Google didn’t do this!)
Sequence Learning
Recommender systems
- Will this user like this item? (Classification)
- Rating prediction for the user (Regression)

d2l.ai

Unsupervised Learning

Clustering
Subspace Estimation (PCA)
Learn the distribution where data comes from (and generate it!)
Self supervised learning

Deep Reinforcement Learning

“Learn” policies
“Learn” Q values for state-action pairs

Why now?

Decade	Dataset	Memory	Floating point ops/s
1970	100 (Iris)	1 KB	100 KF (Intel 8080)
1980	1 K (house prices in Boston)	100 KB	1 MF (Intel 80186)
1990	10 K (optical character recognition)	10 MB	10 MF (Intel 80486)
2000	10 M (web pages)	100 MB	1 GF (Intel Core)
2010	10 G (advertising)	1 GB	1 TF (NVIDIA C2050)
2020	1 T (social network)	100 GB	1 PF (NVIDIA DGX-2)

Growth in datasets and compute
Increase complexity and memory without explicit increase in no of parameters (Attention)
Scaling behavior of transformers
Multi-stage designs with memory networks/neural interpreters
Generative modeling with GANs, Diffusion Models
Distributed computation (7 mins training for resnet models)
DL Frameworks (TF, Pytorch)

How does it change your life?

Intelligent assistants
ChatGPT
The AI Strategist - Go
AI for Science - Drug Discovery
Self driving (real time image perception capabilities)

Let’s start with

Reading papers

Three pass approach

How to read a paper

The first pass (10-15 mins)
- Title, Abstract, Introduction
- Section and Sub-section headings
- Conclusions

Answer the five Cs

Category
Context
Correctness
Contributions
Clarity

The second pass (1 hour)

Figures, Diagrams, Illustrations, Graphs
References! (this is how you find the really good papers)

After this pass, you

understand the paper content
are able to summarize the paper appropriately into your own paper

The third pass (virtual re-implementation)

4-5 hours to 1 hour
Given the input, how would you approach it?
Identify and challenge every assumption in the paper
Attention to detail

Finding papers

Google Scholar
- Where was it published? (CVPR, ICCV, ACCV, WACV, ICML, ICLR, AAAI, Neurips)
- Who cited this?
- Which references did they cite?
Semantic Scholar
Connected Papers
Some new custom GPTs

Math basics

Math basics: Dot products

Dot Products for vectors \(x\) and \(y\) (vector-vector products)
\(x^Ty\) or \(\lvert x \rvert \lvert y \rvert cos\theta\)

import numpy as np

# Let's take two vectors in 3D space
x = np.array([2, 3, 4])
y = np.array([1, 5, 7])

# Compute the dot product algebraically
dot_product_algebraic = np.dot(x, y)

# Compute the magnitudes of x and y
magnitude_x = np.linalg.norm(x)
magnitude_y = np.linalg.norm(y)

# Compute the angle between x and y using the dot product and magnitudes
# Using the dot product formula: x . y = |x| * |y| * cos(theta)
# We solve for cos(theta) as: cos(theta) = (x . y) / (|x| * |y|)
cos_theta = dot_product_algebraic / (magnitude_x * magnitude_y)

# Now we can calculate the dot product geometrically
dot_product_geometric = magnitude_x * magnitude_y * cos_theta

(dot_product_algebraic, dot_product_geometric, np.isclose(dot_product_algebraic, dot_product_geometric))

Math basics - rotations & projections

\(y = Ax\) (Vector matrix products)
- Rotation if A is a square matrix
- Projection if an n-dimensional vector is converted to an m-dimensional vector

A.shape, x.shape, torch.mv(A, x), A@x

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

Math basics - composing projections

\(C = AB\)

Rate of change of scalars: Calculus

Differentiation
- Analytical
  - Sum and Product Rules
- Numerical

Multivariate calculus

Please derive

\(\nabla_x x^T A = x\)
\(\nabla_x Ax = A^T\)
\(\nabla_x x^T A x = (A + A^T) x\)

Automatic Differentiation

Reference Link
- Differential for scalars
- Differential for vectors
Must watch: Build autograd from scratch with Andrej Karpathy

Probability and Statistics

Probability: Underlying parameter of the Data
Statistics: Looks at data and finds “estimators” of the probability
- The estimators then converge to underlying probability
  - The converged probability generates data (that one can expect to be generated in the future)

Data -> Sample -> Estimators -> Probability -> Data

Covered in class

Loss functions for
- linear regression - least squares, MLE
- linear classification - cross entropy
Analytical solution for regression

Generality of log likelihood

Convolutional Neural Networks

Use cases for CNNs

What is a convolution?

Common terms: stride, padding, dilation, kernel size

Why CNN?

Capture spatial locality and maintain locality in feature space
Introducing Spatial Invariance (pooling) in feature space
- similar response to a feature present anywhere in the image

Why CNN?

Increasing receptive field

Source: UDLBook

Much faster convergence due to reduced increased model bias

Convolution Effectiveness

Dimensionality

1D CNN - audio, text, finance, etc.
2D CNN - images

(Down and Up) Sampling

Down and up-sampling (UDLbook)

Transposed convolutions (upsampling)

Other operations

Change number of channels
- point-wise convolutions (1x1)
Learning channel-wise filters
- depth-wise convolution

The DL revival?

What started the DL revival?

DL Meme

The Imagenet dataset

Imagenet dataset considerations

Alexnet

AlexNet

What progress looks like

ImageNet performance

Perception of images

Classification
Object Detection
Semantic Segmentation
Pose
Dense Pose

Perception of images

Monocular depth prediction
Stereo depth prediction
Anomaly detection

Depth Prediction

Depth Anything Model

Creation of images

Conditioned on text prompt/other images
High resolution from low resolution - forensics, satellite, movies, etc.

Midjourney creations

Modifying images

Style Transfer

Modifying images - contd (LEDITS++)

Potter

Convolution Evolution

Improving Convolutions

Proving convolutions: AlexNet
Improving convolutions: breakthrough in deeper networks
- alexnet (8 layers) to vgg (18 layers) improved performance but not further
- Why?
  - vanishing/exploding gradients (handled by smart initialization)
  - Common step size in optimizers might move optimization to unrelated gradient points
  - Shattered Gradients

Shattered Gradients

Increasingly complex way of calculating gradients in deeper networks

Saviors: Residual connections + batch normalization

Intuitively, allows for different complexities in the learned concepts

Residual connections

Gradient is easier Order

Expressive functions

Expressive ability

Loss surface with skip connections

On CIFAR-10

Numeric control

Control activations (forward pass) & gradients (backward pass)
- (keep same expected variance)
- He initialization: \(\beta\) initialized to 0, \(\omega\) initialized to normal distribution - (0, \(\frac{2}{D_h}\)) where \(D_h\) is the number of hidden neurons in the previous layer
- Residual : might still explode with increasing variance throughout the layers. Helps establish a numerical range after every layer.
  - Batch Normalization
    - Step1: Normalization
    - Step2: Scale & Shift
    - \(m_h\) and \(s_h\) at test time?: calculated from the entire training dataset

Bottleneck residuals

Limiting the number of parameters

Residual blocks

Image Classification with residual blocks

Resnet-200 Alternate representation

DenseNet

either concatenate or do a 1x1 convolution to add a weighted combination
comparable performance as compared to resnet

U-net

Bring dexterity with semantics

Semantic Segmentation examples

Semantic Segmentation

Pose Prediction Example (Hourglass)

Different types of normalization

Note: Input is 1D

Normalization Types

[End of UDL References]

Handling backprop in shared weights

For calculating the update for the kernel weights,
- calculate gradient on all points of the image/activation/feature map
- sum gradient
  - rather than updating a weight based solely on its performance at a single position, the update considers its effect across all the positions it was applied to

Other influential advances: NiN (Network in Network)

purpose: reduce computation
Same architecture as alexnet except with a max pool of 3x3 window
output layers = number of labels followed by 1x1 convolution + average pooling instead of FC layer
slower training

NiN Training times

ref: D2L.ai

GoogLeNet (2014 ImageNet Winner)

First clear distinction between stem (data ingest), body (processing), head (prediction)

Inception Cell Deeper

ResNext

Grouped Convolutions is \(g\times\) faster than a dense convolution
The 1x1 convolutions allows for information sharing between groups

Learning embeddings from images

Image Augmentations

Flips
Crops/Resize/Rescale
Color Augmentations

Augmentations

Single Shot Multibox Detector (2016)

Multibox detector -> Assigns ground truth boxes to anchor boxes

Loss is a weighted sum of confidence and localization loss. N = number of matched boxes.

SSD - Loss functions

How to handle class imbalance (more negative boxes than positives)?
- Sample from negative (take only top 3 most confident boxes)

SSD Vs YOLO

YOLO Comparison

Retina Net

RetinaNet

CornerNet

Corner Detection Corner detection with embeddings

Proposal based methods

ROI Pooling ROI Generation and Pooling

Deformable convolutions

Deformable Convolutions

CentripetalNet

Semantic Segmentation: Capturing multi-scale context (from deeplabv3)

Capturing multi-scale context

Side: Spatial pooling pyramids

Blending with laplacian pyramids

Learning similar embeddings

Face Recognition

Object Reidentification also uses a triplet loss framework after making object level representations
Note how reidentification and detection are conflicting objectives?

ReID Task

FairMOT

How would you implement?

signature verification
face verification
face recognition

Cosine Similarity

Take a q-dimensional output -> L2 Normalization -> Use cosine similarity

Cosine Similarity Concept Formula

Fine grained/Hierarchical classification

Fine grained classification dataset Cars

How to highlight distinctions?
- Part based embeddings

Part based classification Joint localization and classification

Link1 Link2

Visualizing embeddings

What did we learn?

Please find the slides here -> Reference Link

Video Link

Adaptive Style Transfer

Adaptive instance normalization allows the activation distribution to match a style activation distribution

Example outputs (Adaptive Style Transfer)

output

Generating Images

Covered in visualization
- Neural style transfer
- Fast style transfer
- Adaptive Style Transfer
Generating from scratch
- GANs
- DCGANs
- Stable Diffusion

GANs (Generative Adversarial Networks)

GANs Complete Loss Function

ref; you think you know GANS?

Common issues in GANs

Mode Collapse
Vanishing gradients

Discriminator performs better than generator; lack of challenging examples for discriminator -> vanishing gradient

Convergence

ref

DCGAN

Deep Convolutional Generative Adversarial Networks
- both generator and discriminator are convolutions/transposed-convolutions
- wasserstein loss functions (critique instead of discriminator) for handling the common issues

Conditional GANs

Conditioning on label/other modalities of data

Style GAN

Style Mixing (Mixing regularization)
Stochastic variation by introducing noise
Separation of global effects from stochasticity
- focuses on generating effects indistinguishable by discriminator

Ref

Example outputs

Stable Diffusion

to be continued after attention

Detecting anomalies in images

Learning distributions from where data is generated

Autoencoders

Simple autoencoder

AnoGAN

Anomaly detection through GANs Reference

GANomaly

GANomaly Reference

Masked autoencoders

MAE link

Other methods for anomaly detection

Variational autoencoders
Self supervision based anomalies
Patch based anomalies

Speaking of anomalies

That you are here—that life exists and identity, That the powerful play goes on, and you may contribute a verse.

Walt Whitman

Understanding sequences

reference

Structure of data

Tabular data (\(x \in \mathbb{R}^d\))
- can be stored in a table. Fixed length input/output
Image data
- Still fixed size but typically single input -> single output
- Could be variable size (purely convolutional networks)
How do we handle
- sound, video, text, etc.?
- not fixed length inputs but sequences of fixed length inputs

Common Applications

Discriminative

Sentiment Analysis
Machine Translation
Video Understanding

Generative

Generating music
Writing answers to questions
Creating videos!

Some videos

Sora

Recurrent Neural Networks

Ordered list of features \(x_1, .. x_T\) where \(x_t\) is the \(t^{th}\) time step in the sequence
Sequence independence, not feature independence
Aligned (e.g., POS tagging) and unaligned predictions (translation, sentiment)

RNNs POS, Sentiment

Auto-regressive models

Quantity of interest: \(P(x_t | x_1, x_2 .. x_{t-1})\)
- Notice varying sequence length after every time step –> New model after every timestep?

Stock Value over years

Handling varying input length?

Take \(\tau\) past values only (\(\tau^{th}\) order Markov assumption)
Keep a summary (\(h_t\)) of past values
- latent autoregressive models: \(h_t = g(h_{t-1}, x_{t-1})\)
- stationarity assumption with respect to relationship with past values (relationship between the \(t-1\) step and current step is learned and thus implicitly assumed stationary)

Sequence Models

Modeling joint probabilility of all features in a sequence (sequence model)
- If sequence is a sentence of a language, it’s called language model.
- \(P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^T P(x_t \mid x_{t-1}, \ldots, x_1)\)
  - Automatically takes an autoregressive shape when used in this form

Example - Markov model: input

Markov Assumption
- \(P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^T P(x_t \mid x_{t-1})\)

Input data

Model: k-step ahead prediction

1-step ahead

Linear Regression Model Output

k-step ahead

Why build language models?

Generate text based on human input
Rule out unlikely sentences (in e.g., Speech to Text)
How?
- \(\begin{split}\begin{aligned}&P(\textrm{deep}, \textrm{learning}, \textrm{is}, \textrm{fun}) \\ =&P(\textrm{deep}) P(\textrm{learning} \mid \textrm{deep}) P(\textrm{is} \mid \textrm{deep}, \textrm{learning}) P(\textrm{fun} \mid \textrm{deep}, \textrm{learning}, \textrm{is}).\end{aligned}\end{split}\)
- How to calculate these probabilities? The frequentist approach.

Modeling text

Read text
Tokenize (by enumerate all vocabulary or all characters - design decision)
- Check out gpt tokenizer video

@d2l.add_to_class(TimeMachine)  #@save
def _preprocess(self, text):
    return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]

indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']

Token Frequency

Zipf’s Law: \(n_i \propto \frac{1}{i^\alpha}\)

What do the top ten words tell you?

Word frequencies

[('the', 2261),
 ('i', 1267),
 ('and', 1245),
 ('of', 1155),
 ('a', 816),
 ('to', 695),
 ('was', 552),
 ('in', 541),
 ('that', 443),
 ('my', 440)]

Bigrams

[('of--the', 309),
 ('in--the', 169),
 ('i--had', 130),
 ('i--was', 112),
 ('and--the', 109),
 ('the--time', 102),
 ('it--was', 99),
 ('to--the', 85),
 ('as--i', 78),
 ('of--a', 73)]

Trigrams

[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]

Indicates structure

Modeling probabilities

Model probabilities over some large corpus, e.g., Wikipedia, Project Gutenberg, etc.

\[\begin{split}\begin{aligned} \text{Unigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2) P(x_3) P(x_4),\\ \text{Bigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_2) P(x_4 \mid x_3),\\ \text{Trigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_1, x_2) P(x_4 \mid x_2, x_3). \end{aligned}\end{split}\]

Laplace Smoothing

Problem: New words and combinations would still be frequent
\(n\): number of words, \(m\): number of unique words; \(\epsilon\): smoothing hyperparameter (\(\epsilon \in [0, \inf)\)) \[ \begin{split}\begin{aligned} \hat{P}(x) & = \frac{n(x) + \epsilon_1/m}{n + \epsilon_1}, \\ \hat{P}(x' \mid x) & = \frac{n(x, x') + \epsilon_2 \hat{P}(x')}{n(x) + \epsilon_2}, \\ \hat{P}(x'' \mid x,x') & = \frac{n(x, x',x'') + \epsilon_3 \hat{P}(x'')}{n(x, x') + \epsilon_3}. \end{aligned}\end{split} \]

Problems with frequentist approach

“Meaning” of words is ignored
All counts need to be stored
n-grams often occur rarely specially with larger n values

Training the network

Fixed length neural network based language model

I/O

skip \(d\) random tokens at the start to get random subsequences
m partitioned subsequences: \(\mathbf x_d, \mathbf x_{d+n}, \ldots, \mathbf x_{d+n(m-1)}\) where \(m = \frac{T-d}{n}\)

Measuring model quality - perplexity

“It is raining outside”

“It is raining banana tree”

“It is raining piouw;kcj pwepoiut”

Likelihood by itself is not a good measure
- Shorter sequences are always more likely
Surprise (cross entropy) modeled over entire sequence
- \(CE = \frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1)\)
- Perplexity = \(e^{CE} \in [1, \inf)\)
- Baseline: uniform distribution across all tokens

The breakthrough

What is a good \(\tau\)?
- How do we eliminate \(\tau\) altogether?
- Could we capture historical values in a compressed fashion?
- Could we maintain state from longer past sequences?
- \(P(x_t \mid x_{t-1}, \ldots, x_1) \approx P(x_t \mid h_{t-1})\)
- Continual update to hidden state: \(h_t = f(x_{t}, h_{t-1})\)

Recurrent Neural Networks

Refer notes

Cell Architecture

Character level RNN-based language model

Gradient Clipping

Lipschitz Continuity with constant L \(|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|\)
Gradient Update \(|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|\)

How to control gradient?
- Gradient Clipping \(\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.\)

Refer jupyter notebook for implementation

BPTT

Refer notes

Truncation

Comparing strategies for computing gradients in RNNs. From top to bottom:
- randomized truncation
- regular truncation
- full computation

Alternate cells

LSTMs
GRUs

LSTM

GRU

Deep RNNs

Deep Recurrent Neural Network

Other tasks - fill in the blanks

I am ___.
I am ___ hungry.
I am ___ hungry, and I can eat half a pig.

Bi-directional RNN computation

\[ \begin{split}\begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{\textrm{xh}}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{\textrm{hh}}^{(f)} + \mathbf{b}_\textrm{h}^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{\textrm{xh}}^{(b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{\textrm{hh}}^{(b)} + \mathbf{b}_\textrm{h}^{(b)}), \end{aligned}\end{split} \]

concatenate the forward and backward hidden states \(\overrightarrow{\mathbf{H}_t}\) and \(\overleftarrow{\mathbf{H}_{t}}\) to obtain the hidden state.
\(\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{\textrm{hq}} + \mathbf{b}_\textrm{q}\)

Class FAQ

What are we covering in the course?
- How architectures evolved
- What are the foundational concepts
- 6 broad topics - Math (losses, likelihoods), CNNs, RNNs, GNNs, Transformers, SSL
- Why so much?
  - In a fast moving area, it’s perhaps better to know the field and study what’s required rather than know something in detail which is not used?
What are we not covering in the course?
- Completely understanding an architecture/code involved. Recommendation: read papers/code
Why aren’t we coding more?
- DL coding takes time. Spend more time in the project by delving into more detail.
What will I be able to do after the course?
- Guess what architecture might be used for a certain application
- Code 1 type of DL application (the project you do)
- Read papers and judge how novel/relevant a paper is
Exam format
- MCQ with negative marking and/or written questions

Class feedback form

Anonymous
Will be done after every class. Also, open after class but please refer to a specific class while giving feedback.
Link

Use cases - Encoder Decoder

Teacher forcing
Loss?

Test time - encoder decoder

Beam Search

Metric for seq2seq models

\(BLEU = \exp\left(\min\left(0, 1 - \frac{\textrm{len}_{\textrm{label}}}{\textrm{len}_{\textrm{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n}\)

BLEU = Bilingual Evaluation Understudy
Target = A, B, C, D, E, F
Prediction = A, B, B, C, D
\(p_n\) = precision of predicted n-grams in the original sequence. \(p_1 = 4/5, p_2 = 3/4, p_3 = 1/3, p_4 = 0\)

Building applications in NLP

Structure

Pre-training/Tokenization

Modeling words is not enough
- Let’s go sit on the bank
- They looted a bank
Could we learn more contextualized embeddings?
What are embeddings?
- Mapping words –> n-dimensional real number space

How to represent words?

One-hot?
- Cosine similarity?
- 1. Fails to capture similarity
Need representations to capture both similarity and context

Word2Vec

Skip-gram
- Context window size = 2, Center = love
- ‘The man loves his son’
Continuous bag of words

Skip-gram

\(P(\textrm{"the"},\textrm{"man"},\textrm{"his"},\textrm{"son"}\mid\textrm{"loves"})\)
\(P(\textrm{"the"}\mid\textrm{"loves"})\cdot P(\textrm{"man"}\mid\textrm{"loves"})\cdot P(\textrm{"his"}\mid\textrm{"loves"})\cdot P(\textrm{"son"}\mid\textrm{"loves"})\)

probability

Model probability with softmax, \(P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_o^\top \mathbf{v}_c)}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\)
For multiple context words, \(\prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(w^{(t+j)} \mid w^{(t)})\)

Training: Word2Vec

Maximize cross entropy \(\log P(w_o \mid w_c) =\mathbf{u}_o^\top \mathbf{v}_c - \log\left(\sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)\right)\)
\(\begin{split}\begin{aligned}\frac{\partial \textrm{log}\, P(w_o \mid w_c)}{\partial \mathbf{v}_c}&= \mathbf{u}_o - \frac{\sum_{j \in \mathbb{V}} \exp(\mathbf{u}_j^\top \mathbf{v}_c)\mathbf{u}_j}{\sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\\&= \mathbf{u}_o - \sum_{j \in \mathbb{V}} \left(\frac{\exp(\mathbf{u}_j^\top \mathbf{v}_c)}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\right) \mathbf{u}_j\\&= \mathbf{u}_o - \sum_{j \in \mathbb{V}} P(w_j \mid w_c) \mathbf{u}_j.\end{aligned}\end{split}\)

Training Word2vec: CBOW

\(P(w_c \mid w_{o_1}, \ldots, w_{o_{2m}}) = \frac{\exp\left(\frac{1}{2m}\mathbf{u}_c^\top (\mathbf{v}_{o_1} + \ldots + \mathbf{v}_{o_{2m}}) \right)}{ \sum_{i \in \mathbb{V}} \exp\left(\frac{1}{2m}\mathbf{u}_i^\top (\mathbf{v}_{o_1} + \ldots + \mathbf{v}_{o_{2m}}) \right)}\)
\(\frac{\partial \log\, P(w_c \mid \mathbb{W}_o)}{\partial \mathbf{v}_{o_i}} = \frac{1}{2m} \left(\mathbf{u}_c - \sum_{j \in \mathbb{V}} \frac{\exp(\mathbf{u}_j^\top \bar{\mathbf{v}}_o)\mathbf{u}_j}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \bar{\mathbf{v}}_o)} \right) = \frac{1}{2m}\left(\mathbf{u}_c - \sum_{j \in \mathbb{V}} P(w_j \mid \mathbb{W}_o) \mathbf{u}_j \right)\)

Applications of NLP

POS Tagging

Source

Summarization

Extractive
Abstractive

Extractive summarization through ranking (TextRank)
Abstractive Summarization

Source

Sentiment Classification

Source

Image Captioning

Source

Visual QnA

Source

Many others

Language Translation
Audio translation
Video classification
Human Activity Recognition
Traffic Prediction and Anomaly Detection
Weather forecasting
Stock market prediction etc…

Revisiting networks

Types of networks discussed so far

MLP - fixed length input, output
Convolutions - fixed array input, output
- or at least fixed structure
RNN (LSTM, GRU, etc.)
- fixed input per time step but variable timesteps
Graph Convolution Networks

The challenge of graphs

Variable Topology
- Sizes and number of connections change often
- How to design expressive functions?
Requires Scalable methods
- Graphs can run into millions of nodes
There may only be a single monolithic graph (e.g., the Linkedin Graph)
- Train on samples and test on others may not apply

Source: UDLBook

What is a graph?

The graphs of our lives

Nodes and vertices
- Typically sparse
More examples of graphs - social network, computer programs, protein interactions, scientific literature, geometric point clouds, images

Examples of graphs

Examples

Which of these can you process with DL?

Apart from the structure

Node embeddings (attributes)
- Fixed length embedding of say, person name, age group, interests, etc.
Edge embeddings (attributes)
- Traffic, number of lanes on the road, whether footpaths exist
The three graphical musketeers
- \(A, X, E\)
- Adjacency Matrix, Node Embedding, Edge Embedding

Graphical musketeers

Source: UDLBook

Properties of adjacency matrix

Source: UDLbook

Graph Permutations

Graph nodes are indexed randomly. Therefore, for each permutation, the network should be invariant

How to express a permutation of a graph mathematically?
- \(P\) matrix with each row containing only one 1
- \(X' = XP\)
  - What happens if P(m, n) is 1?
- \(A' = P^TAP\)

Tasks on “entire” graphs

Predict temperature for state change (Regression)
Predict if a molecule is poisnous (Classification)

Approach - Graph Task

For individual nodes, understand the “meaning & context” of the nodes
- Take \(X\) and \(A\) as input –> Pass through \(K\) layers –> \(H_k \forall k \in [1, K]\) hidden representations
- For the graph task,
  - Average all node embeddings (also called mean pooling)
  - Follow it up with a linear layer/MLP
  - Use BCE for classification/MSE for regression
  Classification

Approach - Node Task

For example, predict the part of the plane using the point cloud node
Node embeddings are used as input to classifier/regression stage directly
Same \(\beta_k\) and \(\omega_k\) are used across nodes

Per node optimization

Approach - Edge Prediction Task

For example, “friend” suggestions
A similarity metric for two nearby nodes to predict if an edge should ‘indeed’ exist
- Dot product for similarity

Graph tasks

Requirements of equivariance and invariance

Equivariance/Covariance, e.g., image segmentation
- f[t[x]] = t[f[x]]
Invariance, e.g., classification of a flip
- f[t[x]] = f[x]

For graph tasks, mean pooling is ________ to permutation matrix \(P\)

Graph Convolutions

The \(K\) layers induce a relational inductive bias
- e.g., predict plane part using nodes of point cloud

GCN

The convolution in GCN

Convolution: Aggregates information from neighbors
- Message passing from neighbors
- Fixed weights according to the location of neighbor in a CNN. What about graphs?

Uniformly weighted combination of all neighbors
- Writing combined for all nodes

Check design considerations

equivariant to permutations of node indices
can cope with any number of neighbors
exploits graph structure to provide relational inductive bias
shares weights across nodes

Example: Drug classification

118 elements in the periodic table, N = number of nodes in graph
\(X \in R^{118\times N}\), \(A \in R^{N \times N}\), \(\Omega_0 \in R^{D \times 118}\)

Batching the compute

Given I training graphs \(\{X_i, A_i\}\), \(y_i\) labels, one can learn \(\{\Omega_k, \beta_k\}_{k=1}^K\) for the K layers with BCE loss and SGD
Batching:
- Supergraph with modified adjacency matrix
- Separate pooling per graph

Inductive vs transductive

Supervised vs semi-supervised (imagine labeling more nodes in the academic literature graph)
Graph tasks –> inductive
Node, Edge tasks –> inductive or transductive.

Node task example

Transductive task on a large graph - label each node
Impossible to fit the entire graph, i.e., subselect
How to sample nodes?

Smarter subsampling

Neighborhood sampling (randomly subsample from hops)
- Think dropout
Graph partitioning
- Think clustering graph nodes to simplify graphs or
- Removing some edges which might be ‘weaker’

Many ways of combination with neighbors

Sum
(Learned) Weighted sum
Residual connections
Mean aggregation (instead of just aggregation)

Edge graphs

Transform into a node graph and use

Unifying modalities

Attention

The impact of attention

1980s to 2010s
- CNNs and RNNs remain almost unchallenged
- Improvements in compute, data storage, even optimization but minimal changes to architecture
Today
- SOTA architectures on most things are Transformer Models

Intuition of attention

Consider the seq2seq model,
- Instead of compressing information into a single vector, can we revisit input again?
- while decoding (generating) different words, shouldn’t we be looking at different parts of the input?
How?
- In the encoding step, generate a “representation” equal to the input length
- At the decode time, use a weighted sum of the “right” input “representations” to “understand” the context
- ideally, the process of finding the “right” inputs should be learnable

Queries, Keys and Values

Consider a database where you’re retrieving first names from last names

We can design queries that operate on (key, value) pairs in such a manner as to be valid regardless of the database size.
The same query can receive different answers, according to the contents of the database.
The “code” being executed for operating on a large state space (the database) can be quite simple (e.g., exact match, approximate match, top- k).
There is no need to compress or simplify the database to make the operations effective.

Retreiving values from database

\(\mathbb{D} \stackrel{\textrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}\) is queried with a query \(q\)
\(\textrm{Attention}(\mathbf{q}, \mathbb{D}) \stackrel{\textrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,\)
- For a typical database, \(\alpha\) = exact match
- Could we design an \(\alpha\) that captures “relevance” in some learnable sense
- Instead of returning a single value, we return a weighted sum of values

Properties of attention-based querying

Typically generates a linear sum over “relevant” values. Some special cases:

Attention Pooling Architecture

Scoring Function

Dot Product Attention
All keys are assumed to be zero mean, unit variance (normalization)
Ideally, we’d like the dot products to have unit variance as well.
Therefore, we normalize by \(1/\sqrt{d}\) where \(k \in R^d\)

\(\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(\mathbf{q}^\top \mathbf{k}_i / \sqrt{d})}{\sum_{j=1} \exp(\mathbf{q}^\top \mathbf{k}_j / \sqrt{d})}\)

Handling variable length input for softmax

Input

Dive into Deep Learning
Learn to code blank
Hello world blank blank

Masked Softmax

\(v_i\) set to 0, \(\alpha(k_i, v_i)\) set to \(-\inf\)

def masked_softmax(X, valid_lens):  #@save
    """Perform softmax operation by masking elements on the last axis."""
    # X: 3D tensor, valid_lens: 1D or 2D tensor
    def _sequence_mask(X, valid_len, value=0):
        maxlen = X.size(1)
        mask = torch.arange((maxlen), dtype=torch.float32,
                            device=X.device)[None, :] < valid_len[:, None]
        X[~mask] = value
        return X
    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)
        # On the last axis, replace masked elements with a very large negative
        # value, whose exponentiation outputs 0
        X = _sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)

Matrix Multiplication for Dot Product Attention

n Queries, m key-value pairs in the database
\(Q \in R^{n\times d}\), \(K \in R^{m\times d}\), \(V \in R^{m\times v}\)
Note that Q and K are assumed to be the same dimension \(d\)
\(Attention = \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}\)

class DotProductAttention(nn.Module):  #@save
    """Scaled dot product attention."""
    def __init__(self, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    # Shape of queries: (batch_size, no. of queries, d)
    # Shape of keys: (batch_size, no. of key-value pairs, d)
    # Shape of values: (batch_size, no. of key-value pairs, value dimension)
    # Shape of valid_lens: (batch_size,) or (batch_size, no. of queries)
    def forward(self, queries, keys, values, valid_lens=None):
        d = queries.shape[-1]
        # Swap the last two dimensions of keys with keys.transpose(1, 2)
        scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d)
        self.attention_weights = masked_softmax(scores, valid_lens)
        return torch.bmm(self.dropout(self.attention_weights), values)

Different dimesions for keys, queries?

Additive attention
- \(a(\mathbf q, \mathbf k) = \mathbf w_v^\top \textrm{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R}\)
- where \(\mathbf W_k\in\mathbb R^{h\times k}\), \(\mathbf W_k\in\mathbb R^{h\times k}\), and \(\mathbf w_v\in\mathbb R^{h}\)

class AdditiveAttention(nn.Module):  #@save
    """Additive attention."""
    def __init__(self, num_hiddens, dropout, **kwargs):
        super(AdditiveAttention, self).__init__(**kwargs)
        self.W_k = nn.LazyLinear(num_hiddens, bias=False)
        self.W_q = nn.LazyLinear(num_hiddens, bias=False)
        self.w_v = nn.LazyLinear(1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, queries, keys, values, valid_lens):
        queries, keys = self.W_q(queries), self.W_k(keys)
        # After dimension expansion, shape of queries: (batch_size, no. of
        # queries, 1, num_hiddens) and shape of keys: (batch_size, 1, no. of
        # key-value pairs, num_hiddens). Sum them up with broadcasting
        features = queries.unsqueeze(2) + keys.unsqueeze(1)
        features = torch.tanh(features)
        # There is only one output of self.w_v, so we remove the last
        # one-dimensional entry from the shape. Shape of scores: (batch_size,
        # no. of queries, no. of key-value pairs)
        scores = self.w_v(features).squeeze(-1)
        self.attention_weights = masked_softmax(scores, valid_lens)
        # Shape of values: (batch_size, no. of key-value pairs, value
        # dimension)
        return torch.bmm(self.dropout(self.attention_weights), values)

Limitations of the RNN

Too much is expected of the state
In Bahdanau Attention, When predicting a token, if not all the input tokens are relevant, the model aligns (or attends) only to parts of the input sequence that are deemed relevant to the current prediction. This is then used to update the current state before generating the next token.

Using attention to enhance state

\(\mathbf{c}_{t'} = \sum_{t=1}^{T} \alpha(\mathbf{s}_{t' - 1}, \mathbf{h}_{t}) \mathbf{h}_{t}\)

Identify query, key, value

class Seq2SeqAttentionDecoder(AttentionDecoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0):
        super().__init__()
        self.attention = d2l.AdditiveAttention(num_hiddens, dropout)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(
            embed_size + num_hiddens, num_hiddens, num_layers,
            dropout=dropout)
        self.dense = nn.LazyLinear(vocab_size)
        self.apply(d2l.init_seq2seq)

    def init_state(self, enc_outputs, enc_valid_lens):
        # Shape of outputs: (num_steps, batch_size, num_hiddens).
        # Shape of hidden_state: (num_layers, batch_size, num_hiddens)
        outputs, hidden_state = enc_outputs
        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)

    def forward(self, X, state):
        # Shape of enc_outputs: (batch_size, num_steps, num_hiddens).
        # Shape of hidden_state: (num_layers, batch_size, num_hiddens)
        enc_outputs, hidden_state, enc_valid_lens = state
        # Shape of the output X: (num_steps, batch_size, embed_size)
        X = self.embedding(X).permute(1, 0, 2)
        outputs, self._attention_weights = [], []
        for x in X:
            # Shape of query: (batch_size, 1, num_hiddens)
            query = torch.unsqueeze(hidden_state[-1], dim=1)
            # Shape of context: (batch_size, 1, num_hiddens)
            context = self.attention(
                query, enc_outputs, enc_outputs, enc_valid_lens)
            # Concatenate on the feature dimension
            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
            # Reshape x as (1, batch_size, embed_size + num_hiddens)
            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
            outputs.append(out)
            self._attention_weights.append(self.attention.attention_weights)
        # After fully connected layer transformation, shape of outputs:
        # (num_steps, batch_size, vocab_size)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
                                          enc_valid_lens]

    @property
    def attention_weights(self):
        return self._attention_weights

Introducing the relevance in decoding

Attention in sequences

Link

Interpretability ++

Attention in Images

CNN + RNN + Attention

Attention in graphs

Limitations?

Only one type of relationship capture?
- Relevance
The order of input isn’t captured

Multiple heads of attention

Mathematically, MHA

Self attention

With multiple heads, the attention mechanism learns queries, keys and values as linear transformations of the input X itself. This is why it’s called Self Attention.

num_hiddens, num_heads = 100, 5
attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens))
d2l.check_shape(attention(X, X, X, valid_lens),
                (batch_size, num_queries, num_hiddens))

Wait, what order was that in?

Positional Encodings

Space efficiency is of importance
Should be able to uniquely identify location

0 in binary is 000
1 in binary is 001
2 in binary is 010
3 in binary is 011
4 in binary is 100
5 in binary is 101
6 in binary is 110
7 in binary is 111

Represented using continuous functions for space efficiency

\(\begin{split} p_{i, 2j} = \sin\left(\frac{i}{10000^{2j/d}}\right), p_{i, 2j+1} = \cos\left(\frac{i}{10000^{2j/d}}\right).\end{split}\)

Understanding positional encoding

Encouraging unique encodings through decreasing frequencies and offsets

Learning different frequency information for long sequences

Relative positional encoding

Should be easy to learn ‘one token before, two tokens after’ kind of relationships

\(\begin{split}\begin{aligned} \begin{bmatrix} \cos(\delta \omega_j) & \sin(\delta \omega_j) \\ -\sin(\delta \omega_j) & \cos(\delta \omega_j) \\ \end{bmatrix} \begin{bmatrix} p_{i, 2j} \\ p_{i, 2j+1} \\ \end{bmatrix} =&\begin{bmatrix} \cos(\delta \omega_j) \sin(i \omega_j) + \sin(\delta \omega_j) \cos(i \omega_j) \\ -\sin(\delta \omega_j) \sin(i \omega_j) + \cos(\delta \omega_j) \cos(i \omega_j) \\ \end{bmatrix}\\ =&\begin{bmatrix} \sin\left((i+\delta) \omega_j\right) \\ \cos\left((i+\delta) \omega_j\right) \\ \end{bmatrix}\\ =& \begin{bmatrix} p_{i+\delta, 2j} \\ p_{i+\delta, 2j+1} \\ \end{bmatrix}, \end{aligned}\end{split}\)

Transformer

An architecture without recurrent connections that can capture sequential as well as long range information
Four elements
- Multi-head self attention (MHA)
- Feedforward Network (FFN)
- Residual connections
- Encoder-Decoder Attention

Vision Transformer

Swin Transformer

Hierarchical feature maps for linear scaling with image size Hierarchical Features & Shifted Windows

Graph Transformer

Pros

Long range connections
Navigating the maze with the map
Node update still unrelated to graph structure

Cons

Identifiability of nodes (need better encodings)
Loss of relational inductive bias
Increased computational complexity

Source

Graph Transformer detailed

Multimodal (Unified) transformers

VQA: Visual Question Answering
SNLI-VE: Stanford Natural Language Inference - Visual Entailment
MNLI: Multi-Genre Natural Language Inference
QNLI: Question Natural Language Inference
QQP: Quora Question Pairs
SST-2: Stanford Sentiment Treebank - 2

Architecture

The unification

Source

Pre-training

Purpose

Improve generalization
Find ways to learn (pretext) tasks without labels such that “relevant” information about underlying data is captured in some “embedding”
- Completing incomplete images
- Solving puzzles
- Correcting corrupted files/text
- Finding signal from noise

Encoder-only

BERT (Bidirectional Encoding Representations from Transformers)
- Masking words randomly
output embedding is projected for classification/regression tasks

Fine-tune BERT

Encoder-decoder: Pre-training T5

T5 = Text to text transfer transformer
- Includes the “task” in the encoder, e.g., summarize..

Fine-tuning T5

Decoder only: GPT

Using GPT2 for different tasks without fine-tuning

Scalability of transformers

Image SSL

Source1, Video, Slides

preferred source -> Source2

Visual pretext tasks:

Relative positions
Puzzles
Predicting rotations

The secret ingredient

Currently,

Pretext Tasks
Tricks to ensure non-trivial solutions

Sometimes, these things go beyond imagination

OpenAI CLIP (Contrastive Language Image Pretraining)
- SSL on 400 million text-image pairs
- Beats robustness gap by 75%
- Beats Imagenet classification without using Imagenet labels (zero-shot)

Built on

modern architectures like the Transformer
VirTex which explored autoregressive language modeling
ICMLM which investigated masked language modeling
ConVIRT which studied the same contrastive objective we use for CLIP but in the field of medical imaging.

Architecture

Train time?
- 256 GPUs for 2 weeks

Variants of SSL

Text pretraining
Classification and segmentation of images
Speech Recognition
CURL (Reinforcement Learning)
Drug Discovery
Learning from demonstrations

On satellite images

SatCLIP

Geospatial Computer Vision Group at Plaksha

SSL + Satellite images for
- Agriculture
- Urban Planning
What are the events that are a frequently occuring phenomenon at multiple places?
- Agriculture: Crops growing and harvesting
- Urban Infrastructure development
- Socio-economic development
Could this “information/insight” be converted into value for a societal segment?
- Farmers
- City dwellers
- Governments, etc.