Elements of Deep Learning

Anupam Sobti

Introduction

Deep Learning (3-1-0)

  • Practicum: 25% assignments and a semester project
    • Project needs to be a novel contribution to deep learning
      • existing method applied to a novel setting
      • novel method for existing problem
      • TBD on a public git repo in groups of upto 4. Grades will be distributed according to commits.
  • Math heavy (at times)
  • Code heavy (always)
    • Pytorch

How we run the labs

  • Code examples of things we do in class
  • Try to understand all code
  • Read documentation
  • Feel free to generate GPT code but be ready to be quizzed on everything generated

Course Evaluation

  • Quizzes: Top 3 of 4 (30%) [held every month]
  • Assignments: 2 (20%)
  • Project: 40%
    • Mid-sem: 10%
    • Weekly Reviews (top 4 scores): 5%
    • End-Sem: 25%
  • Attendance: 10%

Matrix:

  • More than 80% = 10 marks (full)
  • 70 to 80 % = 8 marks
  • 60 to 70 % = 6 marks
  • Less than 60 % = 0 marks

Teaching Assistant

Project Timeline

  • Mid-sem report
    • Problem statement proposal
      • Proposed changes to the state of the art
    • Related Work Study
  • End-sem report
    • Planned experiments
    • Executed experiments
    • Results and Analysis
    • Paper Submission

Star Projects from last year

  • WarpNet
  • Enhancing Yield Prediction
  • Airfoil Design
  • Dynamic Hand Gesture Recognition
  • Classifying seizures
  • Poverty Prediction from Satellite Images
  • Evaluating UI Screens

Course References

Surviving in deep learning

  • Learn to read papers
    • If you can’t read papers, your DL will be outdated in 2 weeks time
  • Learn to read/write math required for DL
  • Be ready to read and write code
  • Learn to structure experiments (wandb)

Compute

  • Laptop
  • Coming soon..

Teaching

  • 28 lectures
  • 2 paper reading sessions (student presentations)
    • [groups of upto 4]
  • Tentative lecture plan:

Lectures-1

Lectures-2

Why Deep Learning?

  • Data and compute availability in large quantities
  • Hard to define features. Need to be automatically learnt.
  • DL allows a hierarchical feature composition to handle complexity
    • Real life data often has the same property. Think images, text, sound, etc.

S2.0

Hierarchical Feature Composition

Source: link

The process

  • The data that we can learn from.
  • A model of how to transform the data.
  • An objective function that quantifies how well (or badly) the model is doing.
  • An algorithm to adjust the model’s parameters to optimize the objective function.

Things to wonder!

  • How do we train models with more parameters than the data samples?
    • Alexnet + Imagenet -> 60 million parameters with 1.2 million images
    • How do they still generalize?
  • Theoretically, two layers should be sufficient? Why aren’t they?

Kinds of (Supervised) Machine Learning Problems

  • Classification
    • Binary, Multiclass, Hierarchical
  • Tagging
    • Multilabel classification, e.g., topics in a forum or categories of clothes
  • Regression
  • Search
    • How to rank pages as per the relevance for the query (original PageRank from Google didn’t do this!)
  • Sequence Learning
  • Recommender systems
    • Will this user like this item? (Classification)
    • Rating prediction for the user (Regression)

Multilabel Classification

d2l.ai

Unsupervised Learning

  • Clustering
  • Subspace Estimation (PCA)
  • Learn the distribution where data comes from (and generate it!)
  • Self supervised learning

Deep Reinforcement Learning

  • “Learn” policies
  • “Learn” Q values for state-action pairs

Why now?

Decade Dataset Memory Floating point ops/s
1970 100 (Iris) 1 KB 100 KF (Intel 8080)
1980 1 K (house prices in Boston) 100 KB 1 MF (Intel 80186)
1990 10 K (optical character recognition) 10 MB 10 MF (Intel 80486)
2000 10 M (web pages) 100 MB 1 GF (Intel Core)
2010 10 G (advertising) 1 GB 1 TF (NVIDIA C2050)
2020 1 T (social network) 100 GB 1 PF (NVIDIA DGX-2)
  • Growth in datasets and compute
  • Increase complexity and memory without explicit increase in no of parameters (Attention)
  • Scaling behavior of transformers
  • Multi-stage designs with memory networks/neural interpreters
  • Generative modeling with GANs, Diffusion Models
  • Distributed computation (7 mins training for resnet models)
  • DL Frameworks (TF, Pytorch)

How does it change your life?

  • Intelligent assistants
  • ChatGPT
  • The AI Strategist - Go
  • AI for Science - Drug Discovery
  • Self driving (real time image perception capabilities)

Let’s start with

Reading papers

Three pass approach

How to read a paper

  • The first pass (10-15 mins)
    • Title, Abstract, Introduction
    • Section and Sub-section headings
    • Conclusions

Answer the five Cs

  • Category
  • Context
  • Correctness
  • Contributions
  • Clarity

The second pass (1 hour)

  • Figures, Diagrams, Illustrations, Graphs
  • References! (this is how you find the really good papers)

After this pass, you

  • understand the paper content
  • are able to summarize the paper appropriately into your own paper

The third pass (virtual re-implementation)

  • 4-5 hours to 1 hour
  • Given the input, how would you approach it?
  • Identify and challenge every assumption in the paper
  • Attention to detail

Finding papers

  • Google Scholar
    • Where was it published? (CVPR, ICCV, ACCV, WACV, ICML, ICLR, AAAI, Neurips)
    • Who cited this?
    • Which references did they cite?
  • Semantic Scholar
  • Connected Papers
  • Some new custom GPTs

Math basics

Math basics: Dot products

  • Dot Products for vectors \(x\) and \(y\) (vector-vector products)
  • \(x^Ty\) or \(\lvert x \rvert \lvert y \rvert cos\theta\)
import numpy as np

# Let's take two vectors in 3D space
x = np.array([2, 3, 4])
y = np.array([1, 5, 7])

# Compute the dot product algebraically
dot_product_algebraic = np.dot(x, y)

# Compute the magnitudes of x and y
magnitude_x = np.linalg.norm(x)
magnitude_y = np.linalg.norm(y)

# Compute the angle between x and y using the dot product and magnitudes
# Using the dot product formula: x . y = |x| * |y| * cos(theta)
# We solve for cos(theta) as: cos(theta) = (x . y) / (|x| * |y|)
cos_theta = dot_product_algebraic / (magnitude_x * magnitude_y)

# Now we can calculate the dot product geometrically
dot_product_geometric = magnitude_x * magnitude_y * cos_theta

(dot_product_algebraic, dot_product_geometric, np.isclose(dot_product_algebraic, dot_product_geometric))

Math basics - rotations & projections

  • \(y = Ax\) (Vector matrix products)
    • Rotation if A is a square matrix
    • Projection if an n-dimensional vector is converted to an m-dimensional vector
A.shape, x.shape, torch.mv(A, x), A@x
(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

Math basics - composing projections

  • \(C = AB\)

Rate of change of scalars: Calculus

  • Differentiation
    • Analytical
      • Sum and Product Rules
    • Numerical

Multivariate calculus

Please derive

  • \(\nabla_x x^T A = x\)
  • \(\nabla_x Ax = A^T\)
  • \(\nabla_x x^T A x = (A + A^T) x\)

Automatic Differentiation

  • Reference Link
    • Differential for scalars
    • Differential for vectors
  • Must watch: Build autograd from scratch with Andrej Karpathy

Probability and Statistics

  • Probability: Underlying parameter of the Data
  • Statistics: Looks at data and finds “estimators” of the probability
    • The estimators then converge to underlying probability
      • The converged probability generates data (that one can expect to be generated in the future)

Data -> Sample -> Estimators -> Probability -> Data

Covered in class

  • Loss functions for
    • linear regression - least squares, MLE
    • linear classification - cross entropy
  • Analytical solution for regression

Generality of log likelihood

Convolutional Neural Networks

Use cases for CNNs

  • What is a convolution?
  • Common terms: stride, padding, dilation, kernel size

Why CNN?

  • Capture spatial locality and maintain locality in feature space
  • Introducing Spatial Invariance (pooling) in feature space
    • similar response to a feature present anywhere in the image

Why CNN?

  • Increasing receptive field

Source: UDLBook

Much faster convergence due to reduced increased model bias

Convolution Effectiveness

Dimensionality

  • 1D CNN - audio, text, finance, etc.
  • 2D CNN - images

(Down and Up) Sampling

Down and up-sampling (UDLbook)

Transposed convolutions (upsampling)

Other operations

  • Change number of channels
    • point-wise convolutions (1x1)
  • Learning channel-wise filters
    • depth-wise convolution

The DL revival?

  • What started the DL revival?

DL Meme

The Imagenet dataset

Imagenet dataset considerations

Alexnet

AlexNet

What progress looks like

ImageNet performance

Perception of images

  • Classification
  • Object Detection
  • Semantic Segmentation
  • Pose
  • Dense Pose

Detectron 2 Tasks

Perception of images

  • Monocular depth prediction
  • Stereo depth prediction
  • Anomaly detection

Depth Prediction

Depth Anything Model

Creation of images

  • Conditioned on text prompt/other images
  • High resolution from low resolution - forensics, satellite, movies, etc.

Midjourney creations

Modifying images

Style Transfer

Modifying images - contd (LEDITS++)

Potter

Convolution Evolution

Improving Convolutions

  • Proving convolutions: AlexNet
  • Improving convolutions: breakthrough in deeper networks
    • alexnet (8 layers) to vgg (18 layers) improved performance but not further
    • Why?
      • vanishing/exploding gradients (handled by smart initialization)
      • Common step size in optimizers might move optimization to unrelated gradient points
      • Shattered Gradients

Shattered Gradients

  • Increasingly complex way of calculating gradients in deeper networks Shattered Gradients

Saviors: Residual connections + batch normalization

  • Intuitively, allows for different complexities in the learned concepts

Residual

Layers

Unraveling

Residual connections

Gradient is easier Order

Expressive functions

Expressive ability

Loss surface with skip connections

On CIFAR-10

Numeric control

  • Control activations (forward pass) & gradients (backward pass)
    • (keep same expected variance)
    • He initialization: \(\beta\) initialized to 0, \(\omega\) initialized to normal distribution - (0, \(\frac{2}{D_h}\)) where \(D_h\) is the number of hidden neurons in the previous layer
    • Residual : might still explode with increasing variance throughout the layers. Helps establish a numerical range after every layer.
      • Batch Normalization
        • Step1: Normalization
        • Step2: Scale & Shift
        • \(m_h\) and \(s_h\) at test time?: calculated from the entire training dataset

Bottleneck residuals

  • Limiting the number of parameters

Residual blocks

Image Classification with residual blocks

Resnet-200 Alternate representation

DenseNet

  • either concatenate or do a 1x1 convolution to add a weighted combination
  • comparable performance as compared to resnet Densenet architecture

U-net

  • Bring dexterity with semantics Architecture

Semantic Segmentation examples

Semantic Segmentation

Pose Prediction Example (Hourglass)

Hourglass

Different types of normalization

  • Note: Input is 1D

Normalization Types

[End of UDL References]

Handling backprop in shared weights

  • For calculating the update for the kernel weights,
    • calculate gradient on all points of the image/activation/feature map
    • sum gradient
      • rather than updating a weight based solely on its performance at a single position, the update considers its effect across all the positions it was applied to

Other influential advances: NiN (Network in Network)

  • purpose: reduce computation
  • Same architecture as alexnet except with a max pool of 3x3 window
  • output layers = number of labels followed by 1x1 convolution + average pooling instead of FC layer
  • slower training

NiN Training times

ref: D2L.ai

GoogLeNet (2014 ImageNet Winner)

  • First clear distinction between stem (data ingest), body (processing), head (prediction)

Inception Cell Deeper

Architecture

ResNext

ResNext

  • Grouped Convolutions is \(g\times\) faster than a dense convolution
  • The 1x1 convolutions allows for information sharing between groups

Learning embeddings from images

Image Augmentations

  • Flips
  • Crops/Resize/Rescale
  • Color Augmentations

Augmentations

Single Shot Multibox Detector (2016)

SSD
  • Multibox detector -> Assigns ground truth boxes to anchor boxes

Loss Function
  • Loss is a weighted sum of confidence and localization loss. N = number of matched boxes.

SSD - Loss functions

Loss Functions
  • How to handle class imbalance (more negative boxes than positives)?
    • Sample from negative (take only top 3 most confident boxes)

SSD Vs YOLO

YOLO Comparison

Retina Net

RetinaNet

CornerNet

Corner Detection Corner detection with embeddings

Proposal based methods

ROI Pooling ROI Generation and Pooling

Deformable convolutions

Deformable Convolutions

CentripetalNet

CentripetalNet

Semantic Segmentation: Capturing multi-scale context (from deeplabv3)

Capturing multi-scale context

Side: Spatial pooling pyramids

Blending with laplacian pyramids

Learning similar embeddings

  • Face Recognition

Triplet Loss

Loss Function
  • Object Reidentification also uses a triplet loss framework after making object level representations
  • Note how reidentification and detection are conflicting objectives?

ReID Task

FairMOT

FairMOT

How would you implement?

  • signature verification
  • face verification
  • face recognition

Cosine Similarity

  • Take a q-dimensional output -> L2 Normalization -> Use cosine similarity

Cosine Similarity Concept Formula

Fine grained/Hierarchical classification

Fine grained classification dataset Cars

  • How to highlight distinctions?
    • Part based embeddings

Part based classification Joint localization and classification

Link1 Link2

Visualizing embeddings

What did we learn?

Video Link

Adaptive Style Transfer

  • Adaptive instance normalization allows the activation distribution to match a style activation distribution

Adaptive Instance Normalization

Adaptive Style Transfer

Example outputs (Adaptive Style Transfer)

output

Generating Images

  • Covered in visualization
    • Neural style transfer
    • Fast style transfer
    • Adaptive Style Transfer
  • Generating from scratch
    • GANs
    • DCGANs
    • Stable Diffusion

GANs (Generative Adversarial Networks)

GANs Complete Loss Function

ref; you think you know GANS?

Common issues in GANs

  1. Mode Collapse Mode Collapse
  2. Vanishing gradients
  • Discriminator performs better than generator; lack of challenging examples for discriminator -> vanishing gradient
  1. Convergence Convergence issues in GANs

ref

DCGAN

  • Deep Convolutional Generative Adversarial Networks
    • both generator and discriminator are convolutions/transposed-convolutions
    • wasserstein loss functions (critique instead of discriminator) for handling the common issues

Conditional GANs

Conditioning on label/other modalities of data

Style GAN

StyleGAN Architecture
  • Style Mixing (Mixing regularization)
  • Stochastic variation by introducing noise
  • Separation of global effects from stochasticity
    • focuses on generating effects indistinguishable by discriminator

Ref

Example outputs

Example outputs: StyleGAN

Stable Diffusion

  • to be continued after attention

Detecting anomalies in images

Learning distributions from where data is generated

Autoencoders

  • Simple autoencoder

AnoGAN

Anomaly detection through GANs Reference

GANomaly

GANomaly Reference

Masked autoencoders

MAE link

Other methods for anomaly detection

  • Variational autoencoders
  • Self supervision based anomalies
  • Patch based anomalies

Speaking of anomalies

That you are here—that life exists and identity, That the powerful play goes on, and you may contribute a verse.

  • Walt Whitman

Understanding sequences

reference

Structure of data

  • Tabular data (\(x \in \mathbb{R}^d\))
    • can be stored in a table. Fixed length input/output
  • Image data
    • Still fixed size but typically single input -> single output
    • Could be variable size (purely convolutional networks)
  • How do we handle
    • sound, video, text, etc.?
    • not fixed length inputs but sequences of fixed length inputs

Common Applications

Discriminative

  • Sentiment Analysis
  • Machine Translation
  • Video Understanding

Generative

  • Generating music
  • Writing answers to questions
  • Creating videos!

Some videos

Sora

Recurrent Neural Networks

  • Ordered list of features \(x_1, .. x_T\) where \(x_t\) is the \(t^{th}\) time step in the sequence
  • Sequence independence, not feature independence
  • Aligned (e.g., POS tagging) and unaligned predictions (translation, sentiment)

RNNs POS, Sentiment

Auto-regressive models

  • Quantity of interest: \(P(x_t | x_1, x_2 .. x_{t-1})\)
    • Notice varying sequence length after every time step –> New model after every timestep?

Stock Value over years

Handling varying input length?

  • Take \(\tau\) past values only (\(\tau^{th}\) order Markov assumption)
  • Keep a summary (\(h_t\)) of past values
    • latent autoregressive models: \(h_t = g(h_{t-1}, x_{t-1})\)
    • stationarity assumption with respect to relationship with past values (relationship between the \(t-1\) step and current step is learned and thus implicitly assumed stationary)

Sequence Models

  • Modeling joint probabilility of all features in a sequence (sequence model)
    • If sequence is a sentence of a language, it’s called language model.
    • \(P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^T P(x_t \mid x_{t-1}, \ldots, x_1)\)
      • Automatically takes an autoregressive shape when used in this form

Example - Markov model: input

  • Markov Assumption
    • \(P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^T P(x_t \mid x_{t-1})\)

Input data

Model: k-step ahead prediction

1-step ahead

Linear Regression Model Output

k-step ahead

k-step comparison k-step

Why build language models?

  • Generate text based on human input
  • Rule out unlikely sentences (in e.g., Speech to Text)
  • How?
    • \(\begin{split}\begin{aligned}&P(\textrm{deep}, \textrm{learning}, \textrm{is}, \textrm{fun}) \\ =&P(\textrm{deep}) P(\textrm{learning} \mid \textrm{deep}) P(\textrm{is} \mid \textrm{deep}, \textrm{learning}) P(\textrm{fun} \mid \textrm{deep}, \textrm{learning}, \textrm{is}).\end{aligned}\end{split}\)
    • How to calculate these probabilities? The frequentist approach.

Modeling text

  • Read text
  • Tokenize (by enumerate all vocabulary or all characters - design decision)
@d2l.add_to_class(TimeMachine)  #@save
def _preprocess(self, text):
    return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']

Token Frequency

Zipf’s Law: \(n_i \propto \frac{1}{i^\alpha}\)

What do the top ten words tell you?

  • Word frequencies
[('the', 2261),
 ('i', 1267),
 ('and', 1245),
 ('of', 1155),
 ('a', 816),
 ('to', 695),
 ('was', 552),
 ('in', 541),
 ('that', 443),
 ('my', 440)]
  • Bigrams
[('of--the', 309),
 ('in--the', 169),
 ('i--had', 130),
 ('i--was', 112),
 ('and--the', 109),
 ('the--time', 102),
 ('it--was', 99),
 ('to--the', 85),
 ('as--i', 78),
 ('of--a', 73)]
  • Trigrams
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]

Zipf’s Law
  • Indicates structure

Modeling probabilities

  • Model probabilities over some large corpus, e.g., Wikipedia, Project Gutenberg, etc.

\[\begin{split}\begin{aligned} \text{Unigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2) P(x_3) P(x_4),\\ \text{Bigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_2) P(x_4 \mid x_3),\\ \text{Trigram } P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_1, x_2) P(x_4 \mid x_2, x_3). \end{aligned}\end{split}\]

Laplace Smoothing

  • Problem: New words and combinations would still be frequent
  • \(n\): number of words, \(m\): number of unique words; \(\epsilon\): smoothing hyperparameter (\(\epsilon \in [0, \inf)\)) \[ \begin{split}\begin{aligned} \hat{P}(x) & = \frac{n(x) + \epsilon_1/m}{n + \epsilon_1}, \\ \hat{P}(x' \mid x) & = \frac{n(x, x') + \epsilon_2 \hat{P}(x')}{n(x) + \epsilon_2}, \\ \hat{P}(x'' \mid x,x') & = \frac{n(x, x',x'') + \epsilon_3 \hat{P}(x'')}{n(x, x') + \epsilon_3}. \end{aligned}\end{split} \]

Problems with frequentist approach

  • “Meaning” of words is ignored
  • All counts need to be stored
  • n-grams often occur rarely specially with larger n values

Training the network

  • Fixed length neural network based language model

I/O

  • skip \(d\) random tokens at the start to get random subsequences
  • m partitioned subsequences: \(\mathbf x_d, \mathbf x_{d+n}, \ldots, \mathbf x_{d+n(m-1)}\) where \(m = \frac{T-d}{n}\)

Measuring model quality - perplexity

“It is raining outside”

“It is raining banana tree”

“It is raining piouw;kcj pwepoiut”

  • Likelihood by itself is not a good measure
    • Shorter sequences are always more likely
  • Surprise (cross entropy) modeled over entire sequence
    • \(CE = \frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1)\)
    • Perplexity = \(e^{CE} \in [1, \inf)\)
    • Baseline: uniform distribution across all tokens

The breakthrough

  • What is a good \(\tau\)?
    • How do we eliminate \(\tau\) altogether?
    • Could we capture historical values in a compressed fashion?
    • Could we maintain state from longer past sequences?
    • \(P(x_t \mid x_{t-1}, \ldots, x_1) \approx P(x_t \mid h_{t-1})\)
    • Continual update to hidden state: \(h_t = f(x_{t}, h_{t-1})\)

Recurrent Neural Networks

  • Refer notes

Cell Architecture

Character level RNN-based language model

Gradient Clipping

  • Lipschitz Continuity with constant L \(|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|\)

  • Gradient Update \(|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|\)

  • How to control gradient?
    • Gradient Clipping \(\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.\)

Refer jupyter notebook for implementation

BPTT

  • Refer notes

Truncation

  • Comparing strategies for computing gradients in RNNs. From top to bottom:
    • randomized truncation
    • regular truncation
    • full computation

Alternate cells

  • LSTMs
  • GRUs

LSTM

LSTM

GRU

GRU

Deep RNNs

Deep Recurrent Neural Network

Other tasks - fill in the blanks

  • I am ___.
  • I am ___ hungry.
  • I am ___ hungry, and I can eat half a pig.

Architecture

Bi-directional RNN computation

\[ \begin{split}\begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{\textrm{xh}}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{\textrm{hh}}^{(f)} + \mathbf{b}_\textrm{h}^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{\textrm{xh}}^{(b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{\textrm{hh}}^{(b)} + \mathbf{b}_\textrm{h}^{(b)}), \end{aligned}\end{split} \]

  • concatenate the forward and backward hidden states \(\overrightarrow{\mathbf{H}_t}\) and \(\overleftarrow{\mathbf{H}_{t}}\) to obtain the hidden state.
  • \(\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{\textrm{hq}} + \mathbf{b}_\textrm{q}\)

Class FAQ

  • What are we covering in the course?
    • How architectures evolved
    • What are the foundational concepts
    • 6 broad topics - Math (losses, likelihoods), CNNs, RNNs, GNNs, Transformers, SSL
    • Why so much?
      • In a fast moving area, it’s perhaps better to know the field and study what’s required rather than know something in detail which is not used?
  • What are we not covering in the course?
    • Completely understanding an architecture/code involved. Recommendation: read papers/code
  • Why aren’t we coding more?
    • DL coding takes time. Spend more time in the project by delving into more detail.
  • What will I be able to do after the course?
    • Guess what architecture might be used for a certain application
    • Code 1 type of DL application (the project you do)
    • Read papers and judge how novel/relevant a paper is
  • Exam format
    • MCQ with negative marking and/or written questions

Class feedback form

  • Anonymous
  • Will be done after every class. Also, open after class but please refer to a specific class while giving feedback.
  • Link QR Code for class feedback

Use cases - Encoder Decoder

Encoder Decoder Architectures

Sequence2Sequence Learning
  • Teacher forcing
  • Loss?

Test time - encoder decoder

Test time
  • Beam Search

Beam Search

Metric for seq2seq models

\(BLEU = \exp\left(\min\left(0, 1 - \frac{\textrm{len}_{\textrm{label}}}{\textrm{len}_{\textrm{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n}\)

  • BLEU = Bilingual Evaluation Understudy
  • Target = A, B, C, D, E, F
  • Prediction = A, B, B, C, D
  • \(p_n\) = precision of predicted n-grams in the original sequence. \(p_1 = 4/5, p_2 = 3/4, p_3 = 1/3, p_4 = 0\)

Building applications in NLP

Structure

Pre-training/Tokenization

  • Modeling words is not enough

    • Let’s go sit on the bank
    • They looted a bank
  • Could we learn more contextualized embeddings?

  • What are embeddings?

    • Mapping words –> n-dimensional real number space

How to represent words?

  • One-hot?
    • Cosine similarity?
      1. Fails to capture similarity
  • Need representations to capture both similarity and context

Word2Vec

  • Skip-gram
    • Context window size = 2, Center = love
    • ‘The man loves his son’
  • Continuous bag of words

Skip-gram

  • \(P(\textrm{"the"},\textrm{"man"},\textrm{"his"},\textrm{"son"}\mid\textrm{"loves"})\)
  • \(P(\textrm{"the"}\mid\textrm{"loves"})\cdot P(\textrm{"man"}\mid\textrm{"loves"})\cdot P(\textrm{"his"}\mid\textrm{"loves"})\cdot P(\textrm{"son"}\mid\textrm{"loves"})\)

probability

  • Model probability with softmax, \(P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_o^\top \mathbf{v}_c)}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\)

  • For multiple context words, \(\prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(w^{(t+j)} \mid w^{(t)})\)

Training: Word2Vec

  • Maximize cross entropy \(\log P(w_o \mid w_c) =\mathbf{u}_o^\top \mathbf{v}_c - \log\left(\sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)\right)\)

  • \(\begin{split}\begin{aligned}\frac{\partial \textrm{log}\, P(w_o \mid w_c)}{\partial \mathbf{v}_c}&= \mathbf{u}_o - \frac{\sum_{j \in \mathbb{V}} \exp(\mathbf{u}_j^\top \mathbf{v}_c)\mathbf{u}_j}{\sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\\&= \mathbf{u}_o - \sum_{j \in \mathbb{V}} \left(\frac{\exp(\mathbf{u}_j^\top \mathbf{v}_c)}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\right) \mathbf{u}_j\\&= \mathbf{u}_o - \sum_{j \in \mathbb{V}} P(w_j \mid w_c) \mathbf{u}_j.\end{aligned}\end{split}\)

Training Word2vec: CBOW

  • \(P(w_c \mid w_{o_1}, \ldots, w_{o_{2m}}) = \frac{\exp\left(\frac{1}{2m}\mathbf{u}_c^\top (\mathbf{v}_{o_1} + \ldots + \mathbf{v}_{o_{2m}}) \right)}{ \sum_{i \in \mathbb{V}} \exp\left(\frac{1}{2m}\mathbf{u}_i^\top (\mathbf{v}_{o_1} + \ldots + \mathbf{v}_{o_{2m}}) \right)}\)
  • \(\frac{\partial \log\, P(w_c \mid \mathbb{W}_o)}{\partial \mathbf{v}_{o_i}} = \frac{1}{2m} \left(\mathbf{u}_c - \sum_{j \in \mathbb{V}} \frac{\exp(\mathbf{u}_j^\top \bar{\mathbf{v}}_o)\mathbf{u}_j}{ \sum_{i \in \mathbb{V}} \exp(\mathbf{u}_i^\top \bar{\mathbf{v}}_o)} \right) = \frac{1}{2m}\left(\mathbf{u}_c - \sum_{j \in \mathbb{V}} P(w_j \mid \mathbb{W}_o) \mathbf{u}_j \right)\)

Applications of NLP

POS Tagging

POS Tagger

Architecture

Source

Summarization

  • Extractive
  • Abstractive
  • Extractive summarization through ranking (TextRank)

  • Abstractive Summarization

Source

Sentiment Classification

Sentiment Classification

Source

Image Captioning

Image Captioning

Source

Visual QnA

Visual QnA

Source

Many others

  • Language Translation
  • Audio translation
  • Video classification
  • Human Activity Recognition
  • Traffic Prediction and Anomaly Detection
  • Weather forecasting
  • Stock market prediction etc…

Revisiting networks

Types of networks discussed so far

  • MLP - fixed length input, output
  • Convolutions - fixed array input, output
    • or at least fixed structure
  • RNN (LSTM, GRU, etc.)
    • fixed input per time step but variable timesteps
  • Graph Convolution Networks

The challenge of graphs

  • Variable Topology
    • Sizes and number of connections change often
    • How to design expressive functions?
  • Requires Scalable methods
    • Graphs can run into millions of nodes
  • There may only be a single monolithic graph (e.g., the Linkedin Graph)
    • Train on samples and test on others may not apply

Source: UDLBook

What is a graph?

The graphs of our lives

  • Nodes and vertices
    • Typically sparse
  • More examples of graphs - social network, computer programs, protein interactions, scientific literature, geometric point clouds, images

Examples of graphs

Examples

  • Which of these can you process with DL?

Apart from the structure

  • Node embeddings (attributes)
    • Fixed length embedding of say, person name, age group, interests, etc.
  • Edge embeddings (attributes)
    • Traffic, number of lanes on the road, whether footpaths exist
  • The three graphical musketeers
    • \(A, X, E\)
    • Adjacency Matrix, Node Embedding, Edge Embedding

Graphical musketeers

Source: UDLBook

Properties of adjacency matrix

Source: UDLbook

Graph Permutations

  • Graph nodes are indexed randomly. Therefore, for each permutation, the network should be invariant
  • How to express a permutation of a graph mathematically?
    • \(P\) matrix with each row containing only one 1
    • \(X' = XP\)
      • What happens if P(m, n) is 1?
    • \(A' = P^TAP\)

Tasks on “entire” graphs

  • Predict temperature for state change (Regression)
  • Predict if a molecule is poisnous (Classification)

Approach - Graph Task

  • For individual nodes, understand the “meaning & context” of the nodes
    • Take \(X\) and \(A\) as input –> Pass through \(K\) layers –> \(H_k \forall k \in [1, K]\) hidden representations
    • For the graph task,
      • Average all node embeddings (also called mean pooling)
      • Follow it up with a linear layer/MLP
      • Use BCE for classification/MSE for regression

      Classification

Approach - Node Task

  • For example, predict the part of the plane using the point cloud node
  • Node embeddings are used as input to classifier/regression stage directly
  • Same \(\beta_k\) and \(\omega_k\) are used across nodes

Per node optimization

Approach - Edge Prediction Task

  • For example, “friend” suggestions
  • A similarity metric for two nearby nodes to predict if an edge should ‘indeed’ exist
    • Dot product for similarity

Graph tasks

Requirements of equivariance and invariance

  • Equivariance/Covariance, e.g., image segmentation
    • f[t[x]] = t[f[x]]
  • Invariance, e.g., classification of a flip
    • f[t[x]] = f[x]
  • For graph tasks, mean pooling is ________ to permutation matrix \(P\)

Graph Convolutions

  • The \(K\) layers induce a relational inductive bias
    • e.g., predict plane part using nodes of point cloud

GCN

The convolution in GCN

  • Convolution: Aggregates information from neighbors
    • Message passing from neighbors
    • Fixed weights according to the location of neighbor in a CNN. What about graphs?
  • Uniformly weighted combination of all neighbors
    • Writing combined for all nodes

Check design considerations

  • equivariant to permutations of node indices
  • can cope with any number of neighbors
  • exploits graph structure to provide relational inductive bias
  • shares weights across nodes

Example: Drug classification

  • 118 elements in the periodic table, N = number of nodes in graph
  • \(X \in R^{118\times N}\), \(A \in R^{N \times N}\), \(\Omega_0 \in R^{D \times 118}\)

Batching the compute

  • Given I training graphs \(\{X_i, A_i\}\), \(y_i\) labels, one can learn \(\{\Omega_k, \beta_k\}_{k=1}^K\) for the K layers with BCE loss and SGD
  • Batching:
    • Supergraph with modified adjacency matrix
    • Separate pooling per graph

Inductive vs transductive

  • Supervised vs semi-supervised (imagine labeling more nodes in the academic literature graph)
  • Graph tasks –> inductive
  • Node, Edge tasks –> inductive or transductive.

Node task example

  • Transductive task on a large graph - label each node
  • Impossible to fit the entire graph, i.e., subselect
  • How to sample nodes? k-hop sampling

Smarter subsampling

  • Neighborhood sampling (randomly subsample from hops)
    • Think dropout
  • Graph partitioning
    • Think clustering graph nodes to simplify graphs or
    • Removing some edges which might be ‘weaker’

Many ways of combination with neighbors

  • Sum
  • (Learned) Weighted sum
  • Residual connections
  • Mean aggregation (instead of just aggregation)

Edge graphs

  • Transform into a node graph and use

Unifying modalities

Attention

The impact of attention

  • 1980s to 2010s
    • CNNs and RNNs remain almost unchallenged
    • Improvements in compute, data storage, even optimization but minimal changes to architecture
  • Today
    • SOTA architectures on most things are Transformer Models

Intuition of attention

  • Consider the seq2seq model,
    • Instead of compressing information into a single vector, can we revisit input again?
    • while decoding (generating) different words, shouldn’t we be looking at different parts of the input?
  • How?
    • In the encoding step, generate a “representation” equal to the input length
    • At the decode time, use a weighted sum of the “right” input “representations” to “understand” the context
    • ideally, the process of finding the “right” inputs should be learnable

Queries, Keys and Values

Consider a database where you’re retrieving first names from last names

  • We can design queries that operate on (key, value) pairs in such a manner as to be valid regardless of the database size.
  • The same query can receive different answers, according to the contents of the database.
  • The “code” being executed for operating on a large state space (the database) can be quite simple (e.g., exact match, approximate match, top- k).
  • There is no need to compress or simplify the database to make the operations effective.

Retreiving values from database

  • \(\mathbb{D} \stackrel{\textrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}\) is queried with a query \(q\)
  • \(\textrm{Attention}(\mathbf{q}, \mathbb{D}) \stackrel{\textrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,\)
    • For a typical database, \(\alpha\) = exact match
    • Could we design an \(\alpha\) that captures “relevance” in some learnable sense
    • Instead of returning a single value, we return a weighted sum of values

Properties of attention-based querying

  • Typically generates a linear sum over “relevant” values. Some special cases:

Attention Pooling Architecture

Scoring Function

  • Dot Product Attention
  • All keys are assumed to be zero mean, unit variance (normalization)
  • Ideally, we’d like the dot products to have unit variance as well.
  • Therefore, we normalize by \(1/\sqrt{d}\) where \(k \in R^d\)

\(\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(\mathbf{q}^\top \mathbf{k}_i / \sqrt{d})}{\sum_{j=1} \exp(\mathbf{q}^\top \mathbf{k}_j / \sqrt{d})}\)

Handling variable length input for softmax

Input

Dive into Deep Learning
Learn to code blank
Hello world blank blank

Masked Softmax

  • \(v_i\) set to 0, \(\alpha(k_i, v_i)\) set to \(-\inf\)
def masked_softmax(X, valid_lens):  #@save
    """Perform softmax operation by masking elements on the last axis."""
    # X: 3D tensor, valid_lens: 1D or 2D tensor
    def _sequence_mask(X, valid_len, value=0):
        maxlen = X.size(1)
        mask = torch.arange((maxlen), dtype=torch.float32,
                            device=X.device)[None, :] < valid_len[:, None]
        X[~mask] = value
        return X
    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)
        # On the last axis, replace masked elements with a very large negative
        # value, whose exponentiation outputs 0
        X = _sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)

Matrix Multiplication for Dot Product Attention

  • n Queries, m key-value pairs in the database
  • \(Q \in R^{n\times d}\), \(K \in R^{m\times d}\), \(V \in R^{m\times v}\)
  • Note that Q and K are assumed to be the same dimension \(d\)
  • \(Attention = \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}\)
class DotProductAttention(nn.Module):  #@save
    """Scaled dot product attention."""
    def __init__(self, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    # Shape of queries: (batch_size, no. of queries, d)
    # Shape of keys: (batch_size, no. of key-value pairs, d)
    # Shape of values: (batch_size, no. of key-value pairs, value dimension)
    # Shape of valid_lens: (batch_size,) or (batch_size, no. of queries)
    def forward(self, queries, keys, values, valid_lens=None):
        d = queries.shape[-1]
        # Swap the last two dimensions of keys with keys.transpose(1, 2)
        scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d)
        self.attention_weights = masked_softmax(scores, valid_lens)
        return torch.bmm(self.dropout(self.attention_weights), values)

Different dimesions for keys, queries?

  • Additive attention
    • \(a(\mathbf q, \mathbf k) = \mathbf w_v^\top \textrm{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R}\)
    • where \(\mathbf W_k\in\mathbb R^{h\times k}\), \(\mathbf W_k\in\mathbb R^{h\times k}\), and \(\mathbf w_v\in\mathbb R^{h}\)
class AdditiveAttention(nn.Module):  #@save
    """Additive attention."""
    def __init__(self, num_hiddens, dropout, **kwargs):
        super(AdditiveAttention, self).__init__(**kwargs)
        self.W_k = nn.LazyLinear(num_hiddens, bias=False)
        self.W_q = nn.LazyLinear(num_hiddens, bias=False)
        self.w_v = nn.LazyLinear(1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, queries, keys, values, valid_lens):
        queries, keys = self.W_q(queries), self.W_k(keys)
        # After dimension expansion, shape of queries: (batch_size, no. of
        # queries, 1, num_hiddens) and shape of keys: (batch_size, 1, no. of
        # key-value pairs, num_hiddens). Sum them up with broadcasting
        features = queries.unsqueeze(2) + keys.unsqueeze(1)
        features = torch.tanh(features)
        # There is only one output of self.w_v, so we remove the last
        # one-dimensional entry from the shape. Shape of scores: (batch_size,
        # no. of queries, no. of key-value pairs)
        scores = self.w_v(features).squeeze(-1)
        self.attention_weights = masked_softmax(scores, valid_lens)
        # Shape of values: (batch_size, no. of key-value pairs, value
        # dimension)
        return torch.bmm(self.dropout(self.attention_weights), values)

Limitations of the RNN

  • Too much is expected of the state
  • In Bahdanau Attention, When predicting a token, if not all the input tokens are relevant, the model aligns (or attends) only to parts of the input sequence that are deemed relevant to the current prediction. This is then used to update the current state before generating the next token.

Using attention to enhance state

\(\mathbf{c}_{t'} = \sum_{t=1}^{T} \alpha(\mathbf{s}_{t' - 1}, \mathbf{h}_{t}) \mathbf{h}_{t}\)

  • Identify query, key, value
class Seq2SeqAttentionDecoder(AttentionDecoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0):
        super().__init__()
        self.attention = d2l.AdditiveAttention(num_hiddens, dropout)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(
            embed_size + num_hiddens, num_hiddens, num_layers,
            dropout=dropout)
        self.dense = nn.LazyLinear(vocab_size)
        self.apply(d2l.init_seq2seq)

    def init_state(self, enc_outputs, enc_valid_lens):
        # Shape of outputs: (num_steps, batch_size, num_hiddens).
        # Shape of hidden_state: (num_layers, batch_size, num_hiddens)
        outputs, hidden_state = enc_outputs
        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)

    def forward(self, X, state):
        # Shape of enc_outputs: (batch_size, num_steps, num_hiddens).
        # Shape of hidden_state: (num_layers, batch_size, num_hiddens)
        enc_outputs, hidden_state, enc_valid_lens = state
        # Shape of the output X: (num_steps, batch_size, embed_size)
        X = self.embedding(X).permute(1, 0, 2)
        outputs, self._attention_weights = [], []
        for x in X:
            # Shape of query: (batch_size, 1, num_hiddens)
            query = torch.unsqueeze(hidden_state[-1], dim=1)
            # Shape of context: (batch_size, 1, num_hiddens)
            context = self.attention(
                query, enc_outputs, enc_outputs, enc_valid_lens)
            # Concatenate on the feature dimension
            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
            # Reshape x as (1, batch_size, embed_size + num_hiddens)
            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
            outputs.append(out)
            self._attention_weights.append(self.attention.attention_weights)
        # After fully connected layer transformation, shape of outputs:
        # (num_steps, batch_size, vocab_size)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
                                          enc_valid_lens]

    @property
    def attention_weights(self):
        return self._attention_weights

Introducing the relevance in decoding

Dynamically choosing \(h_t\)s.

Attention in sequences

Link

Interpretability ++

Attention in Images

CNN + RNN + Attention

Attention in graphs

Limitations?

  • Only one type of relationship capture?
    • Relevance
  • The order of input isn’t captured

Multiple heads of attention

Mathematically, MHA

Self attention

  • With multiple heads, the attention mechanism learns queries, keys and values as linear transformations of the input X itself. This is why it’s called Self Attention.

num_hiddens, num_heads = 100, 5
attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens))
d2l.check_shape(attention(X, X, X, valid_lens),
                (batch_size, num_queries, num_hiddens))

Wait, what order was that in?

Positional Encodings

  • Space efficiency is of importance
  • Should be able to uniquely identify location
0 in binary is 000
1 in binary is 001
2 in binary is 010
3 in binary is 011
4 in binary is 100
5 in binary is 101
6 in binary is 110
7 in binary is 111
  • Represented using continuous functions for space efficiency

\(\begin{split} p_{i, 2j} = \sin\left(\frac{i}{10000^{2j/d}}\right), p_{i, 2j+1} = \cos\left(\frac{i}{10000^{2j/d}}\right).\end{split}\)

Understanding positional encoding

  • Encouraging unique encodings through decreasing frequencies and offsets

  • Learning different frequency information for long sequences

Relative positional encoding

  • Should be easy to learn ‘one token before, two tokens after’ kind of relationships

\(\begin{split}\begin{aligned} \begin{bmatrix} \cos(\delta \omega_j) & \sin(\delta \omega_j) \\ -\sin(\delta \omega_j) & \cos(\delta \omega_j) \\ \end{bmatrix} \begin{bmatrix} p_{i, 2j} \\ p_{i, 2j+1} \\ \end{bmatrix} =&\begin{bmatrix} \cos(\delta \omega_j) \sin(i \omega_j) + \sin(\delta \omega_j) \cos(i \omega_j) \\ -\sin(\delta \omega_j) \sin(i \omega_j) + \cos(\delta \omega_j) \cos(i \omega_j) \\ \end{bmatrix}\\ =&\begin{bmatrix} \sin\left((i+\delta) \omega_j\right) \\ \cos\left((i+\delta) \omega_j\right) \\ \end{bmatrix}\\ =& \begin{bmatrix} p_{i+\delta, 2j} \\ p_{i+\delta, 2j+1} \\ \end{bmatrix}, \end{aligned}\end{split}\)

Transformer

  • An architecture without recurrent connections that can capture sequential as well as long range information
  • Four elements
    • Multi-head self attention (MHA)
    • Feedforward Network (FFN)
    • Residual connections
    • Encoder-Decoder Attention

Vision Transformer

Swin Transformer

  • Hierarchical feature maps for linear scaling with image size Hierarchical Features & Shifted Windows

Graph Transformer

Pros

  • Long range connections
  • Navigating the maze with the map
  • Node update still unrelated to graph structure

Cons

  • Identifiability of nodes (need better encodings)
  • Loss of relational inductive bias
  • Increased computational complexity

Source

Graph Transformer detailed

Multimodal (Unified) transformers

  • VQA: Visual Question Answering
  • SNLI-VE: Stanford Natural Language Inference - Visual Entailment
  • MNLI: Multi-Genre Natural Language Inference
  • QNLI: Question Natural Language Inference
  • QQP: Quora Question Pairs
  • SST-2: Stanford Sentiment Treebank - 2

Architecture

The unification

Source

Pre-training

Purpose

  • Improve generalization
  • Find ways to learn (pretext) tasks without labels such that “relevant” information about underlying data is captured in some “embedding”
    • Completing incomplete images
    • Solving puzzles
    • Correcting corrupted files/text
    • Finding signal from noise

Encoder-only

  • BERT (Bidirectional Encoding Representations from Transformers)
    • Masking words randomly
  • output embedding is projected for classification/regression tasks

Fine-tune BERT

Encoder-decoder: Pre-training T5

  • T5 = Text to text transfer transformer
    • Includes the “task” in the encoder, e.g., summarize..

Fine-tuning T5

Decoder only: GPT

Using GPT2 for different tasks without fine-tuning

Scalability of transformers

Image SSL

Source1, Video, Slides

preferred source -> Source2

Visual pretext tasks:

  • Relative positions
  • Puzzles
  • Predicting rotations

The secret ingredient

Currently,

  • Pretext Tasks
  • Tricks to ensure non-trivial solutions

Sometimes, these things go beyond imagination

  • OpenAI CLIP (Contrastive Language Image Pretraining)
    • SSL on 400 million text-image pairs
    • Beats robustness gap by 75%
    • Beats Imagenet classification without using Imagenet labels (zero-shot)

Built on

  • modern architectures like the Transformer
  • VirTex which explored autoregressive language modeling
  • ICMLM which investigated masked language modeling
  • ConVIRT which studied the same contrastive objective we use for CLIP but in the field of medical imaging.

Architecture

  • Train time?
    • 256 GPUs for 2 weeks

Variants of SSL

  • Text pretraining
  • Classification and segmentation of images
  • Speech Recognition
  • CURL (Reinforcement Learning)
  • Drug Discovery
  • Learning from demonstrations

On satellite images

SatCLIP

Geospatial Computer Vision Group at Plaksha

  • SSL + Satellite images for
    • Agriculture
    • Urban Planning
  • What are the events that are a frequently occuring phenomenon at multiple places?
    • Agriculture: Crops growing and harvesting
    • Urban Infrastructure development
    • Socio-economic development
  • Could this “information/insight” be converted into value for a societal segment?
    • Farmers
    • City dwellers
    • Governments, etc.