Language Vision Model (LVM)
Classic Paper List ¶
- ResNet
-
Transformer (Gives a way for feature extraction with self-attention mechanism)
-
$Attention(Q, K, V) = Softmax_{row}(\frac{Q_{NxD_k} K_{MxD_k}^T}{\sqrt{D_k}}) * V_{MxD_k}$
- $Q_{NxD_k}$ is query, $K_{MxD_k}$ and $V_{MxD_k}$ are keys and values which have the same dimension.
- Each row in $\frac{Q_{NxD_k} K_{MxD_k}^T}{\sqrt{D_k}}$ contains individual weight($M$ weights) for each value ($M$ values).
- It uses self-attention. This means keys, values and queries are the same for encoder module.
-
It can be used as feature extraction block. The input and output will have the same dimension.
- It uses multi-heads so it can have things to learn since attention is just a dot products for key and query.
-
$Attention(Q, K, V) = Softmax_{row}(\frac{Q_{NxD_k} K_{MxD_k}^T}{\sqrt{D_k}}) * V_{MxD_k}$
-
BERT
(Gives a way for pre-training LLM with masked words)
- It is used as a pre-trained model on large language dataset
- It uses WordPiece for tokens
- Unlike for machine translation (using single direction information), pre-trained model could use Bidirectional information.
- ViT (Gives a way for using image patches as input for transformer blocks)
-
MAE (Gives a way for contrastive learning based LLM without running out memory)
-
Swin Transformer
-
CLIP
-
GPT
-
DALL-E2(unCLIP)
-
- It combines ideas from BERT(language features) and ViT(visual features)
-
Auto Encoder(AE)
- An encoder for encoding an image, which can be decoded to original image by a decoder.
-
De-noise Auto Encoder (DAE)
- Add noises to input image while the target is still the original image.
-
Masked Auto Encoder (MAE)
- Add masks to input image while the target is still the original image.
-
Variational Auto Encoder (VAE)
- An encoder to generate a Gaussian Distribution ($\mu, \Sigma$)
-
Vector Quantised-Variational AutoEncoder (VQ-VAE)
- An encoder to generate a feature index map (features are in a $KxD$ matrix codebook) instead of a continuous distribution.
-
Diffusion Model
- Add noises and then de-noise.
- DNN is used to predict noise instead of image.
- Uses U-Net with skip connection + time embedding (which de-noising step) + attention(transformer)
This blog is converted from language-vision-model.ipynb
Written on April 1, 2023