Language Vision Model (LVM)

Classic Paper List

  • ResNet
  • Transformer (Gives a way for feature extraction with self-attention mechanism)
    • $Attention(Q, K, V) = Softmax_{row}(\frac{Q_{NxD_k} K_{MxD_k}^T}{\sqrt{D_k}}) * V_{MxD_k}$
      • $Q_{NxD_k}$ is query, $K_{MxD_k}$ and $V_{MxD_k}$ are keys and values which have the same dimension.
      • Each row in $\frac{Q_{NxD_k} K_{MxD_k}^T}{\sqrt{D_k}}$ contains individual weight($M$ weights) for each value ($M$ values).
    • It uses self-attention. This means keys, values and queries are the same for encoder module.
    • It can be used as feature extraction block. The input and output will have the same dimension.
      • It uses multi-heads so it can have things to learn since attention is just a dot products for key and query.
  • BERT (Gives a way for pre-training LLM with masked words)
    • It is used as a pre-trained model on large language dataset
    • It uses WordPiece for tokens
    • Unlike for machine translation (using single direction information), pre-trained model could use Bidirectional information.
  • ViT (Gives a way for using image patches as input for transformer blocks)
  • MAE (Gives a way for contrastive learning based LLM without running out memory)

  • Swin Transformer

  • CLIP

  • GPT

  • DALL-E2(unCLIP)

  • ViLT

    • It combines ideas from BERT(language features) and ViT(visual features)
  • Auto Encoder(AE)
    • An encoder for encoding an image, which can be decoded to original image by a decoder.
  • De-noise Auto Encoder (DAE)
    • Add noises to input image while the target is still the original image.
  • Masked Auto Encoder (MAE)
    • Add masks to input image while the target is still the original image.
  • Variational Auto Encoder (VAE)
    • An encoder to generate a Gaussian Distribution ($\mu, \Sigma$)
  • Vector Quantised-Variational AutoEncoder (VQ-VAE)
    • An encoder to generate a feature index map (features are in a $KxD$ matrix codebook) instead of a continuous distribution.
  • Diffusion Model
    • Add noises and then de-noise.
    • DNN is used to predict noise instead of image.
    • Uses U-Net with skip connection + time embedding (which de-noising step) + attention(transformer)
This blog is converted from language-vision-model.ipynb
Written on April 1, 2023