Language Vision Model (LVM) – Xipeng Wang – A SLAMer... A roboticist...

Classic Paper List ¶

BERT (Gives a way for pre-training LLM with masked words)
- It is used as a pre-trained model on large language dataset
- It uses WordPiece for tokens
- Unlike for machine translation (using single direction information), pre-trained model could use Bidirectional information.

Auto Encoder(AE)
- An encoder for encoding an image, which can be decoded to original image by a decoder.
De-noise Auto Encoder (DAE)
- Add noises to input image while the target is still the original image.
Masked Auto Encoder (MAE)
- Add masks to input image while the target is still the original image.
Variational Auto Encoder (VAE)
- An encoder to generate a Gaussian Distribution ($\mu, \Sigma$)
Vector Quantised-Variational AutoEncoder (VQ-VAE)
- An encoder to generate a feature index map (features are in a $KxD$ matrix codebook) instead of a continuous distribution.
Diffusion Model
- Add noises and then de-noise.
- DNN is used to predict noise instead of image.
- Uses U-Net with skip connection + time embedding (which de-noising step) + attention(transformer)