-
OpenVision 3 encodes VAE latents (instead of pixels) and trains the resulting representation with two complementary signals:
- Reconstruction loss → preserves low-level visual detail
- Contrastive + Captioning losses → injects high-level semantics
Why feed VAE latents to a ViT? → Understanding and Generation helps each other!
-
Empirically, we observe this design unlocks a non-trivial synergy:
- Training on Understanding alone actually improves pixel reconstruction.
- Training on Generation alone benefits semantics alignment.