- Simple and Effective Architecture: Our tokenizer is a simple VAE + ViT encoder.
- Unified Visual Representations: A single, unified visual representation that can serve both image understanding and image generation.
- Superior Generation Performance: Our gFID is 1.89 on ImageNet, which is substantially better than CLIP+RAE(2.54) / VAE+SiT(2.06) / UniTok+LlamaGen(2.51).
- Remarkable Reconstruction performance: OpenVision 3 achieves 0.216 rFID on ImageNet 256x256.
- Competitive Understanding Ability: Our tokenizer performs comparably with CLIP (62.4 vs 62.2 on SeedBench, 83.7 vs 82.9 on POPE).