OpenVision 3: Project Page

Abstract

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

What's New?

VAE latents

Reconstruction loss → preserves low-level visual detail
Contrastive + Captioning losses → injects high-level semantics

Why feed VAE latents to a ViT? → Understanding and Generation helps each other!

Training on Understanding alone actually improves pixel reconstruction.
Training on Generation alone benefits semantics alignment.

Key Contributions

Simple and Effective Architecture: Our tokenizer is a simple VAE + ViT encoder.
Unified Visual Representations: A single, unified visual representation that can serve both image understanding and image generation.
Superior Generation Performance: Our gFID is 1.89 on ImageNet, which is substantially better than CLIP+RAE(2.54) / VAE+SiT(2.06) / UniTok+LlamaGen(2.51).
Remarkable Reconstruction performance: OpenVision 3 achieves 0.216 rFID on ImageNet 256x256.
Competitive Understanding Ability: Our tokenizer performs comparably with CLIP (62.4 vs 62.2 on SeedBench, 83.7 vs 82.9 on POPE).

Performance on Generation, Reconstruction and Understanding

Reconstruction Performance Comparison

OpenVision 3 outperforms existing unified tokenizers across all metrics on ImageNet 256x256.
Even in comparison with specialized generation-oriented tokenizers, our model maintains competitive or better results.

Class-conditional Image Generation Performance

OpenVision 3 achieves higher generation fidelity under RAE framework on ImageNet than SD-VAE+SiT, UniTok+LlamaGen, CLIP+RAE and OpenVision+RAE.

Understanding Performance under LLaVA-1.5 Framework

OpenVision demonstrates comparible performance with CLIP on multiple multimodal benchmarks wich much less computational cost.

OpenVision 3 - Model Zoo

We release all ViT checkpoints in OpenVision 3 below. All ViT checkpoints should be paired with FLUX.1 VAE. The downsample factor of FLUX.1 VAE is 8.

Model	Resolution	ViT Patch	# Tokens	ViT Link
OpenVision 3-Base	256	2	256	HF
OpenVision 3-Base	224	2	196	HF
OpenVision 3-Large	256	2	256	HF

Acknowledgement

We would like to thank the TPU Research Cloud (TRC) program and Google Cloud Research Credits program for supporting our computing needs. This project is also partly supported by the National Center for Transportation Cybersecurity and Resiliency (TraCR) (a U.S. Department of Transportation National University Transportation Center) headquartered at Clemson University, Clemson, South Carolina, USA (USDOT Grant \#69A3552344812). Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TraCR, and the U.S. Government assumes no liability for the contents or use thereof.

BibTeX

@article{zhang2026openvision, title = {OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation}, author = {Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang}, journal = {arXiv preprint arXiv:2601.15369}, year = {2026} }