OpenVision 3 OpenVision icon

A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang*1 · Sucheng Ren*2 · Yanqing Liu1 · Xianhang Li1 · Zeyu Wang1 · Yuyin Zhou1 · Huaxiu Yao3 · Zeyu Zheng4 · Weili Nie5 · Guilin Liu5 · Zhiding Yu5 · Cihang Xie1

1UC Santa Cruz · 2JHU · 3UNC-Chapel Hill · 4UC Berkeley · 5NVIDIA

OpenVision 3 Teaser

Abstract

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

Key Contributions

  • Simple and Effective Architecture: Our tokenizer is a simple VAE + ViT encoder.
  • Unified Visual Representations: A single, unified visual representation that can serve both image understanding and image generation.
  • Superior Generation Performance: Our gFID is 1.89 on ImageNet, which is substantially better than CLIP+RAE(2.54) / VAE+SiT(2.06) / UniTok+LlamaGen(2.51).
  • Remarkable Reconstruction performance: OpenVision 3 achieves 0.216 rFID on ImageNet 256x256.
  • Competitive Understanding Ability: Our tokenizer performs comparably with CLIP (62.4 vs 62.2 on SeedBench, 83.7 vs 82.9 on POPE).

Performance on Generation, Reconstruction and Understanding

Reconstruction Performance Comparison

OpenVision vs Proprietary Encoders

OpenVision 3 outperforms existing unified tokenizers across all metrics on ImageNet 256x256.
Even in comparison with specialized generation-oriented tokenizers, our model maintains competitive or better results.

Class-conditional Image Generation Performance

OpenVision vs Proprietary Encoders

OpenVision 3 achieves higher generation fidelity under RAE framework on ImageNet than SD-VAE+SiT, UniTok+LlamaGen, CLIP+RAE and OpenVision+RAE.

Understanding Performance under LLaVA-1.5 Framework

LLaVA-1.5 Performance Comparison

OpenVision demonstrates comparible performance with CLIP on multiple multimodal benchmarks wich much less computational cost.

Loss Visualization with Only Semantic Loss

We trained our tokenizer with and without the reconstruction loss, respectively.

Understanding alone benefits reconstruction: In Figures (a) and (b), both pixel-level and latent-level reconstruction losses decrease significantly even in the absence of explicit reconstruction signals.
Reconstruction loss does not harm understanding: Figures (c) and (d) demonstrate that the incorporation of the reconstruction loss has no adverse impact on the losses of the understanding branch.

(a) Pixel Recon Loss

Pixel Reconstruction Loss

(b) Latents Recon Loss

Latents Reconstruction Loss

(c) Caption Loss

Caption Loss

(d) Contrastive Loss

Contrastive Loss

Loss Visualization with Only Reconstruction Loss

We trained our tokenizer with and without the understanding loss, respectively.

Adding semantic signals facilitates reconstruction: In Figure (a), the inclusion of semantic loss leads to a lower image reconstruction loss, suggesting that semantic supervision can, in turn, enhance reconstruction performance.
Reconstruction facilitates semantic generative tasks: Figures (c) and (d) show that in the absence of semantic supervision, the contrastive loss remains almost stagnant, whereas the caption loss exhibits a marginal decline. This indicates that the reconstruction task intrinsically facilitates semantic tasks that are also generative in nature.

(a) Pixel Recon Loss

Pixel Reconstruction Loss

(b) Latents Recon Loss

Latents Reconstruction Loss

(c) Caption Loss

Caption Loss

(d) Contrastive Loss

Contrastive Loss

OpenVision 3 - Model Zoo

We release all ViT checkpoints in OpenVision 3 below. All ViT checkpoints should be paired with FLUX.1 VAE. The downsample factor of FLUX.1 VAE is 8.

ModelResolutionViT Patch# TokensViT Link
OpenVision 3-Base2242196HF
OpenVision 3-Large2562256HF

Acknowledgement

We would like to thank the TPU Research Cloud (TRC) program and Google Cloud Research Credits program for supporting our computing needs.

BibTeX


  @article{zhang2026openvision,
    title   = {OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
    author  = {Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
    journal = {arXiv preprint arXiv:2601.15369},
    year    = {2026}
  }