OpenVision 3 OpenVision icon

A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang*1 · Sucheng Ren*2 · Yanqing Liu1 · Xianhang Li1 · Zeyu Wang1 · Yuyin Zhou1 · Huaxiu Yao3
Zeyu Zheng4 · Weili Nie5 · Guilin Liu5 · Zhiding Yu5 · Cihang Xie1

1UC Santa Cruz · 2JHU · 3UNC-Chapel Hill · 4UC Berkeley · 5NVIDIA

OpenVision 3 Teaser

Abstract

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

What's New?

    OpenVision 3 encodes VAE latents (instead of pixels) and trains the resulting representation with two complementary signals:

  • Reconstruction loss → preserves low-level visual detail
  • Contrastive + Captioning losses → injects high-level semantics

Why feed VAE latents to a ViT? → Understanding and Generation helps each other!

    Empirically, we observe this design unlocks a non-trivial synergy:

  • Training on Understanding alone actually improves pixel reconstruction.
  • Training on Generation alone benefits semantics alignment.
OpenVision vs Proprietary Encoders

Key Contributions

  • Simple and Effective Architecture: Our tokenizer is a simple VAE + ViT encoder.
  • Unified Visual Representations: A single, unified visual representation that can serve both image understanding and image generation.
  • Superior Generation Performance: Our gFID is 1.89 on ImageNet, which is substantially better than CLIP+RAE(2.54) / VAE+SiT(2.06) / UniTok+LlamaGen(2.51).
  • Remarkable Reconstruction performance: OpenVision 3 achieves 0.216 rFID on ImageNet 256x256.
  • Competitive Understanding Ability: Our tokenizer performs comparably with CLIP (62.4 vs 62.2 on SeedBench, 83.7 vs 82.9 on POPE).

Performance on Generation, Reconstruction and Understanding

Reconstruction Performance Comparison

OpenVision vs Proprietary Encoders

OpenVision 3 outperforms existing unified tokenizers across all metrics on ImageNet 256x256.
Even in comparison with specialized generation-oriented tokenizers, our model maintains competitive or better results.

Class-conditional Image Generation Performance

OpenVision vs Proprietary Encoders

OpenVision 3 achieves higher generation fidelity under RAE framework on ImageNet than SD-VAE+SiT, UniTok+LlamaGen, CLIP+RAE and OpenVision+RAE.

Understanding Performance under LLaVA-1.5 Framework

LLaVA-1.5 Performance Comparison

OpenVision demonstrates comparible performance with CLIP on multiple multimodal benchmarks wich much less computational cost.

OpenVision 3 - Model Zoo

We release all ViT checkpoints in OpenVision 3 below. All ViT checkpoints should be paired with FLUX.1 VAE. The downsample factor of FLUX.1 VAE is 8.

ModelResolutionViT Patch# TokensViT Link
OpenVision 3-Base2562256HF
OpenVision 3-Base2242196HF
OpenVision 3-Large2562256HF

Acknowledgement

We would like to thank the TPU Research Cloud (TRC) program and Google Cloud Research Credits program for supporting our computing needs. This project is also partly supported by the National Center for Transportation Cybersecurity and Resiliency (TraCR) (a U.S. Department of Transportation National University Transportation Center) headquartered at Clemson University, Clemson, South Carolina, USA (USDOT Grant \#69A3552344812). Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TraCR, and the U.S. Government assumes no liability for the contents or use thereof.

BibTeX


  @article{zhang2026openvision,
    title   = {OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
    author  = {Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
    journal = {arXiv preprint arXiv:2601.15369},
    year    = {2026}
  }