Fully-Open Vision Encoders • Generative Pretraining

OpenVision 2

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning.
OpenVision 2: A Family of Generative Pretrained Visual Encoders that removes the text encoder and contrastive loss, training with caption-only supervision.

Quick Links (OpenVision)

Quick Links (OpenVision 2)

Training speed1.5–2× faster
Memory footprint~1.8× lower
ScaleUp to 1B+ params
BenchmarksOCR/TextVQA↑
OpenVision 2 teaser, full width

What’s New in OpenVision 2

  • Caption-only generative training: ViT vision encoder + decoder-only text model. No text encoder, no contrastive loss.
  • Training–inference alignment: Pretraining pipeline mirrors modern MLLM usage (e.g., LLaVA), reducing objective mismatch.
  • Efficiency at scale: CLIPA two-stage curriculum (low-res pretraining → short high-res finetune) + visual token masking (keep ~25–35% tokens).
  • High-quality captions: ReCap-DataComp-1B v2 (LLaMA-3 powered, conditioned on alt-text with weighted top-k sampling) for richer, grounded supervision.

Benchmarks (OpenVision vs OpenVision 2)

We report results under two MLLM frameworks. MME is shown as Perception/Cognition.

LLaVA-1.5

MethodVision EncoderParamsRes TextVQAChartQAOCRMME (P/C) SEEDSQAGQAPOPE
OpenVisionL/14304M22457.713.93151487/31769.573.662.986.4
OpenVision 2L/14304M22459.013.73271460/31269.376.562.687.1
OpenVisionL/14304M33661.215.73391525/31570.575.163.787.2
OpenVision 2L/14304M33663.014.53571486/32170.177.563.087.7
OpenVisionSoViT-400M/14400M38462.416.13571493/32070.472.463.888.0
OpenVision 2SoViT-400M/14400M38464.315.03871472/31070.774.963.587.5
OpenVision 2H/14632M22460.213.53401470/30569.375.462.587.2
OpenVision 2H/14632M33663.416.33911470/31170.676.463.188.4
OpenVision 2H/14632M44865.618.14161499/33170.675.663.188.7
OpenVision 2g/141.01B22460.213.73381469/29069.375.062.686.9

Open-LLaVA-Next

MethodVision EncoderParamsRes TextVQAChartQAOCRMME (P/C) SEEDSQAGQAPOPE
OpenVisionL/14304M22465.761.55031567/33273.173.164.787.8
OpenVision 2L/14304M22466.160.45011577/29773.168.464.687.6
OpenVisionL/14304M33668.368.05471520/31073.375.464.488.1
OpenVision 2L/14304M33668.962.35371585/27873.475.264.688.4
OpenVisionSoViT-400M/14400M38467.463.15401500/35372.273.563.487.8
OpenVision 2SoViT-400M/14400M38469.063.45491521/31972.272.763.187.7
OpenVision 2H/14632M22466.460.25141597/31473.376.264.788.4
OpenVision 2H/14632M33669.964.85731572/33773.874.564.487.8
OpenVision 2H/14632M44871.964.95901542/32474.175.664.488.8
OpenVision 2g/141.01B22467.362.45141558/32373.474.464.788.0

Efficiency & Scaling

Training (TPU v4-512)

ModelBackboneResv4-512 HoursFLOPs / Image
OpenVisionL/1422483271.75
OpenVision 2L/1422457208.90
OpenVisionSoViT-400M/143842411636.75
OpenVision 2SoViT-400M/143841211017.74

Memory (TPU v4-64, GB/chip)

Model (L/14)ResBatchPeak Mem
OpenVision2242k24.5
OpenVision2244kOOM
OpenVision 22242k13.8
OpenVision 22244k22.1
OpenVision 22248k28.4
Model (SoViT-400M/14)ResBatchPeak Mem
OpenVision38451227.4
OpenVision3841kOOM
OpenVision 238451214.5
OpenVision 23841k28.8

Strategy Ablations (ViT-L/14 @ 224, v4-64)

MethodCLIPAToken MaskTime (h)
CapPa baseline217
OpenVision 2 (Mask only)190
OpenVision 2 (CLIPA only)67
OpenVision 2 (both)55

How to Load the Converted Vision-Only Encoder

Use our OpenCLIP-compatible interface to load vision-only checkpoints from the Hugging Face Hub.

import torch
from open_clip.factory import create_vision_encoder_and_transforms

# Replace with your HF repo containing the converted vision-only weights
hf_repo = "UCSC-VLAA/openvision2-vit-large-patch14-224-vision-only"

vision_encoder = create_vision_encoder_and_transforms(
    model_name=f"hf-hub:{hf_repo}"
)
vision_encoder.eval()

dummy = torch.ones(1, 3, 224, 224)
with torch.no_grad():
    pooled, tokens = vision_encoder(dummy)
print("pooled:", tuple(pooled.shape), "tokens:", tuple(tokens.shape))

Installation

# Clone
git clone https://github.com/UCSC-VLAA/OpenVision.git
cd OpenVision

# TPU-oriented environment (example)
bash setup.sh DEVICE=tpu JAX_VERSION=0.4.38

# Optional (PyTorch-side use)
pip install open_clip_torch huggingface_hub

Data & Training (Overview)

Citation

@article{li2025openvision,
  title   = {OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning},
  author  = {Li, Xianhang and Liu, Yanqing and Tu, Haoqin and Zhu, Hongru and Xie, Cihang},
  journal = {arXiv preprint arXiv:2505.04601},
  year    = {2025}
}
@article{liu2025openvision2,
  title   = {OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning},
  author  = {Liu, Yanqing and Li, Xianhang and Zhang, Letian and Wang, Zirui and Zheng, Zeyu and Zhou, Yuyin and Xie, Cihang},
  journal = {arXiv preprint arXiv:2509.01644},
  year    = {2025}
}

License: Apache-2.0. JAX implementation builds on Big Vision; PyTorch side builds on OpenCLIP, with references to timm and MAE. Thanks to TPU Research Cloud (TRC) and Google Cloud Research Credits for support.