Fully-Open Vision Encoders • Generative Pretraining

OpenVision 2

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning.
OpenVision 2: A Family of Generative Pretrained Visual Encoders that removes the text encoder and contrastive loss, training with caption-only supervision.

Quick Links (OpenVision)

Project page: OpenVision
ArXiv: arXiv:2505.04601
Code: GitHub
HF Collection: OpenVision Collection

Quick Links (OpenVision 2)

Project page: OpenVision 2
ArXiv: arXiv:2509.01644
HF Collection: OpenVision 2 Collection
Dataset: ReCap-DataComp-1B v2

Training speed1.5–2× faster

Memory footprint~1.8× lower

ScaleUp to 1B+ params

BenchmarksOCR/TextVQA↑

What’s New in OpenVision 2

Caption-only generative training: ViT vision encoder + decoder-only text model. No text encoder, no contrastive loss.
Training–inference alignment: Pretraining pipeline mirrors modern MLLM usage (e.g., LLaVA), reducing objective mismatch.
Efficiency at scale: CLIPA two-stage curriculum (low-res pretraining → short high-res finetune) + visual token masking (keep ~25–35% tokens).
High-quality captions: ReCap-DataComp-1B v2 (LLaMA-3 powered, conditioned on alt-text with weighted top-k sampling) for richer, grounded supervision.

Benchmarks (OpenVision vs OpenVision 2)

We report results under two MLLM frameworks. MME is shown as Perception/Cognition.

LLaVA-1.5

Method	Vision Encoder	Params	Res	TextVQA	ChartQA	OCR	MME (P/C)	SEED	SQA	GQA	POPE
OpenVision	L/14	304M	224	57.7	13.9	315	1487/317	69.5	73.6	62.9	86.4
OpenVision 2	L/14	304M	224	59.0	13.7	327	1460/312	69.3	76.5	62.6	87.1
OpenVision	L/14	304M	336	61.2	15.7	339	1525/315	70.5	75.1	63.7	87.2
OpenVision 2	L/14	304M	336	63.0	14.5	357	1486/321	70.1	77.5	63.0	87.7
OpenVision	SoViT-400M/14	400M	384	62.4	16.1	357	1493/320	70.4	72.4	63.8	88.0
OpenVision 2	SoViT-400M/14	400M	384	64.3	15.0	387	1472/310	70.7	74.9	63.5	87.5
OpenVision 2	H/14	632M	224	60.2	13.5	340	1470/305	69.3	75.4	62.5	87.2
OpenVision 2	H/14	632M	336	63.4	16.3	391	1470/311	70.6	76.4	63.1	88.4
OpenVision 2	H/14	632M	448	65.6	18.1	416	1499/331	70.6	75.6	63.1	88.7
OpenVision 2	g/14	1.01B	224	60.2	13.7	338	1469/290	69.3	75.0	62.6	86.9

Open-LLaVA-Next

Method	Vision Encoder	Params	Res	TextVQA	ChartQA	OCR	MME (P/C)	SEED	SQA	GQA	POPE
OpenVision	L/14	304M	224	65.7	61.5	503	1567/332	73.1	73.1	64.7	87.8
OpenVision 2	L/14	304M	224	66.1	60.4	501	1577/297	73.1	68.4	64.6	87.6
OpenVision	L/14	304M	336	68.3	68.0	547	1520/310	73.3	75.4	64.4	88.1
OpenVision 2	L/14	304M	336	68.9	62.3	537	1585/278	73.4	75.2	64.6	88.4
OpenVision	SoViT-400M/14	400M	384	67.4	63.1	540	1500/353	72.2	73.5	63.4	87.8
OpenVision 2	SoViT-400M/14	400M	384	69.0	63.4	549	1521/319	72.2	72.7	63.1	87.7
OpenVision 2	H/14	632M	224	66.4	60.2	514	1597/314	73.3	76.2	64.7	88.4
OpenVision 2	H/14	632M	336	69.9	64.8	573	1572/337	73.8	74.5	64.4	87.8
OpenVision 2	H/14	632M	448	71.9	64.9	590	1542/324	74.1	75.6	64.4	88.8
OpenVision 2	g/14	1.01B	224	67.3	62.4	514	1558/323	73.4	74.4	64.7	88.0

Efficiency & Scaling

Training (TPU v4-512)

Model	Backbone	Res	v4-512 Hours	FLOPs / Image
OpenVision	L/14	224	83	271.75
OpenVision 2	L/14	224	57	208.90
OpenVision	SoViT-400M/14	384	241	1636.75
OpenVision 2	SoViT-400M/14	384	121	1017.74

Memory (TPU v4-64, GB/chip)

Model (L/14)	Res	Batch	Peak Mem
OpenVision	224	2k	24.5
OpenVision	224	4k	OOM
OpenVision 2	224	2k	13.8
OpenVision 2	224	4k	22.1
OpenVision 2	224	8k	28.4

Model (SoViT-400M/14)	Res	Batch	Peak Mem
OpenVision	384	512	27.4
OpenVision	384	1k	OOM
OpenVision 2	384	512	14.5
OpenVision 2	384	1k	28.8

Strategy Ablations (ViT-L/14 @ 224, v4-64)

Method	CLIPA	Token Mask	Time (h)
CapPa baseline	–	–	217
OpenVision 2 (Mask only)	–	✓	190
OpenVision 2 (CLIPA only)	✓	–	67
OpenVision 2 (both)	✓	✓	55

How to Load the Converted Vision-Only Encoder

Use our OpenCLIP-compatible interface to load vision-only checkpoints from the Hugging Face Hub.

import torch
from open_clip.factory import create_vision_encoder_and_transforms

# Replace with your HF repo containing the converted vision-only weights
hf_repo = "UCSC-VLAA/openvision2-vit-large-patch14-224-vision-only"

vision_encoder = create_vision_encoder_and_transforms(
    model_name=f"hf-hub:{hf_repo}"
)
vision_encoder.eval()

dummy = torch.ones(1, 3, 224, 224)
with torch.no_grad():
    pooled, tokens = vision_encoder(dummy)
print("pooled:", tuple(pooled.shape), "tokens:", tuple(tokens.shape))

Installation

# Clone
git clone https://github.com/UCSC-VLAA/OpenVision.git
cd OpenVision

# TPU-oriented environment (example)
bash setup.sh DEVICE=tpu JAX_VERSION=0.4.38

# Optional (PyTorch-side use)
pip install open_clip_torch huggingface_hub

Data & Training (Overview)

Pretraining data: ReCap-DataComp-1B v2 (synthetic, conditioned on alt-text)
CLIPA schedule: low-res pretraining → short high-res finetune
Visual token masking: keep ~25–35% of tokens into the decoder (reduces compute & regularizes)
Scripts: scripts/project/openvision2/train.sh (TPU Pod helper tpu_command.sh)

Citation

@article{li2025openvision,
  title   = {OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning},
  author  = {Li, Xianhang and Liu, Yanqing and Tu, Haoqin and Zhu, Hongru and Xie, Cihang},
  journal = {arXiv preprint arXiv:2505.04601},
  year    = {2025}
}

@article{liu2025openvision,
  title   = {OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning},
  author  = {Liu, Yanqing and Li, Xianhang and Zhang, Letian and Wang, Zirui and Zheng, Zeyu and Zhou, Yuyin and Xie, Cihang},
  journal = {arXiv preprint arXiv:2509.01644},
  year    = {2025}
}

License: Apache-2.0. JAX implementation builds on Big Vision; PyTorch side builds on OpenCLIP, with references to timm and MAE. Thanks to TPU Research Cloud (TRC) and Google Cloud Research Credits for support.