OpenVision: Project Page

Abstract

OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI’s CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works---e.g., CLIPS for training framework and Recap-DataComp-1B for training data---while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

Key Contributions

Fully Open Vision Encoders: Datasets, training recipes, and model checkpoints are entirely public, fostering reproducibility and transparency in multimodal research.
Wide Range of Model Scales: A comprehensive family of vision encoders from Tiny (5.9M) to Huge (632.1M) parameters, supporting deployment from edge devices to high-capacity servers.
Superior Multimodal Performance: Matches or surpasses proprietary vision encoders (e.g., OpenAI-CLIP, SigLIP) across popular multimodal benchmarks (e.g., LLaVA-1.5, Open-LLaVA-Next).
Efficient Progressive Resolution Training: Demonstrates significant efficiency improvements (2×–3× faster) compared to proprietary counterparts through a progressive, multi-stage resolution training strategy.
Flexible Patch-Size Configuration: Supports adaptive encoding (8×8, 16×16 patches), allowing detailed visual understanding or efficient processing based on practical needs.

Detailed Comparisons and Efficiency

OpenVision vs. Proprietary Encoders

OpenVision encoders match or outperform proprietary models like OpenAI's CLIP and Google's SigLIP across multimodal tasks.

Performance under LLaVA-1.5 Framework

OpenVision demonstrates strong performance improvements over existing CLIP models under the LLaVA-1.5 multimodal framework.

Performance under Open-LLaVA-Next Framework

Under Open-LLaVA-Next, OpenVision maintains its competitive edge, excelling particularly in document-heavy multimodal tasks.

Efficiency Comparison

OpenVision achieves superior multimodal performance with significantly reduced training time compared to proprietary alternatives.

Model Zoo (ImageNet-1K)

We report ImageNet-1K Top-1 accuracy across OpenVision variants. All models are available in both JAX and PyTorch formats.

Model	Size	Patch	Resolution	Top-1	Link	JAX	PyTorch
ViT-Tiny	5M	16	160	46.9%	HF	✓	✓
ViT-Tiny	5M	16	224	49.6%	HF	✓	✓
ViT-Tiny	5M	16	384	51.5%	HF	✓	✓
ViT-Tiny	5M	8	160	51.9%	HF	✓	✓
ViT-Tiny	5M	8	224	53.5%	HF	✓	✓
ViT-Tiny	5M	8	384	53.9%	HF	✓	✓
ViT-Small	22M	16	160	63.5%	HF	✓	✓
ViT-Small	22M	16	224	65.9%	HF	✓	✓
ViT-Small	22M	16	384	67.1%	HF	✓	✓
ViT-Small	22M	8	160	67.3%	HF	✓	✓
ViT-Small	22M	8	224	68.6%	HF	✓	✓
ViT-Small	22M	8	384	68.5%	HF	✓	✓
ViT-Base	86M	16	160	72.4%	HF	✓	✓
ViT-Base	86M	16	224	73.9%	HF	✓	✓
ViT-Base	86M	16	384	74.5%	HF	✓	✓
ViT-Base	86M	8	160	74.8%	HF	✓	✓
ViT-Base	86M	8	224	75.4%	HF	✓	✓
ViT-Base	86M	8	384	75.6%	HF	✓	✓
ViT-Large	307M	14	84	74.7%	HF	✓	✓
ViT-Large	307M	14	224	78.5%	HF	✓	✓
ViT-Large	307M	14	336	78.9%	HF	✓	✓
SoViT-400M	400M	14	84	76.2%	HF	✓	✓
SoViT-400M	400M	14	224	79.7%	HF	✓	✓
SoViT-400M	400M	14	384	79.9%	HF	✓	✓
ViT-Huge	632M	14	84	77.4%	HF	✓	✓
ViT-Huge	632M	14	224	80.4%	HF	✓	✓

Model Usage

With Our Customized OpenCLIP Tokenizer

⚠️ IMPORTANT: Make sure you're importing from src/convert_upload/open_clip/ in this repo.
The tokenizer implementation here is customized and not yet available in the official OpenCLIP repo or PyPI release.

import torch import torch.nn.functional as F from urllib.request import urlopen from PIL import Image # ⚠️ Use our repo's tokenizer implementation at src/convert_upload/open_clip/ from open_clip import create_model_from_pretrained, get_tokenizer model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224') tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224') image = Image.open(urlopen( 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png' )) image = preprocess(image).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features = F.normalize(image_features, dim=-1) text_features = F.normalize(text_features, dim=-1) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) # prints: [[0., 0., 0., 1.0]]

BibTeX

@article{li2025openvision, title = {OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning}, author = {Li, Xianhang and Liu, Yanqing and Tu, Haoqin and Zhu, Hongru and Xie, Cihang}, journal = {arXiv preprint arXiv:2505.04601}, year = {2025} }

OpenVision

A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning