ViLBench: A Suite for Vision-Language Process Reward Modeling


1UC Santa Cruz, 2UT Dallas, 3Amazon Research,
alt text

We first benchmark current vision-language models as different reward models, and present ViLBench that requires intensive step-wise reward. Then we collect 73K+ preference reward data to train a vision-language process reward model ViLPRM that performs better than other baselines.

Motivations

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models --- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1’s generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and benchmark data..

Part I: Benchmarking Vision Large Language Models as Reward Models

Key Findings

Figure 1. We adopt the Best-of-N (BoN) setting, where VLLMs select the best response from a pool of N candidate responses. In detail, we use GPT-4o as the base solution sampler to sample 24 solutions given one question. Then we incorporate 7 different VLLMs as the deterministic scorer to pick the best response among the candidates by assigning scores between 1 to 5 to each reasoning step. We have the finding:

Findings 1: Neither ORM nor PRM excels across all vision-language tasks.

Figure 2. We present the correlations between the model performance and its reward performance on MMStar and MathVista datasets.

Findings 2: Better vision-language models do not necessarily lead to better reward models.

Figure 3. Model performance with the last n of step rewards selected under the Best-of-N paradigm.

Findings 3: The best practice for a vision-language reward model is usually between PRM and ORM.

Table 1. Average performance gain across 7 RMs from text or visual dominant examples on MathVerse using ORM or PRM.

Findings 4: LLMs as reward models provide more benefits on text-dominant examples.

ViLBench: A Vision-Language Benchmark Requiring Intensive Reward Feedback

Table 2. We leverage six open-weight VLLMs to filter samples where they perform well as PRMs but worse as ORMs under the BoN setting.

Part II: ViLPRM: A Vision-Language Process Reward Model

ViLReward-73K: A Step-wise Vision-Language Process Reward Data

Table 3. We collect ViLReward-73K, a vision-language process reward preference dataset using an improved MCTS-search engine.

An Example of Collected Searching Tree using Our MCTS-based Engine

Overall Performance of the Trained ViLPRM

Figure 4. We collect ViLReward-73K, a vision-language process reward preference dataset using an improved MCTS-search engine.

Example for Reward

Figure 5. An example of process scores provided by URSA and our ViLPRM. We mark different scores with different colors.

Datasets and Model Zoo

We are pleased to announce the release of our recaptioned datasets, including Recap-DataComp-1B and Recap-COCO-30K, as well as our caption model, LLaVA-1.5-LLaMA3-8B. Stay tuned for the upcoming release of our CLIP and T2I models!

Dataset

Dataset Num. of Sample url
ViLBench 600 https://huggingface.co/datasets/UCSC-VLAA/ViLBench
ViLReward-73K 73.5K https://huggingface.co/datasets/UCSC-VLAA/ViLReward-73K

Model

Model Type url
ViLPRM-3B Our VL PRM based on QwenVL-2.5-3B Coming Soon...

Acknowledge

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

BibTeX


@article{tu2025vilbench,
  title   = {Vilbench: A Suite for Vision-Language Process Reward Modeling},
  author  = {Tu, Haoqin and Feng, Weitao and Chen, Hardy and Liu, Hui and Tang, Xianfeng and Xie, Cihang},
  journal = {arXiv preprint arXiv:2503.20271},
  year    = {2025}
}