CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Overview

Sequential consistency matters more than semantic overlap.

Standard video retrieval works well when the query and the candidate clip are semantically related, but that is not enough for procedures. The retrieved clip also has to respect the current scene state and the identity of the objects being manipulated.

CAST studies retrieval in settings where actions happen in sequence and every step changes the world. The paper introduces CVR, a benchmark that turns multi-step activities into consistency-sensitive retrieval questions, and proposes a compact State Transition Adapter that predicts the next visual state through residual transition modeling. Instead of relying on heavier cross-encoders, CAST operates on top of a frozen backbone, predicts a next-state embedding from visual history and the current instruction, and reuses the same inference signals for retrieval and generation guidance.

Benchmark the real failure mode

CVR builds negatives that are close in language and appearance but wrong in how the scene should evolve. This turns retrieval into a test of sequential understanding.

Model transitions, not isolated clips

CAST keeps the video-language backbone frozen and predicts the next-state embedding from the anchor state, current instruction, and visual history through a lightweight residual adapter.

Transfer beyond retrieval

The same learned transition score can guide candidate selection for video generation, improving both automatic evaluation and human preference.

Benchmark

CVR diagnoses state and identity consistency.

Each example starts from an activity sequence, builds a language query for the next step, and then contrasts the correct next clip with negatives that are deliberately inconsistent with the current visual state.

Figure 2 benchmark overview Benchmark construction and failure-type overview.

How CVR is built

The benchmark uses step-level clips from YouCook2, CrossTask, and COIN. Query clips define the current context, the natural-language instruction specifies the desired next step, and negatives are sampled to violate either state changes or object identity while remaining semantically plausible.

Dataset split protocol

YouCook2 uses the official train split for training and CVR evaluation on val. COIN uses official train for training and CVR evaluation on test. CrossTask uses a video-disjoint 80/20 split with random seed 42.

1-vs-9 ranking protocol

Each query is evaluated against 10 candidates: 1 ground-truth clip, up to 3 hard state negatives, up to 3 hard identity negatives, and easy negatives to fill the remaining slots. The context history uses up to the previous L=5 clips.

Feature and training setup

In the main experiments, each clip is represented by mean-pooled CLIP ViT-B/32 features from 3 uniformly sampled frames. CAST uses a frozen backbone and is trained with AdamW, a context window of L=5, and dataset-specific training epochs.

CVR benchmark statistics

Final evaluation-set size for each dataset under the fixed 1-vs-9 ranking protocol.

Dataset	Split Protocol	#Videos	#Step-Clips	#Queries
YouCook2	official train; CVR eval on val	414	3,179	2,765
COIN	official train; CVR eval on test	2,134	6,241	4,107
CrossTask	video-disjoint 80/20 split	509	2,731	2,222

Method

CAST learns action-conditioned transitions in embedding space.

Rather than matching the query text directly to the next clip, CAST predicts the target embedding v̂_t from the anchor state v_t-1, instruction q_t, and visual history H_t, then ranks candidates in the same frozen embedding space.

Figure 3 method diagram State Transition Adapter and inference scoring.

State Transition Adapter

CAST is a lightweight query-side adapter on top of a frozen backbone. Given the anchor state v_t-1, the instruction q_t, and the visual history H_t, it predicts the next-state embedding v̂_t through a residual update.

Residual next-state prediction

CAST predicts v̂_t = φ(v_t-1 + Δ(v_t-1, q_t, H_t)), where Δ = Δ_cond + Δ_ctx. The instruction-conditioned branch uses MLP_cond([f_t(q_t); v_t-1]), while the context branch applies cross-attention over H_t followed by MLP_ctx.

Plug-and-play inference

At inference time, CAST keeps the gallery fixed and scores each candidate with three signals: semantic alignment A, visual continuity B, and predicted future-state compatibility C. The final score uses the full ensemble S(q,c)=A(q,c)+w_vB(q,c)+w_pC(q,c).

Why it stays lightweight

The backbone encoder remains frozen, and only the CAST adapter is optimized. This preserves the scalability of standard retrieval pipelines while adding context-aware next-state prediction on the query side.

Qualitative

Qualitative retrieval and evaluation examples.

These examples complement the benchmark tables by showing the concrete failure modes behind state inconsistency and identity inconsistency.

Figure 4 retrieval examples Shows baseline retrieval failures versus CAST on CVR.

Retrieval failure cases

Good examples here are the cases where semantic matching looks reasonable but the retrieved clip violates the intended next state or the identity of the manipulated object.

Figure 8 human evaluation Human-study setup for generation comparison.

Human evaluation protocol

The human study compares the top-ranked outputs selected by standard text matching and CAST reranking under randomized presentation and majority-vote judgment.

Qualitative summary

These cases illustrate typical CVR failure modes for context-free retrieval: the retrieved clip can remain semantically related while still violating the expected temporal state or the manipulated object identity.

Results

Quantitative results on the CVR benchmark.

The quantitative analysis centers on two questions: does explicit state-transition modeling outperform standard context aggregation, and does CAST transfer across diverse frozen backbones?

CVR Benchmark Results (CLIP-B/32)

CAST provides the strongest overall trade-off across datasets and diagnostic metrics, with especially clear gains on state-sensitive retrieval.

Method	Context Modeling	YouCook2		COIN		CrossTask		Diagnostic
Method	Context Modeling	Acc.	MnR	Acc.	MnR	Acc.	MnR	State	Ident.
CLIP Baseline	Context-Free	25.03	3.60	14.10	3.91	16.83	4.15	45.52	28.90
Late Fusion (Heuristic)	Fixed Weighting	31.10	2.56	17.85	3.28	22.05	2.86	28.69	68.29
Late Fusion (Learned)	Learned Weighting	36.60	2.53	44.66	2.11	25.52	2.86	40.06	76.06
Early Fusion	Feature Concat.	35.99	2.28	15.12	2.60	35.29	2.36	31.14	83.59
CAST (Ours)	State Transition	44.77	2.15	40.47	2.16	47.39	2.14	53.81	74.67

Universality Across Backbones

CAST consistently improves over the corresponding zero-shot baseline while operating in the same frozen native vision-language embedding space for each backbone.

Backbone	Setting	YouCook2		COIN		CrossTask		Diagnostic
Backbone	Setting	Acc.	MnR	Acc.	MnR	Acc.	MnR	State	Ident.
Category I: Video Foundation Models
InternVideo2-1B	Zero-Shot	36.75	2.59	17.99	3.36	20.61	3.31	65.70	30.85
InternVideo2-1B	+ CAST	71.68	1.48	51.03	1.90	64.36	1.71	75.43	77.77
VideoPrism-B	Zero-Shot	47.45	2.13	17.60	3.32	20.25	3.24	68.38	33.68
VideoPrism-B	+ CAST	75.59	1.38	51.64	1.90	62.11	1.74	76.92	77.66
Category II: Multimodal Embedding Models
GME-Qwen2-VL-2B	Zero-Shot	29.62	3.10	17.17	3.44	19.40	3.61	56.31	29.73
GME-Qwen2-VL-2B	+ CAST	54.39	1.95	45.68	2.05	52.43	2.04	67.20	72.28
Qwen3-VL-Embedding-2B	Zero-Shot	33.45	2.89	17.73	3.50	19.44	3.56	58.44	29.79
Qwen3-VL-Embedding-2B	+ CAST	56.64	1.85	44.87	2.09	48.96	2.09	66.18	69.11

Effectiveness of the CAST mechanism

With CLIP ViT-B/32 fixed, CAST reaches 44.77 Acc. on YouCook2 and 47.39 on CrossTask, while the learned late-fusion baseline is strongest only on COIN (44.66). This matches the paper's conclusion that scalar aggregation can exploit visual inertia on COIN but does not generalize to larger state changes.

Universality in frozen embedding spaces

The same adapter transfers across InternVideo2, VideoPrism, GME-Qwen2-VL-2B, and Qwen3-VL-Embedding-2B. Across all four backbones, CAST improves both retrieval accuracy and the diagnostic state / identity metrics without changing the frozen backbone encoders.

Ablation

Ablation of architecture, inference, and context length.

The ablations isolate where CAST's gains come from: residual target prediction, dual-path architecture, ensemble inference, and the amount of visual history provided to the model.

Target formulation and architecture

YouCook2 validation benchmark. Residual modeling outperforms direct target prediction, and the dual-path CAST design improves over simple early fusion.

Fusion	Target	Acc.	State	Ident.
Early (Concat)	Direct (v̂_t)	35.99	38.92	83.58
Early (Concat)	Residual (Δ)	38.95	43.51	81.99
CAST (Ours)	Residual (Δ)	44.77	51.03	78.48

Inference signal decomposition

YouCook2 validation benchmark. The full ensemble offers the best balance between state discrimination and identity preservation.

Inference Strategy	Acc.	Ident.	State
A. Text Matching (q)	25.03	25.32	50.45
B. Visual Continuity (v_t-1)	25.90	81.95	27.70
C. CAST Prediction (v̂_t)	42.60	75.99	50.81
Semantic Ensemble (A+C)	45.46	70.38	56.71
Full Ensemble (A+B+C)	44.77	78.48	51.03

Retrieval quality breakdown Exact match, state-misaligned, and identity-inconsistent outcomes.

Qualitative trade-off of the inference ensemble

This figure complements the inference-decomposition table by showing how CAST shifts top-1 retrieval outcomes toward exact matches and identity-consistent continuations.

Effect of context history length Accuracy and identity gains as context length increases.

Context length ablation

The paper reports the largest jump when moving from the context-agnostic setting L=0 to using only the immediate predecessor L=1, with performance then largely saturating as longer history is added.

Demos

Sequential narrative demos.

Each stitched video shows the recent context, the continuation selected by standard text matching, and the continuation selected by CAST, making consistency visible over the full temporal sequence.

Example 1

Sequence ID: Ky0zf0v2F5A_7

Example 2

Sequence ID: PQ97HXmsFR0_2

Example 3

Sequence ID: T_o_T3LEYLY_6

Example 4

Sequence ID: p-NnIyGFZVw_2

Example 5

Sequence ID: qRSZEN6g8jY_11

Example 6

Sequence ID: ulrh6C5V_VI_6

Generation

CAST also helps guide black-box video generation.

Given one visual context image and a next-step instruction, CAST scores the same K=4 Veo candidates selected by standard text matching, then evaluates the selected outputs with a blind human study.

Blind human preference evaluation for CAST generation guidance

Blind human preference evaluation CAST vs. standard text matching with Veo candidates.

Human preference summary

The blind human study reports higher preference for CAST-guided candidate selection than for standard text matching across overall preference, physical plausibility, and temporal logic.

Blind human evaluation protocol

The study samples 300 prompts from the YouCook2 validation benchmark. Standard text matching and CAST reranking each choose one video from the same candidate pool.

Guiding video generation

Blind human evaluation on the prompts where the two methods select different candidates.

Selection Method	Overall Preference	Physical Plausibility	Temporal Logic
Standard Text Match	38.6%	39.9%	38.6%
CAST Reranking	55.1%	50.6%	52.5%
Human Tie	6.3%	9.5%	8.9%

Citation

BibTeX

ArXiv citation for CAST.

@misc{liu2026castmodelingvisualstate,
    title={CAST: Modeling Visual State Transitions for Consistent Video Retrieval}, 
    author={Yanqing Liu and Yingcheng Liu and Fanghong Dong and Budianto Budianto and Cihang Xie and Yan Jiao},
    year={2026},
    eprint={2603.08648},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2603.08648}, 
}