CAST

Modeling Visual State Transitions for Consistent Video Retrieval. CAST introduces Consistent Video Retrieval (CVR), a benchmark that isolates state and identity failures in multi-step activities, together with a lightweight adapter that learns action-conditioned state transitions in frozen video-language embedding space.

Yanqing Liu1,2,*, Yingcheng Liu1,3, Fanghong Dong1, Budianto Budianto1, Cihang Xie2, Yan Jiao1
1Google  ·  2University of California, Santa Cruz  ·  3MIT
*This work was done while Yanqing was a research intern at Google.
Consistent Video Retrieval State Transition Modeling State-Aware Generation Guidance
Overview

Sequential consistency matters more than semantic overlap.

Standard video retrieval works well when the query and the candidate clip are semantically related, but that is not enough for procedures. The retrieved clip also has to respect the current scene state and the identity of the objects being manipulated.

CAST studies retrieval in settings where actions happen in sequence and every step changes the world. The paper introduces CVR, a benchmark that turns multi-step activities into consistency-sensitive retrieval questions, and proposes a compact State Transition Adapter that predicts the next visual state through residual transition modeling. Instead of relying on heavier cross-encoders, CAST operates on top of a frozen backbone, predicts a next-state embedding from visual history and the current instruction, and reuses the same inference signals for retrieval and generation guidance.

Benchmark the real failure mode

CVR builds negatives that are close in language and appearance but wrong in how the scene should evolve. This turns retrieval into a test of sequential understanding.

Model transitions, not isolated clips

CAST keeps the video-language backbone frozen and predicts the next-state embedding from the anchor state, current instruction, and visual history through a lightweight residual adapter.

Transfer beyond retrieval

The same learned transition score can guide candidate selection for video generation, improving both automatic evaluation and human preference.

Benchmark

CVR diagnoses state and identity consistency.

Each example starts from an activity sequence, builds a language query for the next step, and then contrasts the correct next clip with negatives that are deliberately inconsistent with the current visual state.
CVR benchmark overview
Figure 2 benchmark overview Benchmark construction and failure-type overview.

How CVR is built

The benchmark uses step-level clips from YouCook2, CrossTask, and COIN. Query clips define the current context, the natural-language instruction specifies the desired next step, and negatives are sampled to violate either state changes or object identity while remaining semantically plausible.

Dataset split protocol

YouCook2 uses the official train split for training and CVR evaluation on val. COIN uses official train for training and CVR evaluation on test. CrossTask uses a video-disjoint 80/20 split with random seed 42.

1-vs-9 ranking protocol

Each query is evaluated against 10 candidates: 1 ground-truth clip, up to 3 hard state negatives, up to 3 hard identity negatives, and easy negatives to fill the remaining slots. The context history uses up to the previous L=5 clips.

Feature and training setup

In the main experiments, each clip is represented by mean-pooled CLIP ViT-B/32 features from 3 uniformly sampled frames. CAST uses a frozen backbone and is trained with AdamW, a context window of L=5, and dataset-specific training epochs.

CVR benchmark statistics

Final evaluation-set size for each dataset under the fixed 1-vs-9 ranking protocol.

Dataset Split Protocol #Videos #Step-Clips #Queries
YouCook2 official train; CVR eval on val 414 3,179 2,765
COIN official train; CVR eval on test 2,134 6,241 4,107
CrossTask video-disjoint 80/20 split 509 2,731 2,222
Method

CAST learns action-conditioned transitions in embedding space.

Rather than matching the query text directly to the next clip, CAST predicts the target embedding t from the anchor state vt-1, instruction qt, and visual history Ht, then ranks candidates in the same frozen embedding space.
CAST method diagram
Figure 3 method diagram State Transition Adapter and inference scoring.

State Transition Adapter

CAST is a lightweight query-side adapter on top of a frozen backbone. Given the anchor state vt-1, the instruction qt, and the visual history Ht, it predicts the next-state embedding t through a residual update.

Residual next-state prediction

CAST predicts t = φ(vt-1 + Δ(vt-1, qt, Ht)), where Δ = Δcond + Δctx. The instruction-conditioned branch uses MLPcond([ft(qt); vt-1]), while the context branch applies cross-attention over Ht followed by MLPctx.

Plug-and-play inference

At inference time, CAST keeps the gallery fixed and scores each candidate with three signals: semantic alignment A, visual continuity B, and predicted future-state compatibility C. The final score uses the full ensemble S(q,c)=A(q,c)+wvB(q,c)+wpC(q,c).

Why it stays lightweight

The backbone encoder remains frozen, and only the CAST adapter is optimized. This preserves the scalability of standard retrieval pipelines while adding context-aware next-state prediction on the query side.

Qualitative

Qualitative retrieval and evaluation examples.

These examples complement the benchmark tables by showing the concrete failure modes behind state inconsistency and identity inconsistency.
Retrieval examples
Figure 4 retrieval examples Shows baseline retrieval failures versus CAST on CVR.

Retrieval failure cases

Good examples here are the cases where semantic matching looks reasonable but the retrieved clip violates the intended next state or the identity of the manipulated object.

Human evaluation setup
Figure 8 human evaluation Human-study setup for generation comparison.

Human evaluation protocol

The human study compares the top-ranked outputs selected by standard text matching and CAST reranking under randomized presentation and majority-vote judgment.

Qualitative summary

These cases illustrate typical CVR failure modes for context-free retrieval: the retrieved clip can remain semantically related while still violating the expected temporal state or the manipulated object identity.

Results

Quantitative results on the CVR benchmark.

The quantitative analysis centers on two questions: does explicit state-transition modeling outperform standard context aggregation, and does CAST transfer across diverse frozen backbones?

CVR Benchmark Results (CLIP-B/32)

CAST provides the strongest overall trade-off across datasets and diagnostic metrics, with especially clear gains on state-sensitive retrieval.

Method Context Modeling YouCook2 COIN CrossTask Diagnostic
Acc. MnR Acc. MnR Acc. MnR State Ident.
CLIP Baseline Context-Free 25.03 3.60 14.10 3.91 16.83 4.15 45.52 28.90
Late Fusion (Heuristic) Fixed Weighting 31.10 2.56 17.85 3.28 22.05 2.86 28.69 68.29
Late Fusion (Learned) Learned Weighting 36.60 2.53 44.66 2.11 25.52 2.86 40.06 76.06
Early Fusion Feature Concat. 35.99 2.28 15.12 2.60 35.29 2.36 31.14 83.59
CAST (Ours) State Transition 44.77 2.15 40.47 2.16 47.39 2.14 53.81 74.67

Universality Across Backbones

CAST consistently improves over the corresponding zero-shot baseline while operating in the same frozen native vision-language embedding space for each backbone.

Backbone Setting YouCook2 COIN CrossTask Diagnostic
Acc. MnR Acc. MnR Acc. MnR State Ident.
Category I: Video Foundation Models
InternVideo2-1B Zero-Shot 36.75 2.59 17.99 3.36 20.61 3.31 65.70 30.85
InternVideo2-1B + CAST 71.68 1.48 51.03 1.90 64.36 1.71 75.43 77.77
VideoPrism-B Zero-Shot 47.45 2.13 17.60 3.32 20.25 3.24 68.38 33.68
VideoPrism-B + CAST 75.59 1.38 51.64 1.90 62.11 1.74 76.92 77.66
Category II: Multimodal Embedding Models
GME-Qwen2-VL-2B Zero-Shot 29.62 3.10 17.17 3.44 19.40 3.61 56.31 29.73
GME-Qwen2-VL-2B + CAST 54.39 1.95 45.68 2.05 52.43 2.04 67.20 72.28
Qwen3-VL-Embedding-2B Zero-Shot 33.45 2.89 17.73 3.50 19.44 3.56 58.44 29.79
Qwen3-VL-Embedding-2B + CAST 56.64 1.85 44.87 2.09 48.96 2.09 66.18 69.11

Effectiveness of the CAST mechanism

With CLIP ViT-B/32 fixed, CAST reaches 44.77 Acc. on YouCook2 and 47.39 on CrossTask, while the learned late-fusion baseline is strongest only on COIN (44.66). This matches the paper's conclusion that scalar aggregation can exploit visual inertia on COIN but does not generalize to larger state changes.

Universality in frozen embedding spaces

The same adapter transfers across InternVideo2, VideoPrism, GME-Qwen2-VL-2B, and Qwen3-VL-Embedding-2B. Across all four backbones, CAST improves both retrieval accuracy and the diagnostic state / identity metrics without changing the frozen backbone encoders.

Ablation

Ablation of architecture, inference, and context length.

The ablations isolate where CAST's gains come from: residual target prediction, dual-path architecture, ensemble inference, and the amount of visual history provided to the model.

Target formulation and architecture

YouCook2 validation benchmark. Residual modeling outperforms direct target prediction, and the dual-path CAST design improves over simple early fusion.

Fusion Target Acc. State Ident.
Early (Concat) Direct (t) 35.99 38.92 83.58
Early (Concat) Residual (Δ) 38.95 43.51 81.99
CAST (Ours) Residual (Δ) 44.77 51.03 78.48

Inference signal decomposition

YouCook2 validation benchmark. The full ensemble offers the best balance between state discrimination and identity preservation.

Inference Strategy Acc. Ident. State
A. Text Matching (q) 25.03 25.32 50.45
B. Visual Continuity (vt-1) 25.90 81.95 27.70
C. CAST Prediction (t) 42.60 75.99 50.81
Semantic Ensemble (A+C) 45.46 70.38 56.71
Full Ensemble (A+B+C) 44.77 78.48 51.03
Retrieval quality breakdown
Retrieval quality breakdown Exact match, state-misaligned, and identity-inconsistent outcomes.

Qualitative trade-off of the inference ensemble

This figure complements the inference-decomposition table by showing how CAST shifts top-1 retrieval outcomes toward exact matches and identity-consistent continuations.

Effect of context history length
Effect of context history length Accuracy and identity gains as context length increases.

Context length ablation

The paper reports the largest jump when moving from the context-agnostic setting L=0 to using only the immediate predecessor L=1, with performance then largely saturating as longer history is added.

Demos

Sequential narrative demos.

Each stitched video shows the recent context, the continuation selected by standard text matching, and the continuation selected by CAST, making consistency visible over the full temporal sequence.

Example 1

Sequence ID: Ky0zf0v2F5A_7

Example 2

Sequence ID: PQ97HXmsFR0_2

Example 3

Sequence ID: T_o_T3LEYLY_6

Example 4

Sequence ID: p-NnIyGFZVw_2

Example 5

Sequence ID: qRSZEN6g8jY_11

Example 6

Sequence ID: ulrh6C5V_VI_6

Generation

CAST also helps guide black-box video generation.

Given one visual context image and a next-step instruction, CAST scores the same K=4 Veo candidates selected by standard text matching, then evaluates the selected outputs with a blind human study.
Blind human preference evaluation for CAST generation guidance
Blind human preference evaluation CAST vs. standard text matching with Veo candidates.

Human preference summary

The blind human study reports higher preference for CAST-guided candidate selection than for standard text matching across overall preference, physical plausibility, and temporal logic.

Blind human evaluation protocol

The study samples 300 prompts from the YouCook2 validation benchmark. Standard text matching and CAST reranking each choose one video from the same candidate pool.

Guiding video generation

Blind human evaluation on the prompts where the two methods select different candidates.

Selection Method Overall Preference Physical Plausibility Temporal Logic
Standard Text Match 38.6% 39.9% 38.6%
CAST Reranking 55.1% 50.6% 52.5%
Human Tie 6.3% 9.5% 8.9%
Citation

BibTeX

ArXiv citation for CAST.
@misc{liu2026castmodelingvisualstate,
    title={CAST: Modeling Visual State Transitions for Consistent Video Retrieval}, 
    author={Yanqing Liu and Yingcheng Liu and Fanghong Dong and Budianto Budianto and Cihang Xie and Yan Jiao},
    year={2026},
    eprint={2603.08648},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2603.08648}, 
}