A Real-Time, Personalized Agent for the Physical World

VisualClaw combines edge-side video compression, hot/cold skill injection, and memory-guided skill evolution so multimodal agents can see, act, and improve without updating VLM weights.

Haoqin Tu1*, Jianwen Chen2*, Zijun Wang1, Siwei Han2, Juncheng Wu1, Hardy Chen1, Haonian Ji2, Kaiwen Xiong2, Jiaqi Liu2, Peng Xia2, Jieru Mei3, Hongliang Fei3, Jason Eshraghian1, Zeyu Zheng4, Yuyin Zhou1, Huaxiu Yao2, Cihang Xie1 * equal technical contribution 1UC Santa Cruz    2UNC-Chapel Hill    3Google    4UC Berkeley
-98.1% API cost vs. full-frame upload +15.8% peak EgoSchema gain 200 agentic scenarios 24.4 average rounds/scenario
VisualClaw in action: a real-time personalized agent that filters streaming visual evidence, reasons with cloud VLMs, and evolves memory and skills over time.

Abstract

Multimodal agents still face three deployment gaps: dense video frames are expensive, static scaffolds do not improve after deployment, and standard video-QA benchmarks rarely test tool-using workspace behavior. VisualClaw addresses these gaps through hybrid encoding and self-evolving skill banks, and introduces VisualClawArena to evaluate visual evidence use inside executable multimodal workflows.

Three-Timescale System

The system separates fast edge filtering, per-question retrieval, and lower-frequency skill evolution so expensive multimodal context is used only when it matters.

Per frame

Cascaded Encoding Gate

Perceptual hashes, a 128-dimensional CPU encoder, and an adaptive change gate decide major, minor, or skipped frames as a live stream arrives.

Per question

Hot/Cold Skill Injection

The top-k retrieved skills are inlined as hot context, while the rest of the bank stays available as a compact catalogue to keep prompt cost bounded.

Per session

Memory-Guided Evolution

Correct examples enter memory, failures trigger an offline evolver, and bank hygiene keeps skill updates useful over long deployment histories.

VisualClaw pipeline diagram
VisualClaw pipeline: edge-side visual compression, skill/memory retrieval, and meta-evolution of the language-layer scaffold.

VisualClawArena

VisualClawArena turns video examples into multimodal agentic scenarios with documents, chat/audio traces, dynamic updates, and executable checks.

Suite

200 Scenarios

Built from Indoor/VSI, EgoSchema, and QVHighlights videos, with an average of 24.4 steps per scenario.

Grounding

18.1 Average Visual Steps

Each scenario averages 18.1 visual-required steps after timestamp-grounded construction and text-only leakage checks.

Scoring

Executable Workspaces

Agents must reconcile visual facts with files and leave a workspace that can be automatically scored.

VisualClawArena data curation pipeline
Five-stage curation pipeline for executable, diverse, visually grounded multimodal agent tasks.
Example VisualClawArena benchmark scenario with video clip, workspace files, and multi-round instructions
Example VisualClawArena case: each scenario combines a video clip, role and identity files, user context, workspace artifacts, and multi-round instructions that require visual grounding and executable workspace updates.

Results

The same design improves static video-QA, multimodal agentic workflows, and cost efficiency.

General Video-QA Results

Accuracy across four video-QA benchmarks using cascade encoding and the skill/memory evolution variants from the paper.

Benchmark Model Plain Seed +Evolve +SkillMemCat FullEvo Cat. FullEvo Guide Uniform-8 Plain
EgoSchemaGemini 3 Flash52.6067.2068.0065.2064.6068.4060.60
EgoSchemaGPT-5.264.0066.6066.2067.2065.6068.0070.60
Video-MME longGemini 3 Flash60.3361.5661.3362.7862.5664.2261.44
Video-MME longGPT-5.255.8954.0052.2254.6752.7855.8958.78
EgoPlan-BenchGemini 3 Flash24.6230.8029.9328.3130.0428.8537.96
EgoPlan-BenchGPT-5.228.4228.7428.3128.0928.8529.3943.06
NextQAGemini 3 Flash72.7075.1073.9075.5075.7074.5077.70
NextQAGPT-5.273.2072.0072.3070.9072.5073.3078.90

Agentic Results on VisualClawArena

Macro and micro accuracy on the 200-scenario multimodal workspace benchmark, averaging 24.4 rounds per scenario, with Codex and Claude Code backends.

Backend Setting Early Mid Late Micro Macro
CodexVisualClaw Cat.49.6956.1957.3859.8854.27
CodexVisualClaw Guide50.6654.6656.8859.5053.89
CodexVisualClaw w/o FullEvo48.1052.9353.4757.9551.35
CodexUniform-846.7450.4953.9056.0850.25
Claude CodeVisualClaw Cat.52.0350.9153.5057.7952.16
Claude CodeVisualClaw Guide50.8749.9851.4056.5450.77
Claude CodeVisualClaw w/o FullEvo48.8048.5149.6855.3449.00
Claude CodeUniform-840.2544.4947.5249.1043.99
Per-day accuracy on VisualClawArena showing VisualClaw evolution over time
Per-day accuracy over 200 VisualClawArena scenarios. The curve shows how VisualClaw with memory-to-evolver concatenation improves across within-scenario steps, demonstrating real adaptation over time rather than a one-shot prompt gain.

Cost Comparisons

Frame count, token count, and Gemini 3 Flash API spend against full-frame upload and Uniform-8 + FullEvo baselines.

Dataset Configuration KF/Q Tokens/Q $/run vs Full-frame vs U-8 + FullEvo
EgoSchemaFull-frame @1fps~180~192,841$28.93--
EgoSchemaUniform-8 + FullEvo8.0013,419$2.01--
EgoSchemaVisualClaw2.959,524$1.44-95.0%-28.4%
Video-MME longFull-frame @1fps~1,800~1,926,361$520.12--
Video-MME longVisualClaw5.4113,420$3.63-99.3%-15.2%
NextQAUniform-8 + FullEvo8.0014,025$4.21--
NextQAVisualClaw1.518,207$2.47-74.6%-41.3%
All experimentsVisualClaw total--$10.51-98.1%-25.9%

Case Study

VisualClaw case studies
Two representative wins: one driven by evolved skills and one driven by memory-conditioned evolution.
  1. EgoSchema: a single major keyframe is enough; evolved skills recover the purpose-clause answer and flip the baseline choice to the ground truth.
  2. NextQA: the memory bank retrieves prior balance-and-stability patterns, steering the answer away from a misleading motion interpretation.
  3. Deployment signal: both examples compress the video to one keyframe, so the accuracy gain comes from the evolving language-layer scaffold.

BibTeX

@misc{tu2026visualclawrealtimepersonalizedagent,
  title={VisualClaw: A Real-Time, Personalized Agent for the Physical World},
  author={Haoqin Tu and Jianwen Chen and Zijun Wang and Siwei Han and Juncheng Wu and Hardy Chen and Haonian Ji and Kaiwen Xiong and Jiaqi Liu and Peng Xia and Jieru Mei and Hongliang Fei and Jason Eshraghian and Zeyu Zheng and Yuyin Zhou and Huaxiu Yao and Cihang Xie},
  year={2026},
  eprint={2606.16295},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.16295}
}