VLAA-GUI logo

VLAA-GUI: Knowing When to STOP, RECOVER, and SEARCH

A Modular Framework for Reliable GUI Automation

1UC Santa Cruz, 2CMU, 3UNC-Chapel Hill, 4Salesforce Research, 5UC Berkeley
*Equal contribution
OSWorld-Verified: 77.5% WindowsAgentArena: 61.0% Surpasses Human (72.4%)

TL;DR

Autonomous GUI agents suffer from two chronic failure modes: early stopping (declaring success before the task is actually done) and repetitive loops (cycling through the same failing action without recovering). VLAA-GUI is a modular framework with three integrated components that tell the agent when to STOP, RECOVER, and SEARCH:

  • A mandatory Completeness Verifier enforces UI-observable success criteria at every finish step, double-checked by an independent verifier model.
  • A mandatory Loop Breaker detects repeated actions / recurring screen states / reflection-signaled stalls and escalates across three tiers.
  • An on-demand Search Agent queries a search-grounded LLM directly in text, skipping the overhead of browser-based visual search.

Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified with Claude Opus 4.6 and 61.0% on WindowsAgentArena. Three of five backbones surpass human-level (72.4%) on OSWorld in a single pass, and VLAA-GUI with Sonnet 4.6 at only 15 action steps already outperforms the best published 50-step system.

Two Problems We Found

Across benchmarks like OSWorld and WindowsAgentArena, the same two behaviors account for most failures:

1. Early Stopping

Agents routinely declare done() prematurely—after opening a Save As dialog without writing the file, or after toggling a setting without confirming the state actually changed. Completion is left to the model's implicit judgment rather than verified against observable UI evidence. In our analysis, over 86% of all failures involve the agent incorrectly believing it has succeeded.

2. Repetitive Loops

Agents fall into loops—repeating the same failing action, or cycling through the same screen states—burning step budget with no progress. Existing anti-looping heuristics operate at a single granularity and cannot escalate across interaction modalities or planning strategies.

VLAA-GUI targets both issues directly with mandatory post-action checks that run after every step, rather than bolt-on reflection that the agent is free to ignore.

System Overview

VLAA-GUI is built around a single Manager Agent that perceives, reasons, and acts on the GUI in one loop—no hierarchical subtask decomposition, no persistent memory module. After every action, two mandatory tools run: the Completeness Verifier and the Loop Breaker. Three on-demand tools sit in the same action space and are invoked by the Manager whenever a situation calls for them: the Search Agent for retrieving tutorials, the Coding Agent for bulk edits and computation, and the Grounding Agent for precise click localization.

VLAA-GUI framework: the Manager Agent interacts with the Environment via screenshots and actions, and coordinates Mandatory Tools (Verifier, Loop Breaker) and On-demand Tools (Searcher, Coder, Grounder).

When to STOP, RECOVER, and SEARCH

STOP — Completeness Verifier

A two-level check that prevents the agent from declaring success without evidence.

  • Completion Gate (prompt-level). At task start, the Manager rewrites the instruction into 1–3 UI-observable success criteria. Every step, it re-evaluates each criterion against the current screenshot. Only when all criteria pass and the UI is stable may the next action be agent.done().
  • Verifier Judge (agent-level). Whenever the Manager proposes done(), an independent verifier model cross-examines the completion claim with conservative decision rules—any ambiguity, missing file-save confirmation, unverified exact value, or open dialog forces rejection. Rejection reasons are appended to the trajectory so the next step starts with context.
  • Micro-verification rules. Each action type has an expected visual outcome: clicking a button must reveal a new UI element, toggling must flip a state label, typing must show the text in the field, saving must show a new file or success toast.

Impact: reduces false-completion rate by up to 3.9% on OSWorld and 21.0% on WAA Office tasks.

RECOVER — Loop Breaker

Three escalating tiers detect and break repetitive behavior after every action step:

  • Tier 1 — Modality Switch. If the same action on the same target produces no visible change twice, the next action must use a different modality (e.g., keyboard shortcut → menu click → command-line).
  • Tier 2 — Strategy Change. If the same screen state recurs three times, the agent must switch its overall strategy (e.g., menu navigation → programmatic file editing via the Coding Agent).
  • Tier 3 — Reflection-Driven Judge. An external model judge inspects the recent trajectory and emits KEEP or SWITCH. On SWITCH, a hard directive is injected that blacklists the repeated action on the next step.

Impact: nearly halves wasted steps for loop-prone backbones (Gemini 3 Flash: 4.9% → 2.8%).

SEARCH — Search Agent

GUI agents stall on unfamiliar application workflows. Prior work retrieves tutorials via a visual browser sub-agent, which burns steps and inherits the same grounding failures it is trying to compensate for. VLAA-GUI's Search Agent takes a more direct route: it formulates a targeted “How to” query and issues it to a search-grounded LLM (Gemini 3 Pro), which returns a plain-text tutorial. The result is injected into the Manager's context as complementary knowledge.

This unifies all information in the text domain, avoiding the overhead of browser interaction entirely. The Search Agent is invoked only when the Manager is uncertain about a GUI workflow and a tutorial is likely to exist.

Two auxiliary on-demand components complete the system: the Coding Agent handles bulk edits (≥20 cells/lines), non-trivial computation, and tasks where the GUI route is blocked, running Python/Bash in an independent execution loop. The Grounding Agent (Seed 1.8 by default) translates natural-language element descriptions into precise screen coordinates.

📊 Headline Results at a Glance

VLAA-GUI is the first framework to surpass human performance on OSWorld, while each mandatory component contributes measurable reductions in failure modes.

Scaling plot: VLAA-GUI first surpasses human performance (72.4%) on OSWorld-Verified, reaching 77.5% with Opus 4.6 at 100-step budget, across Opus 4.5, Sonnet 4.6, and Gemini 3.1 Pro backbones.
VLAA-GUI is the first system to cross the human-level (72.4%) bar on OSWorld-Verified, reaching 77.5% with Opus 4.6.
Bar chart: the Completeness Verifier reduces false-completion rate by 3.9 points for Sonnet 4.6 (30.4% → 26.5%) and 0.5 points for Gemini 3 Flash (30.2% → 29.7%).
The Completeness Verifier cuts the false-completion rate by up to 3.9%, turning silent failures into recoverable ones.
Scatter plot: the Loop Breaker shifts both Sonnet 4.6 (7.2%,3.2% → 5.0%,2.1%) and Gemini 3 Flash (12.5%,4.9% → 10.6%,2.8%) toward lower loop ratios and fewer wasted steps.
The Loop Breaker nearly halves wasted steps for loop-prone backbones (Gemini 3 Flash: 4.9% → 2.8%).

🏆 OSWorld-Verified Results

Success rate (%) across five application domains. Human performance on OSWorld is 72.4%.

Method Steps OS Office Daily Prof. Workflow Avg.
Prior Work (100 steps)
OpenCUA-72B10061.144.750.072.622.244.9
Claude-Sonnet-4.510070.872.661.463.349.562.8
CoAct-1 w/ GPT-510075.062.957.971.447.959.9
OS-Symphony w/ GPT-510079.265.767.869.258.065.8
Agent S3 w/ Opus 4.510075.076.167.559.259.067.5
HIPPO w/ Opus 4.510087.574.369.395.964.374.5
VLAA-GUI (ours)
VLAA-GUI w/ Opus 4.610091.782.975.283.765.677.5
VLAA-GUI w/ Opus 4.5 + MAI-UI10091.784.372.883.761.176.3
VLAA-GUI w/ Opus 4.510087.579.476.681.661.074.9
VLAA-GUI w/ Gemini 3 Flash10091.764.974.967.363.468.8
VLAA-GUI w/ Gemini 3.1 Pro10083.376.673.873.562.972.5
VLAA-GUI w/ Sonnet 4.610083.379.269.357.168.871.7
Tight Budget — VLAA-GUI at 15 steps already beats best 50-step system
VLAA-GUI w/ Sonnet 4.61583.369.758.757.160.264.1
VLAA-GUI w/ Opus 4.61583.360.766.479.655.964.8

Three backbones (Opus 4.6, Opus 4.5, Gemini 3.1 Pro) surpass human-level performance in a single pass—a first for GUI agents on OSWorld-Verified.

🌟 WindowsAgentArena Results

Success rate (%) on 154 Windows tasks with VLAA-GUI (Gemini 3 Flash). Each ablation removes one component from the full system.

Method Office Web Sys. Code Media Util. Overall
Max 50 Steps
UI-TARS-250.6
Agent S3 w/ GPT-554.1
VLAA-GUI32.673.387.566.752.475.060.4
   − Completeness Verifier11.673.370.854.261.983.351.3
   − Loop Breaker27.966.783.358.338.175.052.6
   − Search Agent18.663.375.066.738.166.749.4
Max 100 Steps
GTA1-32B w/ o351.2
Agent S3 w/ GPT-556.6
VLAA-GUI35.073.387.566.752.483.361.0

🎥 Video Demonstrations

Real trajectories from VLAA-GUI with Claude Opus 4.6 on OSWorld-Verified. Each video shows the agent completing a real task end-to-end on a Ubuntu desktop.

Chrome · Startup Page Fix

“On my Surface Pro, whenever I launch Chrome it always opens ‘funbrain.com’. I don't want this. I cleared my cache but it still happens—can you fix it?”

Case Study: When the Components Work Together

Task: “I am preparing a PPT in LibreOffice Impress. Help me change the color of the slide number to red.”

Case study: VLAA-GUI recovers from an early done() via the Completeness Verifier and Search Agent on a LibreOffice Impress task.
  1. Phase 1 — Premature completion. The agent enters the Master Slide, changes the font color, and calls agent.done(). But in Normal View the slide number color has not changed. FAILED.
  2. Phase 2 — Verifier & Search. The Completeness Verifier rejects the done(): Slide Number Not Red and File Not Saved. The agent then invokes the Search Agent with the query “How to change color of slide number in LibreOffice Impress?”, which surfaces a hidden GUI structure—a second master slide named OBJECT—revealing the underlying template mismatch.
  3. Phase 3 — Recovery. Armed with the search insight, the agent enters the 2nd Master Slide, applies the font color change, and saves the file. Verifier: PASS.

Without the Completeness Verifier, the premature done() in Phase 1 would have been recorded as success and scored 0. Without the Search Agent, the agent would never have learned about the second master slide. Together, STOP, RECOVER, and SEARCH turn a silent failure into a verified completion.

Conclusion and Future Work

We introduced VLAA-GUI, a modular GUI agent framework organized around three decisions every long-horizon agent must make well: when to STOP, when to RECOVER, and when to SEARCH. A mandatory Completeness Verifier closes the false-completion gap that dominates GUI failures, a tiered Loop Breaker escalates across modalities and strategies to escape repetitive behavior, and a text-only Search Agent supplies external workflow knowledge without paying the cost of browser interaction. Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified and 61.0% on WindowsAgentArena, with three backbones surpassing human-level performance in a single pass—while the 15-step configuration already beats the best published 50-step system.

Several directions remain open:

  • More Advanced Planning Strategies. The current Manager operates in a flat, iterative planning mode—one screenshot in, one action out—with no long-horizon task decomposition or lookahead. Richer planning strategies could help on the tasks where VLAA-GUI still stalls.
  • Comprehensive Memory System. VLAA-GUI currently carries only the in-trajectory context of a single task—there is no persistent memory across tasks, across applications, or even across sessions on the same application. A well-structured memory system would let the agent leverage the experience of previous interactions, providing a richer context for decision-making.
  • Visual Grounding with Tools. The current Grounding Agent consumes the full 1920×1080 screenshot and returns coordinates in a single shot—adequate for large, distinct UI elements but brittle on small icons, dense toolbars, or visually similar controls. A more capable grounder would expose a small visual tool library that it can call iteratively: crop to a predicted region, zoom in to resolve sub-pixel ambiguity, annotate candidate elements with numeric labels for disambiguation, and re-ground on the focused region. Integrating these tools also lets the same crop/zoom stream feed the Completeness Verifier, tightening evidence-based success checks for actions whose outcome is localized.

We hope VLAA-GUI's modular view—that reliability comes from mandatory, post-action checks rather than better single-shot reasoning—offers a useful foundation for the next generation of GUI agents.

Acknowledgements

We are grateful to the excellent GUI agent teams who generously shared their code and insights, including CoAct, OS-Symphony, and Agent S3. We sincerely thank the OSWorld team for their support in verifying our trajectories and providing feedback on our evaluation.

BibTeX

@article{han2026vlaagui,
  title={Knowing When to \texttt{STOP}, \texttt{RECOVER}, and \texttt{SEARCH}: A Modular Framework for GUI Automation},
  author={Qijun Han and Haoqin Tu and Zijun Wang and Haoyue Dai and Yiyang Zhou and Nancy Lau and Alvaro A. Cardenas and Yuhui Xu and Ran Xu and Caiming Xiong and Zeyu Zheng and Huaxiu Yao and Yuyin Zhou and Cihang Xie},
  journal={arXiv preprint arXiv:2604.21375},
  year={2026},
  url={https://arxiv.org/abs/2604.21375}
}