VLAA-GUI: A Modular Framework for GUI Automation

TL;DR

Autonomous GUI agents suffer from two chronic failure modes: early stopping (declaring success before the task is actually done) and repetitive loops (cycling through the same failing action without recovering). VLAA-GUI is a modular framework with three integrated components that tell the agent when to STOP, RECOVER, and SEARCH:

A mandatory Completeness Verifier enforces UI-observable success criteria at every finish step, double-checked by an independent verifier model.
A mandatory Loop Breaker detects repeated actions / recurring screen states / reflection-signaled stalls and escalates across three tiers.
An on-demand Search Agent queries a search-grounded LLM directly in text, skipping the overhead of browser-based visual search.

Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified with Claude Opus 4.6 and 61.0% on WindowsAgentArena. Three of five backbones surpass human-level (72.4%) on OSWorld in a single pass, and VLAA-GUI with Sonnet 4.6 at only 15 action steps already outperforms the best published 50-step system.

Two Problems We Found

Across benchmarks like OSWorld and WindowsAgentArena, the same two behaviors account for most failures:

1. Early Stopping

Agents routinely declare done() prematurely—after opening a Save As dialog without writing the file, or after toggling a setting without confirming the state actually changed. Completion is left to the model's implicit judgment rather than verified against observable UI evidence. In our analysis, over 86% of all failures involve the agent incorrectly believing it has succeeded.

2. Repetitive Loops

Agents fall into loops—repeating the same failing action, or cycling through the same screen states—burning step budget with no progress. Existing anti-looping heuristics operate at a single granularity and cannot escalate across interaction modalities or planning strategies.

VLAA-GUI targets both issues directly with mandatory post-action checks that run after every step, rather than bolt-on reflection that the agent is free to ignore.

System Overview

VLAA-GUI is built around a single Manager Agent that perceives, reasons, and acts on the GUI in one loop—no hierarchical subtask decomposition, no persistent memory module. After every action, two mandatory tools run: the Completeness Verifier and the Loop Breaker. Three on-demand tools sit in the same action space and are invoked by the Manager whenever a situation calls for them: the Search Agent for retrieving tutorials, the Coding Agent for bulk edits and computation, and the Grounding Agent for precise click localization.

VLAA-GUI framework: the Manager Agent interacts with the Environment via screenshots and actions, and coordinates Mandatory Tools (Verifier, Loop Breaker) and On-demand Tools (Searcher, Coder, Grounder).

When to `STOP`, `RECOVER`, and `SEARCH`

`STOP` — Completeness Verifier

A two-level check that prevents the agent from declaring success without evidence.

Completion Gate (prompt-level). At task start, the Manager rewrites the instruction into 1–3 UI-observable success criteria. Every step, it re-evaluates each criterion against the current screenshot. Only when all criteria pass and the UI is stable may the next action be agent.done().
Verifier Judge (agent-level). Whenever the Manager proposes done(), an independent verifier model cross-examines the completion claim with conservative decision rules—any ambiguity, missing file-save confirmation, unverified exact value, or open dialog forces rejection. Rejection reasons are appended to the trajectory so the next step starts with context.
Micro-verification rules. Each action type has an expected visual outcome: clicking a button must reveal a new UI element, toggling must flip a state label, typing must show the text in the field, saving must show a new file or success toast.

Impact: reduces false-completion rate by up to 3.9% on OSWorld and 21.0% on WAA Office tasks.

`RECOVER` — Loop Breaker

Three escalating tiers detect and break repetitive behavior after every action step:

Tier 1 — Modality Switch. If the same action on the same target produces no visible change twice, the next action must use a different modality (e.g., keyboard shortcut → menu click → command-line).
Tier 2 — Strategy Change. If the same screen state recurs three times, the agent must switch its overall strategy (e.g., menu navigation → programmatic file editing via the Coding Agent).
Tier 3 — Reflection-Driven Judge. An external model judge inspects the recent trajectory and emits KEEP or SWITCH. On SWITCH, a hard directive is injected that blacklists the repeated action on the next step.

Impact: nearly halves wasted steps for loop-prone backbones (Gemini 3 Flash: 4.9% → 2.8%).

`SEARCH` — Search Agent

GUI agents stall on unfamiliar application workflows. Prior work retrieves tutorials via a visual browser sub-agent, which burns steps and inherits the same grounding failures it is trying to compensate for. VLAA-GUI's Search Agent takes a more direct route: it formulates a targeted “How to” query and issues it to a search-grounded LLM (Gemini 3 Pro), which returns a plain-text tutorial. The result is injected into the Manager's context as complementary knowledge.

This unifies all information in the text domain, avoiding the overhead of browser interaction entirely. The Search Agent is invoked only when the Manager is uncertain about a GUI workflow and a tutorial is likely to exist.

Two auxiliary on-demand components complete the system: the Coding Agent handles bulk edits (≥20 cells/lines), non-trivial computation, and tasks where the GUI route is blocked, running Python/Bash in an independent execution loop. The Grounding Agent (Seed 1.8 by default) translates natural-language element descriptions into precise screen coordinates.

📊 Headline Results at a Glance

VLAA-GUI is the first framework to surpass human performance on OSWorld, while each mandatory component contributes measurable reductions in failure modes.

Scaling plot: VLAA-GUI first surpasses human performance (72.4%) on OSWorld-Verified, reaching 77.5% with Opus 4.6 at 100-step budget, across Opus 4.5, Sonnet 4.6, and Gemini 3.1 Pro backbones. — VLAA-GUI is the first system to cross the **human-level (72.4%)** bar on OSWorld-Verified, reaching **77.5%** with Opus 4.6.

Bar chart: the Completeness Verifier reduces false-completion rate by 3.9 points for Sonnet 4.6 (30.4% → 26.5%) and 0.5 points for Gemini 3 Flash (30.2% → 29.7%). — The **Completeness Verifier** cuts the false-completion rate by up to **3.9%**, turning silent failures into recoverable ones.

Scatter plot: the Loop Breaker shifts both Sonnet 4.6 (7.2%,3.2% → 5.0%,2.1%) and Gemini 3 Flash (12.5%,4.9% → 10.6%,2.8%) toward lower loop ratios and fewer wasted steps. — The **Loop Breaker** nearly halves wasted steps for loop-prone backbones (Gemini 3 Flash: **4.9% → 2.8%**).

🏆 OSWorld-Verified Results

Success rate (%) across five application domains. Human performance on OSWorld is 72.4%.

Method	Steps	OS	Office	Daily	Prof.	Workflow	Avg.
Prior Work (100 steps)
OpenCUA-72B	100	61.1	44.7	50.0	72.6	22.2	44.9
Claude-Sonnet-4.5	100	70.8	72.6	61.4	63.3	49.5	62.8
CoAct-1 w/ GPT-5	100	75.0	62.9	57.9	71.4	47.9	59.9
OS-Symphony w/ GPT-5	100	79.2	65.7	67.8	69.2	58.0	65.8
Agent S3 w/ Opus 4.5	100	75.0	76.1	67.5	59.2	59.0	67.5
HIPPO w/ Opus 4.5	100	87.5	74.3	69.3	95.9	64.3	74.5
VLAA-GUI (ours)
VLAA-GUI w/ Opus 4.6	100	91.7	82.9	75.2	83.7	65.6	77.5
VLAA-GUI w/ Opus 4.5 + MAI-UI	100	91.7	84.3	72.8	83.7	61.1	76.3
VLAA-GUI w/ Opus 4.5	100	87.5	79.4	76.6	81.6	61.0	74.9
VLAA-GUI w/ Gemini 3 Flash	100	91.7	64.9	74.9	67.3	63.4	68.8
VLAA-GUI w/ Gemini 3.1 Pro	100	83.3	76.6	73.8	73.5	62.9	72.5
VLAA-GUI w/ Sonnet 4.6	100	83.3	79.2	69.3	57.1	68.8	71.7
Tight Budget — VLAA-GUI at 15 steps already beats best 50-step system
VLAA-GUI w/ Sonnet 4.6	15	83.3	69.7	58.7	57.1	60.2	64.1
VLAA-GUI w/ Opus 4.6	15	83.3	60.7	66.4	79.6	55.9	64.8

Three backbones (Opus 4.6, Opus 4.5, Gemini 3.1 Pro) surpass human-level performance in a single pass—a first for GUI agents on OSWorld-Verified.

🌟 WindowsAgentArena Results

Success rate (%) on 154 Windows tasks with VLAA-GUI (Gemini 3 Flash). Each ablation removes one component from the full system.

Method	Office	Web	Sys.	Code	Media	Util.	Overall
Max 50 Steps
UI-TARS-2	–	–	–	–	–	–	50.6
Agent S3 w/ GPT-5	–	–	–	–	–	–	54.1
VLAA-GUI	32.6	73.3	87.5	66.7	52.4	75.0	60.4
− Completeness Verifier	11.6	73.3	70.8	54.2	61.9	83.3	51.3
− Loop Breaker	27.9	66.7	83.3	58.3	38.1	75.0	52.6
− Search Agent	18.6	63.3	75.0	66.7	38.1	66.7	49.4
Max 100 Steps
GTA1-32B w/ o3	–	–	–	–	–	–	51.2
Agent S3 w/ GPT-5	–	–	–	–	–	–	56.6
VLAA-GUI	35.0	73.3	87.5	66.7	52.4	83.3	61.0

🎥 Video Demonstrations

Real trajectories from VLAA-GUI with Claude Opus 4.6 on OSWorld-Verified. Each video shows the agent completing a real task end-to-end on a Ubuntu desktop.

Chrome
GIMP
Calc
Impress
Writer
VS Code
Thunderbird
VLC
OS
Multi-App

Chrome · Profile Rename

“Lately I have changed my English name to Thomas. I want to update my username. Could you help me change the username in Chrome profiles to Thomas?”

GIMP · Layer Fill

“Could you fill the background layer with green color, leaving the object layer as is?”

LibreOffice Calc · Period Rate + Highlight

“Please calculate the period rate for my data in a new column with header ‘Period Rate (%)’, convert the results as number type, and highlight the highest result with green (#00ff00) font.”

LibreOffice Impress · Slide Background

“Please make the background blue on all my slides. I was stuck by finding the entrance to do that for a while...”

LibreOffice Writer · Subscript Formatting

“Help me change the 2 in ‘H2O’ to a subscript.”

VS Code · Keyboard Shortcut

“Please help me create a shortcut ‘ctrl+j’ to move cursor focus from terminal to editor in VS Code.”

Thunderbird · Folder + Filter Rule

“Create a local folder called ‘Promotions’ and create a filter to auto-move the inbox emails whose subject contains ‘discount’ to the new folder.”

VLC · Global Hotkey Setup

“Could you help me change the setting to allow pausing the video using a keyboard shortcut without minimizing the PDF reader? I want to focus on the lecture note and not be disturbed by app switching.”

Ubuntu OS · File Recovery

“I am currently using an Ubuntu system, and I have wrongly deleted a poster of party night. Could you help me recover it from the Trash?”

Multi-App · Calc to Chrome Search

“Could you help me copy the data in Cell B6 in this LibreOffice Calc file and search it in the Chrome browser?”

Case Study: When the Components Work Together

Task: “I am preparing a PPT in LibreOffice Impress. Help me change the color of the slide number to red.”

Case study: VLAA-GUI recovers from an early done() via the Completeness Verifier and Search Agent on a LibreOffice Impress task.

Phase 1 — Premature completion. The agent enters the Master Slide, changes the font color, and calls agent.done(). But in Normal View the slide number color has not changed. FAILED.
Phase 2 — Verifier & Search. The Completeness Verifier rejects the done(): Slide Number Not Red and File Not Saved. The agent then invokes the Search Agent with the query “How to change color of slide number in LibreOffice Impress?”, which surfaces a hidden GUI structure—a second master slide named OBJECT—revealing the underlying template mismatch.
Phase 3 — Recovery. Armed with the search insight, the agent enters the 2nd Master Slide, applies the font color change, and saves the file. Verifier: PASS.

Without the Completeness Verifier, the premature done() in Phase 1 would have been recorded as success and scored 0. Without the Search Agent, the agent would never have learned about the second master slide. Together, STOP, RECOVER, and SEARCH turn a silent failure into a verified completion.

Conclusion and Future Work

We introduced VLAA-GUI, a modular GUI agent framework organized around three decisions every long-horizon agent must make well: when to STOP, when to RECOVER, and when to SEARCH. A mandatory Completeness Verifier closes the false-completion gap that dominates GUI failures, a tiered Loop Breaker escalates across modalities and strategies to escape repetitive behavior, and a text-only Search Agent supplies external workflow knowledge without paying the cost of browser interaction. Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified and 61.0% on WindowsAgentArena, with three backbones surpassing human-level performance in a single pass—while the 15-step configuration already beats the best published 50-step system.

Several directions remain open:

More Advanced Planning Strategies. The current Manager operates in a flat, iterative planning mode—one screenshot in, one action out—with no long-horizon task decomposition or lookahead. Richer planning strategies could help on the tasks where VLAA-GUI still stalls.
Comprehensive Memory System. VLAA-GUI currently carries only the in-trajectory context of a single task—there is no persistent memory across tasks, across applications, or even across sessions on the same application. A well-structured memory system would let the agent leverage the experience of previous interactions, providing a richer context for decision-making.
Visual Grounding with Tools. The current Grounding Agent consumes the full 1920×1080 screenshot and returns coordinates in a single shot—adequate for large, distinct UI elements but brittle on small icons, dense toolbars, or visually similar controls. A more capable grounder would expose a small visual tool library that it can call iteratively: crop to a predicted region, zoom in to resolve sub-pixel ambiguity, annotate candidate elements with numeric labels for disambiguation, and re-ground on the focused region. Integrating these tools also lets the same crop/zoom stream feed the Completeness Verifier, tightening evidence-based success checks for actions whose outcome is localized.

We hope VLAA-GUI's modular view—that reliability comes from mandatory, post-action checks rather than better single-shot reasoning—offers a useful foundation for the next generation of GUI agents.

Acknowledgements

We are grateful to the excellent GUI agent teams who generously shared their code and insights, including CoAct, OS-Symphony, and Agent S3. We sincerely thank the OSWorld team for their support in verifying our trajectories and providing feedback on our evaluation.

BibTeX

@article{han2026vlaagui,
  title={Knowing When to \texttt{STOP}, \texttt{RECOVER}, and \texttt{SEARCH}: A Modular Framework for GUI Automation},
  author={Qijun Han and Haoqin Tu and Zijun Wang and Haoyue Dai and Yiyang Zhou and Nancy Lau and Alvaro A. Cardenas and Yuhui Xu and Ran Xu and Caiming Xiong and Zeyu Zheng and Huaxiu Yao and Yuyin Zhou and Cihang Xie},
  journal={arXiv preprint arXiv:2604.21375},
  year={2026},
  url={https://arxiv.org/abs/2604.21375}
}

VLAA-GUI: Knowing When to `STOP`, `RECOVER`, and `SEARCH`

A Modular Framework for Reliable GUI Automation

TL;DR

Two Problems We Found

1. Early Stopping

2. Repetitive Loops

System Overview

When to `STOP`, `RECOVER`, and `SEARCH`

`STOP` — Completeness Verifier

`RECOVER` — Loop Breaker

`SEARCH` — Search Agent

📊 Headline Results at a Glance

🏆 OSWorld-Verified Results

🌟 WindowsAgentArena Results

🎥 Video Demonstrations

Chrome · Profile Rename

GIMP · Layer Fill

LibreOffice Calc · Period Rate + Highlight

LibreOffice Impress · Slide Background

LibreOffice Writer · Subscript Formatting

VS Code · Keyboard Shortcut

Thunderbird · Folder + Filter Rule

VLC · Global Hotkey Setup

Ubuntu OS · File Recovery

Multi-App · Calc to Chrome Search

Case Study: When the Components Work Together

Conclusion and Future Work

Acknowledgements

BibTeX

VLAA-GUI: Knowing When to STOP, RECOVER, and SEARCH

A Modular Framework for Reliable GUI Automation

TL;DR

Two Problems We Found

1. Early Stopping

2. Repetitive Loops

System Overview

When to STOP, RECOVER, and SEARCH

STOP — Completeness Verifier

RECOVER — Loop Breaker

SEARCH — Search Agent

📊 Headline Results at a Glance

🏆 OSWorld-Verified Results

🌟 WindowsAgentArena Results

🎥 Video Demonstrations

Chrome · Profile Rename

GIMP · Layer Fill

LibreOffice Calc · Period Rate + Highlight

LibreOffice Impress · Slide Background

LibreOffice Writer · Subscript Formatting

VS Code · Keyboard Shortcut

Thunderbird · Folder + Filter Rule

VLC · Global Hotkey Setup

Ubuntu OS · File Recovery

Multi-App · Calc to Chrome Search

Case Study: When the Components Work Together

Conclusion and Future Work

Acknowledgements

BibTeX

VLAA-GUI: Knowing When to `STOP`, `RECOVER`, and `SEARCH`

When to `STOP`, `RECOVER`, and `SEARCH`

`STOP` — Completeness Verifier

`RECOVER` — Loop Breaker

`SEARCH` — Search Agent