The OrchestratorOverviewHow It Connects
The StudioOverviewExtensionsHow It Works
ResourcesBlogFAQAbout
Get in touch
Studio Extension

Witness

Observe running applications via DOM, accessibility, or vision

Your agents can see your running application — not through expensive screenshots, but through structured DOM extraction that uses a fraction of the tokens. Witness gives agents eyes on your app so they can verify, debug, and record proof of what's actually happening.

Witness features in chat
Witness features in chat
Launch and observe PowerPoint
Launch and observe PowerPoint
Website screenshot capture
Website screenshot capture
Open website in tab
Open website in tab

Witness tools available directly in the agent chat — observe, capture, and verify without leaving the conversation.

Capabilities

What it does

Tier 1: DOM Extraction

For web applications, extracts the DOM tree — elements, attributes, text content, computed styles. The most token-efficient way to observe an app. Agents get structured data, not pixels.

Tier 2: Accessibility Tree

For native applications, reads the OS accessibility tree. macOS (AX API), Windows (UI Automation), Linux (AT-SPI). Works with any application that exposes accessibility nodes.

Tier 3: Screenshot + Vision

When structure isn't available, captures a screenshot and sends it to a vision-capable LLM. The most expensive tier — used as fallback, not default. Quality is configurable.

Record Proof

Capture application state as evidence — for QA, compliance, or debugging. Each observation step is recorded with timestamp, tier used, and data captured.

Configurable Quality

Choose the tier, quality level, and session step limits. Balance between token cost and observation depth. Default to Tier 1 for efficiency, escalate when needed.

How it works

From install to first use.

1
Point at your appTell the agent which application to observe. For web apps, provide the URL or window. For native apps, the plugin discovers running applications via the OS.
2
Agent observesThe agent calls the observe tool. Tier 1 extracts the DOM tree. If unavailable, Tier 2 reads the accessibility tree. Tier 3 (screenshot) is fallback only.
3
Structured responseThe agent receives structured data — element names, text content, interactive controls, state — not a raw image. This keeps token usage low and responses precise.
4
Interact and verifyAgents can click, type, and navigate. After each action, they observe again to verify the result. Steps are recorded for proof and debugging.
Why local matters

See your app without expensive vision calls.

Tier 1 DOM extraction costs a fraction of what screenshot-based vision costs. Your agents see more, understand better, and spend less — all running locally.

~50×fewer tokens than screenshot + vision LLM
LocalTier 1 & 2 run entirely on your machine
3observation tiers — choose the right cost/depth
0app data sent to cloud in Tier 1 & 2

Witness ships with the Studio. No extra install, no extra cost.