What is an evidence-carrying multimodal agent?

An agent that stores typed evidence for every action-critical condition before risky actions, instead of only claiming that it saw something on screen.

When is a screenshot not enough evidence?

Whenever the observation authorizes a privileged action: sending, paying, deleting, rescheduling, extracting private data, or granting access. Use DOM, accessibility, OCR, document, API, or user-approval evidence.

Do all browser agents need this?

No. It is often too heavy for read-only research. It is worth it for irreversible, external, private, or financial actions.

AI Agents Need Evidence Before They Click

Multimodal agents should not treat screenshots as permission. The safer rule: every risky action needs typed evidence from DOM, accessibility tree, OCR, or API.

The short version

A browser agent sees a button. It describes the button correctly. Then it clicks.

That sounds harmless until the button sends money, emails a customer, deletes an account, or extracts private data from a document.

The useful shift in “Hallucination as Exploit: Evidence-Carrying Multimodal Agents” is not just: multimodal models should hallucinate less.

The sharper point is this:

A false visual claim can become an authorization bug.

If an agent says “this is the right invoice” or “the recipient is selected,” that is not evidence yet. It is model text. For risky actions, model text should not count as permission.

risky clicks from model prose alone

evidence table before the tool call

good sources: DOM, AX, OCR, API

The mistake: screenshot feeling as a permission slip

Many computer-use demos work roughly like this:

Screenshot → model description → tool call

That is fine while the agent only reads or navigates. It gets thin when the description becomes a precondition for action.

Examples:

“The recipient is Max” → send email.
“The amount is €49” → confirm payment.
“This is the right customer account” → export data.
“The checkbox is enabled” → store consent.
“The document has no sensitive data” → upload file.

If those claims only come from free-form vision prose, the agent has no hard brake. It has a plausible story.

Claim vs. evidence

Weak

“I see the right button.”
“The amount looks correct.”
“The invoice belongs to customer X.”
“This is the login page.”

Better

Selector + accessible name + visible text match.
Parsed field amount === expected amount.
Document field / API ID / OCR box proves customer X.
Origin, title, form action, and known markers match.

The small safety harness

You do not need a research project for this. For many internal agents, a small gate before risky actions is enough.

Evidence gate before actions

prove first, act second

01

Action
02

Predicates
03

Evidence types
04

Check sources
05

Block or tool call
06

Audit log

The rule:

No risky action while an action-critical condition is only supported by model text.

Action-critical means anything that permits, limits, or changes the risk of the action:

Who is the recipient, account, customer, or patient?
Which amount, contract, plan, or time range?
Which file, table, message, or data source?
Is the action reversible?
Was consent actually set?

What can count as evidence

DOM: selector, value, attribute, form state, URL, origin.
Accessibility tree: role, name, checked/disabled/expanded state.
OCR: text plus bounding box when no DOM exists.
Parsed document/API: field value, ID, signature, server response.

Steal this: the predicate prompt

Put this before every risky tool call in your agent:

Before executing this action, create an evidence table.

Action: [Tool Call]
Risk: read-only | external | destructive | private | financial

For each action-critical condition:
- Predicate: what must be true?
- Required evidence type: DOM | AX | OCR | parsed document | API | user approval
- Observed evidence: concrete value
- Source pointer: selector, bounding box, field path, URL, response id, or approval text
- Status: proven | missing | conflicting

Rule:
If any condition is missing or only supported by free-form model prose, do not execute the action. Ask for confirmation or use a safer tool.

This is not slowness. It is a seatbelt.

For browser agents

Do

✓ split risky actions into predicates first
✓ log typed evidence next to the tool call
✓ prefer DOM/AX/API over screenshot prose
✓ block on missing evidence instead of guessing

Avoid

× treat screenshot descriptions as permission
× click buttons only by visual position
× trust OCR without bounding box or field context
× build audit logs only from model summaries

Where it is worth it

For an agent that finds recipes, this is too much.

For an agent that fills forms, checks invoices, moves customer data, changes bookings, or sends messages, this is exactly the right boring infrastructure.

The point is not to distrust the model. The point is to cut its role cleanly:

The model may interpret.
The harness must prove.
The action may happen only after that.

That is the practical version of agent safety: low drama, a clear brake, and better everyday reliability.