Attack Detection

How it works

For security-sensitive triggers, the platform adds an attack detection step before execution:

Trigger arrives — The agent receives an event from an untrusted channel (e.g., an incoming email or SMS)

Attack detection — The platform analyses the trigger content for manipulation attempts

Decision — If an attack is detected, the trigger is blocked and admins are alerted. Otherwise, the agent proceeds to act on the event.

Attack detection runs in a separate context from the main agent. This is critical: the detection evaluates the raw trigger content before the agent has a chance to process it, catching manipulation attempts before they can influence the agent’s behavior.

What it checks for

The attack detection evaluates trigger content for:

Check	What it looks for
Prompt injection	Attempts to override the agent’s instructions through crafted input
Data exfiltration	Instructions designed to trick the agent into leaking data
Jailbreak	Attempts to bypass the agent’s safety boundaries
Reconnaissance	Probing to discover what the agent has access to

When does it run?

Attack detection runs automatically for:

Incoming emails — Anyone can send an email to your agent

Incoming SMS — Anyone can text your agent’s number

Script escalations — When a script escalates to the full agent pipeline (unless the script explicitly opts out)

Other trigger types (scheduled tasks, webhooks from authenticated sources, etc.) do not require attack detection because the content source is already controlled.

Configuring alerts

Workspace administrators can configure attack detection alert behavior in Workspace → Settings.

Setting	What it does
Alert recipients	Choose whether alerts go to all workspace admins, admins plus additional emails, or a custom email list
Include blocked content	Optionally include the blocked trigger payload in alert emails for debugging

FAQ

Does attack detection check every trigger?

No. It only screens untrusted trigger content—primarily incoming emails and SMS. Triggers from authenticated or controlled sources (like scheduled tasks or authenticated webhooks) skip this step.

Can attack detection be bypassed by clever prompting?

Attack detection runs before the agent processes the content, and in a separate context. The detection model evaluates the raw input independently, making it resistant to the manipulation techniques it’s designed to catch.

What if attack detection blocks a legitimate message?

Check the activity log to understand why it was blocked. If the message was legitimate, review the content for patterns that might look suspicious (e.g., instructions that resemble prompt injection). You can also adjust the agent’s instructions to make its expected interactions clearer.

Guardrails

Code-enforced constraints like whitelists and approval requirements

User Approval

Human-in-the-loop for sensitive actions

Activity Monitoring

See attack detection results in the activity log

Attack Detection

How it works

What it checks for

When an attack is blocked

When does it run?

Configuring alerts

FAQ

Learn more

Guardrails

User Approval

Activity Monitoring

​How it works

​What it checks for

​When an attack is blocked

​When does it run?

​Configuring alerts

​FAQ

​Learn more

Guardrails

User Approval

Activity Monitoring

How it works

What it checks for

When an attack is blocked

When does it run?

Configuring alerts

FAQ

Learn more