Agent Evals

You’ve built an agent that works. But what happens when you tweak the instructions? Change the model? Add a new capability? Agent Evals let you define “working correctly” once, then verify it automatically anytime. Instead of manually testing after every change, run your evals and know in seconds if something broke—or improved.

Evals tab showing a complete eval suite for a target screening agent

Why use evals

Catch regressions. You improve one behavior and accidentally break another. Evals catch this before your users do. Compare models. Run the same tests across Claude, GPT, and Gemini to find the best fit for your use case—or discover a cheaper model works just as well. Ensure consistency. Unlike manual testing, evals check the same criteria every time. No more “it worked when I tried it.” Iterate with confidence. Refine instructions, adjust capabilities, experiment freely—evals tell you if you’re making progress.

Quick start

The fastest way to understand evals is to create one.

Navigate to your agent’s Evals tab and click Add Eval
Give it a name: “Knows its own name”
Enter a trigger prompt: “What is your name?”
Enter a grader prompt: “Pass if the agent identifies itself as Scout (or whatever your agent is called). Fail if it gives a different name or says it doesn’t have one.”
Click Save, then click Run

Check the Eval Results tab. You’ll see whether it passed or failed, and the grader’s reasoning for its decision. That’s the core of evals: define what you expect, run the test, verify the result. Everything else—validation rules, tool overrides, model comparison—builds on this foundation.

Replace “Scout” with your agent’s actual name. This is what makes graders deterministic—you include the expected answer.

How evals work

Each eval follows a simple flow:

Trigger — You define a prompt that simulates a user message or scenario
Execute — The agent processes the prompt using its regular configuration
Validate — Optional validation rules check tool usage (was tool X called?)
Grade — An optional grader LLM evaluates the response against your criteria

Validation rules give you fast, deterministic checks. Grading gives you nuanced evaluation. You can use either or both.

Creating an eval

Navigate to your agent’s Instructions page and Evals tab and click Add Eval. The eval editor has three tabs: Trigger, Tools, and Grader.

Evals tab showing list of configured evaluations

Trigger tab

Write the prompt your agent will receive when the eval runs. This simulates a user message or scenario.

For a target screening agent called “Deal Scout”, you might write:

“Evaluate this company: Acme Industrial, $45M revenue, manufacturing sector, Germany”
“Research Nordic Components AB and assess fit against our investment criteria”
“Is TechStartup Inc ($2M revenue, SaaS, San Francisco) worth pursuing?”

You can also simulate scenarios by including context directly: “Based on this company profile, evaluate fit against our criteria: Acme Corp, $30M revenue, logistics sector, Netherlands…”

Tools tab (optional)

Here you can exclude or fake specific tools during the eval.

Edit eval modal showing the Tools tab with included, excluded, and faked options

Mode	Description
Included	Tool runs normally (default)
Excluded	Tool is unavailable to the agent during this eval
Faked	Tool returns a canned response instead of actually executing

When to fake tools:

Prevent side effects — Fake send_email so the agent thinks it sent an email, but nothing actually goes out
Control inputs for deterministic testing — Fake perplexity_research to return specific company data, so you know exactly what answer to expect
Speed up tests — Fake slow external API calls

When to exclude tools:

Test how the agent handles missing capabilities
Ensure the agent doesn’t use certain tools for specific scenarios

Faked tools return a default message: “This tool was faked for testing. Assume it succeeded and continue normally.” You can customize this per-tool.

Grader tab

This is where you define how the eval result is determined.

Grader prompt (optional) — If provided, an LLM examines the transcript and uses your prompt to determine the result. For a screening agent:

“Pass if the agent correctly identified this target doesn’t meet our minimum revenue criteria ($20M) and rejected it.”
“Rate the research completeness from 1-10. Award 2 points each for: revenue verified, ownership structure found, industry position assessed, geographic presence confirmed, M&A history checked.”

Grading type:

Type	When to use
Pass/Fail	Clear-cut criteria: “Did it reject non-fits?”
Rating (1-10)	Quality assessment: “How thorough was the research?”

Grader scope — What should the grader see?

Scope	When to use
Full transcript	You care about the process—what tools were called, what steps the agent took
Final response only	You only care about the end result, not how the agent got there

Validation rules (optional) — Code-based assertions about tool usage. These run before the grader and fail fast if not met.

Rule	Example use case
Tool called	When doing research, it should use the `perplexity_search` tool
Tool not called	When asked to draft an email, it should NOT use `send_email`—just draft it

If a validation rule fails, the eval fails immediately and the grader is skipped (saving credits). What if you have no grader prompt and no validation rules? Then it’s a smoke test—the eval passes if the agent responds without errors. This is more of a technical platform check, not typical for most use cases.

Running evals

Once your eval is defined, click the play button on that row to run it, or click Run All to run all evals at once. Results appear in a modal when complete. You can also view them later in the Eval Results tab.

Evals consume credits for both execution and grading. Running against multiple models multiplies usage proportionally.

Model settings

Expand Model Settings in the Evals tab to configure:

Setting	Description
Eval models	Which models to test with. Select multiple to compare results side-by-side.
Grader model	Which model evaluates the results.
Thinking	Enable extended thinking for eval execution and/or grading.

Which model for grading? Consider how hard the grading task is. Look at your grader prompts—if they require nuanced judgment, use a more capable model. For simple checks (“does the response contain X?”), a faster model works fine.

Viewing results

The Eval Results tab shows your eval history organized by run.

Eval Results tab showing history of eval runs

Each run displays pass/fail counts (or average rating), credits consumed, and duration. Switch between views:

Table view — Results listed by eval and model
Grid view — A matrix of eval × model, where each cell shows pass/fail/rating. Useful for comparing how different models perform on the same tests.

Eval run summary in grid view comparing results across models

Result details

Click on the Pass/Fail/Rating badge to see more details about the eval run.

The agent’s complete response
Grader reasoning (why it passed or failed)
Validation rule results
Full conversation transcript including tool calls

Example of a successful eval run:

Eval result showing grader reasoning and transcript

Example of a failed eval run:

Eval result showing validation rule failure

You can examine the full conversation transcript to see the tool calls and the agent’s response.

Writing effective evals

Now that you understand the mechanics, here’s how to write evals that actually catch problems.

Write deterministic grader prompts

Include the expected answer so the grader can simply check—don’t make it figure out what “correct” means. Think of it like a teacher intern grading a test. The intern shouldn’t solve the problems themselves—they should have an answer key. You’re testing the agent, not the grader. Good (deterministic):

“The target should be rejected. Revenue 2 million USD is below our 20 million USD minimum threshold.”
“Research completeness should include: revenue source identified, ownership structure mapped, industry position assessed.”

Bad (requires grader to evaluate):

“Check if the investment analysis is accurate.”
“Verify the research is thorough.”

Control inputs for deterministic testing

To write deterministic graders, you need to control what the agent sees. Two approaches: 1. Fake tools with predetermined responses: For a screening agent, fake perplexity_research to return specific company data:

Tool override: perplexity_research → Faked with: “Nordic Components AB: €52M revenue, family-owned since 1985, automotive parts manufacturing, headquarters in Sweden with plants in Germany and Poland”
Grader: “Check that the agent identified: revenue €52M (above threshold ✓), family-owned (succession opportunity ✓), manufacturing sector (matches criteria ✓), European presence (✓). Should recommend adding to shortlist.”

2. Simulate scenarios in the trigger:

Trigger: “Evaluate this target based on the following data—do not perform additional research: Company: Acme Industrial, Revenue: $45M, Sector: Manufacturing, Ownership: Family (founder retiring), Location: Munich, Germany”

Example: Target Screening Agent eval suite

Here’s how a complete eval suite might look for an investment screening agent:

“Reject obvious non-fits” (Pass/Fail + Validation)

Scenario: Agent should reject targets that don’t meet basic criteria.

Trigger: “Research and evaluate this target: TechStartup Inc, $2M revenue, SaaS, VC-backed, San Francisco”
Validation: perplexity_research must be called
Grader: “Pass if the agent correctly identified this doesn’t meet criteria (revenue below $20M minimum, wrong sector, wrong geography) and rejected it without extensive research.”

What it catches: Agent being too optimistic, flagging everything “for review.” Agent not doing actual research.

”Score against investment criteria” (Rating 1-10)

Scenario: Assess the quality of the agent’s scoring logic.

Trigger: “Evaluate: Acme Industrial, $45M revenue, manufacturing sector, family-owned (founder age 68), based in Germany”
Tool overrides: perplexity_research → Faked with predetermined company profile
Grader: “Rate scoring accuracy 1-10. A qualified target should score 7-8/10 based on: revenue above threshold (✓), manufacturing sector (✓), European location (✓), succession situation (✓). Deduct points if scoring seems arbitrary or doesn’t reference specific criteria.”

What it catches: Inconsistent or inflated scoring. One team found their agent was giving everything 8+/10—this eval revealed the problem.

”Research completeness” (Rating 1-10)

Scenario: Ensure the agent gathers sufficient information before scoring.

Trigger: “Research and evaluate: Nordic Components AB”
Tool overrides: perplexity_research → Faked with partial company data
Grader: “Rate research completeness 1-10. Award 2 points each for: revenue verified, ownership structure identified, industry position assessed, geographic footprint mapped, M&A history checked.”

What it catches: Agent rushing to conclusions. Also useful for model comparison—does a cheaper model still do thorough research?

”Handle missing data gracefully” (Pass/Fail)

Scenario: When information is unavailable, the agent should acknowledge uncertainty.

Trigger: “Evaluate: Private Holdings GmbH”
Tool overrides: perplexity_research → Faked with very limited results
Grader: “Pass if the agent explicitly notes which data points couldn’t be verified and either: (a) recommends direct outreach before scoring, or (b) provides a tentative score with clear caveats. Fail if it presents guesses as facts or gives a confident score despite missing information.”

What it catches: Agent hallucinating data or overconfident with incomplete information.

Using evals to improve your agent

Scenario: Agent was too optimistic The “Reject obvious non-fits” eval kept failing—the agent was adding everything to the shortlist “for further review.” Fix: Added instruction: “If a target clearly fails to meet minimum criteria (revenue below $20M, outside target sectors, wrong geography), reject immediately. Don’t shortlist for ‘potential’ or ‘further review.’” Result: Eval now passes. Analysts spend less time reviewing obvious non-fits.

Scenario: Research quality varied by model Ran “Research completeness” eval across three models:

Claude Sonnet: 8/10
Claude Haiku: 5/10
Gemini Flash: 6/10

Decision: Kept Sonnet for screening. The cost savings from cheaper models weren’t worth incomplete research—analysts had to redo the work manually.

Scenario: Scoring was inconsistent The “Score against criteria” eval showed wild variance—same target got 6/10 one run, 9/10 the next. Fix: Added explicit scoring rubric to instructions: “Score 2 points for each criterion met: revenue above $20M, target sector, European presence, succession situation, strategic fit. Maximum 10 points.” Result: Scores now consistent within ±1 point across runs.

Import and export

Share evals between agents using JSON export/import. To export: Click the menu on any eval and select Export, or export all evals at once. To import: Click Import and paste JSON or upload a file. Evals with duplicate names are skipped.

Tool overrides reference tools by name, so the target agent needs the same tools for those settings to apply.

Letting the agent manage its own evals

Toggle the Evals capability on to let your agent find, create, edit, and run its own evals. This is useful for bootstrapping your eval suite. Ask the agent to create evals covering edge cases you might have missed:

Agent creating new evals to cover edge cases

Then ask it to run them across different models to compare performance:

Agent running evals and reporting results across models

The agent can identify which models struggle with specific tasks—in this case, Haiku failed the comparison task (didn’t look up both documents), while Gemini Flash failed the borderline case (rejected instead of shortlisting with caveats).

Eval-driven development: Define the evals first, then ask the agent to iterate on its own instructions until the evals pass. It’s like test-driven development for AI agents.

Eval design tips

Start simple, add complexity. Begin with basic evals to verify functionality. Add grading criteria once you understand what “good” looks like.
Control inputs for deterministic grading. Fake tools or simulate scenarios so you know exactly what answer to expect.
Use validation rules for tool behavior. They’re faster and cheaper than graders for checking which tools were called.
Fake tools with side effects. Always fake or exclude tools that send emails, make API calls, or modify external systems.
Run evals after changes. Change instructions → run evals. Add capability → run evals. Update model → run evals.
Use ratings for quality tracking. Pass/fail tells you if something works. Ratings (1-10) tell you if it’s getting better.

FAQ

Do I need a grader prompt?

No. Without a grader prompt, the eval operates in “smoke test” mode (passes if no errors) or “validation-only” mode (passes if validation rules pass). Add a grader when you need LLM-based evaluation.

Do eval runs have side effects?

Yes, an eval could have side effects depending on which tools you give it. For example if document editing is enabled, an agent may edit a document during an eval.

Can I test that a tool was called with specific arguments?

Yes, if you mention that in the grader prompt. The normal tool validation rules only check whether a tool was called, not the arguments.

Can I schedule evals to run automatically?

Evals are normally triggered manually. But if you enable the “Eval capability” on your agent, you can ask it to run them on a schedule.

How many evals can I create?

No limit. Create as many test cases as needed to thoroughly validate your agent’s behavior.

Should I use a more capable model for grading?

That depends on the complexity of the grading criteria. If the criteria are simple, you can use a cheaper model for grading. If the criteria are complex, you should use a more capable model for grading. Experiment!

Core Features

Administration

Advanced

Why use evals

Quick start

How evals work