Agent Testing

You cannot unit-test an agent as a black box. Traditional unit tests confirm that `input A` produces `output B`, but agents are stochastic. Testing an agent requires evaluating the **trajectory**—the sequence of tool calls and reasoning steps—not just the final answer.

The Agent Testing Pyramid

| Tier | Strategy | Frequency |

|---|---|---|

| **L1: Unit Tests** | Test individual tool functions and prompt-template parsers. | Every Commit |

| **L2: Deterministic Traces** | Replay a recorded sequence of LLM responses against the orchestration code. | Every Commit |

| **L3: Rollout Evals** | Run the full agent on a fixed set of 50-100 tasks with mocked tools. | Every Prompt Change |

| **L4: Shadow Production** | Run a candidate agent alongside production, comparing outputs but not serving them. | Weekly / Monthly |

Trajectory-Based Evaluation

A "pass" in agent testing is defined by three factors:

1. **Success:** Was the goal achieved?

2. **Efficiency:** Did it use the minimum number of tool calls?

3. **Safety:** Did it avoid forbidden tool sequences?

Reference Test Case (JSON Schema)

```json

{

"test_id": "cancel_sub_001",

"prompt": "Cancel user 42's subscription and refund the last 30 days.",

"expected_outcome": {

"status": "cancelled",

"refund_amount": 29.99

},

"mandatory_tools": ["lookup_user", "cancel_subscription", "issue_refund"],

"forbidden_tools": ["delete_user"],

"max_steps": 5

}

```

LLM-as-Judge Calibration

Using a stronger model (e.g., GPT-4o or Claude 3.5 Sonnet) to grade a smaller agent's performance is efficient but requires calibration.

**The Calibration Loop:**

1. Select 100 agent trajectories.

2. Have a human expert grade them on a 1-5 scale.

3. Have the Judge LLM grade the same 100 using the same rubric.

4. Compute the **Cohen's Kappa** coefficient. If it is below 0.7, your rubric is too vague or your Judge prompt needs more constraints.

Shadow Production

Shadowing is the most reliable way to catch "vibe-based" regressions. Run the new agent logic on a stream of production requests. Store the results in a database and run a diff.

```python

Conceptual Shadow Dispatcher

async def shadow_dispatch(request):

prod_task = run_agent(v1_prompt, request)

shadow_task = run_agent(v2_candidate_prompt, request)

prod_res, shadow_res = await asyncio.gather(prod_task, shadow_task)

if prod_res.trajectory != shadow_res.trajectory:

log_regression(request, prod_res, shadow_res)

return prod_res.final_answer

```

Avoiding the "Cost Spike" Regression

Always track **Tokens Per Success (TPS)**. A new prompt that improves success rate by 2% but increases average tool calls from 3 to 12 is a failed experiment in a production environment.

Further Reading

- [AgenticWorkflowDesign](AgenticWorkflowDesign) — Architecture patterns for reliable agents.

- [LlmEvaluationMetrics](LlmEvaluationMetrics) — Detailed breakdown of ROUGE, BLEU, and BERTScore.

- [AgentObservability](AgentObservability) — How to capture the traces needed for L2 testing.