Agent Prompt Engineering

Agent prompt engineering is the practice of writing prompts that turn LLMs into reliable autonomous workers. Unlike chat prompting, agent prompts must produce structured tool calls, recover from errors, and stay on task across many turns.

Most agent failures are prompt failures. This page covers what works.

What an agent prompt does

An agent prompt typically defines:

- The agent's role and goals

- Available tools and how to use them

- Output format

- Error handling expectations

- Stopping conditions

- Constraints (what not to do)

The system prompt is loaded once; the conversation evolves with tool calls and results.

System prompt structure

Effective agent system prompts have a consistent structure:

```

[Role]

You are an agent that ___.

[Tools]

Available tools: ___ (with descriptions)

[Process]

For each task:

1. ___

2. ___

3. ___

[Output format]

Tool calls in format: ___

Final answer in format: ___

[Constraints]

Never: ___

Always: ___

[Error handling]

If a tool fails: ___

If you're stuck: ___

```

Each section serves a specific purpose. Skipping any creates failure modes.

Tool descriptions

Tools are the agent's hands. Bad descriptions = bad usage.

Good tool description includes:

- **Purpose**: when to use this tool

- **Parameters**: what each does

- **Returns**: what to expect

- **Examples**: concrete input/output pairs

- **Failure modes**: what to do if it fails

Bad tool description:

```

search_wiki(query): Searches the wiki.

```

Better:

```

search_wiki(query: str)

Searches the wiki for pages matching the query.

Use for: finding existing pages on a topic.

Don't use for: getting full content (use get_page).

Returns: list of {title, snippet, url}.

Example: search_wiki("authentication") →

[{"title": "AuthOverview", "snippet": "...", "url": "..."}]

```

The verbose version saves tokens overall by reducing failed tool calls.

Reasoning prompts

Many agents benefit from "think before acting" patterns:

- **ReAct**: Thought, Action, Observation cycles

- **Chain-of-thought**: think step by step before answering

- **Plan-and-execute**: plan all steps, then execute

Modern LLMs do this naturally with a hint. Explicit reasoning helps for complex tasks.

Output formats

JSON

Most reliable for tool calls. Use schemas.

Risk: malformed JSON. Mitigate with constrained generation (Outlines, Instructor) or repair logic.

Function calling

Native function calling APIs (OpenAI, Anthropic) handle JSON formatting reliably.

Markdown / structured text

For human-readable outputs. Less reliable for parsing.

XML tags

Anthropic's models work well with XML-tagged outputs:

```

<thinking>...</thinking>

<answer>...</answer>

```

Pick a format and be consistent.

Error recovery

Agents must handle failures:

- Tool returns error

- Tool returns unexpected format

- Tool times out

- LLM produces invalid output

Patterns

**Retry with backoff**: transient errors.

**Reformulate**: if a tool fails, try different parameters.

**Fallback tool**: if primary fails, use alternative.

**Ask for help**: explicit "I need clarification" path.

**Fail loudly**: don't silently swallow errors.

The system prompt should describe these patterns explicitly.

Stopping conditions

Agents need clear stop signals:

- Goal achieved

- Maximum turns reached

- Stuck in loop

- User asked to stop

Detecting stuck agents

Common patterns:

- Same tool call repeated with same arguments

- "I'm not sure" loops

- Hallucinating tools that don't exist

Many agent frameworks include detection for these.

Constraints

Tell the agent what NOT to do:

- "Never modify production data without confirmation"

- "Never create files outside /tmp"

- "Always preserve existing user data"

Constraints are tested at every decision point. Be specific.

Few-shot examples

For complex tasks, include examples:

```

Example task: Fix the broken authentication

Example trace:

read_file('auth.py') → ...

detect bug

edit_file('auth.py', fix)

run_tests() → pass

Done.

```

Few-shot examples in agent prompts dramatically improve quality.

Context management

Long agent runs blow context windows.

Strategies

- **Summarization**: condense history periodically

- **Selective retention**: keep important context, drop noise

- **Memory tools**: store and retrieve as needed

- **Sub-agents**: delegate sub-tasks to fresh contexts

Each adds complexity. Start simple.

Multi-agent patterns

When tasks need multiple specialized agents:

Hierarchical

Manager agent delegates to worker agents.

Pipeline

Sequential specialists (research → write → review).

Debate / consensus

Multiple agents propose; vote.

Adds reliability but multiplies cost. Use only when needed.

Token budget

Long system prompts cost on every turn.

Trim:

- Redundant instructions

- Examples that don't pull weight

- Verbose tool descriptions for unused tools

But: don't trim what's actually needed. Quality regressions from prompt cuts are common.

Testing agent prompts

Without tests, prompts drift.

Approaches

- **Eval suite**: representative tasks; pass/fail rate

- **A/B testing**: compare prompt versions on real traffic

- **Adversarial tests**: edge cases, ambiguous inputs

- **Regression suite**: tasks that previously worked

Build evals before iterating.

Common failure patterns

Vague role

"You are a helpful assistant" → undefined behavior.

Missing constraints

Agent does what wasn't intended because nothing forbade it.

Tool descriptions too sparse

Agent uses tools wrong; wastes turns.

No error handling guidance

Agent gives up or hallucinates after first failure.

Inconsistent output format

Mix of JSON and markdown. Parsing breaks.

Prompt drift

Prompt updated without testing. Quality regression unnoticed.

Over-instructed

Hundreds of rules; agent loses the goal.

No stopping condition

Agent loops forever or until token budget exhausted.

Specific examples

Tool-use agent

```

You complete tasks by calling tools.

For each request:

1. Identify which tool(s) you need

2. Call them with appropriate arguments

3. Observe results

4. Iterate or return final answer

Tools:

- search(query): full-text search, returns top 10 results

- read(url): get content of URL

- write(url, content): create or update document

Format tool calls as JSON.

Format final answer as plain text after all tool calls.

Never call write without first reading.

Stop after 10 turns or when goal complete.

```

Coding agent

```

You write and modify code to complete tasks.

Process:

1. Read relevant files

2. Plan changes

3. Make edits

4. Run tests

5. Iterate until tests pass

Tools: read_file, edit_file, run_command

Never:

- Modify files outside the project

- Skip tests

- Force-push

If tests fail after 3 fix attempts, ask for help.

```

Iteration

Agent prompts evolve. Track:

- What changed

- Why

- Effect on eval suite

Treat prompts as code: versioned, tested, reviewed.

Further Reading

- [PromptCaching](PromptCaching) — Caching long prompts

- [TransformerArchitecture](TransformerArchitecture) — How LLMs work

- [OpenSourceLlmEcosystem](OpenSourceLlmEcosystem) — Open models

- [Generative AI Hub](GenerativeAIHub) — Cluster index