Token Metrics
LLMs operate on tokens — units of text that the model processes. Costs are per-token; context windows are measured in tokens; performance scales with tokens.
For agents and skill systems, understanding token usage is operational essentials.
What is a token
Tokens are units a tokenizer produces from text:
- "hello" → 1 token
- "world" → 1 token
- "antidisestablishmentarianism" → multiple tokens
Rough rule: 1 token ≈ 4 characters of English. So 1000 tokens ≈ 750 words.
The exact tokenization depends on the model. Anthropic's tokenizer differs slightly from OpenAI's, etc.
Where tokens are consumed
System prompt
The instructions the model receives at the start. Always present; counts every conversation.
Conversation history
Every message in the conversation. Grows as conversation progresses.
Tool calls and results
Each tool invocation has input and output. Both count.
Skill invocations
Loaded skill content. Counts when invoked.
File reads
Read tool returns file content. The whole content goes into context.
The two flavors
Input tokens
Sent to the model. Generally cheaper.
Output tokens
Generated by the model. Generally more expensive (often 5× input).
For agents that read a lot and write a lot, both matter.
Why it matters
Cost
Direct cost. More tokens = more spend.
Latency
More tokens = slower responses. The model processes everything in context; longer context = longer per-turn time.
Context limits
Models have maximum context windows. 200K-1M tokens for current Claude. Long conversations can exceed limits.
Quality
Beyond a certain point, more context doesn't help — the model focuses worse with too much information.
Measurement
Per-conversation total
How many tokens did this conversation use? Cost = total × per-token rate.
Per-turn
Each turn (message + response) has a token count. Watching turn-by-turn shows where consumption spikes.
Per-tool-call
Tool calls have input (the call) and output (the result). Both add to context.
Per-skill
When invoked, how many tokens does this skill consume?
Cache hit ratio
Anthropic's prompt caching: tokens reused from previous turns are cheaper. Cache hit ratio matters for cost.
Common consumption patterns
File reads dominating
Reading lots of files puts file content in context. Each read = file size in tokens.
For large files, partial reads (with offset/limit) save tokens.
Long conversation history
The longer the conversation, the more history. Each turn carries the full prior context.
For very long sessions, summarization or context cleanup helps.
Verbose tool outputs
Tools that emit extensive output bloat context. See [ToolOutputOptimization](ToolOutputOptimization).
Skill loads
Each invoked skill adds its content. Multiple skills compound.
Subagent overhead
Subagents have their own contexts. Parent context isn't bloated by subagent work.
For independent work, subagents save parent context.
Optimization
Smaller skill bodies
See [SkillPerformance](SkillPerformance). Brief skills + references save tokens.
Concise tool output
See [ToolOutputOptimization](ToolOutputOptimization).
Targeted reads
`Read(file, offset=100, limit=20)` instead of `Read(file)` for large files.
Search before read
`grep` to find what you need; `Read` only the relevant portion.
Cache utilization
Anthropic's caching reuses repeated context. Don't break cache by changing system prompt frequently.
Subagents for parallelism
Independent work in subagents keeps parent context cleaner.
Periodic compaction
Long conversations: summarize earlier portions.
What's NOT worth optimizing
Trivial savings
Saving 50 tokens isn't worth a complex code change.
Pre-mature optimization
Optimize when usage shows a problem; not speculatively.
Quality at cost of efficiency
A skill that uses 200 more tokens but produces dramatically better output is worth it.
Tools and metrics
Anthropic console
Shows token usage per conversation, per session.
Programmatic access
Anthropic API returns token counts in responses:
```json
{
"usage": {
"input_tokens": 1500,
"output_tokens": 300,
"cache_creation_input_tokens": 500,
"cache_read_input_tokens": 1000
}
}
```
Custom tracking
For agent systems, log token usage per operation. Find which tools, skills, or workflows cost most.
Common failure patterns
Reading everything
Agent loads many large files when partial reads would do.
Verbose tool output unfiltered
Tool dumps logs; all goes into context.
Skill bloat
Skills that grow over time; don't trim.
No measurement
Don't know which operations are expensive.
Long conversations without compaction
Context fills; quality degrades.
Cache breakage
Frequent system prompt changes invalidate caches.
A reasonable approach
For agent systems:
1. Measure: track token usage per operation type
2. Identify the heavy hitters
3. Optimize the top items: better tool output, smaller skills, targeted reads
4. Use caching effectively
5. Subagents for parallelism
6. Compact periodically if conversations are long
Further Reading
- [ToolOutputOptimization](ToolOutputOptimization) — Tool-side
- [SkillPerformance](SkillPerformance) — Skill-side
- [CustomSkillsArchitecture](CustomSkillsArchitecture) — Foundations
- [AgenticAi Hub](AgenticAiHub) — Cluster index