Token Metrics

LLMs operate on tokens — units of text that the model processes. Costs are per-token; context windows are measured in tokens; performance scales with tokens.

For agents and skill systems, understanding token usage is operational essentials.

What is a token

Tokens are units a tokenizer produces from text:

"hello" → 1 token
"world" → 1 token
"antidisestablishmentarianism" → multiple tokens

Rough rule: 1 token ≈ 4 characters of English. So 1000 tokens ≈ 750 words.

The exact tokenization depends on the model. Anthropic's tokenizer differs slightly from OpenAI's, etc.

Where tokens are consumed

System prompt

The instructions the model receives at the start. Always present; counts every conversation.

Conversation history

Every message in the conversation. Grows as conversation progresses.

Tool calls and results

Each tool invocation has input and output. Both count.

Skill invocations

Loaded skill content. Counts when invoked.

File reads

Read tool returns file content. The whole content goes into context.

The two flavors

Input tokens

Sent to the model. Generally cheaper.

Output tokens

Generated by the model. Generally more expensive (often 5× input).

For agents that read a lot and write a lot, both matter.

Why it matters

Cost

Direct cost. More tokens = more spend.

Latency

More tokens = slower responses. The model processes everything in context; longer context = longer per-turn time.

Context limits

Models have maximum context windows. 200K-1M tokens for current Claude. Long conversations can exceed limits.

Quality

Beyond a certain point, more context doesn't help — the model focuses worse with too much information.

Measurement

Per-conversation total

How many tokens did this conversation use? Cost = total × per-token rate.

Per-turn

Each turn (message + response) has a token count. Watching turn-by-turn shows where consumption spikes.

Per-tool-call

Tool calls have input (the call) and output (the result). Both add to context.

Per-skill

When invoked, how many tokens does this skill consume?

Cache hit ratio

Anthropic's prompt caching: tokens reused from previous turns are cheaper. Cache hit ratio matters for cost.

Common consumption patterns

File reads dominating

Reading lots of files puts file content in context. Each read = file size in tokens.

For large files, partial reads (with offset/limit) save tokens.

Long conversation history

The longer the conversation, the more history. Each turn carries the full prior context.

For very long sessions, summarization or context cleanup helps.

Verbose tool outputs

Tools that emit extensive output bloat context. See ToolOutputOptimization.

Skill loads

Each invoked skill adds its content. Multiple skills compound.

Subagent overhead

Subagents have their own contexts. Parent context isn't bloated by subagent work.

For independent work, subagents save parent context.

Optimization

Smaller skill bodies

See SkillPerformance. Brief skills + references save tokens.

Concise tool output

See ToolOutputOptimization.

Targeted reads

Read(file, offset=100, limit=20) instead of Read(file) for large files.

Search before read

grep to find what you need; Read only the relevant portion.

Cache utilization

Anthropic's caching reuses repeated context. Don't break cache by changing system prompt frequently.

Subagents for parallelism

Independent work in subagents keeps parent context cleaner.

Periodic compaction

Long conversations: summarize earlier portions.

What's NOT worth optimizing

Trivial savings

Saving 50 tokens isn't worth a complex code change.

Pre-mature optimization

Optimize when usage shows a problem; not speculatively.

Quality at cost of efficiency

A skill that uses 200 more tokens but produces dramatically better output is worth it.

Tools and metrics

Anthropic console

Shows token usage per conversation, per session.

Programmatic access

Anthropic API returns token counts in responses:

{
    "usage": {
        "input_tokens": 1500,
        "output_tokens": 300,
        "cache_creation_input_tokens": 500,
        "cache_read_input_tokens": 1000
    }
}

Custom tracking

For agent systems, log token usage per operation. Find which tools, skills, or workflows cost most.

Common failure patterns

Reading everything

Agent loads many large files when partial reads would do.

Verbose tool output unfiltered

Tool dumps logs; all goes into context.

Skill bloat

Skills that grow over time; don't trim.

No measurement

Don't know which operations are expensive.

Long conversations without compaction

Context fills; quality degrades.

Cache breakage

Frequent system prompt changes invalidate caches.

A reasonable approach

For agent systems:

Measure: track token usage per operation type
Identify the heavy hitters
Optimize the top items: better tool output, smaller skills, targeted reads
Use caching effectively
Subagents for parallelism
Compact periodically if conversations are long