Agent Memory: Architectures, Substrates, and State Management

Atomic Answer: Agent memory is the dynamic state management system that allows stateless Large Language Models (LLMs) to retain context, learn from interactions, and maintain continuity across sessions. It encompasses short-term working memory for immediate tasks and long-term memory for storing facts, user preferences, and historical data via retrieval systems.

In the rapidly evolving landscape of Large Language Models (LLMs), one of the most significant challenges is their inherent statelessness.

Left to their own devices, LLMs "forget" everything after each interaction. To transform these stateless engines into continuous, learning, and personalized entities capable of multi-step reasoning, we must equip them with an Agent Memory system.

Agent memory functions as an external nervous system for AI. Because an LLM’s internal weights are frozen at the time of its training, a dynamic memory architecture allows the agent to maintain context, adapt to new user-provided information, and learn from its past actions without requiring costly model fine-tuning.

This article explores the comprehensive state management problem across different temporal and functional memory dimensions, detailing the specific storage substrates, eviction policies, and control mechanisms necessary for a robust AI agent.

1. The Core Memory Taxonomy

Atomic Answer: AI agent memory taxonomy divides state management into temporal and functional categories. Temporally, it separates short-term working memory from long-term memory systems. Functionally, it classifies data into episodic experiences, semantic facts, and procedural skills, enabling models to accurately recall context, user preferences, and execution workflows.

At a high level, agent memory is categorized into two temporal scopes, mimicking human cognition:

Short-Term (Working) Memory: This acts as the agent’s "RAM." It holds the immediate context required for the current task, such as the ongoing conversation thread, recent tool outputs, and active reasoning steps.
Management Techniques: Because it resides entirely within the LLM’s context window, it must be carefully managed through techniques like summarization, sliding windows, and token budgeting to prevent degradation of the model's performance and ballooning inference costs.
Long-Term Memory (LTM): This persists beyond a single session or task execution. It allows the agent to remember user preferences, historical events, and domain-specific facts over weeks, months, or years.
LTM Implementation: LTM is typically implemented via Retrieval-Augmented Generation (RAG), storing data in external databases and fetching it only when contextually relevant.

Beyond temporal scope, research breaks memory down functionally into cognitive categories:

Episodic Memory: Stores concrete past experiences, sequential interaction logs, and specific task outcomes (e.g., "What did we discuss in our meeting last Tuesday?").
Semantic Memory: Handles factual knowledge, user preferences, and broad domain specifications (e.g., remembering that a user's favorite programming language is Python).
Procedural Memory: Encompasses learned skills, workflows, and behavioral rules (e.g., knowing the exact sequence of tools to call to execute a company reporting process).

2. The Four Memory Channels

Atomic Answer: Agent memory operates across four primary channels to manage lifecycle and storage. These include active scratch reasoning for context, rolling summarization for tool history, structured fact extraction for working memory, and specialized database substrates for long-term recall, each with tailored retention and eviction policies.

In practical implementation, agent memory operates across four distinct channels. Each channel has a specific lifetime, requires a tailored storage substrate, and demands a unique eviction policy.

A. Scratch Reasoning: Context Management

Scratch reasoning tokens (such as "Chain-of-Thought" outputs) are essential for guiding the model through complex logic. However, they rapidly consume token limits and increase costs without providing lasting value to the overall session.

Storage Substrate: Active Model Context Window.
Eviction Policy: Retain only the current turn's reasoning. Once the agent has taken an action or provided an output, prior reasoning steps should be dropped during context compression.
Telemetry and Auditing: While reasoning should be purged from the active context window, the full reasoning trace must be logged to an external observer or telemetry store. This ensures developers can debug agent logic without congesting the prompt.

B. Tool History: Rolling Summarization

Agents frequently execute tools in loops. The standard conversational policy of simply "dropping the oldest messages" when the context window fills is a catastrophic failure mode for agents, as it risks deleting the user's initial instruction or goal.

Storage Substrate: Model Context (Summarized).
Policy: Summarization by Age with Pinned Goal:
- Pin the Goal: The initial user instruction and top-level constraints must always be pinned at the top of the prompt.
- Rolling Summary: When the context reaches a specific threshold (e.g., 50% capacity), summarize the older turns (e.g., turns 1 through $N-5$ ) into a concise paragraph. Preserve only the most recent turns ( $N-4$ to $N$ ) in full fidelity.
- Preserve Entities: The summarization prompt must explicitly instruct the LLM to retain critical entities—such as IDs, names, and exact numeric values—without distortion.

C. Working Memory: Fact Extraction

Working memory consists of high-signal facts the agent discovers during a task that must drive future actions. These facts should be stored in structured "slots" rather than unstructured prose.

Storage Substrate: System Prompt / Structured Data format (e.g., JSON blocks).
Update Mechanism: The agent is provided with a dedicated tool, such as update_working_memory, to explicitly save new facts (e.g., account_id=9876, current_step=payment_verification).
Benefit: By storing facts in a structured block within the system prompt, these details survive all conversational summarization passes. Furthermore, it allows an external orchestrator to programmatically monitor the agent's progress or detect if it has stalled.

D. Long-Term Memory: Storage Selection

Selecting the appropriate storage substrate for long-term memory is critical for ensuring high retrieval quality, low latency, and accurate context grounding. Unbounded memory is an anti-pattern; every channel needs a retention policy.

Use Case	Substrate	Rationale
Semantic Recall	Vector Database (Pinecone, Milvus)	Enables fuzzy matching on past interactions and semantic search over unstructured text.
User Preferences	Relational SQL DB (PostgreSQL)	Allows for typed columns and strict schemas for settings like `timezone`, `role`, or `output_format`.
Exact Recall	Transaction Log / Audit DB	Crucial for definitive answers (e.g., "Was refund #123 issued?"). Requires exact keyword matching.
Relational Knowledge	Knowledge Graph (Neo4j)	Best for mapping complex entities and their interconnected relations, providing highly structured multi-hop retrieval.

Forgetting and Retention Policies:

Privacy: Agents must support a deletion path for compliance frameworks like GDPR (e.g., deleting specific user nodes in a graph or vectors in a database).
Time-To-Live (TTL): Apply TTL to volatile facts that quickly become stale, such as current_location or active_session_id.

3. The Memory Management Cycle

Atomic Answer: The memory management cycle is an orchestrated control loop that continuously observes interactions, curates valuable data for storage, retrieves relevant context for new tasks, and consolidates raw logs into high-level knowledge. This process ensures efficient memory utilization while preventing database bloat and performance degradation.

A production-grade memory architecture requires more than just a database; it requires sophisticated control logic—a "Manager"—to orchestrate the flow of information.

This operates in a continuous loop known as the Memory Cycle:

Observe: Collect data from the current interaction, tool outputs, and user inputs.
Store & Update (Curation): The control logic decides if the observed data is worth saving, preventing memory bloat. High-value data is routed to the appropriate LTM substrate.
Retrieve: During new tasks, fetch relevant memories to ground the LLM’s reasoning. This involves semantic similarity searches or precise SQL/graph queries.
Reflect & Evolve (Consolidation): Periodically, the system must run background jobs to reorganize, summarize, or consolidate raw interaction logs into higher-level knowledge. This process extracts profound insights and prunes redundant memories.

4. Cross-Session Continuity and Verification

Atomic Answer: Cross-session continuity maintains an AI agent's persistent identity by injecting core user context, session summaries, and learned preferences into future prompts. System reliability is ensured through continuous automated evaluations measuring goal retention, fact persistence, and recall latency to optimize overall memory performance and accuracy.

To maintain the illusion of a continuous, living entity across multiple distinct sessions, a minimal continuity stack is required.

The Continuity Stack:

User Context: Core demographic and preference data injected into the system prompt of every turn.
Session Summary: A background process summarizes the last $N$ turns of a concluded session. This summary is injected into the context at the start of the next session.
Preference Injection: Structured facts (e.g., "prefers code output in Python") are permanently stored in the SQL database and injected seamlessly into the system prompt.

Verification and Evaluation:

Memory quality must be continuously measured via automated evaluations (evals):

Goal Retention: Can the agent successfully execute the final action corresponding to the initial instruction after a 15+ turn detour?
Fact Persistence: Does the agent accurately recall a specific ticket ID provided 10 turns ago, despite rolling summarization?
Recall Latency: Are vector retrieval operations introducing unacceptable delays? Monitor the latency added versus the tangible quality gain in the agent's output.

By implementing these structured channels, substrates, and continuous evaluation loops, developers can build AI agents that not only reason effectively but also accumulate lasting intelligence and context over their lifespans.