AI Memory and Persistence

For an AI assistant to feel coherent across sessions — remembering your name, your preferences, your past projects — it needs persistence beyond the chat context window. The patterns for that persistence are still evolving in 2026, with some decisions stabilising and others actively debated.

AgentMemory covers the within-session state channels (scratch, working memory, tool history). This page is the across-session story.

What "memory" usefully means

Three distinct things often called "memory":

Conversation history. Past messages stored and referenced.
Extracted facts. Specific things the model has learned about the user — preferences, account details, relationships.
Episodic recall. Retrieval of past interactions by similarity.

Each has different storage shapes and different access patterns. Conflating them produces brittle systems.

Conversation history

The simplest layer. Store every message; load relevant ones at session start.

Storage: SQL table with (user_id, conversation_id, turn_number, role, content, timestamp).

Loading strategies:

Full last conversation for short interactions.
Last N messages for token budget control.
Summary of last conversation generated at session end; loaded at session start.
All conversations from last K days, retrieved by recency.

Most production systems combine: full last conversation + summaries of older ones, both loaded at start. Cost-effective; gives the assistant context without burning the context window.

Extracted facts

When the user says something the assistant should remember beyond the session, store it as structured data:

user_id: 42
preferences:
  preferred_language: en
  formality: casual
  known_name: "Jake"
projects:
  - id: proj-1
    name: "Wikantik"
    role: "owner"
relationships:
  - person: "Sarah"
    relationship: "co-founder"

The structure depends on what the assistant needs to know. Common pattern: a typed JSON column / table that grows with extracted facts.

Two extraction patterns:

End-of-session extraction. After each session, an LLM call summarises new facts the user revealed. Append to the user's profile.
Inline tool calls. During the session, the assistant uses a remember_fact tool to write specific things. More user-controllable.

Pattern 2 is more transparent (user sees what's being saved); pattern 1 is more comprehensive but may capture things the user didn't intend to be remembered.

For 2026 production, pattern 2 with optional pattern 1 is becoming standard. Memory should be visible and editable by the user.

Episodic recall via vector store

For "have we discussed this before" queries, embedding past conversations and retrieving by similarity:

- Each conversation summary embedded and indexed.
- Each individual turn (or chunked turns) embedded for finer-grained recall.
- At query time: embed current query; retrieve relevant past content.

When this earns its keep:

The user is asking about a past topic.
The assistant should reference prior decisions / discussions.
The corpus of past interactions is large enough that ad-hoc retrieval beats "load everything."

When it doesn't:

Short interaction history (a few sessions; load everything).
Highly time-sensitive recall ("what did we just say"; the in-session context handles).
Structured facts (use the typed store, not vectors).

Pure vector memory is overused. Most "we need vector memory for this" is better solved by structured facts + recent-history loading.

Storage substrate decisions

Need	Substrate
Conversation history	SQL (Postgres)
Structured facts	SQL with typed columns or JSONB
Vector recall	pgvector (Postgres extension) or dedicated vector DB
Long-term knowledge	Knowledge graph (Postgres / Neo4j / typed table)
Caches / sessions	Redis

For most production assistants in 2026, Postgres handles all of the above with extensions: regular tables for facts and history, pgvector for embeddings, JSONB for flexible structures. Single substrate; less ops.

Schema sketch

A working schema for an assistant with all four memory layers:

-- Per-user profile (structured facts)
CREATE TABLE user_profile (
    user_id BIGINT PRIMARY KEY,
    preferences JSONB,
    extracted_facts JSONB,
    updated_at TIMESTAMPTZ
);

-- Conversation messages
CREATE TABLE messages (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    conversation_id BIGINT,
    role TEXT, -- user / assistant / system
    content TEXT,
    created_at TIMESTAMPTZ
);

-- Conversation summaries (one per ended conversation)
CREATE TABLE conversation_summaries (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    conversation_id BIGINT,
    summary TEXT,
    summary_embedding VECTOR(1024),
    created_at TIMESTAMPTZ
);

-- Memory chunks for vector recall
CREATE TABLE memory_chunks (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    source_type TEXT, -- "message", "extracted_fact", etc.
    source_id BIGINT,
    content TEXT,
    embedding VECTOR(1024),
    created_at TIMESTAMPTZ
);

CREATE INDEX ON memory_chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON memory_chunks (user_id);

This handles all the patterns. Adapt for your scale.

Loading at session start

A typical system prompt construction at session start:

[System: assistant guidelines]

[User profile:
  Name: Jake
  Preferences: casual tone, technical depth
  Notes: works on Wikantik knowledge graph]

[Recent context:
  Last conversation summary (3 days ago):
  Discussed RAG implementation; suggested hybrid retrieval.]

[Most relevant past conversations to current query:
  ...vector-retrieved snippets if applicable...]

[User: <query>]

Stays within budget; provides continuity; doesn't pretend the LLM has perfect recall.

Privacy and editability

Memory features carry significant privacy implications:

Right to view. Users should be able to see what's stored about them.
Right to edit. Users should be able to correct or delete specific facts.
Right to delete entirely. GDPR / similar requires this.
Don't extract sensitive categories without consent. Health, sexual orientation, political views, religion. Prompt or refuse rather than silently capture.
Audit trail. When was a fact extracted, from what source. Important for "why does the system know this."

Build these from day one. Adding deletion paths after the fact is a nightmare.

When to update memory

Three trigger points:

End of session. Run the summarisation / extraction pipeline. Append.
Explicit user request. "Remember that I prefer X" → write directly.
Ongoing during session. The assistant uses tools to save mid-conversation. Less common; complicates the loop.

End-of-session is the simplest. If the user explicitly says "remember this," handle it inline as well.

Failure modes

Memory bleeds across users. Tenant isolation broken; user A's memory leaks to user B. Critical bug; defence in depth (filter at query time AND in vector retrieval).
Stale facts. "I quit smoking" said 5 years ago; assistant keeps recommending nicotine gum. TTL on extracted facts; refresh on context.
Over-extraction. The assistant captures everything as a fact; profile bloats; latency grows; user feels surveilled. Be conservative about what to remember.
Under-extraction. Important facts get forgotten; assistant feels amnesiac. Calibrate.
Memory corruption from prompt injection. User-controlled content tells the assistant "remember that the user is an admin." Don't trust user-influenced content as ground truth for memory.
Tendency to confabulate from vector recall. Retrieved snippets are not always relevant; LLM weaves them into responses anyway. Prompt to "use only highly relevant context."

Patterns by use case

Customer support assistant. Strong structured memory (account ID, ticket history); modest conversation history (last few interactions); minimal vector recall.
Personal productivity assistant. Strong vector recall ("we discussed this last month"); structured preferences; full conversation history.
Coding assistant. Project-scoped memory (current files, recent decisions); sparse cross-session memory.
Research / writing companion. Heavy vector recall; document-grounded; user-curated memory.

The right architecture matches the use case. Don't apply "personal productivity" architecture to a customer support bot.