LLM Memory Systems: Building AI Applications That Remember

February 26, 2026 · Applied AI · 14 min read

Why LLM Memory Is Harder Than It Looks

Here's an uncomfortable truth about every LLM you've ever used: it doesn't remember you. Not from yesterday. Not from five minutes ago. Not even from the last sentence — unless that sentence is still in the prompt.

Large language models are stateless functions. They take in a prompt, produce a response, and immediately forget everything. The "conversation" you have with ChatGPT? That's your chat client re-sending the entire conversation history with every single message. The model itself has the long-term memory of a goldfish with amnesia.

This creates three forces that conspire against you the moment you try to build a real application:

Force 1: Token cost scales linearly with history. Every turn you remember, you pay for again. A 10-turn conversation burns ~3,000 tokens before the user even asks their next question. At $3 per million input tokens, a 100-turn conversation costs roughly 30x what the first message did.

Force 2: Context windows have hard limits. Even with 128K-token models, that's about 60,000 words — maybe 2 hours of dense conversation. Hit the ceiling and you must decide what to forget. A 4K-token model gets you barely 10 exchanges.

Force 3: Not all memories are equal. "I prefer PostgreSQL" matters weeks later. "Sounds good, thanks!" does not. But a naive system treats both identically.

The solution isn't one clever trick — it's a toolkit of memory strategies, each suited to different situations. In this post, we'll build five memory systems from scratch, each progressively more sophisticated:

Buffer Store everything (simple, expensive)
Window Keep recent, drop old (bounded cost)
Summary Compress history into summaries (10x compression)
Entity Extract structured facts (never forget what matters)
Semantic Persistent, queryable long-term memory (cross-session recall)

If you've ever worked with human memory research, these map neatly: buffer is working memory, window is attention span, summary is episodic memory, entity is semantic memory, and long-term is autobiographical memory. We'll carry this analogy throughout.

Conversation Buffer — The Naive Baseline

The simplest memory system is no system at all: store every message and send everything to the model every time. This is what most chatbot tutorials teach, and it works fine — until it doesn't.

Think of this as your brain's working memory: it holds everything you're currently thinking about, but capacity is brutally limited. For humans, that's about 7 items. For LLMs, it's the context window.

The math is stark. Each user turn averages ~50 tokens, each assistant response ~150 tokens. That's 200 tokens per exchange. A 4K-token context fills up in about 15 exchanges — subtract the system prompt and you're down to maybe 10. Even a 128K model gives you roughly 500 exchanges before overflow.

Worse, the cost curve is quadratic in practice. Turn 1 sends ~200 tokens. Turn 2 sends ~400. Turn 50 sends ~10,000. You're re-transmitting the entire history every time. At $3/M input tokens, a 100-turn conversation costs about $0.06 total — but that last message alone costs 30x what the first one did.

When is a buffer enough? Short, focused conversations under 20 turns. Prototyping. Situations where cost doesn't matter. For everything else, you need to be smarter about what you keep.

class ConversationBuffer:
    """Store every message. Simplest memory — and most expensive."""

    def __init__(self, max_tokens=4096, tokens_per_word=1.3):
        self.messages = []
        self.max_tokens = max_tokens
        self.tpw = tokens_per_word

    def add(self, role, content):
        tokens = int(len(content.split()) * self.tpw)
        self.messages.append({"role": role, "content": content, "tokens": tokens})

    def get_messages(self):
        return [{"role": m["role"], "content": m["content"]} for m in self.messages]

    def token_count(self):
        return sum(m["tokens"] for m in self.messages)

    def is_near_limit(self, threshold=0.85):
        return self.token_count() > self.max_tokens * threshold

# Usage
buf = ConversationBuffer(max_tokens=4096)
buf.add("user", "I'm building a web app with React and PostgreSQL")
buf.add("assistant", "Great stack choice! What features are you planning?")
buf.add("user", "User auth, a dashboard, and real-time notifications")

print(f"Messages: {len(buf.messages)}")        # 3
print(f"Tokens used: {buf.token_count()}")      # ~30
print(f"Near limit: {buf.is_near_limit()}")     # False
# After 50 turns, token_count() will approach max_tokens
# and is_near_limit() will return True — time for a smarter strategy

Simple and honest. But notice we're already tracking token counts and warning when we approach the limit. Even the naive approach needs guardrails.

Sliding Window — Keep Recent, Drop Old

The first real optimization: only keep the last N messages. Like your brain's attention span — you're aware of the last few things said, but earlier details fade unless they were important enough to stick.

A fixed window of 20 messages gives you bounded memory: cost never exceeds ~4,000 tokens regardless of conversation length. But fixed windows have a nasty problem called the context cliff. Imagine this: on turn 5, the user says "I'm using PostgreSQL." On turn 30, the model suggests MySQL — because turn 5 fell outside the 20-message window.

The fix is importance weighting. Not all messages deserve equal shelf life. A question matters more than "sounds good." A technical decision matters more than small talk. Code matters more than chatter.

Our improved window keeps the most recent K messages plus the highest-importance messages from older history. Think of it as keeping your recent conversation context while also pinning the important decisions to a mental corkboard.

class SlidingWindowMemory:
    """Keep recent messages + high-importance older ones."""

    def __init__(self, recent_k=10, important_k=5):
        self.all_messages = []
        self.recent_k = recent_k
        self.important_k = important_k

    def _score(self, msg):
        """Heuristic importance: questions > decisions > code > chat."""
        text = msg["content"].lower()
        score = 0.1  # baseline
        if "?" in msg["content"]:
            score += 0.3
        if any(kw in text for kw in ["decide", "use", "prefer", "choose", "switch"]):
            score += 0.4
        if any(kw in text for kw in ["def ", "class ", "import ", "select ", "create "]):
            score += 0.25
        if len(text.split()) > 30:
            score += 0.15  # longer messages tend to carry more info
        return min(score, 1.0)

    def add(self, role, content):
        msg = {"role": role, "content": content, "turn": len(self.all_messages)}
        msg["importance"] = self._score(msg)
        self.all_messages.append(msg)

    def get_window(self):
        if len(self.all_messages) <= self.recent_k + self.important_k:
            return self.all_messages  # everything fits

        recent = self.all_messages[-self.recent_k:]
        older = self.all_messages[:-self.recent_k]

        # Top important_k from older messages, by importance score
        important = sorted(older, key=lambda m: -m["importance"])[:self.important_k]

        # Merge and sort by original turn order
        merged = sorted(important + recent, key=lambda m: m["turn"])
        return [{"role": m["role"], "content": m["content"]} for m in merged]

# Example: PostgreSQL survives the window
mem = SlidingWindowMemory(recent_k=3, important_k=2)
mem.add("user", "I decide to use PostgreSQL for the database")  # high importance
mem.add("assistant", "Good choice for relational data!")
mem.add("user", "sounds good")                                   # low importance
mem.add("assistant", "What's next?")
mem.add("user", "Let's work on the API layer")
mem.add("assistant", "I'll set up Express routes")
mem.add("user", "Add error handling too")

window = mem.get_window()
# PostgreSQL decision (turn 0) survives even at turn 6
# because its importance score (0.5) beats "sounds good" (0.1)
print(f"Window size: {len(window)}")  # 5 (3 recent + 2 important)

The importance heuristic here is simple — keyword matching and message length. In production, you'd use an LLM to score importance, but even these basic heuristics prevent the worst context cliff disasters.

Summary Memory — Compressed Conversation History

What if instead of keeping raw messages, you periodically summarize the conversation and keep the summary? This is your brain's episodic memory — you don't remember every word of yesterday's meeting, but you remember the key decisions and action items.

The compression ratio is remarkable. A 50-turn conversation might consume ~15,000 tokens as raw messages. A summary of the same conversation? About 1,500 tokens — a 10x reduction while preserving the essential facts.

The technique is progressive summarization: every N messages, call an LLM to update a running summary document. The summary grows slowly while the raw conversation is discarded. If you've read our context window strategies post, this is chunk-and-summarize applied to conversation history.

The tradeoff is nuance. Summaries lose conversational tone, hesitation, exact phrasing — the things that matter in therapeutic or coaching AI. But for task-oriented applications (coding assistants, customer support, project management), summaries preserve what matters and discard what doesn't.

One subtle danger: summary drift. After many rounds of summarization, information gets subtly altered — like a game of telephone. "User prefers PostgreSQL for its JSONB support" might drift to "User prefers PostgreSQL" after several summarization rounds, losing the why. The fix is pinned facts: critical information that's preserved verbatim and never summarized away.

class SummaryMemory:
    """Compress conversation into a running summary every N turns."""

    def __init__(self, summarize_every=6):
        self.buffer = []           # unsummarized messages
        self.summary = ""          # running summary document
        self.pinned = []           # facts that survive summarization
        self.summarize_every = summarize_every
        self.turn_count = 0

    def add(self, role, content):
        self.buffer.append({"role": role, "content": content})
        self.turn_count += 1

        # Auto-summarize when buffer reaches threshold
        if len(self.buffer) >= self.summarize_every:
            self._summarize()

    def pin(self, fact):
        """Pin a critical fact so it's never lost to summarization."""
        if fact not in self.pinned:
            self.pinned.append(fact)

    def _summarize(self):
        """Compress buffer into the running summary."""
        conversation = "\n".join(
            f"{m['role']}: {m['content']}" for m in self.buffer
        )
        # In production, you'd call an LLM with a prompt like:
        #   "Update the running summary with new conversation details.
        #    Current summary: {self.summary}
        #    New conversation: {conversation}
        #    Preserve all key facts, decisions, and preferences."
        self.summary = self._simulate_summary(conversation)
        self.buffer = []  # clear after summarizing

    def _simulate_summary(self, conversation):
        """Simulate LLM summarization by extracting key sentences."""
        lines = conversation.split("\n")
        key_words = {"decide", "use", "prefer", "build", "need", "want", "plan"}
        kept = []
        for line in lines:
            if any(w in line.lower() for w in key_words):
                kept.append(line.split(": ", 1)[-1] if ": " in line else line)
        new_part = ". ".join(kept[:3]) if kept else lines[0].split(": ", 1)[-1]
        return f"{self.summary} {new_part}".strip() if self.summary else new_part

    def get_context(self):
        """Assemble memory for the prompt: pinned facts + summary + recent buffer."""
        parts = []
        if self.pinned:
            parts.append("PINNED FACTS:\n" + "\n".join(f"- {f}" for f in self.pinned))
        if self.summary:
            parts.append(f"CONVERSATION SUMMARY:\n{self.summary}")
        if self.buffer:
            parts.append("RECENT MESSAGES:\n" + "\n".join(
                f"{m['role']}: {m['content']}" for m in self.buffer
            ))
        return "\n\n".join(parts)

# Usage
mem = SummaryMemory(summarize_every=4)
mem.pin("User's database: PostgreSQL")  # pinned — never lost
mem.add("user", "I want to build a REST API with Express")
mem.add("assistant", "Let's plan the route structure")
mem.add("user", "I need endpoints for users, posts, and comments")
mem.add("assistant", "Three resources — I'll use RESTful naming")
# ^ After 4 messages, auto-summarization triggers
mem.add("user", "Add pagination to all list endpoints")
print(mem.get_context())
# Shows: pinned fact + compressed summary + recent buffer

Notice how the pinned fact "User's database: PostgreSQL" survives no matter how many summarization cycles run. This is how you prevent summary drift from erasing critical context.

Entity Memory — Structured Facts About the World

Summary memory compresses conversations. Entity memory does something different: it extracts structured facts about specific entities (people, projects, tools, preferences) and maintains a knowledge base. This is your brain's semantic memory — you know that Paris is the capital of France regardless of when you learned it.

The core data structure is simple: entity-attribute-value triples. ("User", "database", "PostgreSQL"). ("Project", "framework", "React 18"). ("API", "auth_method", "JWT"). Each triple is a discrete, addressable fact.

This solves the PostgreSQL problem permanently. The fact is extracted once from the conversation, stored as a triple, and injected into every future prompt. The model never forgets because the fact lives outside the conversation window.

But entity memory has its own challenge: contradictions. "Actually, let's switch to MySQL." This isn't a new fact — it's an update to an existing one. Your system needs to detect when a new extraction contradicts a stored fact and resolve the conflict, typically by trusting the more recent statement.

class EntityMemory:
    """Extract and maintain structured facts about entities."""

    def __init__(self):
        self.entities = {}     # {entity_name: {attribute: value}}
        self.changelog = []    # track what changed and when

    def extract_and_store(self, message, turn_number):
        """Extract entity facts from a message and update the store."""
        # In production: use an LLM with a structured extraction prompt
        triples = self._extract_triples(message)

        for entity, attribute, value in triples:
            if entity not in self.entities:
                self.entities[entity] = {}

            old_value = self.entities[entity].get(attribute)
            if old_value and old_value != value:
                # Contradiction detected — resolve by recency
                self.changelog.append({
                    "turn": turn_number,
                    "entity": entity,
                    "attribute": attribute,
                    "old": old_value,
                    "new": value
                })

            self.entities[entity][attribute] = value

    def _extract_triples(self, message):
        """Simulate entity extraction (in production, use an LLM)."""
        triples = []
        text = message.lower()
        # Simple pattern matching — real systems use LLM extraction
        db_keywords = {"postgresql": "PostgreSQL", "mysql": "MySQL",
                       "mongodb": "MongoDB", "sqlite": "SQLite"}
        for kw, name in db_keywords.items():
            if kw in text:
                triples.append(("Project", "database", name))

        framework_kw = {"react": "React", "vue": "Vue", "angular": "Angular",
                        "express": "Express", "fastapi": "FastAPI", "django": "Django"}
        for kw, name in framework_kw.items():
            if kw in text:
                triples.append(("Project", "framework", name))

        if "jwt" in text:
            triples.append(("API", "auth_method", "JWT"))
        if "oauth" in text:
            triples.append(("API", "auth_method", "OAuth"))

        return triples

    def get_relevant(self, query=""):
        """Get entity facts relevant to the current conversation."""
        # Simple: return everything (for small stores)
        # Production: filter by keyword overlap or embedding similarity
        lines = []
        for entity, attrs in self.entities.items():
            facts = ", ".join(f"{k}: {v}" for k, v in attrs.items())
            lines.append(f"[{entity}] {facts}")
        return "\n".join(lines)

    def get_changelog(self):
        return self.changelog

# Usage — watch contradiction resolution in action
mem = EntityMemory()
mem.extract_and_store("We're building with React and PostgreSQL", turn_number=1)
mem.extract_and_store("Authentication will use JWT tokens", turn_number=5)
mem.extract_and_store("Actually, let's switch to MySQL instead", turn_number=12)

print(mem.get_relevant())
# [Project] database: MySQL, framework: React
# [API] auth_method: JWT

print(mem.get_changelog())
# [{'turn': 12, 'entity': 'Project', 'attribute': 'database',
#   'old': 'PostgreSQL', 'new': 'MySQL'}]

The changelog is a small but powerful addition. When the model sees that the database changed from PostgreSQL to MySQL on turn 12, it can reference the migration intelligently rather than pretending MySQL was always the plan.

For larger entity stores, you'll need relevance filtering — you can't inject every fact into every prompt. Keyword matching works for small stores; embedding similarity works better at scale. Our retrieval reranking post covers quality filtering in depth.

Semantic Long-Term Memory — Persistent, Queryable, Evolving

Now we reach the most sophisticated system: vector-based memory that persists across sessions and enables semantic retrieval. This is autobiographical memory — the rich, personal, evolving record of experiences that makes you you.

The concept: after each conversation, extract "memory-worthy" moments — insights, preferences, decisions, emotional reactions. Embed each memory as a vector. Store with metadata: timestamp, importance score, conversation ID, memory type. Before each response, retrieve the K most relevant memories via cosine similarity.

The magic moment: a user discusses their pytest preferences in Session 1. Three months later, in a completely different conversation, they ask about testing. The system retrieves their previous preferences and responds as if it actually knows them. This is the "it remembers me" feeling that turns a tool into an assistant.

Long-term memory also needs management. Old memories become less relevant (decay). Similar memories should merge (consolidation). Irrelevant memories waste retrieval bandwidth (pruning). We implement decay with a simple exponential: relevance = base_score * 0.95^days_old. A memory that scored 0.9 is worth 0.59 after 10 days and 0.36 after 20 — still retrievable if highly relevant, but naturally yielding to fresher memories.

If you've read our RAG from scratch post, this will feel familiar — semantic memory is essentially RAG applied to personal conversation history instead of a document corpus. Same embedding, same retrieval, different source material.

import numpy as np
from datetime import datetime

class SemanticMemoryStore:
    """Vector-based long-term memory with decay and retrieval."""

    def __init__(self, decay_rate=0.95, embed_dim=64):
        self.memories = []
        self.decay_rate = decay_rate
        self.embed_dim = embed_dim

    def _embed(self, text):
        """Simulate embedding (production: use sentence-transformers or OpenAI)."""
        np.random.seed(hash(text) % 2**31)
        vec = np.random.randn(self.embed_dim)
        return vec / np.linalg.norm(vec)

    def _cosine_sim(self, a, b):
        return float(np.dot(a, b))

    def store(self, text, memory_type="general", importance=0.5, timestamp=None):
        """Store a memory-worthy moment."""
        self.memories.append({
            "text": text,
            "embedding": self._embed(text),
            "type": memory_type,
            "importance": importance,
            "timestamp": timestamp or datetime.now(),
        })

    def retrieve(self, query, top_k=3, now=None):
        """Retrieve the most relevant memories, accounting for decay."""
        if not self.memories:
            return []

        now = now or datetime.now()
        q_embed = self._embed(query)
        scored = []

        for mem in self.memories:
            similarity = self._cosine_sim(q_embed, mem["embedding"])
            days_old = (now - mem["timestamp"]).days
            decay = self.decay_rate ** days_old
            # Final score: semantic similarity * importance * time decay
            score = similarity * mem["importance"] * decay
            scored.append((score, mem))

        scored.sort(key=lambda x: -x[0])
        return [(s, m["text"], m["type"]) for s, m in scored[:top_k]]

# Simulate a user across multiple sessions
store = SemanticMemoryStore()
day1 = datetime(2026, 1, 15)
day30 = datetime(2026, 2, 14)

# Session 1 (day 1): project setup
store.store("User prefers pytest with verbose output and coverage reports",
            memory_type="preference", importance=0.8, timestamp=day1)
store.store("Project uses PostgreSQL 16 with pgvector extension",
            memory_type="decision", importance=0.9, timestamp=day1)

# Session 2 (day 30): user asks about testing
results = store.retrieve("How should I set up tests?", top_k=2, now=day30)
for score, text, mtype in results:
    print(f"[{mtype}] (score: {score:.3f}) {text}")
# The pytest preference surfaces even 30 days later
# because importance=0.8 and semantic similarity is high

The scoring formula — similarity * importance * decay — is the core insight. A highly relevant old memory beats a slightly relevant new one. A critical decision persists longer than a casual preference. The three factors work together to surface exactly the right memories at the right time.

The Full Memory Architecture — Combining All Five

Real applications don't choose one memory type — they combine several. A coding assistant might use a sliding window for the current conversation, entity memory for project facts, and semantic memory for cross-session recall. The challenge is assembling the prompt within a token budget.

Think of it as packing a suitcase. You have a fixed budget (say, 4,096 tokens). System instructions take ~500. The user's message takes ~100. That leaves ~3,500 for memory. How do you allocate?

A practical split: 500 tokens for entity facts, 500 for long-term memories, 2,000 for the conversation window, and 500 reserved for the response. The hybrid manager below implements exactly this — ranking and selecting memories across all stores to maximize relevance within the budget.

class HybridMemoryManager:
    """Combine sliding window + entity memory + semantic long-term memory."""

    def __init__(self, token_budget=3500):
        self.window = SlidingWindowMemory(recent_k=10, important_k=3)
        self.entities = EntityMemory()
        self.longterm = SemanticMemoryStore()
        self.token_budget = token_budget
        self.tpw = 1.3  # tokens per word estimate

    def _est_tokens(self, text):
        return int(len(text.split()) * self.tpw)

    def add_message(self, role, content, turn):
        """Process a new message through all memory systems."""
        self.window.add(role, content)
        if role == "user":
            self.entities.extract_and_store(content, turn)
            # In production: use LLM to decide if moment is memory-worthy
            if len(content.split()) > 15:
                self.longterm.store(content, importance=0.6)

    def assemble_prompt(self, user_message, system_prompt="You are a helpful assistant."):
        """Build the full prompt within the token budget."""
        parts = [{"role": "system", "content": system_prompt}]
        budget = self.token_budget - self._est_tokens(system_prompt)

        # 1. Entity memory (highest priority — structured facts)
        entity_ctx = self.entities.get_relevant(user_message)
        if entity_ctx:
            entity_block = f"Known facts:\n{entity_ctx}"
            cost = self._est_tokens(entity_block)
            if cost < budget * 0.15:  # cap at 15% of budget
                parts[0]["content"] += f"\n\n{entity_block}"
                budget -= cost

        # 2. Long-term memory (cross-session recall)
        lt_results = self.longterm.retrieve(user_message, top_k=3)
        if lt_results:
            lt_texts = [text for _, text, _ in lt_results]
            lt_block = "Relevant memories:\n" + "\n".join(f"- {t}" for t in lt_texts)
            cost = self._est_tokens(lt_block)
            if cost < budget * 0.15:
                parts[0]["content"] += f"\n\n{lt_block}"
                budget -= cost

        # 3. Conversation window (fills remaining budget)
        window_msgs = self.window.get_window()
        for msg in window_msgs:
            cost = self._est_tokens(msg["content"])
            if cost <= budget:
                parts.append(msg)
                budget -= cost

        # 4. Current user message
        parts.append({"role": "user", "content": user_message})
        return parts

# Full workflow
mgr = HybridMemoryManager(token_budget=3500)
mgr.add_message("user", "I'm building a SaaS dashboard with React and PostgreSQL", turn=1)
mgr.add_message("assistant", "Great stack! Let's plan the architecture.", turn=2)
mgr.add_message("user", "I prefer pytest for testing with coverage reports", turn=3)
mgr.add_message("assistant", "I'll set up pytest-cov in the project config.", turn=4)
# ... many turns later ...
mgr.add_message("user", "Now let's set up the deployment pipeline", turn=50)

prompt = mgr.assemble_prompt("Which database should the CI pipeline test against?")
# Entity memory injects: [Project] database: PostgreSQL, framework: React
# Long-term memory surfaces: pytest preference from turn 3
# Window has recent deployment discussion
# All assembled within the 3,500-token budget

The assembly order matters: entity facts first (most dense, highest information-per-token), long-term memories second, then the conversation window fills remaining space. This ensures critical facts are never crowded out by verbose recent chatter.

Choosing the Right Memory System

Here's how the five systems compare across the dimensions that matter:

System	Token Cost	Recall	Cross-Session	Best For
Buffer	O(n) — grows linearly	Perfect (until overflow)	No	Short chats, prototyping
Window	O(k) — bounded	Recent + important	No	Long conversations, cost-sensitive
Summary	O(1) — compressed	Facts yes, nuance no	Possible	Task-oriented apps, support bots
Entity	O(entities)	Structured facts	Yes	Project assistants, CRM
Semantic	O(k) — top-k retrieval	Semantic similarity	Yes	Personal assistants, long-term tools

Most production systems use a hybrid: entity memory for structured facts, a sliding window for conversational context, and optionally semantic memory for cross-session recall. The summary approach is excellent for customer support where you need "what happened so far" without the full transcript.

Try It: Memory System Comparison

Watch a 30-turn conversation play out under each memory system. The user discusses a project, changes a requirement on turn 17, and asks about an early decision on turn 29. See which systems remember — and which forget.

Turn: Turn 29/30

Try It: Memory Extraction Visualizer

Select a conversation turn to see what each memory system extracts from it. Different systems capture different aspects of the same message.

Buffer Memory

Summary Memory

Entity Memory

Semantic Memory

Conclusion

Every LLM application is a memory management problem in disguise. The model itself remembers nothing — your architecture decides what persists, what's compressed, and what's forgotten.

Start simple. A conversation buffer works fine for prototyping and short interactions. When you hit token limits or cost ceilings, graduate to a sliding window. When conversations get long and you need the gist without the bulk, add summary memory. When you need structured facts that survive across sessions, add entity extraction. And when you want the "it knows me" magic, invest in semantic long-term memory.

The hybrid approach — entity facts plus a conversation window plus selective long-term memory — is the sweet spot for most production applications. It's what systems like MemGPT pioneered: treating context management as an operating system problem, with different memory tiers for different access patterns.

Your users won't notice your memory architecture when it works. But they'll absolutely notice when it doesn't — when your assistant forgets their name, re-asks a question they answered an hour ago, or suggests MySQL when they told you PostgreSQL three times. Good memory is invisible. Bad memory is unforgivable.

References & Further Reading

Packer et al. — "MemGPT: Towards LLMs as Operating Systems" — OS-inspired virtual context management with tiered memory (2023)
Park et al. — "Generative Agents: Interactive Simulacra of Human Behavior" — memory-driven agent architecture with reflection and retrieval (2023)
LangChain — Memory Documentation — practical implementations of buffer, summary, and entity memory
Zhong et al. — "MemoryBank: Enhancing LLMs with Long-Term Memory" — Ebbinghaus forgetting curve applied to LLM memory (2023)
DadOps — Context Window Strategies — the broader context management problem that memory is a special case of
DadOps — RAG From Scratch — the retrieval architecture that semantic memory is built on
DadOps — Embeddings From Scratch — how the vectors powering semantic memory work
DadOps — Building AI Agents — agents need memory for state across tool calls