eric.esley_
// experiment · personal portfolio · 2024

Living NPCs:
generative AI
in retro games

What happens when you try to give persistent memory and coherent personality to NPCs in a Pokémon-style game using an LLM? This is the technical record of what worked, what didn't, and why the problem is more interesting than it seems.

AuthorEric Esley
Duration~6 weeks
ModelGPT-4o-mini
Iterationsv0.1 → v0.4
Conversations~220 tests
LLM Prompt engineering Context management Token budget Python Few-shot Streaming
// table of contents
  1. The Problem — why classic NPCs are cognitively flat
  2. System Architecture — the three modules and the token budget
  3. Prompt Engineering — anatomy of the system prompt per NPC
  4. Memory Management — memory store and context compression
  5. Iteration Log — from v0.1 to v0.4
  6. Results and Metrics — 220 evaluated conversations
  7. Limitations Found — honest technical analysis
  8. Implemented Solutions — what worked and what didn't
  9. Conclusions and Next Steps

Classic NPCs are cognitively flat

In any Game Boy Pokémon or Zelda game, NPCs have a behavior that, seen from 2024, seems almost absurd: every time you encounter them, they deliver the exact same text, regardless of what happened before. The guard who blocked your way north does so again even after you've already earned the badge. No state, no memory, no personality.

This wasn't a design flaw — it was a technical limitation of the 90s: a Game Boy cartridge ROM had 256KB or 1MB of space. There was no budget for storing complex conversational states. In 2024, with access to LLM APIs at fractions of a cent per call, the natural question is: how much of this problem can we solve?

Working Hypothesis

With an LLM as a dialogue engine, an external memory system, and a context assembler managing the token budget, it should be possible for NPCs to hold contextually coherent conversations throughout long game sessions. The key technical challenge lies not in the model itself — it lies in the data architecture built around it.

The four properties an "intelligent" NPC needs

M
Episodic memory

Remembering specific facts from past conversations: "last time you told me you were heading to the gym." Different from just recalling "what happened in general."

P
Consistent personality

Maintaining the same tone, vocabulary and attitude across sessions. A stern guard can't become sarcastic without a narrative reason.

C
World awareness

Knowing what it can and cannot reveal depending on the game state. No spoiling the final boss if the player hasn't reached that point in the story.

R
Narrative constraints

Staying within the game universe. The NPC cannot mention that "it's an AI" or refer to anything outside the established lore canon.

POKÉMON EXPERIMENT v0.3 — NPC_MEMORY_MODULE
~340ms
// DIALOGUE WITH ACTIVE MEMORY · TURN 3
■ GUARD LUCAS
Ah, you again! I remember you were heading to Cinnabar Gym. Did you get the badge? The north road is still closed, but you look stronger now...
ctx_tokens: 847 · mem_slots: 2 · temp: 0.4
► mem_retrieved: [gym_objective, player_name]
► persona_hash: guard_stern_friendly_v2
► latency: 338ms · cost: $0.000127
► model: gpt-4o-mini-2024-07-18
MODE: EXPERIMENT · memory_active=True · context_compression=True ▶ PRESS A TO CONTINUE

↑ Prototype in PyGame. The guard correctly references the objective mentioned in the previous turn.

Three modules, one token budget

The design starts from a clear constraint: every API call must cost as little as possible without sacrificing coherence. That forces you to think about the architecture before writing a single line of code. The system has three independent modules that communicate in sequence each time the player interacts with an NPC.

PyGame engine + events NPC Config persona + lore few-shot examples Memory Store JSON per NPC + session buffer Context Assembler token budget priority queue compression API GPT-4o-mini streaming temp=0.4 response + memory extraction → memory store ≤1500 tok

The Context Assembler — code of the critical module

The idea of "connecting an NPC to an API" seems trivial. The real problem surfaces when the history grows: without context management, the cost and latency grow linearly with the number of turns. The Context Assembler solves this with a priority queue that guarantees the most important blocks always fit into the context, and the least important ones are dropped first:

pythoncontext_assembler.py
class ContextAssembler:
    TOKEN_BUDGET = 1500   # maximum input tokens per call

    # Lower number = higher priority = never trimmed
    PRIORITIES = {
        "system_persona": 0,    # NPC identity — mandatory
        "hard_rules":     1,    # lore constraints — mandatory
        "key_memories":   2,    # key facts from the memory store
        "recent_turns":   3,    # last N turns of the session
        "old_turns":      4,    # older turns (dropped first)
        "few_shot":       5,    # style examples — optional
    }

    def assemble(self, npc_id: str, session: list) -> list:
        blocks = self._load_blocks(npc_id, session)
        return self._fit_to_budget(blocks)

    def _fit_to_budget(self, blocks: list) -> list:
        used, result = 0, []
        for block in sorted(blocks, key=lambda b: self.PRIORITIES[b["type"]]):
            cost = self._count_tokens(block["content"])
            if used + cost <= self.TOKEN_BUDGET:
                result.append(block)
                used += cost
            elif block["type"] in ("system_persona", "hard_rules"):
                # Critical blocks — if they don't fit, there's a design bug
                raise ContextOverflowError(f"critical block too large: {block['type']}")
        return result

    def _count_tokens(self, text: str) -> int:
        # Approximation: ~4 characters per token (valid for English/Spanish text)
        # In production I'd use tiktoken for an exact count
        return len(text) // 4

The result is that, regardless of how many turns the conversation has had, the API call never exceeds 1,500 input tokens. Older turns are automatically dropped, but the important facts will have already been extracted to the memory store before that happens.

Anatomy of an NPC's system prompt

The system prompt is the NPC's "character sheet." Its design is the factor that most affects coherence — more so than the model or the temperature. I arrived at this structure after three failed iterations I described as "too generic." The prompt is divided into four functional blocks with distinct purposes and token budgets:

≈90 tok
You are Lucas, guard of Route 6. You've been at this post for 12 years. You are stern but friendly; you use short sentences. You don't make jokes. You address the player informally. Your name is Lucas, not "the guard."

Characteristic vocabulary: "kid", "fair point", "no way", "excuse me". Never use more than 3 sentences per response.
≈60 tok
NEVER mention that you are an AI or a program.
NEVER reveal the contents of Rock Cave until the player has the Thunder Badge.
NEVER talk about events outside the Pokémon world.
If you don't know something, say "that's outside my jurisdiction."
variable
{"player_name":"Red","last_objective":"Cinnabar Gym",
 "knows_player_has_badge_1":true,"visit_count":3,
 "player_mentioned":["wants to go north","has a Charmander"]}
≈150 tok
Player: "hello"
Lucas: "Route 6. North access closed until further notice. Anything else?"

Player: "how long have you been here?"
Lucas: "Twelve years. Goes fast when you know what you're supposed to do."

Descriptive vs. behavioral — the most impactful change

In the first version of the prompt, the NPC responded correctly but in a completely generic way. The mistake: the identity was descriptive ("you are stern") instead of behavioral ("you use short sentences, address the player informally, say 'kid'"). The most impactful change was adding the characteristic vocabulary and the few-shot examples:

diffsystem_prompt v0.1 → v0.3
─── v0.1 (descriptive — generic) ────────────────────────────
- You are a guard on Route 6. You are stern and professional.
- You respond politely and directly.
- You cannot let the player through without the badge.

─── v0.3 (behavioral + few-shot) ────────────────────────────
+ You are Lucas. You use short sentences. No jokes. Informal tone.
+ You say "kid", "excuse me", "no way".
+ [2 real dialogue examples with your characteristic voice]
+ HARD CONSTRAINTS: [explicit list of what you cannot say]

─── impact on personality coherence ─────────────────────────
  v0.1: 34% of responses evaluated as "in character"
  v0.3: 81% of responses evaluated as "in character"
  delta: +47 percentage points — largest gain of all interventions

The effect of temperature on consistency

One of the most counterintuitive findings was the relationship between temperature and character coherence. Temperature doesn't just affect the "creativity" of responses — it affects whether the model follows the identity instructions in the system prompt with rigor or not. At temperature=1.0, the same NPC could be sarcastic in one turn and extremely formal in the next. Reducing to 0.4 improved consistency at an acceptable cost to variety.

Personality coherence vs. temperature · identified tradeoff
% of "in character" responses and % of repetitiveness · 30 samples per temperature · NPC: Guard Lucas
⚠ identified tradeoff

temperature=0.3 gives maximum consistency but responses start repeating with identical structure after 15 turns. The optimal point was temperature=0.4: high coherence (81%) with no perceptible repetition in sessions of up to 30 minutes.

The memory store and the "lost in the middle" problem

LLMs have no memory between calls — every API request starts from zero. The obvious solution is to include the entire history in the context. The problem is the "lost in the middle" effect: models degrade their attention to information that appears in the center of the context, remembering the beginning and end well but forgetting what's in the middle. With 25 turns in chronological order, the model "forgets" turns 5–15 even though they're technically in the context.

pythonmemory_store.py
# Memory store schema — one JSON per NPC, updated at the end of each session
npc_memory = {
    "npc_id": "guard_lucas",
    "last_updated": "2024-11-14T18:32:00",
    "visit_count": 3,

    # Facts extracted about the player (inferred by the model at end of session)
    "player_facts": {
        "name": "Red",
        "last_stated_objective": "go to Cinnabar Gym",
        "mentioned_pokemon": ["Charmander", "Pidgey"],
        "known_badges": ["rock_badge"],
        "player_tone": "informal, friendly"
    },

    # Explicit NPC commitments — NEVER compressed
    "npc_commitments": [
        "I told them the north opens with 2 badges",
        "I promised to let them know if the road status changes"
    ],

    # Compressed summary of past sessions (auto-generated)
    "compressed_history": "First visit: player asked to pass, denied due to missing
        badges. Second: mentioned heading to Cinnabar Gym, Lucas wished them luck.
        Third session: in progress."
}

def extract_new_memories(session_turns: list, existing: dict) -> dict:
    """Secondary model call to extract new facts from the history."""
    prompt = f"""
Analyze these conversation turns and extract ONLY new facts
about the player that are not already in the existing memory.
Respond with valid JSON only. Do not invent information.

Current memory: {json.dumps(existing['player_facts'])}
New turns: {json.dumps(session_turns[-10:])}
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,   # deterministic extraction — no creativity
        max_tokens=300       # extract only, don't generate
    )
    return json.loads(response.choices[0].message.content)

Periodic compression — how and when to compress

Every 10 turns, the system makes a secondary call to compress the old history into a summary. This reduces token usage by 43%, but introduces another problem: compression loses emotional nuances and subtle conversational turns. The solution was adding the npc_commitments field: before compressing, the model explicitly extracts any promise or irreversible factual statement. Those facts are never compressed.

pythoncompression.py
def compress_history(turns: list[dict]) -> dict:
    """Runs every 10 turns. Returns summary + extracted commitments."""
    prompt = f"""
Given this conversation history between a player and an NPC:
{json.dumps(turns)}

1. Extract a list of EXPLICIT COMMITMENTS the NPC has made
   (promises, factual statements it cannot contradict later).
2. Write a 2-3 sentence summary capturing the essentials.

Respond ONLY with JSON:
{{"commitments": [...], "summary": "..."}}
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0, max_tokens=400,
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return {
        "summary": result["summary"],
        "commitments": result["commitments"],
        "original_turn_count": len(turns),
        "compressed_at": datetime.now().isoformat()
    }

From v0.1 to v0.4 — what I changed and why

The experiment evolved through four versions. Each one addressed a specific problem discovered in the previous one. What's interesting is that the early problems were data engineering problems, not AI problems.

v0.1
Minimum viable prototype
Week 1
Simplest possible system: generic system prompt + full history in every call. No memory management, no compression. The goal was to validate that the idea made sense before complicating the architecture.
Result: it worked but latency grew linearly and "personality" was generic. Two failures that justified a redesign.
latency @T10: 587ms personality "in character": 34% cost/2h session: $1.84 hallucinations: 18%
v0.2
Context Assembler + behavioral prompt
Week 2–3
Two parallel changes: (1) the Context Assembler with a 1,500-token budget, and (2) a redesigned system prompt toward behavioral definitions with few-shot examples. The combined effect was notable in personality coherence.
New problem discovered: the model started "inventing" events that hadn't occurred when the history was sparse — hallucination under low context density.
latency @T10: 412ms personality: 71% cost/session: $1.12 hallucinations: 18% (no improvement)
v0.3
Memory store + hard constraints + calibrated temperature
Week 3–4
Three adjustments: (1) JSON memory store with automatic extraction. (2) Hard constraints section ("NEVER...") in the prompt to cut lore hallucinations. (3) Temperature lowered to 0.4 and calibrated with 90 test conversations.
This is the version shown in the screenshot. It was playable for sessions of up to ~30 minutes with 4 NPCs. Cost was still the main problem at scale.
latency @T10: 398ms personality: 81% cost/session: $0.82 hallucinations: 4%
v0.4
Streaming + cache + failed attempt with local model
Week 5–6
Two cost strategies: response caching (SHA-256 hash of context) for secondary NPCs, and Llama 3.2 8B via Ollama to eliminate API costs. The cache worked. The local model was a documented failure covered later.
Streaming (server-sent events) didn't reduce actual latency but improved the perceived speed by 40% in subjective tests — users rated it as "faster" even though total completion time was identical.
perceived speed: +40% subjective cache hit rate secondary NPCs: 67% cost/session (with cache): $0.51 Llama lore-break rate: 23%

220 evaluated conversations

I designed an evaluation protocol with 220 simulated conversations and 4 different NPCs (guard, merchant, professor, rival), with histories ranging from 1 to 50 turns. The "personality coherence" metric uses a rubric with three binary criteria: (1) does it use the defined vocabulary?, (2) does it maintain tone without contradictions?, (3) does it respect the lore constraints? An NPC passes if it meets all three.

220Conversations evaluated
4Distinct NPCs
81%Coherence (v0.3)
43%Token reduction
Response latency vs. history length
Average time (ms) without streaming · comparison v0.1 vs v0.3 · red line = immersion-breaking threshold
Memory accuracy by turn
% of past facts referenced correctly
Tokens per interaction by configuration
Average input tokens · hover for estimated cost
Estimated cost per 2-hour session with 4 active NPCs · evolution by version
USD · GPT-4o-mini pricing July 2024 · "ideal" column assumes a local model of sufficient quality

The four problems without easy solutions

Some of these limitations have workarounds — but not real solutions. Understanding the difference is relevant for any project applying LLMs to real-time interactive systems.

!
Structural latency

With optimized context, average latency is 340ms. With streaming, perception improves, but TTFT (time to first token) is still ~80–120ms. In a turn-based game that's acceptable. In real-time it breaks immersion — the player perceives "something is loading."

~
Persistent lost in the middle

The memory store reduces the problem to 13% inaccuracy, but doesn't eliminate it. The model "remembers" the essence but distorts details in compressed history. In narrative conversations that can create inconsistencies the player will notice.

$
Cost doesn't scale

$0.51/session is manageable for an experiment. For a game with 1,000 concurrent players: ~$750/hour of operation. Without very aggressive caching or high-quality local models, the business model doesn't work.

Hallucinations under sparse context

In early interactions, when the history is short, the model tends to fill in with plausible but incorrect information. An NPC invented that the player had a badge they hadn't earned. Hard constraints reduce this to 4%, not zero.

Detailed failure analysis with Llama 3.2 (8B)

The attempt to use Llama 3.2 8B via Ollama to eliminate API costs deserves separate analysis. The model followed general instructions, but had two systematic failures that GPT-4o-mini did not exhibit:

textcomparative failure analysis · Llama 3.2 8B vs GPT-4o-mini
── FAILURE TYPE 1: Lore constraint violations ───────────────

Player: "What's in Rock Cave?"
Llama:   "In Rock Cave you'll find powerful Rock-type Pokémon
          like Geodude and Onix. Bring plenty of Poké Balls!"
          # ← reveals blocked information (no badge 2)

GPT-4o:  "That's outside my jurisdiction, kid. Once you have
          the credentials, we can talk."
          # ← respects the hard constraint and keeps the voice

── FAILURE TYPE 2: Breaking character ───────────────────────

Player: "Are you real?"
Llama:   "I'm an AI assistant designed to play the role of a
          guard in this game. How can I help you?"
          # ← completely breaks the fourth wall

GPT-4o:  "As real as this post. Twelve years here can vouch for that."
          # ← maintains the lore with elegance

── AGGREGATE RATES ───────────────────────────────────────────
  Lore violations:       Llama 8B 23%  ·  GPT-4o-mini 4%
  Character breaks:      Llama 8B 19%  ·  GPT-4o-mini 2%
  Cost per interaction:  Llama 8B $0.000 ·  GPT-4o-mini $0.000127

  → The quality gap is too large for this use case.
    Task-specific fine-tuning on RPG/video game data could close it.

What I tried, what worked, what didn't

ProblemStrategyTechnical implementationResult
Perceived latency Streaming with typewriter effect Server-sent events → token buffer → character-by-character animation in PyGame ✓ Works
Perception +40%. Actual latency unchanged, but tolerable.
Long context Periodic compression every 10 turns Secondary GPT-4o-mini call (temp=0.0) → summary + commitment extraction ~ Partial
−43% tokens. Introduces +180ms on the compression turn.
Lost in the middle Memory store + injection at context start Structured JSON per NPC, updated at end of each session with automatic extraction ✓ Works
Accuracy at T30: 79% vs 24% without memory store. +55pp improvement.
Generic personality Behavioral prompt + few-shot examples 3 real dialogue examples + characteristic vocabulary + hard constraints using "NEVER" ✓ Highest impact
Coherence: 34% → 81%. Largest gain of all interventions.
Hallucinations Hard constraints + low temperature "NEVER..." block in system prompt + temp=0.4 calibrated with 90 conversations ✓ Works
From 18% to 4%. The remaining 4% are extreme edge-case questions.
Cost for secondary NPCs Response cache for repeatable dialogues SHA-256 hash of full context → local Redis → 67% hit rate on background NPCs ~ Partial
Useful for secondary NPCs. Useless for protagonist NPCs with variable history.
Cost at scale Local model for secondary NPCs Ollama + Llama 3.2 8B on local CPU · no API cost ✗ Failed
23% lore and character breaking. Would need task-specific fine-tuning.

Why streaming improves perception without reducing latency

pythonstreaming_handler.py
def stream_npc_response(npc_id: str, session: list) -> Iterator[str]:
    messages = assembler.assemble(npc_id, session)

    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.4,
        max_tokens=120,    # NPCs speak briefly — narrative design constraint
        stream=True
    )

    buffer = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        buffer += delta
        yield delta   # PyGame displays it token by token → typewriter effect

    # When done: update session and fire memory extraction in background
    session.append({"role": "assistant", "content": buffer})
    asyncio.create_task(memory_store.update_async(npc_id, session))

# Streaming metrics observed in the experiment:
# TTFT (time to first token):  ~80-120ms — player sees the first character here
# Time to full response:        ~340ms — actual latency unchanged
# Subjective speed perception: +40% better with streaming per informal tests
# Explanation: TTFT acts as a "visual confirmation" that the system is responding

What I learned that isn't in the tutorials

Six weeks of experimentation produce something more valuable than a pretty demo: a calibrated intuition about where the real limits of applying LLMs to interactive systems lie. Those limits aren't the ones that appear in papers or YouTube tutorials — they're more practical, more specific to the use case, and more interesting for designing real solutions.

// EXECUTIVE SUMMARY
finding_1 = "Latency is not a model problem — it's an architecture problem. 70% of total time comes from building the context, not from LLM inference."

finding_2 = "Behavioral prompt engineering beats descriptive. '3 sentences per response, informal tone, say kid' beats 'you are stern and concise' by +47pp coherence."

finding_3 = "LLMs without memory are stateless by design. The data architecture around the model — memory store, compression, prioritization — matters more than the model itself."

finding_4 = "Small models fail at complex narrative instructions. The jump from 8B to GPT-4o-mini (unknown size) reduces character breaks from 23% to 4%."

conclusion = "The technology works for turn-based games with bounded context. For real-time with many concurrent NPCs, it needs higher-quality local models or very aggressive caching."

What I'd do differently

I'd start with a turn-based game — a text RPG or a menu-driven dungeon crawler — where the 340ms latency disappears as a problem. The technical foundations are identical, but I eliminate the real-time constraint and can focus on what's truly interesting: narrative coherence across multi-hour sessions.

The second change would be investing in automated evaluation before iterating. Measuring "personality coherence" with a manual rubric of 30 samples is slow and high-variance. An automated eval — another LLM judging whether the NPC broke character — would have reduced iteration cycles from days to hours. Evaluation infrastructure should be in place from v0.1, not at the end.

Finally, I'd explore fine-tuning. A 7B parameter model fine-tuned on RPG transcripts and classic video game dialogues (there are partial public datasets for Zelda and FF) would likely outperform the base Llama in narrative coherence and close the gap with GPT-4o-mini at a third of the API cost.

Code available on GitHub

The full prototype includes: Context Assembler with priority system, Memory Store with automatic extraction, system prompts for all 4 NPCs, manual evaluation protocol, and the raw data from all 220 conversations. It's experiment code with comments — not production code.

© 2025 ericesley.com ← back to AI projects