What happens when you try to give persistent memory and coherent personality to NPCs in a Pokémon-style game using an LLM? This is the technical record of what worked, what didn't, and why the problem is more interesting than it seems.
In any Game Boy Pokémon or Zelda game, NPCs have a behavior that, seen from 2024, seems almost absurd: every time you encounter them, they deliver the exact same text, regardless of what happened before. The guard who blocked your way north does so again even after you've already earned the badge. No state, no memory, no personality.
This wasn't a design flaw — it was a technical limitation of the 90s: a Game Boy cartridge ROM had 256KB or 1MB of space. There was no budget for storing complex conversational states. In 2024, with access to LLM APIs at fractions of a cent per call, the natural question is: how much of this problem can we solve?
With an LLM as a dialogue engine, an external memory system, and a context assembler managing the token budget, it should be possible for NPCs to hold contextually coherent conversations throughout long game sessions. The key technical challenge lies not in the model itself — it lies in the data architecture built around it.
Remembering specific facts from past conversations: "last time you told me you were heading to the gym." Different from just recalling "what happened in general."
Maintaining the same tone, vocabulary and attitude across sessions. A stern guard can't become sarcastic without a narrative reason.
Knowing what it can and cannot reveal depending on the game state. No spoiling the final boss if the player hasn't reached that point in the story.
Staying within the game universe. The NPC cannot mention that "it's an AI" or refer to anything outside the established lore canon.
↑ Prototype in PyGame. The guard correctly references the objective mentioned in the previous turn.
The design starts from a clear constraint: every API call must cost as little as possible without sacrificing coherence. That forces you to think about the architecture before writing a single line of code. The system has three independent modules that communicate in sequence each time the player interacts with an NPC.
The idea of "connecting an NPC to an API" seems trivial. The real problem surfaces when the history grows: without context management, the cost and latency grow linearly with the number of turns. The Context Assembler solves this with a priority queue that guarantees the most important blocks always fit into the context, and the least important ones are dropped first:
class ContextAssembler: TOKEN_BUDGET = 1500 # maximum input tokens per call # Lower number = higher priority = never trimmed PRIORITIES = { "system_persona": 0, # NPC identity — mandatory "hard_rules": 1, # lore constraints — mandatory "key_memories": 2, # key facts from the memory store "recent_turns": 3, # last N turns of the session "old_turns": 4, # older turns (dropped first) "few_shot": 5, # style examples — optional } def assemble(self, npc_id: str, session: list) -> list: blocks = self._load_blocks(npc_id, session) return self._fit_to_budget(blocks) def _fit_to_budget(self, blocks: list) -> list: used, result = 0, [] for block in sorted(blocks, key=lambda b: self.PRIORITIES[b["type"]]): cost = self._count_tokens(block["content"]) if used + cost <= self.TOKEN_BUDGET: result.append(block) used += cost elif block["type"] in ("system_persona", "hard_rules"): # Critical blocks — if they don't fit, there's a design bug raise ContextOverflowError(f"critical block too large: {block['type']}") return result def _count_tokens(self, text: str) -> int: # Approximation: ~4 characters per token (valid for English/Spanish text) # In production I'd use tiktoken for an exact count return len(text) // 4
The result is that, regardless of how many turns the conversation has had, the API call never exceeds 1,500 input tokens. Older turns are automatically dropped, but the important facts will have already been extracted to the memory store before that happens.
The system prompt is the NPC's "character sheet." Its design is the factor that most affects coherence — more so than the model or the temperature. I arrived at this structure after three failed iterations I described as "too generic." The prompt is divided into four functional blocks with distinct purposes and token budgets:
In the first version of the prompt, the NPC responded correctly but in a completely generic way. The mistake: the identity was descriptive ("you are stern") instead of behavioral ("you use short sentences, address the player informally, say 'kid'"). The most impactful change was adding the characteristic vocabulary and the few-shot examples:
─── v0.1 (descriptive — generic) ──────────────────────────── - You are a guard on Route 6. You are stern and professional. - You respond politely and directly. - You cannot let the player through without the badge. ─── v0.3 (behavioral + few-shot) ──────────────────────────── + You are Lucas. You use short sentences. No jokes. Informal tone. + You say "kid", "excuse me", "no way". + [2 real dialogue examples with your characteristic voice] + HARD CONSTRAINTS: [explicit list of what you cannot say] ─── impact on personality coherence ───────────────────────── v0.1: 34% of responses evaluated as "in character" v0.3: 81% of responses evaluated as "in character" delta: +47 percentage points — largest gain of all interventions
One of the most counterintuitive findings was the relationship between temperature and character coherence. Temperature doesn't just affect the "creativity" of responses — it affects whether the model follows the identity instructions in the system prompt with rigor or not. At temperature=1.0, the same NPC could be sarcastic in one turn and extremely formal in the next. Reducing to 0.4 improved consistency at an acceptable cost to variety.
temperature=0.3 gives maximum consistency but responses start repeating with identical structure after 15 turns. The optimal point was temperature=0.4: high coherence (81%) with no perceptible repetition in sessions of up to 30 minutes.
LLMs have no memory between calls — every API request starts from zero. The obvious solution is to include the entire history in the context. The problem is the "lost in the middle" effect: models degrade their attention to information that appears in the center of the context, remembering the beginning and end well but forgetting what's in the middle. With 25 turns in chronological order, the model "forgets" turns 5–15 even though they're technically in the context.
# Memory store schema — one JSON per NPC, updated at the end of each session npc_memory = { "npc_id": "guard_lucas", "last_updated": "2024-11-14T18:32:00", "visit_count": 3, # Facts extracted about the player (inferred by the model at end of session) "player_facts": { "name": "Red", "last_stated_objective": "go to Cinnabar Gym", "mentioned_pokemon": ["Charmander", "Pidgey"], "known_badges": ["rock_badge"], "player_tone": "informal, friendly" }, # Explicit NPC commitments — NEVER compressed "npc_commitments": [ "I told them the north opens with 2 badges", "I promised to let them know if the road status changes" ], # Compressed summary of past sessions (auto-generated) "compressed_history": "First visit: player asked to pass, denied due to missing badges. Second: mentioned heading to Cinnabar Gym, Lucas wished them luck. Third session: in progress." } def extract_new_memories(session_turns: list, existing: dict) -> dict: """Secondary model call to extract new facts from the history.""" prompt = f""" Analyze these conversation turns and extract ONLY new facts about the player that are not already in the existing memory. Respond with valid JSON only. Do not invent information. Current memory: {json.dumps(existing['player_facts'])} New turns: {json.dumps(session_turns[-10:])} """ response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.0, # deterministic extraction — no creativity max_tokens=300 # extract only, don't generate ) return json.loads(response.choices[0].message.content)
Every 10 turns, the system makes a secondary call to compress the old history into a summary. This reduces token usage by 43%, but introduces another problem: compression loses emotional nuances and subtle conversational turns. The solution was adding the npc_commitments field: before compressing, the model explicitly extracts any promise or irreversible factual statement. Those facts are never compressed.
def compress_history(turns: list[dict]) -> dict: """Runs every 10 turns. Returns summary + extracted commitments.""" prompt = f""" Given this conversation history between a player and an NPC: {json.dumps(turns)} 1. Extract a list of EXPLICIT COMMITMENTS the NPC has made (promises, factual statements it cannot contradict later). 2. Write a 2-3 sentence summary capturing the essentials. Respond ONLY with JSON: {{"commitments": [...], "summary": "..."}} """ response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.0, max_tokens=400, response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return { "summary": result["summary"], "commitments": result["commitments"], "original_turn_count": len(turns), "compressed_at": datetime.now().isoformat() }
The experiment evolved through four versions. Each one addressed a specific problem discovered in the previous one. What's interesting is that the early problems were data engineering problems, not AI problems.
I designed an evaluation protocol with 220 simulated conversations and 4 different NPCs (guard, merchant, professor, rival), with histories ranging from 1 to 50 turns. The "personality coherence" metric uses a rubric with three binary criteria: (1) does it use the defined vocabulary?, (2) does it maintain tone without contradictions?, (3) does it respect the lore constraints? An NPC passes if it meets all three.
Some of these limitations have workarounds — but not real solutions. Understanding the difference is relevant for any project applying LLMs to real-time interactive systems.
With optimized context, average latency is 340ms. With streaming, perception improves, but TTFT (time to first token) is still ~80–120ms. In a turn-based game that's acceptable. In real-time it breaks immersion — the player perceives "something is loading."
The memory store reduces the problem to 13% inaccuracy, but doesn't eliminate it. The model "remembers" the essence but distorts details in compressed history. In narrative conversations that can create inconsistencies the player will notice.
$0.51/session is manageable for an experiment. For a game with 1,000 concurrent players: ~$750/hour of operation. Without very aggressive caching or high-quality local models, the business model doesn't work.
In early interactions, when the history is short, the model tends to fill in with plausible but incorrect information. An NPC invented that the player had a badge they hadn't earned. Hard constraints reduce this to 4%, not zero.
The attempt to use Llama 3.2 8B via Ollama to eliminate API costs deserves separate analysis. The model followed general instructions, but had two systematic failures that GPT-4o-mini did not exhibit:
── FAILURE TYPE 1: Lore constraint violations ─────────────── Player: "What's in Rock Cave?" Llama: "In Rock Cave you'll find powerful Rock-type Pokémon like Geodude and Onix. Bring plenty of Poké Balls!" # ← reveals blocked information (no badge 2) GPT-4o: "That's outside my jurisdiction, kid. Once you have the credentials, we can talk." # ← respects the hard constraint and keeps the voice ── FAILURE TYPE 2: Breaking character ─────────────────────── Player: "Are you real?" Llama: "I'm an AI assistant designed to play the role of a guard in this game. How can I help you?" # ← completely breaks the fourth wall GPT-4o: "As real as this post. Twelve years here can vouch for that." # ← maintains the lore with elegance ── AGGREGATE RATES ─────────────────────────────────────────── Lore violations: Llama 8B 23% · GPT-4o-mini 4% Character breaks: Llama 8B 19% · GPT-4o-mini 2% Cost per interaction: Llama 8B $0.000 · GPT-4o-mini $0.000127 → The quality gap is too large for this use case. Task-specific fine-tuning on RPG/video game data could close it.
| Problem | Strategy | Technical implementation | Result |
|---|---|---|---|
| Perceived latency | Streaming with typewriter effect | Server-sent events → token buffer → character-by-character animation in PyGame | ✓ Works Perception +40%. Actual latency unchanged, but tolerable. |
| Long context | Periodic compression every 10 turns | Secondary GPT-4o-mini call (temp=0.0) → summary + commitment extraction | ~ Partial −43% tokens. Introduces +180ms on the compression turn. |
| Lost in the middle | Memory store + injection at context start | Structured JSON per NPC, updated at end of each session with automatic extraction | ✓ Works Accuracy at T30: 79% vs 24% without memory store. +55pp improvement. |
| Generic personality | Behavioral prompt + few-shot examples | 3 real dialogue examples + characteristic vocabulary + hard constraints using "NEVER" | ✓ Highest impact Coherence: 34% → 81%. Largest gain of all interventions. |
| Hallucinations | Hard constraints + low temperature | "NEVER..." block in system prompt + temp=0.4 calibrated with 90 conversations | ✓ Works From 18% to 4%. The remaining 4% are extreme edge-case questions. |
| Cost for secondary NPCs | Response cache for repeatable dialogues | SHA-256 hash of full context → local Redis → 67% hit rate on background NPCs | ~ Partial Useful for secondary NPCs. Useless for protagonist NPCs with variable history. |
| Cost at scale | Local model for secondary NPCs | Ollama + Llama 3.2 8B on local CPU · no API cost | ✗ Failed 23% lore and character breaking. Would need task-specific fine-tuning. |
def stream_npc_response(npc_id: str, session: list) -> Iterator[str]: messages = assembler.assemble(npc_id, session) stream = client.chat.completions.create( model="gpt-4o-mini", messages=messages, temperature=0.4, max_tokens=120, # NPCs speak briefly — narrative design constraint stream=True ) buffer = "" for chunk in stream: delta = chunk.choices[0].delta.content or "" buffer += delta yield delta # PyGame displays it token by token → typewriter effect # When done: update session and fire memory extraction in background session.append({"role": "assistant", "content": buffer}) asyncio.create_task(memory_store.update_async(npc_id, session)) # Streaming metrics observed in the experiment: # TTFT (time to first token): ~80-120ms — player sees the first character here # Time to full response: ~340ms — actual latency unchanged # Subjective speed perception: +40% better with streaming per informal tests # Explanation: TTFT acts as a "visual confirmation" that the system is responding
Six weeks of experimentation produce something more valuable than a pretty demo: a calibrated intuition about where the real limits of applying LLMs to interactive systems lie. Those limits aren't the ones that appear in papers or YouTube tutorials — they're more practical, more specific to the use case, and more interesting for designing real solutions.
I'd start with a turn-based game — a text RPG or a menu-driven dungeon crawler — where the 340ms latency disappears as a problem. The technical foundations are identical, but I eliminate the real-time constraint and can focus on what's truly interesting: narrative coherence across multi-hour sessions.
The second change would be investing in automated evaluation before iterating. Measuring "personality coherence" with a manual rubric of 30 samples is slow and high-variance. An automated eval — another LLM judging whether the NPC broke character — would have reduced iteration cycles from days to hours. Evaluation infrastructure should be in place from v0.1, not at the end.
Finally, I'd explore fine-tuning. A 7B parameter model fine-tuned on RPG transcripts and classic video game dialogues (there are partial public datasets for Zelda and FF) would likely outperform the base Llama in narrative coherence and close the gap with GPT-4o-mini at a third of the API cost.
The full prototype includes: Context Assembler with priority system, Memory Store with automatic extraction, system prompts for all 4 NPCs, manual evaluation protocol, and the raw data from all 220 conversations. It's experiment code with comments — not production code.