Demystify AI

2026-05-03T00:00:00.000Z· 9 min read

Tokens, context windows, attention — model mechanics without math

A working mental model for the path from prompt to returned text: tokens, context windows, and attention without a single equation.

demystifyprimertokenscontext-windowattentionmodel-mechanics

Tuesday Morning Ate Two Days

A platform team came in Tuesday morning and spent the first hour pasting their entire incident-response runbook into a chat session, asking the assistant to draft a postmortem template against it. Good session. Useful answers. They left it open.

Wednesday afternoon, halfway through a long debugging conversation in the same window, the lead asked the model to “use the runbook section on database failover from earlier” and got back a generic answer with invented runbook steps. Nobody noticed for a day. The postmortem shipped with fabricated procedure references. Two days of cleanup followed.

The conversation stayed intact. The model did not get worse. The runbook had simply scrolled out of the window, and nothing in the chat UI disclosed the drop. Most teams hit this exact failure at least once before someone explains the mechanism under the hood.

Picture a sliding whiteboard

Picture the model working at a whiteboard with a fixed width. Everything currently visible — prompt, system instructions, conversation history, pasted documents — has to fit on the board. When new content arrives and the board fills up, the oldest content gets erased from the left to make room.

[ system prompt | early conversation | mid conversation | YOUR LATEST MESSAGE ]
                 ^ erased first when the board fills

The model only responds from the whiteboard content currently visible. It has no memory of erased content. It does not know erasing happened.

This picture explains most long-session failures. Stop here and most odd behavior in long AI chats becomes legible.

Tokens, Not Characters; Attention Weighs, Not Copies

The whiteboard uses tokens, not characters or words. Tokens are internal chunks roughly 3/4 of an English word. “Postmortem” might be one token; “irreproducibility” might be four. A 200,000-token context window sounds enormous, and it is, but a single pasted log file can burn 30,000 tokens in one shot.

The other adjustment: generation does not read the whiteboard left-to-right. The model looks at every token on the board simultaneously and weighs how much each one matters for predicting the next word. This weighing is attention. For each new word, the model chooses which earlier tokens deserve weight and which can fade.

Why the looseness is the feature

Most explanations skip the useful part: weighted attention makes the model useful at all.

If the model had to use every token on the board equally, long context would fail. A question about line 12 would drown in 50,000 tokens of unrelated logs. If it only used the most recent tokens, long arguments would collapse. Attention lets the model decide, per word, what matters. This mechanism makes 40-page contract questions about clause 7 possible.

The looseness is also why the model can’t promise it noticed something. Attention is a soft weighting, not a guaranteed read. A token can be on the whiteboard and still get under-weighted into irrelevance. “It’s in the context” is necessary but not sufficient.

Position matters, and the window is hard

Two mechanical details worth carrying around.

First, position matters. Tokens at the very start of the context (system prompts, early instructions) and tokens at the very end (the most recent message) tend to get higher attention weights than middle tokens. Researchers have repeatedly measured the “lost in the middle” effect, and production use makes it visible. A critical instruction buried halfway through a long document faces higher soft-ignore risk than the same instruction placed at the top or bottom.

Second, the window is hard, not soft. When input exceeds the context limit, something has to give. Some tools silently truncate the oldest messages. Some summarize older history into a compressed note. Some return an error. The behavior depends entirely on the wrapper around the model — ChatGPT, Claude.ai, Copilot, an internal RAG app — not the model itself. Two products on the same underlying model can behave completely differently when the window fills, and almost none clearly disclose content drops.

How this fails in the wild

Silent truncation. The Tuesday-morning scenario. Long session, original context scrolled out, model confidently answers from nothing. Check for it by asking whether the task depends on information much earlier in the session and whether the only evidence is model memory.

Middle blindness. A long document contains the key constraint on page 6 of 14. The model gives a fluent answer and violates the constraint. Spot it by re-pasting the constraint near the question instead of relying on “in there somewhere.”

Token sticker shock. A small-looking PDF turns into 80,000 tokens because of OCR noise or repeated headers, and the budget disappears quietly. Spot it by watching cost or latency spike on modest-looking inputs.

Five things to do Monday morning

  1. Treat long chat sessions as suspect. If a conversation has run more than an hour or covers more than one major topic, start a fresh session for the next task and re-paste only what matters.
  2. Put critical instructions at the top or bottom of long inputs. Never in the middle. The model’s attention has known geography.
  3. Re-state the constraint near the question. “Given the runbook above, with failover blocked during business hours, draft…” beats trusting the model to find it.
  4. Know the tool’s truncation behavior. Ask vendors directly: what happens when the context fills? Silent drop, summarization, or error? The answer changes product use.
  5. Measure tokens, not characters, when sizing inputs. Most providers expose a tokenizer. Use it before architecting document pipelines at scale.

Worth reading next

  • Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023). The empirical paper on positional attention decay — readable, with clear charts. arxiv.org/abs/2307.03172
  • Stephen Wolfram, What Is ChatGPT Doing… and Why Does It Work? (2023). The accessible long-form explainer walks through tokens and attention without requiring linear algebra. writings.stephenwolfram.com

The whiteboard model gives AI tooling conversations a durable picture. Once the board filling up becomes visible, most weird behavior stops looking weird.

Next in the Demystify AI series: temperature, sampling, and why the same prompt gives different answers — the dial almost nobody explains.

Axioms touched

Lighter touch than the Learn series — primer pieces don't usually lean heavily on the axiom catalog, but where they do it's noted.