Shape probability, control authority - where AI behavior should live
The Determinism Ladder moves AI behavior from probability layers into authority layers as consequence rises.
The common failure mode sounds reasonable at first: make the prompt better.
The prompt grows. Then it grows again. A few examples become a style guide. The style guide becomes policy. The policy becomes a miniature database. The miniature database becomes a compliance surface. After enough growth, the prompt no longer frames the task. It impersonates the whole system.
The Determinism Ladder exists for this exact moment.
The question is not only “how can the model do better?” The question is: where should the behavior live?
The Determinism Ladder has one practical split.
Model-shaping changes probability: what the model is likely to say or do. It influences what the model sees, prioritizes, imitates, or treats as normal. It belongs in prompt text, harness instructions, retrieved context, reusable term packs, adapters, and fine-tunes.
System-control changes authority: what the system may do. It executes work, blocks unsafe movement, stores evidence, or proves a claim. It belongs in tools, workflows, validators, approvals, monitors, evals, and receipts.
Shaping influences. Control enforces. The boundary matters because a shaped model can still ignore, forget, overgeneralize, hallucinate, or comply with hostile context. A controlled system can refuse, log, replay, and prove.
Model-shaping placements
These placements change what the model is likely to do. They do not enforce the outcome by themselves.
| Behavior needs | Best first home | If placed in the wrong layer |
|---|---|---|
| Ephemeral task framing | Prompt or per-turn scaffold | A simple task becomes a permanent rule. |
| Local agent operating rules | AGENTS.md, CLAUDE.md, skills, workspace rules, IDE harness instructions, or agent definitions | Local guidance starts acting like hidden policy without clear precedence. |
| Current or citable knowledge | RAG, graph traversal, database/API lookup, file search, or MCP resource reads | Fresh facts get baked into stale memory or adapter behavior. |
| Repeated style, tone, or domain phrasing | LoRA, adapters, SFT, reusable prompt pack, or glossary package | Repeated priors consume prompt space forever. |
System-control placements
These placements move responsibility outside model habit. They execute, block, or prove.
| Behavior needs | Best first home | If placed in the wrong layer |
|---|---|---|
| Repeated deterministic procedure | Tool, workflow, template, planner, or code | Exact steps depend on the model remembering the ritual. |
| External action or state change | Tool call, workflow, API write, or MCP tool | The model describes action instead of executing under authorization and logs. |
| Non-negotiable rule | Gate, validator, policy, or approval | A hard rule becomes a suggestion inside prompt text. |
| Confidence claim | Eval, monitor, receipt, or shadow judge | Trust depends on persuasion instead of measured evidence. |
This is the short form:
model-shaping = probability layer: prompt + instruction stack + retrieval context + adapter
system-control = authority layer: tool + workflow + gate + eval
prompt = local intent
instruction stack = imported operating rules
retrieval context = evidence and current state
adapter = repeated style or behavioral prior
tool/workflow = external execution or exact procedure
gate = non-negotiable prevention
eval = proof The Determinism Ladder hub gives the broader map. This article names the day-to-day placement decision.
Mature composition: filled templates
A dynamically filled prompt template sits in the probability layer, but mature implementations rarely stop there.
The template frames the run: output requirements, voice, sections, examples, rubric, JSON shape, and task-specific acceptance criteria. Retrieved graph facts, MCP resource reads, file search, or API lookups fill the open slots. A validator, schema check, policy gate, or eval then enforces the output contract after generation.
template = prompt scaffold
slot data = retrieval context
schema check = gate
quality check = eval
action after acceptance = tool or workflow This pattern matters because it keeps each responsibility in the right layer. The model receives a clear assignment. The system preserves provenance for the inserted facts. The gate rejects malformed output. The eval leaves evidence about whether the template still works.
Mature composition: convergence loops
Orchestrated GVR or GV+AR convergence sits around many model-shaped runs. It uses probability for candidate judgment, then uses authority for acceptance.
orchestrator = workflow
generator agents = model-shaped runs
verifier agents = eval
refiner agents = workflow plus prompt scaffold
convergence threshold = gate
final receipt = eval evidence
graph memory = retrieval context plus provenance This is the same ladder move in a larger loop. The generator may propose. The verifier panel may disagree. The refiner may revise. The graph stores the claims, critiques, votes, versions, and receipts. The convergence rule decides whether the artifact may advance.
The public pattern has nearby references. Shadow tribunals covers second opinions beside the primary run. Eval and observability covers receipts. Graph-constrained execution covers explicit state and edges. Three repos, one thesis names the GVAR engine as a public pattern repo.
Google DeepMind’s Gemini Deep Think / Aletheia writeup describes a math research agent using iterative generate, verify, and revise loops for research-level problems. StoneyTECH treats GVR as a public learning adaptation of the same broad pattern: generate candidates, verify independently, refine under graph state, and accept only with convergence evidence.
Crossing the boundary
Move from shaping to control when failure has meaningful consequences: safety, money, compliance, reputation, irreversible action, replay, authorization, or evidence.
A prompt can say “prefer safe commands.” A tool wrapper can restrict the command. A gate can refuse a risky command. An eval can prove the refusal rule still works after the model, prompt, or workflow changes.
This is the core Determinism Ladder trade: behavior with low consequence can remain shaped. Behavior with high consequence needs control.
Roles and implementations
The terms in the table name roles, not separate products. This is the part where the vocabulary can feel slippery. A system builder does not usually buy a “tool” product, a “gate” product, and an “eval” product. The builder wires implementation surfaces together, then assigns responsibility to each surface.
MCP, graph, CI, workflows, and harness files show up more than once for this reason. MCP can expose a read-only resource, which makes it context. The same MCP server can expose a scanner, which makes it a tool. If the scanner result blocks publication, it also participates in a gate. If the scan result gets stored with a timestamp, input, output, and verdict, it becomes eval evidence.
The same surface can carry different roles, but not every role fits every surface equally. Capability decides the fit. A surface can serve as context when it exposes evidence. It can serve as a tool when it performs a bounded operation. It can serve as a gate when its result blocks promotion or action. It can serve as an eval when it leaves a measured receipt.
| Surface | Strongest natural roles | Limited or conditional roles |
|---|---|---|
| MCP | context through resources; tool through typed operations | gate or eval only when wired to policy decisions and stored receipts |
| Graph | context through facts, edges, provenance; eval through coverage and drift checks | tool only through traversal helpers; gate only when promotion checks required graph state |
| CI/build | gate through validators; eval through test output and verification receipts | tool when generating artifacts; context when publishing generated state |
| Workflow | tool through jobs and actions; gate through approval or policy branches | context as run state; eval only when run summaries become durable receipts |
| Agent harness | instruction stack through local rules and skills; gate through permission boundaries | tool through approved wrappers; eval through transcript review or fixture runs |
This resolves the overlap: context, tool, gate, and eval describe runtime responsibility. MCP, graph, CI, workflows, and harness files describe implementation surfaces. The practical question is not “can this surface be anything?” The better question is “which capability is actually exposed during this run?”
An MCP compliance scanner makes the point concrete. As an MCP tool, it runs the scan. As a gate, its result can block publication or deployment. As an eval, its receipt proves what the scanner checked. The same scanner may also read graph context before deciding what counts as compliant.
The rest of the article walks the split in order: first model-shaping, then system-control.
Prompt work
Prompt work improves behavior by asking better.
It fits early exploration, task framing, local style, and reversible experiments. The Stack Matrix starts here because prompt changes cost little and reveal whether the problem needs more machinery.
Prompting fails when it becomes the home for facts, policy, state, permissions, or proof. A prompt can mention a rule. It cannot enforce the rule after the model ignores it.
Use prompt work when:
- the behavior changes often
- the cost of being wrong stays small
- the instruction belongs to the current task
- no audit trail beyond the run output matters
Instruction stack
Instruction-stack work improves behavior by making local operating rules explicit.
AGENTS.md, CLAUDE.md, Codex skills, Claude Code skills, VS Code workspace instructions, Cursor rules, Zed context, tool permission policies, and agent definitions shape what the model believes the session permits.
These surfaces are not retrieval. The harness imports them as instruction layers. They may be legitimate and useful, but they share the same attack surface as prompt injection when they come from untrusted paths, broad scopes, stale files, or ambiguous precedence.
Instruction-stack work needs:
- precedence
- path and workspace scope
- provenance
- user visibility
- conflict handling
- versioning
The placement rule: per-turn prompts can frame the task, but they should not override higher-precedence harness rules.
Retrieval context
Retrieval-context work improves behavior by supplying better evidence and current state.
RAG, graph traversal, MCP resources, APIs, databases, files, and search are external, on-demand context retrieval. The model no longer needs to remember every current fact. The system can fetch the fact, attach provenance, and keep the answer near the source.
MCP needs a precise note here. MCP is not itself RAG. It is a protocol surface. MCP resources can provide retrieval context; MCP tools can perform reads, writes, checks, or actions. The placement depends on the exposed capability.
This is why the published-content MCP matters. A public site becomes more useful when agents can read the site as structured context instead of scraping prose only.
Retrieval context fits:
- current product facts
- citations
- customer-specific state
- policy text
- graph relationships
- run-specific boundaries
Retrieved material is evidence, not instruction. A retrieved document may contain commands, but the system should treat those commands as quoted content unless a higher-precedence instruction says otherwise.
Retrieval context still spends tokens. It also needs retrieval quality, ranking, injection controls, and evidence discipline. The graph-constrained execution piece covers the next step: context can constrain choices, not only inform prose.
Fine-tuning work
Fine-tuning improves behavior by changing what the model treats as normal.
The LoRA + RAG composition piece gives the clean split: voice and repeated behavior can live in weights or adapters; fresh facts should live in retrieval. The LLM construction primer explains the training path behind the adapter.
Fine-tuning and adapters fit:
- repeated tone
- repeated formatting
- domain phrasing
- stable classification habits
- compact behavioral priors
They do not fit fresh facts, permission checks, revocation, hard policy, or exact workflow enforcement. Those need controlled runtime surfaces.
The bonus runtime-adapter idea from the LoRA primer sits between prompt and fine-tune. A graph-backed term catalog can make phrases like red team, invariant, or canary carry a compact procedure. This is not training. It is a reusable prompt or retrieval package: cheaper to change than an adapter, weaker than weights, and useful as staging data before a future adapter training run.
Tool and workflow work
Tools improve behavior by moving action and exact procedure out of the model.
A model can draft a command. A tool can execute a typed operation with parameters, authorization, logging, retries, and failure handling. The MCP primer explains the protocol version of this move; cheaper alternatives to MCP explains when a simpler surface wins.
Tool means role: execute bounded work. MCP tools, API calls, Cloud Run jobs, local command wrappers, n8n flows, and governed agent actions can all fill it.
Tool placement fits:
- writes
- searches
- API calls
- ticket creation
- evidence collection
- data transforms
- deterministic multi-step procedures
The key line: the model proposes or routes; the tool executes under a contract.
Gate work
Gates improve behavior by refusing bad states.
This is the highest-value move for non-negotiable rules. A prompt can say “never publish private content.” A validator can block the build. A policy check can reject a write. A human approval gate can stop a risky action before it reaches production.
Gate means role: stop promotion or action when a rule fails. The implementation may live in CI, an MCP policy tool, a content validator, an approval workflow, a runtime authorization check, or a deployment rule.
Gates fit:
- public/private boundary enforcement
- credential and secret checks
- compliance rules
- destructive actions
- deployment promotion
- data residency constraints
The deployment-context-first article shows this at architecture scale: location and residency cannot live as helpful prompt text. They shape the system.
Eval work
Evals improve behavior by proving the placement worked.
After a behavior moves from prompt to context, from context to adapter, or from adapter to gate, the system still needs proof. The eighth-lever essay names eval and observability as the missing layer. The shadow tribunals article adds second opinions beside the primary run.
Eval means role: measure the behavior and leave a receipt. Unit tests, content contracts, MCP scanner results, graph coverage reports, shadow tribunal votes, and replay harnesses can all fill it.
Eval placement fits:
- regression checks
- prompt-vs-context comparisons
- adapter acceptance
- retrieval quality checks
- gate coverage checks
- model swap decisions
No placement earns trust without a receipt.
The practical rule
Use the smallest lever capable of carrying the behavior:
shape the model:
prompt if the behavior is local
use harness instructions if the rule travels with the workspace
retrieve if the fact changes
adapt if the style or behavioral prior repeats
control the system:
use a tool if the action or exact procedure leaves the model
gate if failure needs prevention
eval if the claim needs trust This turns the Determinism Ladder into an operating question. Not “how much AI should this system use?” Instead: should this behavior stay in the probability layer, or move into the authority layer?
Axioms applied in this essay
This article tested 6 of the StoneyTECH engineering axioms. Each verdict is the result of applying that axiom in this specific argument.
- #1 The smallest lever wins held
The article turns smallest lever into a placement table: prompt, context, adapter, tool, gate, eval.
- #2 Push work down toward determinism held
Determinism increases by moving repeated behavior out of persuasion and into controlled surfaces.
- #5 Never trust 'running' without sentinels held
Evals and gates become sentinels once a behavior matters enough to verify.
- #11 Cite or be silent held
The piece links back to prior public articles carrying the underlying claims.
- #14 Two cheaper alternatives first held
The table preserves reversible early moves before training or gating.
- #16 Don't comment without building. Don't curate without proving. held
The article closes the graph around existing proof pieces instead of creating a detached slogan.
