2026-05-17· 8 min read

Shape probability, control authority - where AI behavior should live

The Determinism Ladder moves AI behavior from probability layers into authority layers as consequence rises.

determinism-ladderpromptingragfine-tuningloraevalsgovernance

On this page Overview

The common failure mode sounds reasonable at first: make the prompt better.

The prompt grows. Then it grows again. A few examples become a style guide. The style guide becomes policy. The policy becomes a miniature database. The miniature database becomes a compliance surface. After enough growth, the prompt no longer frames the task. It impersonates the whole system.

The Determinism Ladder exists for this exact moment.

The question is not only “how can the model do better?” The question is: where should the behavior live?

The Determinism Ladder has one practical split.

Model-shaping changes probability: what the model is likely to say or do. It influences what the model sees, prioritizes, imitates, or treats as normal. It belongs in prompt text, harness instructions, retrieved context, reusable term packs, adapters, and fine-tunes.

System-control changes authority: what the system may do. It executes work, blocks unsafe movement, stores evidence, or proves a claim. It belongs in tools, workflows, validators, approvals, monitors, evals, and receipts.

Shaping influences. Control enforces. The boundary matters because a shaped model can still ignore, forget, overgeneralize, hallucinate, or comply with hostile context. A controlled system can refuse, log, replay, and prove.

Model-shaping placements

These placements change what the model is likely to do. They do not enforce the outcome by themselves.

Behavior needs	Best first home	If placed in the wrong layer
Ephemeral task framing	Prompt or per-turn scaffold	A simple task becomes a permanent rule.
Local agent operating rules	`AGENTS.md`, `CLAUDE.md`, skills, workspace rules, IDE harness instructions, or agent definitions	Local guidance starts acting like hidden policy without clear precedence.
Current or citable knowledge	RAG, graph traversal, database/API lookup, file search, or MCP resource reads	Fresh facts get baked into stale memory or adapter behavior.
Repeated style, tone, or domain phrasing	LoRA, adapters, SFT, reusable prompt pack, or glossary package	Repeated priors consume prompt space forever.

System-control placements

These placements move responsibility outside model habit. They execute, block, or prove.

Behavior needs	Best first home	If placed in the wrong layer
Repeated deterministic procedure	Tool, workflow, template, planner, or code	Exact steps depend on the model remembering the ritual.
External action or state change	Tool call, workflow, API write, or MCP tool	The model describes action instead of executing under authorization and logs.
Non-negotiable rule	Gate, validator, policy, or approval	A hard rule becomes a suggestion inside prompt text.
Confidence claim	Eval, monitor, receipt, or shadow judge	Trust depends on persuasion instead of measured evidence.

This is the short form:

model-shaping = probability layer: prompt + instruction stack + retrieval context + adapter
system-control = authority layer: tool + workflow + gate + eval

prompt = local intent
instruction stack = imported operating rules
retrieval context = evidence and current state
adapter = repeated style or behavioral prior
tool/workflow = external execution or exact procedure
gate = non-negotiable prevention
eval = proof

Map from behavior need into probability-layer surfaces, authority-layer surfaces, and eval receipt. — The Determinism Ladder moves responsibility from influence toward enforcement as consequence rises.

The Determinism Ladder hub gives the broader map. This article names the day-to-day placement decision.

Mature composition: filled templates

A dynamically filled prompt template sits in the probability layer, but mature implementations rarely stop there.

The template frames the run: output requirements, voice, sections, examples, rubric, JSON shape, and task-specific acceptance criteria. Retrieved graph facts, MCP resource reads, file search, or API lookups fill the open slots. A validator, schema check, policy gate, or eval then enforces the output contract after generation.

template = prompt scaffold
slot data = retrieval context
schema check = gate
quality check = eval
action after acceptance = tool or workflow

This pattern matters because it keeps each responsibility in the right layer. The model receives a clear assignment. The system preserves provenance for the inserted facts. The gate rejects malformed output. The eval leaves evidence about whether the template still works.

Mature composition: convergence loops

Orchestrated GVR or GV+AR convergence sits around many model-shaped runs. It uses probability for candidate judgment, then uses authority for acceptance.

orchestrator = workflow
generator agents = model-shaped runs
verifier agents = eval
refiner agents = workflow plus prompt scaffold
convergence threshold = gate
final receipt = eval evidence
graph memory = retrieval context plus provenance

This is the same ladder move in a larger loop. The generator may propose. The verifier panel may disagree. The refiner may revise. The graph stores the claims, critiques, votes, versions, and receipts. The convergence rule decides whether the artifact may advance.

The public pattern has nearby references. Shadow verifier panels covers second opinions beside the primary run. Eval and observability covers receipts. Graph-constrained execution covers explicit state and edges. Three repos, one thesis names the GVAR engine as a public pattern repo.

Google DeepMind’s Gemini Deep Think / Aletheia writeup describes a math research agent using iterative generate, verify, and revise loops for research-level problems. StoneyTECH treats GVR as a public learning adaptation of the same broad pattern: generate candidates, verify independently, refine under graph state, and accept only with convergence evidence.

Crossing the boundary

Move from shaping to control when failure has meaningful consequences: safety, money, compliance, reputation, irreversible action, replay, authorization, or evidence.

A prompt can say “prefer safe commands.” A tool wrapper can restrict the command. A gate can refuse a risky command. An eval can prove the refusal rule still works after the model, prompt, or workflow changes.

This is the core Determinism Ladder trade: behavior with low consequence can remain shaped. Behavior with high consequence needs control.

Roles and implementations

The terms in the table name roles, not separate products. This is the part where the vocabulary can feel slippery. A system builder does not usually buy a “tool” product, a “gate” product, and an “eval” product. The builder wires implementation surfaces together, then assigns responsibility to each surface.

MCP, graph, CI, workflows, and harness files show up more than once for this reason. MCP can expose a read-only resource, which makes it context. The same MCP server can expose a scanner, which makes it a tool. If the scanner result blocks publication, it also participates in a gate. If the scan result gets stored with a timestamp, input, output, and verdict, it becomes eval evidence.

The same surface can carry different roles, but not every role fits every surface equally. Capability decides the fit. A surface can serve as context when it exposes evidence. It can serve as a tool when it performs a bounded operation. It can serve as a gate when its result blocks promotion or action. It can serve as an eval when it leaves a measured receipt.

Surface	Strongest natural roles	Limited or conditional roles
MCP	context through resources; tool through typed operations	gate or eval only when wired to policy decisions and stored receipts
Graph	context through facts, edges, provenance; eval through coverage and drift checks	tool only through traversal helpers; gate only when promotion checks required graph state
CI/build	gate through validators; eval through test output and verification receipts	tool when generating artifacts; context when publishing generated state
Workflow	tool through jobs and actions; gate through approval or policy branches	context as run state; eval only when run summaries become durable receipts
Agent harness	instruction stack through local rules and skills; gate through permission boundaries	tool through approved wrappers; eval through transcript review or fixture runs

This resolves the overlap: context, tool, gate, and eval describe runtime responsibility. MCP, graph, CI, workflows, and harness files describe implementation surfaces. The practical question is not “can this surface be anything?” The better question is “which capability is actually exposed during this run?”

An MCP compliance scanner makes the point concrete. As an MCP tool, it runs the scan. As a gate, its result can block publication or deployment. As an eval, its receipt proves what the scanner checked. The same scanner may also read graph context before deciding what counts as compliant.

Capability map showing MCP, graph, CI build, workflow, and agent harness surfaces passing through a capability check before context, tool, gate, or eval roles. — Surfaces do not automatically fill every role. Exposed capability decides the safe placement.

The rest of the article walks the split in order: first model-shaping, then system-control.

Prompt work

Prompt work improves behavior by asking better.

It fits early exploration, task framing, local style, and reversible experiments. The Stack Matrix starts here because prompt changes cost little and reveal whether the problem needs more machinery.

Prompting fails when it becomes the home for facts, policy, state, permissions, or proof. A prompt can mention a rule. It cannot enforce the rule after the model ignores it.

Use prompt work when:

the behavior changes often
the cost of being wrong stays small
the instruction belongs to the current task
no audit trail beyond the run output matters

Instruction stack

Instruction-stack work improves behavior by making local operating rules explicit.

AGENTS.md, CLAUDE.md, Codex skills, Claude Code skills, VS Code workspace instructions, Cursor rules, Zed context, tool permission policies, and agent definitions shape what the model believes the session permits.

These surfaces are not retrieval. The harness imports them as instruction layers. They may be legitimate and useful, but they share the same attack surface as prompt injection when they come from untrusted paths, broad scopes, stale files, or ambiguous precedence.

Instruction-stack work needs:

precedence
path and workspace scope
provenance
user visibility
conflict handling
versioning

The placement rule: per-turn prompts can frame the task, but they should not override higher-precedence harness rules.

Retrieval context

Retrieval-context work improves behavior by supplying better evidence and current state.

RAG, graph traversal, MCP resources, APIs, databases, files, and search are external, on-demand context retrieval. The model no longer needs to remember every current fact. The system can fetch the fact, attach provenance, and keep the answer near the source.

MCP needs a precise note here. MCP is not itself RAG. It is a protocol surface. MCP resources can provide retrieval context; MCP tools can perform reads, writes, checks, or actions. The placement depends on the exposed capability.

This is why the published-content MCP matters. A public site becomes more useful when agents can read the site as structured context instead of scraping prose only.

Retrieval context fits:

current product facts
citations
customer-specific state
policy text
graph relationships
run-specific boundaries

Retrieved material is evidence, not instruction. A retrieved document may contain commands, but the system should treat those commands as quoted content unless a higher-precedence instruction says otherwise.

Retrieval context still spends tokens. It also needs retrieval quality, ranking, injection controls, and evidence discipline. The graph-constrained execution piece covers the next step: context can constrain choices, not only inform prose.

Fine-tuning work

Fine-tuning improves behavior by changing what the model treats as normal.

The LoRA + RAG composition piece gives the clean split: voice and repeated behavior can live in weights or adapters; fresh facts should live in retrieval. The LLM construction primer explains the training path behind the adapter.

Fine-tuning and adapters fit:

repeated tone
repeated formatting
domain phrasing
stable classification habits
compact behavioral priors

They do not fit fresh facts, permission checks, revocation, hard policy, or exact workflow enforcement. Those need controlled runtime surfaces.

The bonus runtime-adapter idea from the LoRA primer sits between prompt and fine-tune. A graph-backed term catalog can make phrases like red team, invariant, or canary carry a compact procedure. This is not training. It is a reusable prompt or retrieval package: cheaper to change than an adapter, weaker than weights, and useful as staging data before a future adapter training run.

Tool and workflow work

Tools improve behavior by moving action and exact procedure out of the model.

A model can draft a command. A tool can execute a typed operation with parameters, authorization, logging, retries, and failure handling. The MCP primer explains the protocol version of this move; cheaper alternatives to MCP explains when a simpler surface wins.

Tool means role: execute bounded work. MCP tools, API calls, Cloud Run jobs, local command wrappers, n8n flows, and governed agent actions can all fill it.

Tool placement fits:

writes
searches
API calls
ticket creation
evidence collection
data transforms
deterministic multi-step procedures

The key line: the model proposes or routes; the tool executes under a contract.

Gate work

Gates improve behavior by refusing bad states.

This is the highest-value move for non-negotiable rules. A prompt can say “never publish private content.” A validator can block the build. A policy check can reject a write. A human approval gate can stop a risky action before it reaches production.

Gate means role: stop promotion or action when a rule fails. The implementation may live in CI, an MCP policy tool, a content validator, an approval workflow, a runtime authorization check, or a deployment rule.

Gates fit:

public/private boundary enforcement
credential and secret checks
compliance rules
destructive actions
deployment promotion
data residency constraints

The deployment-context-first article shows this at architecture scale: location and residency cannot live as helpful prompt text. They shape the system.

Eval work

Evals improve behavior by proving the placement worked.

After a behavior moves from prompt to context, from context to adapter, or from adapter to gate, the system still needs proof. The eighth-lever essay names eval and observability as the missing layer. The shadow verifier panels article adds second opinions beside the primary run.

Eval means role: measure the behavior and leave a receipt. Unit tests, content contracts, MCP scanner results, graph coverage reports, shadow verifier panel votes, and replay harnesses can all fill it.

Eval placement fits:

regression checks
prompt-vs-context comparisons
adapter acceptance
retrieval quality checks
gate coverage checks
model swap decisions

No placement earns trust without a receipt.

The practical rule

Use the smallest lever capable of carrying the behavior:

shape the model:
prompt if the behavior is local
use harness instructions if the rule travels with the workspace
retrieve if the fact changes
adapt if the style or behavioral prior repeats

control the system:
use a tool if the action or exact procedure leaves the model
gate if failure needs prevention
eval if the claim needs trust

This turns the Determinism Ladder into an operating question. Not “how much AI should this system use?” Instead: should this behavior stay in the probability layer, or move into the authority layer?

Axioms applied in this essay

This article tested 6 of the StoneyTECH engineering axioms. Each verdict is the result of applying that axiom in this specific argument.

#1 The smallest lever wins held
The article turns smallest lever into a placement table: prompt, context, adapter, tool, gate, eval.
#2 Push work down toward determinism held
Determinism increases by moving repeated behavior out of persuasion and into controlled surfaces.
#5 Never trust 'running' without sentinels held
Evals and gates become sentinels once a behavior matters enough to verify.
#11 Cite or be silent held
The piece links back to prior public articles carrying the underlying claims.
#14 Two cheaper alternatives first held
The table preserves reversible early moves before training or gating.
#16 Don't comment without building. Don't curate without proving. held
The article closes the graph around existing proof pieces instead of creating a detached slogan.