2026-05-11· 11 min read

Deployment context first — when on-prem, sovereign-cloud, and public-cloud are different architectures

Deployment context comes before model choice. Three contexts, changing levers, and shippable architectures make axiom #18 concrete.

deployment-contextarchitecturedeterminism-ladderaxiom-18data-residency

On this page Overview

The model-portability piece opened with this story, and it earns a second telling here: an EU healthcare team committed to a closed-frontier US-hosted model in week one. The architecture looked beautiful. Demos landed. Then in week twenty-six, legal explained patient data could not leave the EU, and the model ran in no region legally available to the data. Six weeks of architecture work disappeared into what should have been a 1-day model swap.

It would have been a 1-day swap. If they’d known to ask the question on day one.

The inaugural piece named deployment context as decision-zero. The model-portability essay named cases where deployment context flips the model decision into week one. This piece walks deeper: three deployment contexts (public cloud, sovereign cloud / private cloud, on-prem / air-gap), and the shape each agentic-stack lever takes inside each one. Same lever, three different architectures.

In the determinism-ladder lens

Every other essay in this series talks about pushing model autonomy down into deterministic execution. The deployment-context lens runs the same trade from a different axis: every context trades capability for constraint about where the system runs. Public cloud trades little for capability: frontier model, hosted vector store, off-the-shelf trace store. Sovereign cloud trades some capability for residency determinism. On-prem trades more capability for full control over every byte of every request.

The architectural mistake treats these as one architecture with three deployment options. They differ structurally. The same lever — say, RAG — becomes one artifact in public cloud (Pinecone + hosted embedding model + hosted vector reranker), another artifact in sovereign cloud (region-pinned Pinecone or in-region pgvector + regional embedding endpoint + smaller open reranker), and another artifact on-prem (pgvector + locally hosted embedding via text-embeddings-inference + CPU-bound reranker on the database host).

The decision tree from the inaugural piece is correct: deployment context first, model within context, then the rest of the stack. This piece walks each context end-to-end so first becomes concrete.

1. Public cloud, default region

Opening anecdote. A B2B SaaS team building a customer-support assistant. US-only customers, no PHI, no PCI, standard enterprise DPA. They reached for Anthropic’s Claude API, Pinecone for retrieval, LangSmith for tracing, and shipped in three weeks.

What this context actually means.

Customer data is not sovereignty-constrained.
The provider’s default data-retention policy satisfies customer requirements or has negotiated override.
Network policy allows egress to provider APIs.
The cost-per-token of frontier models is acceptable for the use case.

Lever choices.

Lever	Public-cloud default
Model	Closed-frontier hosted: Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro
API	Hosted via the provider’s SSE/HTTP endpoint; per-environment keys; prompt caching on
LoRA	If needed, the provider’s fine-tuning product (Anthropic Custom Models, OpenAI Fine-tuning) — not LoRA in the technical sense, but the same outcome
RAG	Hosted vector store: Pinecone, Weaviate Cloud, Turbopuffer; hosted embedding endpoints
Skills	Provider/client-specific (Claude Skills, etc.); pulled from the public ecosystem with a review
MCP	Hosted MCP server on a managed app/runtime platform with per-installation tokens
Agents	Cloud-hosted via the provider’s Agent SDK or LangGraph / OpenAI Agents on Modal / Daytona
Eval & observability	LangSmith, Langfuse Cloud, Phoenix-as-a-service, PromptLayer

The trade. Capability and time-to-ship rise. Cost becomes the running tax because frontier inference remains expensive. Sovereignty and audit stay minimal: provider contract, provider regions, and provider incident-response playbook carry trust.

When to escape this context. Customer DPAs start specifying region constraints. Enterprise procurement asks “where does the data actually go.” A regulator names a compliance regime prohibiting inference egress. Escape rarely arrives as one moment; procurement questions turn into six-week migrations.

Failure mode named. Public cloud is the fastest path, until a regulator names the cost.

2. Sovereign cloud / private cloud / region-pinned

Opening anecdote. An EU fintech team spent three years building a customer-onboarding agent. Initial architecture: AWS Northern Virginia + OpenAI + Pinecone-Cloud (default region). When counsel classified financial decisioning as high-risk under the EU AI Act, the team had four months to migrate to in-region inference, in-region embedding, and in-region observability without breaking the customer-facing flow running on the existing architecture.

What this context actually means.

Customer data must stay within a specific jurisdiction (EU, France, Germany, India, Australia, Singapore, etc.).
The provider’s region-pinned offering must be contractually-residency-guaranteed, not just “the data stays in the region as a default.”
Some controls (audit logs, data-retention policies, incident notification) must be auditable to a regulatory standard.
The model lever may face constraint: not every closed-frontier provider offers region-pinned versions of its best model. Sometimes the binding constraint becomes “best model legally available in this region.”

Lever choices.

Lever	Sovereign-cloud version
Model	Closed-frontier with region-pinned offering (Anthropic via AWS Bedrock EU; OpenAI via Azure with EU residency; Gemini via GCP EU regions) — OR an in-region self-hosted open-weight model when no closed-frontier option meets the residency contract
API	Provider’s regional endpoint with contractual residency guarantee; per-environment keys; prompt caching on if the provider offers regional cache isolation
LoRA	For open-weight self-hosting: in-region training via in-region GPU compute (AWS Trainium EU, Azure ND-series EU). For closed models: provider regional fine-tuning offering, when available.
RAG	Region-pinned vector store: Pinecone with region pinning, Weaviate Cloud EU, OR in-region pgvector on a database in the same region as the source data
Skills	Skills loaded from a privately-hosted registry; signed by the team; not pulled from public registries
MCP	MCP server in the same region as the data; auth boundary contractual; audit logs in-region
Agents	Agent runtime in-region on the selected cloud/runtime platform
Eval & observability	Self-hosted in-region: Langfuse self-host on in-region Postgres; Phoenix self-host; OpenLLMetry → in-region OTel collector. NOT LangSmith Cloud unless they offer EU residency.

The trade. Capability becomes moderate because the regional frontier model usually trails the global best. Time-to-ship increases because every component needs in-region placement or an in-region option. Sovereignty and audit become contractual: “where does the data actually go” gets answered with a region-pinning clause and audit log.

When to escape this context. Almost never. Law or contract usually placed the system here. Escape up to public cloud requires a customer-facing data-classification change, usually outside scope.

Sometimes-it-bites edge. Region pinning is contractual, not always default. Many hosted vector stores, Pinecone among them, can default to multi-region behavior unless region pinning gets selected and contracted. Many “EU presence” providers share this ambiguity. Read the contract; distrust marketing shorthand.

Failure mode named. In-region is a contract clause, not a default. The default global service may only have an EU point of presence.

3. On-prem / air-gap / restricted-network

Opening anecdote. A defense-contractor team needed an internal coding-assistant agent. The dev environment lived in an air-gapped lab with no internet egress except a small internal artifact repository. First plan: Claude via VPN from the lab. Security needed two weeks to explain “VPN” does not turn a closed-frontier API call into an air-gap-compatible call. Architecture changed: self-hosted Qwen 3.6 14B with merge-time LoRAs for coding voice, pgvector on internal Postgres for code-search RAG, OpenLLMetry -> internal OTel collector for traces.

What this context actually means.

No (or extremely-restricted) network egress.
All inference, all retrieval, all observability, and all artifact storage on hardware the customer owns or controls.
Some contexts (true air-gap) cannot make outbound HTTPS to any external API; others (restricted-network) can call a small allow-list (e.g. api.anthropic.com only, or an approved internal artifact repository only).
Available hardware constrains model capability, typically open-weight models in the 7B-70B range plus specialized models for niche tasks.

Lever choices.

Lever	On-prem version
Model	Self-hosted open-weight: Qwen 3.6 14B / 70B, Llama 3.3 70B, Mistral Mixtral, DeepSeek-Coder-V2 for code-specialized work. Specialized open models when they outperform the general frontier (medical imaging foundation models; genomics models; legal-doc specialists).
API	Self-hosted inference engine: `vllm` or `tensorrt-llm` for production scale; `llama.cpp` for CPU-only / smaller deployments. Behind an internal HTTPS auth boundary; key rotation handled by the internal IDP.
LoRA	In-house training pipeline on internal GPUs. Reproducible from a signed input (dataset hash + hyperparameters + base-weight hash). LoRA adapters merged at load time for production.
RAG	pgvector on internal Postgres OR Qdrant / Weaviate self-hosted. Local embedding model: `bge-small-en` runs on CPU; `bge-large-en-v1.5` when a small GPU can serve embeddings. Cross-encoder reranker (`bge-reranker-base`) on the same box.
Skills	Internal-only skill registry. Signed at publish; verified at install. No public skills.
MCP	MCP server inside the secured perimeter, behind the internal auth boundary. Hosted on whatever the secured environment uses for internal services (Kubernetes in the secured cluster, internal Lambda, etc.).
Agents	Agent runtime inside the secured perimeter. Bounded agency (allowlist of mutating verbs, human-in-the-loop on the dangerous ones). Network egress allow-list at the agent’s container or VPC boundary.
Eval & observability	OpenLLMetry -> internal OTel collector -> existing Tempo / Loki / Grafana stack. Phoenix OSS or Langfuse self-host on internal Postgres for higher-level UI. NEVER hosted SaaS, even with VPN.

The trade. Capability faces constraint: a generation or two behind the closed frontier on raw quality, sometimes more. Time-to-ship becomes longest because everything needs owned operation. Sovereignty and audit become total: every byte of every request, every model artifact, every retrieval stays on controlled hardware. The operating team can answer every “where does the data go” question with “nowhere external.”

True air-gap vs. restricted-network. True air-gap (no egress whatsoever) requires physical model-artifact transfer or approved one-way transfer, blocks public-source dataset augmentation, and places updates on security-controlled release cadence. Restricted-network (egress to a tightly bound allow-list) softens the rule: approved registries can provide weights, certain provider APIs can work when contractually permitted, and updates can move faster. Architectural choices remain similar; operational tempo changes.

Sometimes-it-bites edge. “An exception for the API call can probably happen” appears in week three. Almost never. Security teams exist to say no to exceptions, and the regulatory framework placing the team in air-gap usually forbids exceptions too. Build for the air-gap from day one inside the air-gap class.

Failure mode named. Air-gap is binary. There’s no almost-air-gap.

The decision tree (deployment-context first)

This repeats the inaugural decision tree with deployment context as the first filter. Step 0 conditions every later step:

Pick the deployment context. Public cloud, sovereign cloud / private cloud, on-prem / restricted-network, true air-gap. This decision precedes every other decision below.
Pick the model within context. In public cloud, this remains reversible (the inaugural piece’s “swap later” advice applies). In sovereign cloud, model choice travels with context. In on-prem / air-gap, model choice happens before context permits many other decisions. (The model-portability-exceptions essay walks the cases.)
Pick the API within context. Hosted at the provider regional endpoint, hosted with negotiated residency, or self-hosted with vllm / tensorrt-llm. The decision sits downstream of model and context, not independent.
Pick the rest of the levers within context. Each lever has a context table above. Public-cloud defaults differ from sovereign-cloud defaults, which differ from on-prem defaults. Same lever name; different artifact.
Decide threat-surface controls (companion to axiom #17). The deployment context multiplies the threat surface. A confused-deputy attack on a public-cloud agent is different from a confused-deputy attack on an air-gapped agent — the blast radius and the auditing capability are both different. (The threat-surface-layer-by-layer essay walks the per-layer controls.)
Verify in code, not in runbooks. Per-region constraints, per-context placement decisions, and audit-log requirements should all be verifiable by inspection of the deployment manifest (Terraform / Pulumi / k8s YAML / service-runtime config). Axiom #7 — every escalation in code, not in backlogs — applies here too.

When the context bites unexpectedly

Three patterns where the deployment-context decision shifts after launch:

The customer mix changes. A US-only B2B starts adding EU customers, or the first enterprise customer has a DPA clause about data residency. The context changes from public-cloud to sovereign-cloud mid-flight, and every lever needs a parallel sovereign-region version.
The data classification changes. A team built on the assumption “data is just text” discovers personal health information inside the text after a customer pastes a medical question into chat. Suddenly the trace store, retrieval store, and model-training exposure become sovereignty-bound. Context stayed; data classification shifted the implication.
The regulator names a new regime. EU AI Act, FedRAMP IL5/IL6 expansion, healthcare-specific frameworks, financial-services-specific frameworks. The team didn’t change deployment contexts; the regulator widened what “sovereign” means.

In every case, rework cost scales with lever decisions made without the context-first lens. A hosted vector store picked on day one because “standard choice” triggers sovereign-region migration under all three patterns. pgvector from the start — same lever, smaller-lever version — moves on to the next problem.

Spirit

The Determinism Ladder series treats every architectural decision as a lever-trade between model autonomy and system determinism. The deployment-context lens is the same trade from the perspective of where the system runs. Public cloud trades capability for nothing in the median case and for control over location in the regulated case. Sovereign cloud trades a little capability for control over location and contractual residency. On-prem trades a generation of capability for control over everything.

The structural error caught by the v3.2 panel: treating the model as decision-zero. The model is one lever among many; deployment context supplies the constraint set inside which every lever gets decided. Right ordering keeps the rest of the architecture optional. Wrong ordering creates six-week redo cycles.

Pick deployment context first. Pick the smallest-lever version of every other lever within context. Verify placement in code, not in a postmortem.

Axiom #18 in operating form.

Next in the Determinism Ladder series: shape probability, control authority — where AI behavior should live once the deployment context locks.

Axioms applied in this essay

This article tested 7 of the StoneyTECH engineering axioms. Each verdict is the result of applying that axiom in this specific argument.

#18 Pick the deployment context before the model refined
The inaugural named deployment context as decision-zero. This piece refines the axiom from 'pick deployment context first' to 'pick deployment context first AND walk every lever through context constraints; the same lever names a different artifact in each context.' Three contexts walk every lever explicitly.
#17 Threat-model the surface (assume adversarial input) held
Each deployment context multiplies threat surface differently. Companion to threat-surface-layer-by-layer; the two essays compose axioms #17 and #18 together — security and deployment share the same desk.
#14 Two cheaper alternatives first held
Each context's lever choices follow the cheaper-alternatives-first discipline: in public cloud, a hosted API is the cheapest path; in on-prem, a self-hosted open-weight model with pgvector and OpenLLMetry is the cheapest path. The lever doesn't change; the cheapest version of the lever does.
#11 Cite or be silent held
Cites GDPR, EU AI Act, FedRAMP IL5/IL6, HIPAA, SOC2, NIST AI RMF, and provider-specific data-residency contracts. Cross-references the inaugural's three-contexts table and the model-portability-exceptions essay without reproducing them.
#10 Story-anchor every claim held
Three opening anecdotes, one per context. EU healthcare team's week-26 legal sit-down (public-cloud-to-sovereign-region forced migration); a defense-contractor team's air-gap surprise; a fintech team's sovereign-region trade-off.
#1 The smallest lever wins held
The smallest-lever rule applies per context: pick the smallest lever satisfying the context's binding constraint; avoid expensive levers just because the context carries more constraint.
#2 Push work down toward determinism held
Each context pushes a different unit of uncertainty down into deterministic execution. Public cloud trades determinism for capability; sovereign region trades capability for determinism about residency; on-prem trades both for full control.