Deployment Mode · 3 of 5
Self-Hosted LLMs.
Open-source frontier-class models - Llama, Mistral, Qwen, DeepSeek - running on dedicated GPU infrastructure that BrainPack operates on your behalf, or on GPUs you own. No Anthropic in the data path. No OpenAI. No Google. No Microsoft. Just the model, the GPU, your data, and the BrainPack operating layer. For workloads where the requirement is "no third-party AI provider can ever see this" - this is the deployment mode. We run the GPUs. We operate the models. You get the outcomes.
The Provider Cannot See Data It Never Receives.
ZDR is a contract. Public cloud is a configuration. Self-hosted is a different proposition entirely - there is no AI provider in the data path. The model runs on a GPU under our control or yours. The query never leaves the perimeter. The response never enters someone else's log file, even briefly. For data classes where the answer to "could the provider technically see this for 200 milliseconds during inference" is no, the deployment mode has to be self-hosted. There is no other answer.
Self-hosted used to mean "build it yourself, run it yourself, hope the talent stays" - which is why most enterprises avoided it. Open-source models were behind frontier capability by 12-18 months. GPU infrastructure was a six-figure CapEx decision. Operations required ML engineers most enterprises could not retain. The math did not work outside a few specific industries.
In 2026, the math has changed. Open-source models - Llama, Mistral, Qwen, DeepSeek - closed most of the capability gap on the workloads that actually matter. Managed self-hosted services run the GPU layer for you. The capability gap is now a few months for cutting-edge tasks and zero for production workloads. The GPU economics work for any enterprise above the smallest scale. The only remaining barrier was operational complexity - and BrainPack handles that as part of the managed layer.
A Control Boundary Decision, Not a Vendor Preference.
Self-hosted means running inference on infrastructure that BrainPack or you control directly, with no third-party AI provider involved. The data goes from your environment to the GPU and back. Nothing else is in the data path. No AI vendor sees the prompt, the response, or the model's reasoning because no AI vendor is part of the call.
The models in this category are open-weight only Llama (Meta), Mistral, Qwen (Alibaba), DeepSeek, and the long tail of fine-tuned variants built on top of them. The frontier closed models (Claude, GPT, Gemini) are not available self-hosted; their providers do not release weights. For most production workloads, the capability gap is now small. For some cutting-edge tasks deep research, specialized reasoning it remains real.
Self-hosted is not automatically the right choice GPUs reserved for your inference carry a lower unit cost per token at sufficient volume, but below that volume the hardware sits idle and costs more. The break-even is workload-dependent, generally somewhere between 10 million and 50 million tokens per day. Self-hosted is appropriate for some data classes and uneconomic for others. The deployment decision is a control-boundary-and-volume decision, not a vendor-trust decision.
BrainPack treats Self-Hosted as one execution surface among five. The Connect, Orchestrate, and Govern layers do not change. What changes is where the inference actually executes and the fact that no AI vendor is in the data path at all.
How It Actually Works — Govern LayerWhen Self-Hosted Is The Right Mode
Six Workloads Where It Wins.
Six workload categories where self-hosted is the appropriate answer - and where ZDR or on-premise are usually the alternatives BrainPack also supports.
CORE IP AND COMPETITIVE TECHNOLOGY
Source code for products you sell, proprietary algorithms, trade secrets, manufacturing process documentation, R&D pipelines. The data defines your competitive position. "The provider saw it for 200 milliseconds and discarded it" is not the answer - the data should never have left your control. Self-hosted is the answer.
PRE-ANNOUNCEMENT MATERIAL FINANCIAL INFORMATION
Quarterly earnings before release, M&A documents, board materials, executive compensation discussions, trading strategies. Even ZDR is too permissive - the data goes through a third-party provider, however briefly. Self-hosted eliminates the provider entirely.
HIGH-VOLUME WORKLOADS WHERE TOKEN ECONOMICS MATTER
Customer service automation processing millions of interactions per day. Internal knowledge agents serving thousands of employees. Document processing pipelines handling tens of thousands of documents. Above ~10-50M tokens per day, dedicated GPU becomes cheaper per token than frontier API rates - and the savings compound.
WORKLOADS WHERE LATENCY MATTERS
Self-hosted models on dedicated GPUs serve responses with predictable latency, no public-cloud noisy-neighbor effects, no rate-limit waits, no provider-side queueing. For real-time voice, sub-second agent loops, or high-frequency analytical workloads, self-hosted often outperforms cloud APIs.
SOVEREIGN AI REQUIREMENTS
Workloads where the data must demonstrably never leave a specific national jurisdiction, must be processed on infrastructure owned by domestic entities, or must comply with sovereignty rules that cloud AI providers cannot satisfy. Self-hosted on regional GPU infrastructure addresses this.
"NO US TECH COMPANY IN THE DATA PATH" REQUIREMENTS
Some defense, financial, healthcare, and government workloads in non-US jurisdictions explicitly preclude US-headquartered AI providers from the inference path. Self-hosted with European or other open-source models on regional infrastructure is the answer.
When Self-Hosted Is The Wrong Mode.
And Where The Workload Should Go Instead.
Five workload categories where self-hosted is the wrong answer and where BrainPack routes work to public cloud, ZDR, on-premise, or air-gapped instead.
Low-Volume Or Unpredictable Workloads
Workloads that do not clear the 10–50M tokens-per-day break-even leave dedicated GPUs sitting idle. The economics invert pay-per-token public cloud or ZDR is cheaper, sometimes by a factor of ten. Reserve self-hosted for steady, high-throughput pipelines, not bursty experimentation.
Frontier-Capability Tasks
Deep research, complex multi-step reasoning, the newest multimodal capabilities, advanced coding agents. Open-weight models have closed most of the gap, but on the frontier edge, Claude, GPT, and Gemini still lead sometimes by a generation. If the workload genuinely needs that capability and the data class permits, public cloud or ZDR is the right surface.
Data Subject To Strict Residency Or Air-Gap Rules
Self-hosted in BrainPack's infrastructure still means a data center somewhere likely not in the jurisdiction the regulator cares about, and not air-gapped. Defense classifications, banking sovereignty rules, government workloads under FedRAMP High require on-premise or air-gapped deployment in the specified region. Self-hosted on shared infrastructure does not satisfy those requirements.
General Productivity Work
Drafting emails, summarizing public documents, brainstorming, code completion on non-sensitive repos. The data class does not demand the control boundary, the volume rarely justifies the GPU reservation, and the model selection is narrower. Public cloud does this work faster, cheaper, and on better models.
Workloads Where Time-To-Capability Matters More Than Control
A pilot that needs to ship in a week. A new use case where the team is still validating whether AI solves the problem at all. Self-hosted requires GPU procurement, model selection, fine-tuning decisions, and operational standup. Public cloud ships in days. Validate first, then graduate to self-hosted if the data class and volume support it.
Where to route them instead
How Self-Hosted Orchestrates.
With Every Other Deployment Mode.
Self-hosted is rarely the only deployment mode in a real enterprise. It runs alongside public cloud, ZDR, on-premise, and air-gapped - each handling the workloads that fit it best. The Govern layer routes automatically.
A real BrainPack deployment looks like this:
Same user. Same conversational interface. Same agent library. Same governance policies. Five different inference paths — selected automatically by the Govern layer based on data classification, regulatory framework, and policy.
The user never picks the deployment mode. The mode picks itself.
Self-Hosted Inside the BrainPack Layer.
What BrainPack Adds On Top Of A Raw API Call.
Running an open-source LLM on a GPU is technically straightforward. Running it as production infrastructure with the operational rigor enterprise requires is not. Several things BrainPack does on top of the GPU layer make self-hosted production-grade.
GPU procurement, sizing, and lifecycle management
We size the GPU capacity to your workload mix, procure the hardware (or operate yours), handle firmware updates, manage utilization, and replace hardware as it ages. You do not run a GPU operations team.
Model evaluation, selection, and migration
New open-source models ship monthly. We evaluate each on your specific workload patterns, deploy the production-ready ones, and migrate workloads when newer models outperform on your tasks. Your AI capability does not freeze the moment a model is deployed.
Observability and incident response
Every inference call is monitored. Latency anomalies trigger alerts. Quality regressions surface before they impact users. Failed inferences are diagnosed and resolved. The operational maturity is comparable to a mature SaaS - not a research environment.
Multi-model routing within self-hosted
The orchestrator routes within self-hosted too. A simple lookup goes to a 7B parameter model on minimal GPU. A complex reasoning task goes to a 70B parameter model. A coding task goes to Qwen-Coder. The routing is transparent to users; the cost optimization is significant.
Failover to other modes
If a self-hosted GPU has an outage, the orchestrator fails over to ZDR endpoints automatically - preserving the contractual posture as much as possible while keeping the AI available. The user does not see the failover; the audit log records it.
Fine-tuning and customization
Self-hosted enables true fine-tuning on your data - which is impossible at frontier-API providers. BrainPack manages fine-tuning pipelines, evaluation, and deployment of customized variants for workloads where it produces meaningful improvement.
Cost transparency and TCO modeling
We track utilization, cost per token across self-hosted vs other modes, and provide chargeback reports. When self-hosted is cheaper than alternatives, you see that. When it is more expensive, you see that too - and the orchestrator can be policy-tuned to optimize accordingly.
Costs And Speed.
What You Actually Get.
Self-hosted is the slowest deployment mode to stand up and the cheapest unit cost at volume. Both statements come with caveats.
To first capability. GPU procurement, model selection, fine-tuning decisions, and infrastructure standup. No shortcut on the timeline.
Per call. Open-weight models on dedicated GPU are competitive on most workloads. Reasoning-heavy tasks run slower than frontier closed models — the gap is closing but not closed.
Fixed monthly cost, not pay-per-token. Below the break-even, it is the most expensive mode by a wide margin. Above it, the cheapest by a wide margin.
Tokens-per-day where self-hosted becomes cheaper than pay-per-token APIs. BrainPack models this before deployment, not after the GPUs are ordered.
The real expense of self-hosted is not the GPU bill it is reserved capacity sitting idle because the workload mix did not match the projection. The Govern layer routes spillover and bursty work to public cloud or ZDR automatically, keeping the dedicated infrastructure at the utilization the math assumed.
Self-Hosted, Fully Controlled.
Alongside Every Other Mode, Per Data Class.
Self-hosted is operating in production environments today, alongside other deployment modes for non-IP-sensitive workloads.
Self-hosted Llama running on dedicated GPU handles financial analysis on un-announced quarterly numbers. Public cloud handles marketing copy. ZDR handles individual customer interactions. Three modes, one operating layer.
Self-hosted Mistral processes supplier contract analysis where the negotiation positions cannot transit any third-party provider. Public cloud handles supply chain analytics. ZDR handles customer support cases.
Self-hosted Llama on hospital-owned GPUs handles clinical research data. ZDR handles administrative HR queries. Air-gapped handles classified compliance investigations. Three sensitivity levels, three appropriate modes.
When the Provider Cannot Be in the Data Path.
Self-hosted is the deployment mode for workloads where third-party AI providers are not acceptable - at any retention posture, for any duration, under any contract. Talk to an architect about which workloads in your environment require self-hosted, and how the orchestration policy should split work across all five modes.