What is the difference between self-hosted and on-premise?

Self-hosted is about whether a third-party AI provider is in the data path (no). On-premise is about where the GPU physically lives (your data center). They often overlap but answer different questions.

Which open-source models does BrainPack support?

Llama (all sizes), Mistral (including Large and Mixtral), Qwen (Qwen2.5 and Qwen-Coder), DeepSeek (V3 and reasoning variants), plus Whisper, embeddings, and vision models.

How much GPU do we need for self-hosted?

Depends on workload mix and model size. Small workload: single GPU. High-volume enterprise with 70B-class models: multi-GPU cluster. BrainPack does sizing during scoping. Most mid-market starts with 4-8 GPU equivalents.

Is self-hosted cheaper than public cloud AI?

Above sufficient volume, yes — typically 30-60% cheaper per token at high utilization. Below break-even (10-50M tokens/day), public cloud or ZDR is cheaper.

Are open-source models as capable as Claude or GPT?

For most production workloads in 2026, the gap is small. For cutting-edge reasoning and complex multimodal, frontier closed models still lead by a few months. The gap closes faster every quarter.

Can BrainPack run self-hosted on our existing GPU infrastructure?

Yes. Hardware ownership stays with you; operational complexity stays with us. Common in regulated industries, government, defense, and large financial institutions.

How does fine-tuning work in self-hosted?

Self-hosted enables true fine-tuning on your data — domain knowledge, vocabulary, communication style. BrainPack manages the full pipeline (data prep, training, evaluation, deployment).

What happens if a self-hosted model has issues?

BrainPack monitors every inference. Quality regressions, latency anomalies, and failed inferences are detected and resolved. Critical workloads fail over to ZDR endpoints automatically.

How fast does self-hosted deployment ship?

BrainPack-operated GPUs: 3-6 weeks for first capabilities. Your own GPUs: 2-4 months total including hardware procurement. Initial workloads can ship on ZDR while infrastructure is being prepared.

Can we run self-hosted air-gapped?

Yes. Self-hosted plus air-gapped — open-source models on GPUs inside an environment with no internet connection — is the highest-isolation deployment mode. Used for classified, defense, and critical infrastructure workloads.

What is the difference between self-hosted and on-premise?

Self-hosted is about whether a third-party AI provider is in the data path (no). On-premise is about where the GPU physically lives (your data center). They often overlap but answer different questions.

Which open-source models does BrainPack support?

Llama (all sizes), Mistral (including Large and Mixtral), Qwen (Qwen2.5 and Qwen-Coder), DeepSeek (V3 and reasoning variants), plus Whisper, embeddings, and vision models.

How much GPU do we need for self-hosted?

Depends on workload mix and model size. Small workload: single GPU. High-volume enterprise with 70B-class models: multi-GPU cluster. BrainPack does sizing during scoping. Most mid-market starts with 4-8 GPU equivalents.

Is self-hosted cheaper than public cloud AI?

Above sufficient volume, yes — typically 30-60% cheaper per token at high utilization. Below break-even (10-50M tokens/day), public cloud or ZDR is cheaper.

Are open-source models as capable as Claude or GPT?

For most production workloads in 2026, the gap is small. For cutting-edge reasoning and complex multimodal, frontier closed models still lead by a few months. The gap closes faster every quarter.

Can BrainPack run self-hosted on our existing GPU infrastructure?

Yes. Hardware ownership stays with you; operational complexity stays with us. Common in regulated industries, government, defense, and large financial institutions.

How does fine-tuning work in self-hosted?

Self-hosted enables true fine-tuning on your data — domain knowledge, vocabulary, communication style. BrainPack manages the full pipeline (data prep, training, evaluation, deployment).

What happens if a self-hosted model has issues?

BrainPack monitors every inference. Quality regressions, latency anomalies, and failed inferences are detected and resolved. Critical workloads fail over to ZDR endpoints automatically.

How fast does self-hosted deployment ship?

BrainPack-operated GPUs: 3-6 weeks for first capabilities. Your own GPUs: 2-4 months total including hardware procurement. Initial workloads can ship on ZDR while infrastructure is being prepared.

Can we run self-hosted air-gapped?

Yes. Self-hosted plus air-gapped — open-source models on GPUs inside an environment with no internet connection — is the highest-isolation deployment mode. Used for classified, defense, and critical infrastructure workloads.

Deployment Mode · 3 of 5

Self-Hosted LLMs.

Open-source frontier-class models - Llama, Mistral, Qwen, DeepSeek - running on dedicated GPU infrastructure that BrainPack operates on your behalf, or on GPUs you own. No Anthropic in the data path. No OpenAI. No Google. No Microsoft. Just the model, the GPU, your data, and the BrainPack operating layer. For workloads where the requirement is "no third-party AI provider can ever see this" - this is the deployment mode. We run the GPUs. We operate the models. You get the outcomes.

Talk to an Architect Compare Deployment Modes

When no provider can be in the data path

The Provider Cannot See Data It Never Receives.

ZDR is a contract. Public cloud is a configuration. Self-hosted is a different proposition entirely - there is no AI provider in the data path. The model runs on a GPU under our control or yours. The query never leaves the perimeter. The response never enters someone else's log file, even briefly. For data classes where the answer to "could the provider technically see this for 200 milliseconds during inference" is no, the deployment mode has to be self-hosted. There is no other answer.

Self-hosted used to mean "build it yourself, run it yourself, hope the talent stays" - which is why most enterprises avoided it. Open-source models were behind frontier capability by 12-18 months. GPU infrastructure was a six-figure CapEx decision. Operations required ML engineers most enterprises could not retain. The math did not work outside a few specific industries.

In 2026, the math has changed. Open-source models - Llama, Mistral, Qwen, DeepSeek - closed most of the capability gap on the workloads that actually matter. Managed self-hosted services run the GPU layer for you. The capability gap is now a few months for cutting-edge tasks and zero for production workloads. The GPU economics work for any enterprise above the smallest scale. The only remaining barrier was operational complexity - and BrainPack handles that as part of the managed layer.

What self-hosted really means

A Control Boundary Decision, Not a Vendor Preference.

Self-hosted means running inference on infrastructure that BrainPack or you control directly, with no third-party AI provider involved. The data goes from your environment to the GPU and back. Nothing else is in the data path. No AI vendor sees the prompt, the response, or the model's reasoning because no AI vendor is part of the call.

The models in this category are open-weight only Llama (Meta), Mistral, Qwen (Alibaba), DeepSeek, and the long tail of fine-tuned variants built on top of them. The frontier closed models (Claude, GPT, Gemini) are not available self-hosted; their providers do not release weights. For most production workloads, the capability gap is now small. For some cutting-edge tasks deep research, specialized reasoning it remains real.

Self-hosted is not automatically the right choice GPUs reserved for your inference carry a lower unit cost per token at sufficient volume, but below that volume the hardware sits idle and costs more. The break-even is workload-dependent, generally somewhere between 10 million and 50 million tokens per day. Self-hosted is appropriate for some data classes and uneconomic for others. The deployment decision is a control-boundary-and-volume decision, not a vendor-trust decision.

BrainPack treats Self-Hosted as one execution surface among five. The Connect, Orchestrate, and Govern layers do not change. What changes is where the inference actually executes and the fact that no AI vendor is in the data path at all.

How It Actually Works — Govern Layer

Where it wins

When Self-Hosted Is The Right Mode

Six Workloads Where It Wins.

Six workload categories where self-hosted is the appropriate answer - and where ZDR or on-premise are usually the alternatives BrainPack also supports.

CORE IP AND COMPETITIVE TECHNOLOGY

Source code for products you sell, proprietary algorithms, trade secrets, manufacturing process documentation, R&D pipelines. The data defines your competitive position. "The provider saw it for 200 milliseconds and discarded it" is not the answer - the data should never have left your control. Self-hosted is the answer.

PRE-ANNOUNCEMENT MATERIAL FINANCIAL INFORMATION

Quarterly earnings before release, M&A documents, board materials, executive compensation discussions, trading strategies. Even ZDR is too permissive - the data goes through a third-party provider, however briefly. Self-hosted eliminates the provider entirely.

HIGH-VOLUME WORKLOADS WHERE TOKEN ECONOMICS MATTER

Customer service automation processing millions of interactions per day. Internal knowledge agents serving thousands of employees. Document processing pipelines handling tens of thousands of documents. Above ~10-50M tokens per day, dedicated GPU becomes cheaper per token than frontier API rates - and the savings compound.

WORKLOADS WHERE LATENCY MATTERS

Self-hosted models on dedicated GPUs serve responses with predictable latency, no public-cloud noisy-neighbor effects, no rate-limit waits, no provider-side queueing. For real-time voice, sub-second agent loops, or high-frequency analytical workloads, self-hosted often outperforms cloud APIs.

SOVEREIGN AI REQUIREMENTS

Workloads where the data must demonstrably never leave a specific national jurisdiction, must be processed on infrastructure owned by domestic entities, or must comply with sovereignty rules that cloud AI providers cannot satisfy. Self-hosted on regional GPU infrastructure addresses this.

"NO US TECH COMPANY IN THE DATA PATH" REQUIREMENTS

Some defense, financial, healthcare, and government workloads in non-US jurisdictions explicitly preclude US-headquartered AI providers from the inference path. Self-hosted with European or other open-source models on regional infrastructure is the answer.

Where it loses

When Self-Hosted Is The Wrong Mode.
And Where The Workload Should Go Instead.

Five workload categories where self-hosted is the wrong answer and where BrainPack routes work to public cloud, ZDR, on-premise, or air-gapped instead.

Low-Volume Or Unpredictable Workloads

Workloads that do not clear the 10–50M tokens-per-day break-even leave dedicated GPUs sitting idle. The economics invert pay-per-token public cloud or ZDR is cheaper, sometimes by a factor of ten. Reserve self-hosted for steady, high-throughput pipelines, not bursty experimentation.

Frontier-Capability Tasks

Deep research, complex multi-step reasoning, the newest multimodal capabilities, advanced coding agents. Open-weight models have closed most of the gap, but on the frontier edge, Claude, GPT, and Gemini still lead sometimes by a generation. If the workload genuinely needs that capability and the data class permits, public cloud or ZDR is the right surface.

Data Subject To Strict Residency Or Air-Gap Rules

Self-hosted in BrainPack's infrastructure still means a data center somewhere likely not in the jurisdiction the regulator cares about, and not air-gapped. Defense classifications, banking sovereignty rules, government workloads under FedRAMP High require on-premise or air-gapped deployment in the specified region. Self-hosted on shared infrastructure does not satisfy those requirements.

General Productivity Work

Drafting emails, summarizing public documents, brainstorming, code completion on non-sensitive repos. The data class does not demand the control boundary, the volume rarely justifies the GPU reservation, and the model selection is narrower. Public cloud does this work faster, cheaper, and on better models.

Workloads Where Time-To-Capability Matters More Than Control

A pilot that needs to ship in a week. A new use case where the team is still validating whether AI solves the problem at all. Self-hosted requires GPU procurement, model selection, fine-tuning decisions, and operational standup. Public cloud ships in days. Validate first, then graduate to self-hosted if the data class and volume support it.

Where to route them instead

Zero Data Retention Soverign AI On-Premise Air-Gapped

Routing alongside other modes

How Self-Hosted Orchestrates.

With Every Other Deployment Mode.

Self-hosted is rarely the only deployment mode in a real enterprise. It runs alongside public cloud, ZDR, on-premise, and air-gapped - each handling the workloads that fit it best. The Govern layer routes automatically.

A real BrainPack deployment looks like this:

Same user. Same conversational interface. Same agent library. Same governance policies. Five different inference paths — selected automatically by the Govern layer based on data classification, regulatory framework, and policy.

The user never picks the deployment mode. The mode picks itself.

Deployment Modes Hub How It Actually Works - Govern Layer

What BrainPack operates on top

Self-Hosted Inside the BrainPack Layer.
What BrainPack Adds On Top Of A Raw API Call.

Running an open-source LLM on a GPU is technically straightforward. Running it as production infrastructure with the operational rigor enterprise requires is not. Several things BrainPack does on top of the GPU layer make self-hosted production-grade.

GPU procurement, sizing, and lifecycle management

We size the GPU capacity to your workload mix, procure the hardware (or operate yours), handle firmware updates, manage utilization, and replace hardware as it ages. You do not run a GPU operations team.

Model evaluation, selection, and migration

New open-source models ship monthly. We evaluate each on your specific workload patterns, deploy the production-ready ones, and migrate workloads when newer models outperform on your tasks. Your AI capability does not freeze the moment a model is deployed.

Observability and incident response

Every inference call is monitored. Latency anomalies trigger alerts. Quality regressions surface before they impact users. Failed inferences are diagnosed and resolved. The operational maturity is comparable to a mature SaaS - not a research environment.

Multi-model routing within self-hosted

The orchestrator routes within self-hosted too. A simple lookup goes to a 7B parameter model on minimal GPU. A complex reasoning task goes to a 70B parameter model. A coding task goes to Qwen-Coder. The routing is transparent to users; the cost optimization is significant.

Failover to other modes

If a self-hosted GPU has an outage, the orchestrator fails over to ZDR endpoints automatically - preserving the contractual posture as much as possible while keeping the AI available. The user does not see the failover; the audit log records it.

Fine-tuning and customization

Self-hosted enables true fine-tuning on your data - which is impossible at frontier-API providers. BrainPack manages fine-tuning pipelines, evaluation, and deployment of customized variants for workloads where it produces meaningful improvement.

Cost transparency and TCO modeling

We track utilization, cost per token across self-hosted vs other modes, and provide chargeback reports. When self-hosted is cheaper than alternatives, you see that. When it is more expensive, you see that too - and the orchestrator can be policy-tuned to optimize accordingly.

The break-even math

Costs And Speed.
What You Actually Get.

Self-hosted is the slowest deployment mode to stand up and the cheapest unit cost at volume. Both statements come with caveats.

SPEED

4–8 wks

To first capability. GPU procurement, model selection, fine-tuning decisions, and infrastructure standup. No shortcut on the timeline.

LATENCY

300ms–3s

Per call. Open-weight models on dedicated GPU are competitive on most workloads. Reasoning-heavy tasks run slower than frontier closed models — the gap is closing but not closed.

UNIT COST

Reserved GPU

Fixed monthly cost, not pay-per-token. Below the break-even, it is the most expensive mode by a wide margin. Above it, the cheapest by a wide margin.

BREAK-EVEN

10–50M /day

Tokens-per-day where self-hosted becomes cheaper than pay-per-token APIs. BrainPack models this before deployment, not after the GPUs are ordered.

HIDDEN COST

Misclassification.

The real expense of self-hosted is not the GPU bill it is reserved capacity sitting idle because the workload mix did not match the projection. The Govern layer routes spillover and bursty work to public cloud or ZDR automatically, keeping the dedicated infrastructure at the utilization the math assumed.

BPU Pricing — How Capacity Funds All Modes

Running today

Self-Hosted, Fully Controlled.
Alongside Every Other Mode, Per Data Class.

Self-hosted is operating in production environments today, alongside other deployment modes for non-IP-sensitive workloads.

01 · RETAIL ENTERPRISE

Self-hosted Llama running on dedicated GPU handles financial analysis on un-announced quarterly numbers. Public cloud handles marketing copy. ZDR handles individual customer interactions. Three modes, one operating layer.

02 · DISTRIBUTION COMPANY

Self-hosted Mistral processes supplier contract analysis where the negotiation positions cannot transit any third-party provider. Public cloud handles supply chain analytics. ZDR handles customer support cases.

03 · HEALTHCARE NETWORK

Self-hosted Llama on hospital-owned GPUs handles clinical research data. ZDR handles administrative HR queries. Air-gapped handles classified compliance investigations. Three sensitivity levels, three appropriate modes.

See All Results

When the Provider Cannot Be in the Data Path.

Self-hosted is the deployment mode for workloads where third-party AI providers are not acceptable - at any retention posture, for any duration, under any contract. Talk to an architect about which workloads in your environment require self-hosted, and how the orchestration policy should split work across all five modes.

Talk to an Architect Compare Deployment Modes

Deployment Modes Hub Public Cloud Mode Zero Data Retention Mode On-Premise Mode Air-Gapped Mode Fully Managed AI Infrastructure How It Actually Works BPU Pricing

Self-Hosted LLMs.

The Provider Cannot See Data It Never Receives.

A Control Boundary Decision, Not a Vendor Preference.

When Self-Hosted Is The Right Mode

Six Workloads Where It Wins.

CORE IP AND COMPETITIVE TECHNOLOGY

PRE-ANNOUNCEMENT MATERIAL FINANCIAL INFORMATION

HIGH-VOLUME WORKLOADS WHERE TOKEN ECONOMICS MATTER

WORKLOADS WHERE LATENCY MATTERS

SOVEREIGN AI REQUIREMENTS

"NO US TECH COMPANY IN THE DATA PATH" REQUIREMENTS

When Self-Hosted Is The Wrong Mode.
And Where The Workload Should Go Instead.

Low-Volume Or Unpredictable Workloads

Frontier-Capability Tasks

Data Subject To Strict Residency Or Air-Gap Rules

General Productivity Work

Workloads Where Time-To-Capability Matters More Than Control

How Self-Hosted Orchestrates.

With Every Other Deployment Mode.

Self-Hosted Inside the BrainPack Layer.
What BrainPack Adds On Top Of A Raw API Call.

GPU procurement, sizing, and lifecycle management

Model evaluation, selection, and migration

Observability and incident response

Multi-model routing within self-hosted

Failover to other modes

Fine-tuning and customization

Cost transparency and TCO modeling

Costs And Speed.
What You Actually Get.

Self-Hosted, Fully Controlled.
Alongside Every Other Mode, Per Data Class.

When the Provider Cannot Be in the Data Path.

Packs

Apps

Sales & Marketing

Finance & Admin

eCommerce & Retail

Operations & Logistic

HR & Workforce

Services & Projects

Communication & Engagement

Self-Hosted LLMs.

The Provider Cannot See Data It Never Receives.

A Control Boundary Decision, Not a Vendor Preference.

When Self-Hosted Is The Right Mode

Six Workloads Where It Wins.

CORE IP AND COMPETITIVE TECHNOLOGY

PRE-ANNOUNCEMENT MATERIAL FINANCIAL INFORMATION

HIGH-VOLUME WORKLOADS WHERE TOKEN ECONOMICS MATTER

WORKLOADS WHERE LATENCY MATTERS

SOVEREIGN AI REQUIREMENTS

"NO US TECH COMPANY IN THE DATA PATH" REQUIREMENTS

When Self-Hosted Is The Wrong Mode.And Where The Workload Should Go Instead.

Low-Volume Or Unpredictable Workloads

Frontier-Capability Tasks

Data Subject To Strict Residency Or Air-Gap Rules

General Productivity Work

Workloads Where Time-To-Capability Matters More Than Control

How Self-Hosted Orchestrates.

With Every Other Deployment Mode.

Self-Hosted Inside the BrainPack Layer.What BrainPack Adds On Top Of A Raw API Call.

GPU procurement, sizing, and lifecycle management

Model evaluation, selection, and migration

Observability and incident response

Multi-model routing within self-hosted

Failover to other modes

Fine-tuning and customization

Cost transparency and TCO modeling

Costs And Speed.What You Actually Get.

Self-Hosted, Fully Controlled.Alongside Every Other Mode, Per Data Class.

When the Provider Cannot Be in the Data Path.

Packs

Apps

When Self-Hosted Is The Wrong Mode.
And Where The Workload Should Go Instead.

Self-Hosted Inside the BrainPack Layer.
What BrainPack Adds On Top Of A Raw API Call.

Costs And Speed.
What You Actually Get.

Self-Hosted, Fully Controlled.
Alongside Every Other Mode, Per Data Class.