< BLOG HOME

AI on Kubernetes: How Cloud Native Infrastructure Becomes the Foundation for Intelligent Applications

image

AI workloads don't arrive on a blank slate. In the real world, they typically land on multi-cluster, multi-cloud Kubernetes environments that are already straining under their own operational complexity. The question organizations face today is not whether to run AI on Kubernetes. After all, 36% of cloud native developers are already doing it. The real question is whether their platform infrastructure can keep up with the pace of AI adoption, or whether the GPU layer will expose every gap the platform layer has been quietly accumulating over the years.

Kubernetes is Already the AI Substrate

The CNCF's most recent cloud native survey counted 15.6 million cloud native developers. Of those, 52%, more than 7.1 million people, are doing AI/ML work. 36% are running AI workloads on Kubernetes in some form, with another 18% actively planning to. The convergence of Kubernetes and AI/ML is well underway.

This was not accidental. Kubernetes is well-positioned to absorb the AI wave because AI infrastructure needed to solve the exact problems Kubernetes had spent years hardening, including resilient multi-node scheduling, declarative workload management, autoscaling, and self-healing. The cloud native ecosystem quickly responded: KServe, LLM-D, AIBrix, and a growing set of CNCF projects extended the Kubernetes API surface to cover AI/ML pipelines natively.

At the platform level, Kubernetes 1.34 and 1.35 marked an inflection point. Dynamic Resource Allocation (DRA) reached general availability, bringing GPU and hardware accelerator scheduling into the same declarative model as storage provisioning. For teams running heterogeneous GPU fleets across clouds, this is significant: workloads no longer compete for static node allocations. The scheduler can dynamically determine accelerator availability.

The Infrastructure is Harder than it Looks

When enterprises first adopted cloud, they expected a clean, simple interaction between a data center and a cloud environment. What they actually operate today, however, looks very different: multiple cloud providers, edge environments, private cloud, and hybrid cloud, all stitched together with APIs that were never designed to work together. What started as one cluster has become thousands. Twenty-plus clusters per team is unremarkable.

That scale creates concrete operational problems. Without a centralized view, clusters drift. GitOps tooling, built for smaller fleet sizes, starts to buckle under the weight of multi-cloud environments with hundreds of clusters across regions. Visibility degrades as the environment grows. Teams trying to apply consistent policies across this sprawl find that neither the tooling nor the staffing model was designed for it.

The AI layer compounds this. Provisioning a GPU instance from a major hyperscaler can take six weeks. Every YAML you have to write, every driver you have to configure manually, every cluster you have to provision by hand is time not spent building the application. Without automation at every layer, modern AI infrastructure is simply too slow to iterate on.

Compliance pressure is increasing alongside this. Sovereignty requirements, particularly in Europe with DORA and related frameworks, are creating new audit, data residency, and security obligations that cut across the entire infrastructure stack. Meeting those requirements without consistent visibility across clusters is nearly impossible.

GPU Multi-Tenancy: The Problem the Industry Is Getting Wrong

High-end GPUs cost upward of $20,000 per card. No organization can afford to assign a dedicated GPU to every developer or team. Multi-tenancy is the practical reality. But the industry is treating it as a solved problem when it is not even well-defined.

Every GPU platform on the market today claims multi-tenant support. The word appears in every RFP, on every vendor slide, in every comparison matrix. The problem is that "multi-tenant" now means two completely different things depending on who is saying it, and the difference between those two things is the difference between a viable platform and a liability.

The first meaning is namespace-based isolation inside a shared Kubernetes cluster. This is what most platforms ship by default. It works by assigning workloads to separate namespaces with RBAC rules governing access. The second meaning is hardware-enforced isolation at the GPU, compute, and network layer: MIG-based GPU partitioning that creates physically separated instances, cluster-per-tenant Kubernetes architecture, and DPU-based network isolation at the NIC level.

The operators building production GPU clouds have largely stopped treating these as equivalent. Namespace isolation is not an enforcement boundary. It is a naming convention with RBAC on top. At scale, the noisy-neighbor problem stops being an efficiency concern and becomes an SLA commitment the operator cannot make. In regulated verticals, namespace isolation is not a story the compliance team can take to an auditor. And in commercial GPUaaS, the difference between a single-tenant leasing model and a shared, governed, multi-tenant service model is the margin improvement that makes the business viable.

There are real tradeoffs within the hardware-enforced approach. Time slicing, where multiple workloads share a GPU through time-multiplexed access, creates noisy-neighbor problems because workloads share memory space. MIG avoids memory sharing but introduces fragmentation: fixed-size partitions mean you cannot always fill capacity efficiently depending on workload mix. The right answer depends on workload characteristics, and it differs between NVIDIA and AMD hardware. Platform teams building multi-tenant GPU infrastructure need an explicit orchestration strategy for this, not just Kubernetes defaults.

The practical consequence is that multi-GPU orchestration needs to sit alongside multi-cluster orchestration as a first-class concern. When the evaluation gets technical and customers start asking how isolation is actually enforced, the answer determines whether the conversation moves to commercial terms or stalls.

The Open Source AI Stack Has Caught Up

Open source AI workloads on Kubernetes organize into three layers, each at a different maturity point. 

  • Training is the most established: PyTorch holds roughly 80% of the model training activity on Hugging Face, and the tooling around distributed training on Kubernetes is reasonably mature. 

  • Inference is where the current investment is concentrated. AIBrix, entering the CNCF ecosystem, provides a GenAI inference infrastructure layer focused on efficient serving at scale. LLM-D uses the Kubernetes Inference Gateway to build distributed inference with aware routing and KV cache management. Both reflect the same insight: inference at scale is a distributed systems problem, and Kubernetes is the right substrate. 

  • Agents are the frontier. KAgent provides a Kubernetes-native framework for orchestrating multi-agent systems, and MCP (Model Context Protocol) servers are becoming the standardization layer between agent runtimes and the tools they interact with.

The tooling exists. The challenge now is how to assemble it coherently.

Building the AI-Native Platform

Running AI workloads well takes more than just having available GPUs. Platform teams need a comprehensive framework that covers the full lifecycle: 

  • Developer experience - Internal self-service portals so developers do not need to understand the full CNCF landscape

  • Security - Software supply chain assurance and policy enforcement before AI workloads hit production, not retrofitted afterward

  • CI/CD and infrastructure-as-code foundations

  • Resilience engineering that accounts for AI-specific failure modes like GPU job preemption and inference latency spikes

  • Cost and observability with GPU-aware attribution

These layers are not optional. Without them, teams have not way of ensuring reliability and economic discipline.

This blog is based on a talk by Bharath Nallapeta, Prithvi Raj, and Satyam Bhardwaj at India Impact AI Summit 2026. Watch a recording of their presentation below.

k0rdent: Composable AI-Ready Kubernetes in 15 Minutes

It can take weeks to manually set up AI-ready infrastructure, including NVIDIA drivers, GPU operators, cluster configuration, service mesh, ingress, and monitoring. Every week spent on platform setup is a week not spent building the application. k0rdent is Mirantis's open source platform for collapsing that timeline, and its defining design principle is composability.

Composability is practical and necessary. A development team needs a 4-node test cluster. A QA team needs a 10-node cluster. A chaos engineering team needs 4 nodes distributed across cloud regions for realistic stress testing. If your platform cannot serve all of these from shared templates with different configuration values, you end up hand-crafting solutions for each team. That does not scale.

k0rdent splits the problem into three areas. 

  • Cluster management handles Day 0 provisioning using Cluster API (CAPI), the upstream Kubernetes standard for declarative cluster lifecycle management. CAPI providers exist for every major environment, from AWS to bare metal, so k0rdent works consistently regardless of where you are provisioning. 

  • State management handles Day 2 operations through Sveltos, an open source project that integrates with GitOps workflows and provides a service catalog model: install the services you need, nothing more. 

  • Observability rounds out the picture with Prometheus, Grafana, OpenCost, and OpenTelemetry bundled as part of the platform.

k0rdent avoids the infrastructure-as-code toolchain that most platform teams default to. Ansible, Terraform, and Crossplane are powerful, but they require specialists, and finding people with that expertise is a real hiring constraint. k0rdent's bet is simpler: if you know YAML, Helm charts, and kubectl, you have everything you need. Cluster templates are pre-built for each cloud provider. You select the template, specify your configuration, and the cluster comes up. End-to-end, from provisioning to a running AI workload, the process takes 15 to 20 minutes. All it takes to swap GPU types, e.g., from a T4 to an H100 or A100, is changing a single configuration value.

What Comes Next

After the infrastructure and GPU layers are built, the next frontier is integration: making clusters and the applications running on them capable of interacting directly with MCP servers, moving toward platforms that respond to intent rather than explicit configuration. k0rdent's roadmap runs in this direction, with integrated MCP support as the next layer above the Kubernetes and GPU foundation already in place.

For platform engineers building AI infrastructure today, the practical takeaway is that the open source ecosystem has caught up with demand. The challenge now is assembling it coherently, without having to rebuild the stack every time a team's requirements change. Composable, conformant, observable platforms should not be a future goal, but rather the baseline for shipping AI reliably today.

Learn more about open source k0rdent and Mirantis k0rdent AI for service providers and enterprises.

Benjamin Lam

Jr. Technical Marketing Engineer

Mirantis simplifies Kubernetes.

From the world’s most popular Kubernetes IDE to fully managed services and training, we can help you at every step of your K8s journey.

Connect with a Mirantis expert to learn how we can help you.

CONTACT US
k8s-callout-bg.png