< BLOG HOME

GPU Infrastructure's 15-Minute Miracle: When Complexity Meets Composability

image

Let's be honest for a second. If you've ever tried to stand up a production GPU cluster for AI workloads, you know the drill. It's a multi-week if not multi-month odyssey through dependency hell, where every framework wants its own special snowflake configuration, and by the time you're done, your best engineers have aged visibly.

At AI Infrastructure Field Day, we had the opportunity to show the industry a different way to build AI Infrastructure. When our CTO Shaun O'Meara claimed we could spin up a fully operational NVIDIA Run.ai inference cluster in about 15 minutes, the room full of infrastructure experts were understandably interested...

The Problem We're All Facing

Here's what we're seeing across the industry: companies are dumping millions into GPU hardware, expecting magic. What they're getting instead is a pile of expensive silicon that sits there while teams struggle to make it actually useful. The gap between "we bought GPUs" and "we're running production AI workloads" is measured in months, not minutes.

The secret? It's not the hardware that's hard. It's everything else…The orchestration, the multi-tenancy, the resource scheduling, the observability stack. Oh, and making sure your $40,000 GPU isn't sitting idle running someone's hello-world experiment.

Our VP of Product Management, Kevin Kamel, opened our presentation by addressing three brutal realities we hear from customers every day:

  • Converting single-tenant GPU hardware into multi-tenant services is a nightmare

  • The talent shortage means you can't just hire your way out of the problem

  • Everyone expects hyperscaler-level experiences now - self-service portals, integrated observability, efficient monetization

We've spent years building infrastructure for some of the world's largest clouds. From our early days as OpenStack pioneers to stewarding Kubernetes and acquiring Docker Enterprise and Lens, we've been in the trenches of infrastructure complexity. GPU infrastructure is just the latest chapter - but arguably a critical one.

Showing, Not Just Telling

Our Product Marketing Specialist, Anjelica Ambrosio, took the stage to prove our point. No pre-baked environments, no smoke and mirrors - just a real deployment from scratch. Using the Mirantis k0rdent AI Cloud Service Provider portal, she:

Used the Customer Portal to create an inference cluster template defining a new AI-optimized host cluster (and showed how this could be added to the customer’s Marketplace as a one-click option).

Used the CSP Operator Portal Product Builder to create a new service: building an AI host cluster with metrics onboard and integrated with the provider’s Grafana front end.

Used the IaaS Portal to demonstrate bare metal provisioning, cluster creation within a tenant, and showed dashboards displaying integrated Grafana monitoring for Kubernetes and for VMs hosted with KubeVirt.

Used the GPU PaaS portal to demonstrate deploying a complete inference cluster: selecting GPU nodes, configuring Kubernetes, adding Run.ai dependencies like ArgoCD and Knative, and closing by accessing the completed cluster and its running workloads through the automatically-integrated Run.ai webUI.

Fifteen minutes. That's all it took. And most of that (almost 14 minutes) was waiting for AWS machines to boot and downloading the configuration and credential information needed to manage them.

All those painful dependencies that normally take weeks to configure — cert-manager, GPU operators, Argo workflows — were automatically provisioned and configured. No manual YAML wrangling, no debugging version conflicts.

The Architecture Behind the Magic

Shaun walked through how we've organized the platform services layer above the GPU infrastructure. We didn't just automate the existing painful process - we fundamentally rethought it. Instead of forcing everyone to build from scratch, we've created composable service templates for training, inference, and data services.

The key insight? Services should be building blocks, not monoliths. They can be chained, extended, and validated without custom integration work for every new workload. When we demonstrated adding Run.ai to the cluster, it wasn't a special case requiring custom work - it was just another building block from our catalog.

Our labeling system automatically tags GPU nodes during cluster creation, and Run.ai validates these labels to ensure workloads land where they belong. GPU workloads on GPU nodes, everything else on CPU nodes. Simple? Yes. But it's the kind of simple that only comes from learning what breaks in production.

Answering the Hard Questions

The Field Day delegates had many questions, and that's exactly what we wanted. Here's what real operators care about:

"Can you mix frameworks - say Run.ai with Kubeflow - in the same deployment?"

Absolutely. Our catalog approach means you can compose based on what teams actually need. Today it's Run.ai, tomorrow add MLflow, next week swap in something else. No architectural redesign required. It's Lego blocks for AI infrastructure.

"What about sovereign clouds and air-gapped deployments?"

We shared the story of Nebul, a sovereign AI cloud in the Netherlands. They were drowning - managing thousands of Kubernetes clusters, enforcing strict multi-tenancy, dealing with stranded GPU resources. After adopting k0rdent AI, their small team could focus on business growth instead of infrastructure firefighting. And yes, it works completely disconnected from the internet.

"How do you handle the skills gap?"

We've been here before. During the early OpenStack era, we helped enterprises build private clouds when nobody knew how. Same playbook, different decade - we offer everything from managed services to skills transfer, meeting organizations wherever they are on the expertise spectrum.

Composability: The Real Game-Changer

We're not trying to build the One Platform to Rule Them All. We're building Lego blocks for GPU infrastructure.

Our Product Builder demo showed this philosophy in action. An operator can log into the self-service portal and within minutes:

  • Create new cluster products

  • Set parameters

  • Deploy to an internal marketplace

  • Monitor everything with real-time observability dashboards

This isn't just about deployment speed. It's about being able to evolve your AI infrastructure without starting from scratch every time requirements change.

Making GPU Infrastructure a Business Asset

Let's talk about what really matters to the business. Every minute your GPU cluster isn't running production workloads is money burning. Our platform doesn't just deploy infrastructure - it makes it billable.

Whether you need internal chargeback or want to sell services externally, k0rdent AI transforms racks of GPUs into metered AI services. We've built in flexible pricing models too:

  • OPEX consumption-based pricing for clouds that want to pay as they grow

  • CAPEX-aligned licensing for enterprises with budget constraints

  • FedRAMP support for government contracts

This isn't a science project; it's a business platform designed for real-world deployment.

Why We Built This

We've been building and operating infrastructure for over a decade. We've seen every "revolutionary" automation platform that works great in demos and falls apart in production. That's why we didn't just automate the existing painful process - we redesigned it from first principles.

GPU infrastructure doesn't have to be special. It doesn't need its own unique operational model that only three people in your organization understand. It can be as consumable as traditional compute - if you approach it right.

See It For Yourself

The entire AI Infrastructure Field Day presentation is available to watch, including the full demo and technical deep dives. We believe in showing our work, not just talking about it.

If you're sitting on GPU hardware wondering why it's so hard to make it useful, or if you're a service provider trying to compete with hyperscalers without their army of engineers, let's talk. The question isn't whether you need better GPU infrastructure automation - you do. The question is whether you want to spend the next six months building it yourself or fifteen minutes deploying something that already works.

GPU complexity doesn't have to be your reality. We've proven it can be solved. Now it's time to put that solution to work in your environment.

Edward Ionel

Head of Growth

Mirantis simplifies Kubernetes.

From the world’s most popular Kubernetes IDE to fully managed services and training, we can help you at every step of your K8s journey.

Connect with a Mirantis expert to learn how we can help you.

CONTACT US
k8s-callout-bg.png