Vivid Planet Software GmbH
Challenges using Azure Kubernetes Service
Description
Daniel Karnutsch von Vivid Planet Software spricht in seinem devjobs.at TechTalk über wesentliche Punkte, welche bei der Verwendung von Azure Kubernetes Service auftreten – Infrastructure as Code, Kosten und Planung der Kapazitäten.
By playing the video, you agree to data transfer to YouTube and acknowledge the privacy policy.
Video Summary
In “Challenges using Azure Kubernetes Service,” Daniel Karnutsch (Vivid Planet Software GmbH) shares practical lessons from adopting managed AKS across infrastructure as code with Terraform, cost estimation, and capacity planning. He explains pitfalls from mismatched provider defaults (e.g., OS disk types) and how to change the default node pool VM size with zero downtime by temporarily swapping system node pools via the Azure UI/CLI and then reconciling Terraform state. Viewers learn to treat node spend as a lower bound, validate estimates with a near‑production setup and a fixed-versus-linear cost split, leverage FinOps, and proactively request quotas while planning for region shortages and reservation lock‑in.
Azure Kubernetes Service in the Real World: IaC Pitfalls, Cost Surprises, and Capacity Risks — Notes from “Challenges using Azure Kubernetes Service” by Daniel Karnutsch (Vivid Planet Software GmbH)
Why this talk matters to engineering teams
In “Challenges using Azure Kubernetes Service,” Daniel Karnutsch, a software engineer at Vivid Planet Software GmbH, unpacks three practical challenges his team faced adopting Azure Kubernetes Service (AKS): infrastructure as code, cost estimation, and capacity planning. Vivid Planet is a ~60-person software development company near Salzburg delivering custom solutions for large enterprises, mainly on the web. That context makes the talk resonate: AKS promises to lower operational burden for smaller teams without taking away the flexibility required for enterprise-grade workloads.
From our DevJobs.at vantage point, the session is refreshingly pragmatic. It shows where managed Kubernetes genuinely reduces complexity—and where it introduces new, less obvious trade-offs that teams must manage deliberately.
The decision space: Hosting Kubernetes across a flexibility/complexity spectrum
Daniel starts by mapping common options for running Kubernetes—useful for any team revisiting their platform strategy:
- On-premises, self-managed Kubernetes: maximum control from hardware to orchestration, with corresponding complexity and staffing needs.
- Self-managed Kubernetes in the public cloud: the cloud provider handles hardware; you retain full Kubernetes control. Flexible, but still complex to operate.
- Managed Kubernetes in the public cloud (e.g., Azure Kubernetes Service): the provider deploys and manages the control plane, offers sensible defaults, and takes over some administrative tasks. You keep enough flexibility for your requirements but lose direct access to control-plane components.
- Platform-as-a-Service (e.g., Red Hat OpenShift in the public cloud): more opinionated, often handles routing and similar concerns for you, at the cost of the least flexibility.
Vivid Planet chose managed Kubernetes in the public cloud—specifically AKS—because it strikes the right balance for a smaller team and brings one big benefit: easy access to other cloud services like managed databases or storage. Daniel also points out that managed Kubernetes has matured significantly. You can start by deploying applications while the provider manages the control plane and offers options like automatic Kubernetes version upgrades.
That said, some responsibilities remain firmly on your side even with AKS: routing/load balancing, multi-tenancy, and security setup are your job.
Challenge 1: Infrastructure as code—when provider defaults don’t match platform reality
Daniel uses infrastructure as code (IaC) to illustrate a broader theme with managed services: abstraction is helpful, but never perfectly aligned with how the underlying platform evolves. That is where friction—and outages if you’re unlucky—can originate.
Why IaC in the first place? A quick recap
- Version control: Infrastructure definitions are files; you gain pull requests, reviews, history, and contributor workflows—just like application code.
- Reproducibility: You can tear environments down and bring them back up; cloning configuration lets you spin up new environments with less risk and effort.
- Automation: Existing CI/CD pipelines can deploy infrastructure just as they deploy applications.
Vivid Planet uses Terraform (with Pulumi mentioned as another popular tool). IaC brings real benefits, but Daniel highlights two concrete friction points that teams should expect.
Pain point A: Diverging defaults in the Terraform provider
A consistent pattern: Terraform providers ship their own default values, and those values don’t necessarily match current cloud-provider defaults or recommendations. Daniel’s AKS example is instructive:
- The OS disk type default in the Terraform provider is “managed disks.”
- According to Daniel, these are “not recommended” because of lower performance; the team wanted “ephemeral disks.”
- If you rely on provider defaults without reviewing them, you might end up with a suboptimal platform configuration.
The practical implication is clear: IaC isn’t a replacement for understanding the platform’s current recommendations. If the Terraform provider lags behind platform changes, your desired configuration and your actual configuration can drift apart—silently.
Pain point B: Changes that Terraform says will destroy the cluster—but the platform can do zero-downtime
The second example concerns a common day-2 operation: changing the VM size of the default node pool in an AKS cluster. Daniel’s observation:
- A Terraform plan for this change proposes destroying the default node pool, which also destroys the entire cluster—implying downtime.
But AKS allows a zero-downtime path using platform tools:
- Use the Azure portal or CLI to create a temporary system node pool to serve as the default.
- Delete the old default node pool (you must always have at least one system node pool).
- Create a new default node pool with the desired VM size.
- Delete the temporary node pool.
Daniel reports that this can be done in production with zero downtime. The catch: Terraform doesn’t know you’ve done this. You must manually edit the Terraform state to reflect the new reality. After updating state, a Terraform plan will show no changes—desired and actual state are re-aligned.
The meta-lesson:
- Some changes may look impossible or destructive in Terraform.
- The underlying platform can still support them safely via native tools.
- Teams need runbooks for both: how to perform platform-native operations and how to reconcile IaC state afterward.
Challenge 2: Cost and cost estimation—beyond the naive node-only model
Daniel invokes a familiar truth that becomes painfully concrete with cloud bills:
“Prediction is very difficult, especially about the future.”
The naive estimate for AKS often goes like this: “We only pay for nodes.” You check VM prices, do the math, and declare victory. Daniel shows why that’s incomplete.
What else you end up paying for
- SLAs: If your customers require an SLA, you must back your components accordingly—and that costs money.
- Support plans: Many cloud providers limit standard support to billing-only topics. For technical assistance, you need an additional support plan.
- Networking specifics: Requirements like a static IP entail extra networking components. Those are often bandwidth-priced, making them hard to estimate up front.
- Observability: To understand resource usage in production, you need monitoring and logging. Those tools are usage-based (number of containers, number of log lines, etc.), which are difficult to predict early on.
Allocation is hard in multi-tenant or multi-project clusters
Even with a good total estimate, cost allocation is tricky:
- Node costs can be distributed based on resource requests.
- Linear, usage-based costs (networking, observability) still need to be split, and doing that fairly is difficult.
Daniel’s practical guidance
- Treat node costs as an absolute lower bound. Daniel’s concrete figure: in their experience, nodes accounted for about 50% of total cluster costs—so the real bill was roughly double the naive node-only estimate.
- If you can afford it, run a near-production setup for a longer period and use the observed costs as your baseline.
- Divide costs into fixed and linear parts: initial adoption incurs fixed costs, then costs scale linearly with each new application.
- Look to FinOps developments for cost assignment and savings opportunities.
The mental model shifts from “VM prices × node count” to “nodes = minimum; significant, usage-driven costs sit around them.” That helps engineering and finance meet in the middle with realistic expectations and fewer surprises.
Challenge 3: Capacity planning—“It’s just somebody else’s computer”
The second quote Daniel cites is as blunt as it is accurate:
“There is no cloud. It’s just somebody else’s computer.”
Even hyperscale clouds run on finite, managed physical resources. What does this mean for AKS workloads?
Availability isn’t guaranteed—for node types or regions
Daniel highlights three scenarios:
- Specific VM types can become unavailable, either due to high demand or because the provider deprecates them in favor of newer generations.
- Entire regions can be “full.” Daniel notes he wouldn’t have imagined this a few years ago, but they have experienced it in practice.
- Reservations or committed usage may reduce your bill but also limit your flexibility: you’re tied to certain products or node types and can’t always jump to the newest options.
Mitigations—with trade-offs
- Migrating to a less-used region: possible, but at the cost of added latency. If your cluster is stateless and databases sit outside the cluster, moving the cluster without moving the database increases latency. The alternative—migrating databases as well—is non-trivial. Neither path is ideal.
- Increasing quotas early and generously: estimate how many nodes you’ll need, validate that you can actually use that amount, and keep headroom for cluster upgrades, new customers, or seasonal spikes.
The common thread: public cloud capacity isn’t infinitely elastic in practice. Planning means doing the unglamorous work early so you have options when you need them.
Clear takeaways for engineers
Grounded in Daniel’s talk, we see five practical action areas:
1) Use IaC deliberately
- Don’t assume Terraform provider defaults match cloud-provider recommendations. Set critical parameters explicitly.
- Treat destructive Terraform plans as signals to explore platform-native procedures.
- Anticipate state drift: when you must change resources outside Terraform, plan to reconcile state manually.
2) Master the platform operations you’ll need
- For AKS-specific operations like swapping the default node pool, maintain runbooks using the portal or CLI—and procedures for bringing IaC state back in sync.
3) Be honest about costs
- Expect significant non-node costs: SLAs, support, networking, and observability.
- Use node costs as a lower bound, not the whole story.
- Validate estimates with a longer-running, near-production environment when possible.
- Separate fixed vs. linear costs; leverage FinOps practices for allocation and savings.
4) Treat capacity as a risk and a product attribute
- VM types can disappear or be over-subscribed; regions can be full.
- Reservations save money but trade away flexibility—enter them with eyes open.
- Proactively raise quotas and keep slack for upgrades, onboarding, and spikes.
5) Managed doesn’t mean hands-off
- AKS removes control plane burden, not your responsibilities for routing, multi-tenancy, security, and cost governance.
What “Challenges using Azure Kubernetes Service” leaves us with
Daniel Karnutsch’s session is a call for operational realism. AKS effectively reduces complexity where it hurts most—running and securing the control plane. But it simultaneously concentrates responsibility into three areas that teams must approach with discipline:
- IaC is an accelerator—but only if you recognize gaps between provider defaults and platform recommendations. The OS disk type example and the default node pool VM-size change demonstrate how attention to detail separates “downtime and rebuild” from “zero-downtime upgrade.”
- Costs are a system, not a single line item on a VM price list. Daniel’s 50% figure for node costs is a strong signal: without data from a longer-running, near-production setup, any estimate remains fragile.
- Capacity in public clouds is neither infinite nor guaranteed. Reservations lower bills but bind you. Region moves can restore capacity but introduce latency, particularly if databases remain out of cluster. Quotas raised early create room to maneuver when it matters.
For teams adopting or scaling AKS, this is encouraging and demanding in equal measure: managed Kubernetes has matured, and the real challenges are where architecture, operations, and cost models intersect—the very areas Daniel addresses head-on.
Closing: Focus where it counts
We leave “Challenges using Azure Kubernetes Service” with a pragmatic triad:
- Decide consciously: where does AKS abstract, and where do we remain accountable?
- Measure transparently: what costs actually arise, how do they distribute, and which assumptions survive contact with production?
- Plan ahead: what capacity risks could hit our setup, and what buffers give us room to maneuver?
Put differently, the two quotes Daniel cites stand as operating principles: predicting the future is hard—and the cloud is still someone else’s computer. Teams that internalize both can run AKS stably, scalably, and with fewer unpleasant surprises.