durchblicker.at
Out of the Dark, Into the Light
Description
Michael Karner von Durchblicker erzählt in seinem devjobs.at TechTalk über den Weg, wie das Team die gesamte Infrastruktur auf die Google Cloud Platform gewechselt hat – und welche Überraschungen es dabei gegeben hat.
By playing the video, you agree to data transfer to YouTube and acknowledge the privacy policy.
Video Summary
In “Out of the Dark, Into the Light,” Michael Karner (durchblicker.at) explains the shift from a patchwork of multi‑cloud VMs to mostly Google Cloud Run, anchored by Terraform‑based IaC, centralized secrets, stronger logging/alerting, and clearly separated dev/stage/prod environments. He details AWS vs GCP POCs, the choice of GCP, hitting Cloud Run’s CPU throttling with Revit MQ consumers, trying GKE Autopilot (too slow to scale), and ultimately resolving it with Cloud Run “CPU always on,” while migrating incrementally and dismantling a Jenkins‑based frontend pipeline. Viewers can apply these lessons to prefer managed services, plan service‑by‑service migrations with Terraform and solid ops guardrails, and configure Cloud Run for background processing without wholesale rewrites.
From VM Graveyard to Cloud Run: How durchblicker.at modernized its platform on GCP
Context: A migration story told in production truths
In “Out of the Dark, Into the Light,” Speaker: Michael Karner (Company: durchblicker.at) walks through a multi-year journey: moving from a multi-cloud, VM-heavy setup to a platform where Google Cloud Run is the backbone. At DevJobs.at, we listened closely and distilled the technical narrative, decision points, and lessons that matter to engineering teams planning similar transformations.
durchblicker.at is Austria’s independent, market-leading online price comparison site. Founded in 2010, the company grew to over 80 employees and currently offers 28 comparisons. Earlier this year, it was sold to the NetRisk group, which operates leading online comparison portals in CEE. Karner has been a software developer since the late 1990s and joined durchblicker.at in 2019. He started on the backend, and now focuses on infrastructure, operations, and on-call.
The talk is a retrospective—no blame, just a clear-eyed account of why the team moved to GCP and how they made Cloud Run work at scale without a “big bang.”
Starting point: A decade of change meets a grown system
The early years (starting 2010) looked familiar: bare-metal, VMs, jQuery—while the ecosystem shifted dramatically around them:
- 2005: First version of Git
- 2013: Docker
- 2014: Kubernetes
- Mid-2010s: AWS Lambda, Google Cloud Functions
- 2019: AWS Fargate, Google Cloud Run
- 2021: Google Kubernetes Autopilot
By 2020, the “startup archaeology” showed. “There’s a graveyard of ideas—mainly in our code base and infrastructure.” Concretely (early 2021):
- Around 120 Git repositories on a self-hosted GitLab; about 40 no longer used
- About 60 VMs and bare-metal servers across Google Compute Engine, DigitalOcean, and Hetzner
- DNS was running at AWS; a few S3 buckets with lots of terabytes
- MySQL databases in GCP and locally on VMs
- Around 20 core microservices
- Releasing “was a big fear” for some developers
A graphical commit history (2010–2021) told the same story: a lot of history, fragmentation, and eroding trust in the system’s single source of truth.
Pain points: DIY fatigue, noisy ops, and manual drift
Karner listed major pain points—“not exhaustive,” but painfully recognizable:
- Too much built in-house instead of using Platform-as-a-Service or Software-as-a-Service
- Secret management not handled optimally
- Build processes on Jenkins not optimal
- Logging and alerting “not good”; Graylog, Elastic, and Grafana were all run in-house
- Too much noise from non-errors—true errors were hard to find
- Constant manpower for security patches and OS updates
- New test systems slow to set up
- Loss of trust in the Ansible repo as the “single source of truth”; manual changes directly on VMs weren’t fed back into Ansible
- Certificate management wasn’t optimal
- Multi-cloud issues due to network traffic
- “No living DevOps culture,” and security issues
In short: tech debt, fragmented tooling, manual ops, and inconsistent automation weighed the team down.
Guardrails for the renewal
Against that backdrop, durchblicker.at set pragmatic, conservative guidelines:
- Fully migrate all core services into one cloud
- Make best-practice decisions—with consultants helping
- Containerize services
- No big bang: migrate service by service
- Conservative approach: migrate services “as-is” to isolate new bugs from old ones
- Essential central secret management
- Infrastructure as code as a must-have; no hand-clicks in GUIs
- Three stages: development, staging, production
- Really good logging and error reporting
- Replace Jenkins with GitLab
These guardrails pointed the way: away from manual sprawl and toward a repeatable, declarative, observable delivery pipeline on a single platform.
Choosing the platform: POC in AWS and GCP, then a gut call
The team ran a dual POC: migrating a service to AWS (with a consultant) and to GCP (with help from Google). Findings:
- AWS CDK “was not optimal” for them; Terraform was preferred
- No clear platform winner at that time
In June 2021, they made a “more or less gut decision” for GCP. Reasons:
- The imperative vs. declarative split in infrastructure definitions (with Terraform on the declarative side)
- GCP had already worked well for durchblicker.at (load balancers and databases ran reliably there)
- If Kubernetes would be needed, GCP support looked better
They then engaged Haptic (Vienna) as their consulting partner. In July 2021, a joint kickoff with Haptic and Google set the migration in motion.
Planning vs reality: The underestimated roadmap
The original roadmap was “completely underestimated”: about a month of learning, then one to two services per week to Cloud Run—“done by October 2021.” Reality, as always, had a say. Complex service portfolios, conservative migration, and deep cleanup work take time.
The first pilot: “tariffs” under load
The first pilot was “tariffs,” the tariff calculation engine—selected because:
- It’s one of the older services (~11 years)
- It has few dependencies
- It has a lot of traffic—so failures show up quickly
By August 2021, containerization and Terraform were ready, and the Cloud Run service ran fine “for a few hours.” Then came the first major issue:
- Almost all services connect to RabbitMQ to handle asynchronous jobs.
- Cloud Run throttles CPU once a request ends. If a service is consuming from RabbitMQ during that time, jobs won’t be processed—throttled CPU means no worker cycles.
This was a deal-breaker for async processing. Two options emerged:
- Replace RabbitMQ with Pub/Sub and use HTTP push—would require rewriting all services
- Use GKE instead of Cloud Run and keep RabbitMQ
They tried GKE Autopilot. Load tests showed scale-up was “pretty too slow.” The team also didn’t want to run Kubernetes by hand—managed services were a priority.
Then the “miracle” happened. In September 2021, Google announced “CPU always on” for Cloud Run—a flag that keeps CPU on even when there’s no incoming request. That was the missing piece for RabbitMQ consumers. They migrated three services from Autopilot back to Cloud Run; one service was already on Cloud Run. Migration speed picked up significantly.
End of 2021: Laying the foundations and gaining speed
By the end of 2021, a solid foundation was in place—scripted in Terraform:
- Organization and projects for stages in GCP
- IAM, org policies
- Billing alerts
- Alerting and Atlassian Opsgenie
- More than half of production services migrated
- A few MySQL databases migrated
- A few Cloud Functions introduced
- Using Cloud Storage and Cloud Scheduler
- Using Secret Manager
- Using Cloud Armor to limit access to some resources
- Cloud Debug and Cloud Profiler in use
And the team still “had fun” working on the project—a rare and valuable signal mid-migration.
2022: Taming the frontend monster
January 2022 kicked off the frontend migration. The setup was… baroque:
- 11 Git repositories
- The legacy “old style” website
- A new Next.js React-based website
- Configuration data
- A service for publication and removal of PDF documents for download
- More than 100,000 JSONs and binaries
- Builds and deployments driven by Jenkins, which glued repos together, built artifacts, and rsynced to four servers
As a colleague put it: “It is a monster.”
It took half a year. Since June 1, 2022, the website has been running in production—“and we are still alive.”
Today’s picture: Managed, consistent, Cloud Run first
The “screenshot from tomorrow” looks strong: All services are managed by Google and running on Cloud Run. “Everything works fine.” Many of the goals have been reached, though “there’s still a lot to do.”
The verdict: Pros and cons after the journey
Honest reflections from the field:
Cons:
- “Google support is kind of adventurous.”
- “AWS seems to have more modern features than GCP”—GCP is “a little bit behind.”
- “GCP has a few limitations which are not understandable.”
- “The Google Cloud console app on smartphones is only weird”—“you can’t believe it’s really from Google.”
Pros:
- “Terraform is pretty cool.”
- “Working with Haptic is great.”
- “GCP is stable.”
- “Cloud Run is a real cool product.”
The team was a “gang of four” (plus two at times). And they are hiring.
Technical takeaways for engineering teams
The session yields concrete, practice-shaped lessons—anchored in what durchblicker.at actually did and discovered:
1) Prefer conservative, incremental migration
- Move services “as-is” to isolate new bugs from legacy issues.
- Avoid big-bang cutovers; migrate service by service.
2) One cloud, three environments
- Consolidation reduces multi-cloud friction (for example, network traffic considerations).
- Keep a consistent trio: development, staging, production.
3) Infrastructure as code—no exceptions
- Use Terraform to define org, projects, IAM, policies, billing alerts—make the platform reproducible.
- Manual changes on machines break trust in your source of truth; that was a real problem with Ansible drift.
4) Treat secret management as foundational
- Centralize secrets with the managed Secret Manager.
5) Invest in observability early
- The pre-migration state had “not good” logging/alerting and too much noise from non-errors.
- Managed alerting and tools such as Opsgenie, Cloud Debug, and Cloud Profiler help surface real issues.
6) Simplify build/deploy paths
- Jenkins glue code that stitches repos together and rsyncs to servers is fragile.
- Replacing Jenkins with GitLab was an explicit goal.
7) Validate architecture choices under load
- RabbitMQ + Cloud Run exposed CPU throttling issues in real conditions.
- GKE Autopilot’s scale-up was too slow for their needs here.
- The arrival of “CPU always on” changed the calculus.
8) Managed-first mindset—without dogma
- “We didn’t want to run a Kubernetes cluster by hand.” Managed services were preferred.
- When Autopilot didn’t meet the scale-up need, Cloud Run with “CPU always on” became the right fit.
9) Bring in partners and anchor with a kickoff
- Haptic (Vienna) plus a kickoff with Google in July 2021 accelerated alignment and execution.
10) Be honest about roadmaps
- The original schedule was “a complete underestimation.” Recognizing that early is valuable.
A practical checklist we took away
- Have you listed concrete pain points—including monitoring noise and manual drifts?
- Are you using managed services wherever they make sense, instead of building everything yourself?
- Is your secret management centralized and enforced?
- Is IaC truly your single source of truth—or do manual changes still creep into servers?
- Do you run three consistent environments (dev/staging/prod) with clear policies?
- Is there a path to move away from brittle Jenkins glue?
- Have you picked a high-traffic, low-dependency pilot to surface issues quickly?
- Do you understand the limits and modes of your chosen cloud features (e.g., CPU modes in Cloud Run)?
- Have you measured scale-up time under load—not just in idle conditions?
- Did you leave room in the roadmap for learning and surprises?
Lines that stick
“There’s a graveyard of ideas—mainly in our code base and in our infrastructure.”
“Cloud Run throttles CPU when the request ends. If you’re consuming from RabbitMQ, jobs won’t be handled anymore.”
“We didn’t want to run a Kubernetes cluster by hand.”
“Some kind of miracle would be nice—and the miracle happened: CPU always on for Cloud Run.”
“It is a monster.” (on the old frontend build/deploy pipeline)
“Since June 1, 2022, we run in production with the website—and we are still alive.”
“Terraform is pretty cool. GCP is stable. Cloud Run is a real cool product.”
Closing
“Out of the Dark, Into the Light” by Michael Karner (durchblicker.at) is a grounded account of platform renewal: take stock of reality, set guardrails, start with a focused pilot, lean on managed services and IaC, and let pragmatism—not dogma—guide your path. The specifics are durchblicker.at’s; the patterns are widely applicable. If you’re staring at your own VM graveyard, this story offers a clear map of what to prioritize and how to keep momentum until the cloud “light” actually turns on.