Workplace Image SQUER

Architecting for Scale

Description

David Leitner von SQUER spricht in seinem TechTalk über die auftretenden Challenges – technologisch und organisatorisch – wenn ein System skaliert und wie damit umgegangen werden kann.

By playing the video, you agree to data transfer to YouTube and acknowledge the privacy policy.

Video Summary

Architecting for Scale by David Leitner explains that modularization and distribution are separate axes and that microservices should be driven primarily by organizational autonomy and ownership, not as a fix for poor modularity. He shows how to design self-sufficient subsystems and vertically sliced teams (e.g., splitting payments into credit card vs. SEPA) that build and run their services, using event-driven streams, caching, circuit breakers, load balancers, deliberate chaos/game days, and agreed availability “nines” with platform and dependent teams. Viewers will learn to avoid a distributed monolith, architect around end-to-end responsibilities, and plan for external failure to keep lead time low as teams and systems grow.

Architecting for Scale: Designing Self‑Sufficient Subsystems and Autonomous Teams — Insights from “Architecting for Scale” by David Leitner (SQUER)

Why this session mattered

We at DevJobs.at joined “Architecting for Scale” in Linz to hear David Leitner (SQUER) lay out a crisp, experience‑based blueprint for scaling software delivery. Leitner, SQUER’s Chief Technology and a self‑described “coding architect,” works hands‑on with teams: clarifying architectural constraints, shaping vision with stakeholders, and helping implement it. His core message: quality must exist from architecture down to design and implementation — and sustainable scale is a socio‑technical outcome, not just a technical one.

He framed the talk in SQUER’s four lines of work: enabling organizations to build great digital products, hands‑on product delivery (from design thinking to sustainable implementation), software transformation (like decomposing monoliths or migrating off out‑of‑maintenance tech), and cloud/platform engineering. Architecting for scale touches all of them, because structure and flow emerge from both architecture and organization.

From simple success to complex systems

The common trajectory: a small team builds something valuable, it works, revenue or impact grows, the team expands, multiple teams join in — and the system’s complexity climbs. That success should be a good problem to have, but many organizations notice the same pattern: adding people does not linearly increase value delivery. Worse, outages rise as the system and contributor base grow.

“We put more people on the product, but the value doesn’t go up linearly. And even worse, the failure rate goes up.”

That’s the opposite of optimizing the flow of value to customers — or, as Leitner put it, the opposite of “safely and sustainably reduci[ng] the lead time to ‘thank you.’” The natural reaction is to modularize. But here is where a widespread confusion creeps in.

Modularization is not distribution

Leitner separates two axes that often get conflated:

  • Modularization can be good or bad.
  • Deployment can be monolithic or distributed.

Historical growth frequently traps teams in a monolithic deployment with poor modularization — the “big ball of mud.” But moving to distributed deployment does not fix modularization by itself. It can even make things worse: you trade the simplicity of a monolith for the complexities of the network while keeping tight coupling — the “distributed monolith.”

“Monolithic deployment doesn’t automatically mean bad modularization. ‘Well‑structured monoliths’ exist — and they are often far better than distributed monoliths.”

A quote Leitner highlighted captures it succinctly:

“If you can’t build a well‑structured monolith, what makes you think microservices are the answer?” (Simon Brown)

So if you need modularity, modularize. Don’t expect distribution to do that job for you.

Why we still distribute: autonomy over raw tech

There are legitimate technical reasons for distributed systems:

  • individual scaling demands for certain services,
  • technology segmentation (say, ML in Python while core services run in Go/Java),
  • co‑location (e.g., deploy near a dependency or a specific region).

Leitner’s observation: these reasons exist but are not the main drivers in most real cases. The dominant driver is organizational — ownership and autonomy.

“You build it, you run it” is hard when teams share one deployment unit. If Team B can crash the shared runtime, Team A won’t promise to “own” their module’s uptime — they can’t control the blast radius. Distribution gives teams the control they need to accept responsibility. It’s a socio‑technical move: cut the system so that teams can build and run their parts independently.

The target state is clear in Leitner’s words: autonomous, self‑sufficient subsystems — often implemented as coarse‑grained microservices or self‑contained systems.

The core design principle: self‑sufficient subsystems

Leitner’s thesis in one line:

“The key to architect for scale is to design self‑sufficient subsystems that empower autonomous teams.”

He grounded this in an e‑commerce example: a Payment team/service and an Account team/service.

Outgoing dependencies: resilience over blame

Suppose the Payment service is down. The Payment team says, “It’s not our fault — the Account service didn’t return the users.” The appropriate response is: it is your service; you own the ability to process payments.

Resilient design here means:

  • Cache the user data Payment needs, instead of hard‑depending on Account at runtime.
  • Consume an event stream from Account (e.g., a Kafka topic of user‑changed events) and maintain a local, up‑to‑date projection as a protection layer.

With this, Account can go offline without taking Payment down. That’s the essence of event‑driven, reactive architecture in service of autonomy and resilience. Leitner noted a separate talk on “the rise of reactive microservices,” but kept the focus here on the principle: build protection against external failure.

Incoming dependencies: you must defend yourself

Equally important: callers can overwhelm you through misconfiguration or bugs. You don’t control external systems, so defending your service is your responsibility:

  • Use load balancers effectively.
  • Implement circuit breakers, timeouts, and retries judiciously.
  • Cache responses where it reduces pressure and risk.

In short, architect with external failure in mind. In distributed systems, everything that can go wrong will go wrong at some point.

“Services that can run in a chaotic environment can run everywhere. Services that can only run in a stable environment can only run under such conditions.”

Expect and rehearse chaos: game days in daylight

Leitner encourages teams to inject controlled chaos on purpose while everyone is available — not to wait for a 3 a.m. incident. Game days at 2 p.m. with the whole team present:

  • “Turn off the database. What happens?”
  • “Break the network link between services. How do they behave?”
  • “Misconfigure a dependency. How does the error propagate?”

The aim is to harden processes, self‑healing, and recovery so they are ready when real incidents occur. Chaos is unavoidable in complex, distributed systems — plan for it.

Beyond technology: end‑to‑end responsibilities

A purely technical solution won’t unlock autonomy if teams lack the business capabilities to act end‑to‑end. Back to Payment: it executes a payment, but occasionally fails when the payment gateway has an error. One approach is to make the gateway more stable. Leitner invites a different question: why is the gateway a separate team/domain in the first place?

If Payment is supposed to own the end‑to‑end responsibility (customer enters card data; payment succeeds), then cut vertically. If the domain is too large for a single team, split by business capability — for example, one team for credit card payments and one for SEPA payments. Each owns an end‑to‑end flow, making them genuinely autonomous.

Complicated subsystems: carve‑out with clear expectations

Not everything belongs inside a stream‑aligned team. Some capabilities are complicated enough to be standalone — Leitner’s example: a fraud‑check subsystem. Offloading that protects the business‑value teams’ cognitive load, but it raises the bar for integration. What’s needed:

  • a mutual understanding of availability (“the nines”),
  • lightweight SLAs — not heavy contracts, but clear expectations,
  • integration designs that match the promised availability.

Leitner made the consequences concrete. If the fraud subsystem offers “three nines” (around nine hours of annual downtime), Payment must architect differently than if it offers “five nines” (around six minutes per year). With three nines, Payment might accept the payment and apply it later when the internal error clears; with five nines, a user‑facing “please retry” dialog may be adequate.

Also critical: mean time to failure (MTTF). How frequently do failures occur? A legacy core backing system might have a different failure profile and demands stronger decoupling so a blip doesn’t stall the entire business process.

Platform teams as enablers — with SLAs, too

Leitner echoed patterns familiar from Team Topologies: platform teams that help stream‑aligned teams move faster and safer — for instance by operating Kubernetes clusters or owning shared cloud resources. The same rule applies: agree on expectations. How many nines does the platform provide? Which services are supported at which level? Teams need that clarity to architect for availability on their critical paths.

“We need to architect for availability in our critical path.”

The socio‑technical throughline

Leitner’s throughline is consistent: scaling is a socio‑technical discipline. Architectural excellence must pair with organizational design — ownership, team boundaries, and end‑to‑end responsibilities. Distribution is primarily a lever to enable autonomy, not a shortcut to good modularity.

He distilled four points from the session:

  1. Keep external failure in mind — treat it as given.
  2. Embrace chaos deliberately — game days, forced failures, recovery drills.
  3. Architect around end‑to‑end responsibilities — choose vertical slices.
  4. Architect for availability on critical paths — understand the nines and MTTF of your dependencies and design accordingly.

He noted that in a related talk he covered three additional points, but the focus here remained on these four.

Applying the lessons now: a practical checklist

Without inventing new tooling, teams can act on these ideas today:

  • Map dependencies thoroughly: both outgoing and incoming. Identify what’s critical to your “lead time to ‘thank you.’”
  • Recut services along end‑to‑end flows: prefer vertical slices (e.g., Credit Card vs. SEPA) over horizontal layers (e.g., a separate “gateway team”).
  • Establish protective seams: add caches and consume event streams (like user‑changed events) to build local projections — so another team’s downtime doesn’t become yours.
  • Turn on defensive patterns: circuit breakers, sensible timeouts, bounded retries, and load balancing that actually sheds risk.
  • Clarify the nines with dependent subsystems and platform teams: lightweight SLAs, clear expectations. Nine hours vs. six minutes of annual downtime lead to different integration strategies.
  • Track MTTF: frequency matters as much as duration. Frequent small failures can cascade; decouple accordingly.
  • Run daylight chaos experiments: scheduled game days. Turn off the DB, break links, misconfigure services — and learn together as a team.

Closing thought: architecture as an enabler of ownership

David Leitner’s “Architecting for Scale” (SQUER) argues for a precise kind of simplicity: cut the system so teams can own end‑to‑end flows and defend themselves against the realities of distributed computing. Distribution becomes a means to autonomy; event‑driven seams, caching, and defensive patterns keep failures local; platform expectations and “the nines” make critical paths predictable.

Leitner opened with the observation that growth without structure easily produces complex systems where adding people doesn’t add value and outage rates climb. His answer is to design for ownership: autonomous teams, self‑sufficient subsystems, chaos‑ready operations, and availability by design. That’s how you keep the lead time to “thank you” short — even as you scale.

More Tech Talks

More Tech Lead Stories

More Dev Stories