Microsoft Österreich
What is Chaos Engineering?
Description
Jürgen Etzlstorfer spricht in seinem devjobs.at TechTalk darüber, wie man durch gezieltes Einführen von Fehlern ein System analysieren und verbessern kann.
By playing the video, you agree to data transfer to YouTube and acknowledge the privacy policy.
Video Summary
In What is Chaos Engineering?, Jürgen Etzlstorfer defines chaos engineering as a disciplined way to inject faults under a hypothesis and steady-state goals to verify resilience across the stack—especially in cloud and Kubernetes settings—such as removing a node in a high-availability setup. He advocates measuring outcomes with SLIs/SLOs (e.g., response time <200 ms, error rates) and highlights tools like LitmusChaos, Chaos Mesh, and Gremlin that offer experiments including CPU/storage exhaustion and network faults. The takeaway is to move from ad hoc experiments to pipeline-integrated chaos testing with a dedicated chaos stage alongside load tests (k6/JMeter) and automated SLO evaluations using metrics from systems like Prometheus or Dynatrace.
Chaos Engineering, Done Right: Hypotheses, Faults, and Resilience — Key Insights from “What is Chaos Engineering?” by Jürgen Etzlstorfer (Microsoft Österreich)
Why Chaos Engineering matters right now
In “What is Chaos Engineering?” Jürgen Etzlstorfer (Microsoft Österreich) made a crisp point: Chaos engineering is an engineering discipline. The goal is to introduce faults in a controlled way and analyze how they affect your application and its stack. This is not an ad‑hoc stunt. It starts with a hypothesis, proceeds with a well‑defined experiment, and ends with measurement and improvement.
Cloud‑native systems make this imperative. Orchestration layers such as Kubernetes can shift workloads between machines, scale them up and down, and trigger failovers. Microservices add many inter‑service dependencies. And every incident steals developer time that would otherwise go into product improvements and new features. Chaos engineering is about investing that time upfront to build resilience and, in turn, protect the team’s capacity to deliver.
The problem space: layered stacks, orchestration dynamics, and fragile assumptions
Etzlstorfer framed the system view with a pyramid: the application sits at the very top, and beneath it lie multiple layers—compute, network, databases, orchestration (e.g., Kubernetes), and often third‑party services. Any of these layers can fail. Robust applications account for that reality.
He also emphasized how much we as developers operate on assumptions: that the network is available, that there is enough compute and storage, that third‑party services respond. In practice, there are things we know we know, things we know we don’t know, and things we have never considered. The takeaway: design your application so it can still respond in some way when failures occur outside your immediate scope.
Kubernetes as an orchestration layer
For those less familiar with Kubernetes, Etzlstorfer described it as an orchestration layer for containerized applications. It can move workloads from one server to another. Applications must be prepared for scaling, shutdowns, failover, load balancing, and relocation between nodes. Chaos engineering helps validate this behavior early and continuously.
Microservices and cascading impact
In microservice architectures, the outage of a single service can impact overall availability. Dependencies multiply risk. Testing one service in isolation isn’t enough—you need to validate how the entire system behaves under stress and partial failure.
From hypothesis to experiment: making resilience measurable
Resilience starts with a steady‑state condition and an explicit hypothesis. Etzlstorfer offered a clear example: a target response time of 200 ms. The question then becomes: with a specific fault introduced, does the system maintain that steady state?
- Hypothesis: “Our system responds in under 200 ms.”
- Experiment: “In a high‑availability setup, remove one node.”
- Observation: “Under load, do we still meet the 200 ms target?”
If the outcome is positive, your application is resilient to that failure mode. If not, you’ve found a weakness. Typical improvements include scaling, updating the load balancer configuration, or introducing caching. The point isn’t to prevent all possible faults, but to iterate methodically against the ones that matter, with measurable outcomes.
SRE concepts as the measurement framework: SLI and SLO
Chaos engineering borrows measurement concepts from Site Reliability Engineering:
- Service Level Indicator (SLI): A measurable metric, such as login error rate or the response time of a frontend service.
- Service Level Objective (SLO): The target for an SLI, such as “response time must be faster than 200 ms over a 10‑minute period or a 30‑day period,” or “login error rate less than 2% over a defined period.”
With SLIs and SLOs in place, each experiment ends with a clear evaluation: objectives met or missed—and a decision on what to change next.
Where to inject chaos: layers, fault types, and the need for a framework
Faults can be introduced on several layers, depending on the question you want to answer:
- In the application: intentful pauses or delays in code. Chaos is not bugs—don’t introduce defects; simulate controlled delays instead.
- On the operating system: e.g., CPU or IO stress.
- On the network: packet drops, latency, or full disconnection.
- In cloud resources and infrastructure: remove servers/nodes, exhaust storage, make services temporarily unavailable.
Etzlstorfer stressed avoiding ad‑hoc actions like randomly switching off a machine. Use a framework—ideally with declarative definitions in an infrastructure‑as‑code style—so experiments are reproducible and orchestrated.
Tools highlighted by Etzlstorfer
- Netflix’s Chaos Monkey: one of the earliest approaches, noted as not available anymore to his knowledge.
- Litmus Chaos (open source): widely used, with a library of fault actions.
- Chaos Mesh (open source): forms the foundation of Azure Chaos Studio.
- Gremlin (commercial): a long‑standing option.
These tools commonly provide built‑in “chaos actions” (e.g., CPU or storage exhaustion, network removal), plus start/stop/orchestration capabilities. The advice is to leverage these instead of building your own ad‑hoc solution.
From one‑off experiments to continuous practice
Etzlstorfer drew a helpful parallel to load testing. You wouldn’t run load tests only once a month. You’d put them in the pipeline, ensuring every feature and bug fix is validated under load. Chaos experiments belong there as well: not ad hoc, but continuous.
A dedicated “chaos” stage in the delivery pipeline
As a concrete example, Etzlstorfer referenced the open‑source project Captain, where he is heavily involved. The idea is to introduce a dedicated pipeline stage for chaos. The flow looks like this:
- Deploy your application into the chaos stage.
- Run your load tests (e.g., with K6 or JMeter).
- Run your chaos testing framework to introduce the defined faults.
- Wait until tests finish (e.g., 10 minutes or an hour—duration is context‑dependent).
- Trigger SLO evaluation: fetch metrics from systems such as Prometheus or Dynatrace and evaluate against your defined objectives.
- Interpret results: SLO met → resilience confirmed. SLO missed → weakness identified, plan the next improvement.
Captain orchestrates this sequence so you can operationalize evaluation and workflows. The broader message: move from ad‑hoc experiments to a mature, engineering‑driven approach by making chaos testing part of your pipelines.
A practical blueprint — step by step
Grounded in Etzlstorfer’s talk, teams can apply the following blueprint:
- Define the steady state
- Choose meaningful SLIs/SLOs. Examples: “Response time < 200 ms over 10 minutes” or “Login error rate < 2% over a defined period.”
- Decide on the load profile you’ll apply through load testing.
- Formulate the hypothesis
- Example: “If one node fails, the system still responds under 200 ms.”
- Alternative: “If a third‑party dependency is unavailable, the login error rate remains < 2%.”
- Select the fault
- Remove a node (HA scenario).
- Cut the network for specific services.
- Exhaust CPU or storage.
- Introduce in‑app delays (do not add bugs).
- Describe and automate the experiment declaratively
- Pick a framework (e.g., Litmus Chaos, Chaos Mesh, Gremlin).
- Define experiments as code so they are reproducible.
- Combine load and chaos
- Run load tests (e.g., K6 or JMeter) in parallel with chaos experiments.
- Observe whether SLIs remain within SLO bounds.
- Measure and evaluate
- Pull metrics from your observability systems (e.g., Prometheus, Dynatrace).
- Trigger SLO evaluation and make a go/no‑go decision.
- Improve based on outcomes
- If targets were met: document and keep the experiment for regression.
- If targets were missed: address the weakness—scale out, tune the load balancer, add caching—and test again.
- Integrate into the pipeline
- Establish a dedicated “chaos” stage so every change is validated automatically.
This mirrors the core sequence in the session: hypothesis → experiment → measurement → improvement, run continuously rather than once.
Guiding principles emphasized by Etzlstorfer
“Chaos Engineering is actually an engineering discipline.”
- Controlled, not arbitrary: avoid ad‑hoc toggling of machines. Prefer declared, orchestrated experiments.
- System thinking: the app is only the top of the pyramid—every layer can fail.
- Challenge assumptions: verify compute, storage, and third‑party dependencies under stress.
- Metrics first: select SLIs, set SLOs, and evaluate after each run.
- Make it continuous: just like load testing, integrate chaos testing into the pipeline.
Typical fault scenarios — and what they teach us
Etzlstorfer’s examples underscore how different layers expose different weaknesses. The learning comes from combining faults with load and objective evaluation:
- Removing a node in an HA setup: validates true fault tolerance and load redistribution.
- Cutting network connectivity: surfaces timeouts and dependency handling—does the app fail safe or hang?
- Exhausting CPU or storage: tests backpressure and graceful degradation.
- In‑app delays: simulates slow downstream services or internal bottlenecks.
The output is actionable: SLO met or missed, followed by specific changes such as scaling, load balancer tuning, or caching.
Team impact: reclaiming developer time
Etzlstorfer’s developer‑centric argument was simple: every outage drains developer time into firefighting, configuration tweaks, and stability fixes. By deliberately designing for failure and validating behavior under stress, teams protect their capacity to build features. Chaos engineering turns unpredictable interruptions into planned learning cycles.
From theory to habit: keeping chaos engineering effective
- Start small, measure what matters: pick one or two SLIs/SLOs and a clear fault (e.g., node loss) to begin.
- Standardize experiments: describe them declaratively, version them, and run them the same way every time.
- Add load: only meaningful under realistic load conditions.
- Automate evaluation: trigger SLO checks after each run and make results visible.
- Translate results into improvements: scale, reconfigure, or add caching—and iterate.
These steps reflect the method outlined in the session: avoid spectacle, pursue repeatability.
Conclusion: a mature path to resilient systems
“What is Chaos Engineering?” by Jürgen Etzlstorfer (Microsoft Österreich) distills the craft into a simple arc: define steady state, inject controlled faults, measure against SLOs, and automate the loop in your delivery pipeline. Tools like Litmus Chaos, Chaos Mesh (the foundation for Azure Chaos Studio), and Gremlin provide the building blocks. Load testing with K6 or JMeter and metrics from Prometheus or Dynatrace close the loop.
Most importantly, the mindset matters: chaos engineering is an engineering discipline. Treat it that way and you’ll avoid ad‑hoc disruption, gain reliable insights, and protect the scarcest resource in your team—developer time.
Key takeaways for engineering teams
- Start with hypotheses and steady‑state metrics (SLIs/SLOs).
- Move chaos experiments into your pipeline; don’t run them ad hoc.
- Use established tools rather than one‑off scripts.
- Measure systematically and evaluate SLOs after every run.
- Improve deliberately: scale, tune load balancers, add caching—whatever the data suggests.
- Think in layers: the app is only the top; any layer can fail.
- Design for outages beyond your immediate scope so the app still responds in some way.
That’s how chaos engineering moves from a buzzword to a practical quality practice—with clear metrics, reproducible experiments, and continuous improvement, just as outlined in “What is Chaos Engineering?”.