Chaos Engineering

When it comes to chaos engineering, tracking the right metrics is essential for understanding system behavior and improving resilience. Let’s explore the common metrics and baseline goals:

Baseline Metrics for Chaos Engineering:

  • Infrastructure Monitoring Metrics:
    • Resource Metrics: These include CPU utilization, I/O activity, disk space, and memory usage. Monitoring tools like Datadog, New Relic, and SignalFX can help collect these data points.
    • State Metrics: Keep an eye on system shutdowns, active processes, and clock time.
    • Network Metrics: Measure DNS latency, packet loss, and overall network health.
  • Alerting and On-Call Metrics:
    • Total Alert Counts: Understand how many alerts each service generates per week.
    • Time to Resolution: Measure how quickly alerts are resolved for each service.
    • Noisy Alerts: Identify self-resolving alerts and address noisy ones.
    • Top Frequent Alerts: Track the top 20 most frequent alerts per week for each service.
  • High Severity Incident (SEV) Metrics:
    • Establish a High Severity Incident Management (SEV) Program:
      • Define SEV levels (e.g., 0, 1, 2, and 3).
      • Measure total incidents per week by SEV level.
      • Track SEVs per week by service.
      • Calculate Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), and Mean Time Between Failures (MTBF) for SEVs by service.

Setting Baseline Goals:

  • Incident Reduction: Determine an appropriate goal for incident reduction over the next few months. Aim for 2x or 10x improvement.
  • CPU Spike Causes: Identify the top 3 main causes of CPU spikes.
  • Downstream/Upstream Effects: Understand typical effects when the CPU spikes.

Remember, collecting baseline metrics allows you to measure the impact of your chaos engineering experiments and set meaningful goals for improvement.