Chaos Engineering – InfiniteKB

When it comes to chaos engineering, tracking the right metrics is essential for understanding system behavior and improving resilience. Let’s explore the common metrics and baseline goals:

Baseline Metrics for Chaos Engineering:

Infrastructure Monitoring Metrics:
- Resource Metrics: These include CPU utilization, I/O activity, disk space, and memory usage. Monitoring tools like Datadog, New Relic, and SignalFX can help collect these data points.
- State Metrics: Keep an eye on system shutdowns, active processes, and clock time.
- Network Metrics: Measure DNS latency, packet loss, and overall network health.
Alerting and On-Call Metrics:
- Total Alert Counts: Understand how many alerts each service generates per week.
- Time to Resolution: Measure how quickly alerts are resolved for each service.
- Noisy Alerts: Identify self-resolving alerts and address noisy ones.
- Top Frequent Alerts: Track the top 20 most frequent alerts per week for each service.
High Severity Incident (SEV) Metrics:
- Establish a High Severity Incident Management (SEV) Program:
  - Define SEV levels (e.g., 0, 1, 2, and 3).
  - Measure total incidents per week by SEV level.
  - Track SEVs per week by service.
  - Calculate Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), and Mean Time Between Failures (MTBF) for SEVs by service.

Setting Baseline Goals:

Incident Reduction: Determine an appropriate goal for incident reduction over the next few months. Aim for 2x or 10x improvement.
CPU Spike Causes: Identify the top 3 main causes of CPU spikes.
Downstream/Upstream Effects: Understand typical effects when the CPU spikes.

Remember, collecting baseline metrics allows you to measure the impact of your chaos engineering experiments and set meaningful goals for improvement.