When it comes to chaos engineering, tracking the right metrics is essential for understanding system behavior and improving resilience. Let’s explore the common metrics and baseline goals:
Baseline Metrics for Chaos Engineering:
- Infrastructure Monitoring Metrics:
- Resource Metrics: These include CPU utilization, I/O activity, disk space, and memory usage. Monitoring tools like Datadog, New Relic, and SignalFX can help collect these data points.
- State Metrics: Keep an eye on system shutdowns, active processes, and clock time.
- Network Metrics: Measure DNS latency, packet loss, and overall network health.
- Alerting and On-Call Metrics:
- Total Alert Counts: Understand how many alerts each service generates per week.
- Time to Resolution: Measure how quickly alerts are resolved for each service.
- Noisy Alerts: Identify self-resolving alerts and address noisy ones.
- Top Frequent Alerts: Track the top 20 most frequent alerts per week for each service.
- High Severity Incident (SEV) Metrics:
- Establish a High Severity Incident Management (SEV) Program:
- Define SEV levels (e.g., 0, 1, 2, and 3).
- Measure total incidents per week by SEV level.
- Track SEVs per week by service.
- Calculate Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), and Mean Time Between Failures (MTBF) for SEVs by service.
- Establish a High Severity Incident Management (SEV) Program:
Setting Baseline Goals:
- Incident Reduction: Determine an appropriate goal for incident reduction over the next few months. Aim for 2x or 10x improvement.
- CPU Spike Causes: Identify the top 3 main causes of CPU spikes.
- Downstream/Upstream Effects: Understand typical effects when the CPU spikes.
Remember, collecting baseline metrics allows you to measure the impact of your chaos engineering experiments and set meaningful goals for improvement.