From 45-Minute Incident Hunts to 4-Minute Root Cause: OpenTelemetry on a Payments Platform
Eight microservices on EKS, CloudWatch logs with no trace IDs, and a 45-minute MTTD on every payment incident. We instrumented the full stack with OpenTelemetry Collector, Grafana Tempo, and auto-instrumentation — and found an N+1 query that had been adding 200ms to every payment for months.