From 45-Minute Incident Hunts to 4-Minute Root Cause: OpenTelemetry on a Payments Platform

Eight glowing cyan hexagonal microservice nodes connected by luminous trace-span threads in a dark navy space, with a gold N+1 anomaly spike illuminating the distributed payment flow

When a payment fails at 2 AM, the first thing an engineer does is open CloudWatch. They search for the error, find a log line from the API gateway, note the timestamp, open a second tab and search the payment service logs for the same timestamp window, find something suspicious, open a third tab for the database slow query logs, try to correlate by timestamp, realize the clocks are off by a few seconds across services, and spend the next 45 minutes manually threading a narrative through disconnected log streams.

That was the baseline: 45 minutes mean time to detect, for a payments platform processing real money in real time.

The fix wasn't more logs. They already had CloudWatch. The fix was distributed tracing — a single trace ID that flows through every service in a request chain, so that instead of grepping timestamps across eight services, you open one view and see the entire journey: which service called which, how long each hop took, where the error originated.

We instrumented the full stack with OpenTelemetry, exported traces to Grafana Tempo, and along the way found an N+1 query that had been adding 200ms to every payment for months. Nobody had noticed because nothing connected the app-level latency to the database behavior.

What We Were Working With

Eight microservices on EKS: an API gateway (Node.js), a payment orchestrator (Node.js), a fraud detection service (Node.js), a reconciliation worker (Go), an accounts ledger service (Go), a notification service (Node.js), a currency conversion service (Go), and a webhook delivery service (Node.js). Most traffic flowed through the API gateway → payment orchestrator → accounts ledger path.

The observability stack before we started:

Component Tool Gap
Logs CloudWatch Logs No trace IDs, no correlation
Metrics CloudWatch Metrics No link to logs or traces
APM None
DB slow queries RDS Performance Insights Disconnected from app traces
Alerting CloudWatch Alarms Fires on symptoms, not causes
MTTD (payment incidents) 45 min (median)

The team was competent. The tooling just made correlation impossible. You can't grep your way to root cause when the information is spread across eight separate log streams with no shared identifier.

Phase 1: OpenTelemetry Collector as DaemonSet

The first decision was deployment architecture for the OTEL Collector. We chose DaemonSet — one collector per node, shared across all pods on that node — over the sidecar pattern.

With 8 services on a cluster that autoscales to 15–20 nodes (and Karpenter managing the node lifecycle), the sidecar approach would mean 30–60 collector processes. A DaemonSet means 15–20. Less CPU overhead, centralized config, one place to update the export pipeline.

The full DaemonSet config:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      tolerations:
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 30
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 30
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args:
            - "--config=/conf/otel-collector-config.yaml"
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
            limits:
              cpu: 500m
              memory: 500Mi
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8888  # Collector metrics
          volumeMounts:
            - name: otel-collector-config
              mountPath: /conf
      volumes:
        - name: otel-collector-config
          configMap:
            name: otel-collector-config

The collector config itself handles trace batching, retry logic, and export to Grafana Tempo. We also wired logs to Loki with the trace_id label extracted automatically:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  resource:
    attributes:
      - key: k8s.cluster.name
        value: "payments-prod"
        action: upsert

exporters:
  otlp/tempo:
    endpoint: http://grafana-tempo.monitoring.svc.cluster.local:4317
    tls:
      insecure: true
  loki:
    endpoint: http://grafana-loki.monitoring.svc.cluster.local:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        k8s.pod.name: "pod"
      record:
        traceID: "trace_id"  # Extracted as Loki label for correlation
  prometheus:
    endpoint: "0.0.0.0:8889"
    enable_open_metrics: true  # Required for exemplar support

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [loki]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus]

With this in place, every pod on every node had a local OTEL endpoint at http://$(NODE_IP):4317. Services point their OTEL SDK there via a NODE_IP env var injected from the pod spec's fieldRef.

Phase 2: Auto-Instrumentation for Node.js Services

Five of the eight services were Node.js. Rather than modifying each service individually, we used the OpenTelemetry Node.js auto-instrumentation package — zero code changes required, just a change to how the process starts.

The deployment patch for all Node.js services:

# Patch applied to all 5 Node.js deployments
spec:
  template:
    spec:
      initContainers:
        - name: otel-agent-init
          image: busybox
          command:
            - sh
            - -c
            - "echo 'OTEL init complete'"
      containers:
        - name: app
          env:
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://$(NODE_IP):4317"
            - name: OTEL_SERVICE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['app']
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "deployment.environment=production,k8s.namespace.name=$(POD_NAMESPACE)"
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: NODE_OPTIONS
              value: "--require @opentelemetry/auto-instrumentations-node/register"

The NODE_OPTIONS line is the key — it tells Node.js to load the auto-instrumentation package before any application code runs. This instruments Express routes, HTTP outbound calls, PostgreSQL queries (pg client), Redis calls, and Kafka producers/consumers automatically.

We added the package to the base Node.js Docker image used by all services:

FROM node:20-alpine
RUN npm install --save @opentelemetry/[email protected]
# ... rest of Dockerfile

No individual service code changed. All five services started emitting spans within an hour.

Phase 3: Manual Spans for Critical Go Payment Flows

The Go services needed a different approach. Go has no auto-instrumentation equivalent — you instrument manually. But for the accounts ledger service (the critical path for every payment), we went further than basic HTTP tracing. We added custom spans around every business-critical operation.

package ledger

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("accounts-ledger")

func (s *LedgerService) ProcessPayment(ctx context.Context, payment Payment) (*Receipt, error) {
    ctx, span := tracer.Start(ctx, "ledger.ProcessPayment",
        trace.WithAttributes(
            attribute.String("payment.id", payment.ID),
            attribute.String("payment.currency", payment.Currency),
            attribute.Int64("payment.amount_cents", payment.AmountCents),
            attribute.String("payment.type", string(payment.Type)),
        ),
    )
    defer span.End()

    // Validate sender balance — creates a child span automatically
    // because we pass ctx through
    balance, err := s.getAccountBalance(ctx, payment.SenderAccountID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "failed to fetch sender balance")
        return nil, err
    }

    if balance < payment.AmountCents {
        span.SetAttributes(attribute.Bool("payment.insufficient_funds", true))
        span.SetStatus(codes.Error, "insufficient funds")
        return nil, ErrInsufficientFunds
    }

    // Debit sender
    ctx, debitSpan := tracer.Start(ctx, "ledger.DebitSender")
    if err := s.debitAccount(ctx, payment.SenderAccountID, payment.AmountCents); err != nil {
        debitSpan.RecordError(err)
        debitSpan.SetStatus(codes.Error, "debit failed")
        debitSpan.End()
        return nil, err
    }
    debitSpan.End()

    // Credit recipient
    ctx, creditSpan := tracer.Start(ctx, "ledger.CreditRecipient")
    if err := s.creditAccount(ctx, payment.RecipientAccountID, payment.AmountCents); err != nil {
        creditSpan.RecordError(err)
        creditSpan.SetStatus(codes.Error, "credit failed")
        creditSpan.End()
        return nil, err
    }
    creditSpan.End()

    receipt := &Receipt{PaymentID: payment.ID, Status: "settled"}
    span.SetAttributes(attribute.String("payment.status", "settled"))
    span.SetStatus(codes.Ok, "payment settled")
    return receipt, nil
}

The W3C TraceContext header (traceparent) flows automatically through the HTTP client calls thanks to the otelhttp transport wrapper:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Replace default http.Client transport with instrumented version
httpClient := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

For the Kafka consumer in the reconciliation service — where trace context comes from message headers, not HTTP headers — we extracted the context explicitly:

func (c *ReconciliationConsumer) processMessage(msg *kafka.Message) {
    // Extract trace context from Kafka message headers
    headers := make(map[string]string)
    for _, h := range msg.Headers {
        headers[string(h.Key)] = string(h.Value)
    }
    ctx := otel.GetTextMapPropagator().Extract(
        context.Background(),
        propagation.MapCarrier(headers),
    )

    ctx, span := tracer.Start(ctx, "reconciliation.ProcessMessage")
    defer span.End()
    // ... rest of processing with ctx
}

This was the one place where W3C propagation needed explicit wiring. Every other service boundary (HTTP) propagated automatically.

Phase 4: Exemplars and the N+1 Discovery

With traces flowing, we added Prometheus exemplars — the feature that links a metric spike directly to the trace that caused it.

The Node.js payment orchestrator used prom-client for metrics. We patched the histogram recording to attach the current trace ID:

const { metrics } = require('@opentelemetry/api');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');

// Histogram with exemplar support
const paymentDuration = meter.createHistogram('payment_duration_seconds', {
  description: 'Payment processing duration',
  boundaries: [0.05, 0.1, 0.2, 0.5, 1, 2, 5],
});

// In the payment handler:
async function handlePayment(req, res) {
  const startTime = Date.now();
  const span = trace.getActiveSpan();
  
  try {
    const result = await processPayment(req.body);
    const duration = (Date.now() - startTime) / 1000;
    
    // Record with exemplar — trace ID attached automatically
    paymentDuration.record(duration, {
      'payment.type': req.body.type,
      'payment.currency': req.body.currency,
    });
    
    res.json(result);
  } catch (err) {
    // ...
  }
}

With exemplars working, a spike on the Grafana p99 latency panel became a one-click investigation. That's how we found the N+1 query.

The N+1 Discovery

Three weeks after the traces went live, we noticed the payment p99 latency had a persistent 200ms baseline that spiked to 400ms under moderate load. The metric had always looked like this — we just hadn't had the context to explain it.

We clicked the exemplar icon on the spike. The trace opened in Tempo. What we saw:

POST /payments/process                    1.24s
  ├── fraud-detection.Evaluate            45ms
  ├── currency-conversion.Convert         12ms
  └── ledger.ProcessPayment              890ms
        ├── ledger.getAccountBalance       8ms   (account 1)
        ├── ledger.getAccountBalance       9ms   (account 2)
        ├── ledger.getAccountBalance       7ms   (account 3)
        ├── ledger.getAccountBalance       8ms   (account 4)
        ├── ledger.getAccountBalance       9ms   (account 5)
        ├── ledger.getAccountBalance       8ms   (account 6)
        ├── ledger.getAccountBalance       7ms   (account 7)
        ├── ledger.getAccountBalance       9ms   (account 8)
        ├── ledger.getAccountBalance       8ms   (account 9)
        ├── ledger.getAccountBalance       7ms   (account 10)
        ├── ledger.getAccountBalance       9ms   (account 11)
        ├── ledger.getAccountBalance       8ms   (account 12)
        ├── ledger.DebitSender            620ms
        └── ledger.CreditRecipient         15ms

Twelve sequential getAccountBalance calls before the actual debit. The DebitSender span then took 620ms on its own.

We ran EXPLAIN ANALYZE on the debit query:

EXPLAIN ANALYZE
SELECT id, balance, currency, status
FROM accounts
WHERE id = $1 AND status = 'active'
FOR UPDATE;

-- Output:
Seq Scan on accounts  (cost=0.00..8420.00 rows=1 width=48)
                      (actual time=614.231..614.232 rows=1 loops=1)
  Filter: ((id = '3f2a...'::uuid) AND (status = 'active'::account_status))
  Rows Removed by Filter: 284197
Planning Time: 0.087 ms
Execution Time: 614.298 ms

Sequential scan on 284,000 rows every time. The accounts table had an index on id but the compound condition WHERE id = $1 AND status = 'active' was not covered — and the FOR UPDATE lock was amplifying the problem under concurrent load.

The twelve getAccountBalance calls were a logic bug: the reconciliation pre-check was iterating accounts that had already been validated upstream. That was a one-line fix.

The missing index was a two-line fix:

-- Added compound index to cover the query pattern
CREATE INDEX CONCURRENTLY idx_accounts_id_status
ON accounts (id, status)
WHERE status = 'active';

-- Verified with EXPLAIN ANALYZE after index creation:
-- Index Scan using idx_accounts_id_status on accounts
-- (actual time=0.041..0.042 rows=1 loops=1)
-- Execution Time: 0.089 ms

614ms → 0.089ms per debit operation. The 200ms baseline on payment p99 disappeared.

None of this was visible in CloudWatch. The RDS Performance Insights showed slow queries, but without the trace connecting the API request to the specific DB call, there was no way to know which service was causing it or why it was running 12 times per payment.

The Final Numbers

After four weeks with full tracing in production:

Metric Before After Change
MTTD (payment incidents) 45 min 4 min -91%
Payment p99 latency 890ms 340ms -62%
Debit DB query time 614ms 0.089ms -99.9%
Incidents requiring DB team escalation ~70% ~15% -78%
Observability tooling cost (monthly) ~$2,100 (CloudWatch) ~$420 (Tempo + Loki on S3) -80%

The cost comparison deserves a note: CloudWatch Logs Insights queries get expensive fast when you're running them at incident frequency across eight log groups. Grafana Tempo on S3 is close to free at this span volume. The FinOps analysis from our cost observability work confirmed the switch paid for the implementation within the first month.

What We'd Do Differently

Instrument the Kafka boundaries on day one, not week three. We deprioritized Kafka consumer instrumentation because it seemed like a minor edge case. It wasn't — the reconciliation service processes every failed payment retry, and missing trace context there meant a full category of incidents still had no correlation. We spent two weeks thinking retries were a separate problem.

Set p99 per-operation alerts from the start. We configured general p99 alerts but not per-span-name alerts. The N+1 query had been adding 200ms to ledger.ProcessPayment for months before we looked at it. Add span.name as a label in your Prometheus histograms and alert at that granularity — a per-operation SLO would have caught this on day one.

Run the OTEL Collector on a dedicated node group. We ran the DaemonSet on shared nodes and had two incidents where collector pods were OOMKilled under heavy telemetry load — trace gaps during exactly the incidents we most needed visibility into. A dedicated node group (one m5.large per AZ) with a NodePool toleration is worth ~$60/month. The Karpenter NodePool config makes it a five-line change.

Validate W3C propagation in staging before rollout. In production we found two services with a legacy X-Request-ID header they were treating as the trace ID — they appeared connected in logs but generated disconnected traces in Tempo. A single integration test asserting the same trace ID appears in spans from two services would have caught this in minutes.


Running microservices and spending 30+ minutes per incident reconstructing what happened? Distributed tracing is the single highest-leverage observability investment you can make. Let's talk through what instrumentation looks like for your stack.

Frequently Asked Questions

Why DaemonSet for the OTEL Collector instead of a sidecar per pod?

At scale, the sidecar pattern burns you on two fronts: resource overhead (each sidecar runs its own collector process — on a 20-service platform that's 20 collectors doing the same batching, retry, and export work) and configuration sprawl (you have to update collector config across every deployment instead of in one place). A DaemonSet gives you one collector per node, shared across all pods on that node. The tradeoff is that you lose per-pod isolation — if the collector on a node has a bad config, all pods on that node lose telemetry. We mitigated this with a separate otel-collector-critical DaemonSet with stricter resource limits and a separate export pipeline for the payment services specifically.

How does W3C TraceContext propagation work across Node.js and Go services?

Both the Node.js auto-instrumentation (@opentelemetry/auto-instrumentations-node) and the Go OTEL SDK default to W3C TraceContext — the traceparent and tracestate HTTP headers. As long as both sides are initialized before the first request, the trace ID flows automatically through every HTTP call. The only manual work is for non-HTTP boundaries: our Kafka consumer in the Go reconciliation service needed an explicit otel.GetTextMapPropagator().Extract() call to pull the trace context from message headers. That's the one place W3C propagation doesn't happen automatically.

Grafana Tempo vs Datadog APM — is it actually cheaper?

Substantially cheaper at scale. Datadog APM pricing is per host per month (around $31/host for the APM add-on) plus ingestion fees beyond the included volume. On an 8-service platform generating ~2 million spans/day, Datadog APM runs to roughly $1,800–2,400/month depending on retention and host count. Grafana Tempo is object-storage-backed (S3 or GCS) — we pay for S3 storage and bandwidth only. At 2M spans/day with 30-day retention, that's around $80–120/month. The operational trade-off: Tempo has less polished UI than Datadog APM and you need Grafana to query it. If your team already runs Grafana for metrics (which they probably do), the marginal cost is near zero.

How do exemplars link Prometheus metrics to traces?

Prometheus exemplars are sample data points attached to a metric observation that carry a trace ID. When your payment service records a histogram observation for request duration, the OTEL SDK can attach the current trace ID as an exemplar. In Grafana, when you click a spike on the p99 latency panel, the exemplar icon appears — click it and Grafana jumps directly to the trace that produced that data point. No copy-pasting trace IDs, no manual correlation. You go from 'p99 spiked at 14:32' to the exact distributed trace in one click. Requires Prometheus with --enable-feature=exemplar-storage and Grafana ≥ 8.4.

What's the easiest way to correlate Loki logs with traces?

Add trace_id as a structured log field in your application, then configure the OTEL Collector's logtransform processor to extract it as a Loki label. In Grafana, enable the Loki-to-Tempo derived field: Settings → Datasource → Loki → Derived Fields → add a regex on trace_id that links to the Tempo datasource. From there, every log line with a trace ID gets a clickable link that opens the full distributed trace. For Node.js with Pino, it's three lines of config using pino-opentelemetry-transport. For Go with zerolog or zap, you inject the span into the logger context on each request.