When a payment fails at 2 AM, the first thing an engineer does is open CloudWatch. They search for the error, find a log line from the API gateway, note the timestamp, open a second tab and search the payment service logs for the same timestamp window, find something suspicious, open a third tab for the database slow query logs, try to correlate by timestamp, realize the clocks are off by a few seconds across services, and spend the next 45 minutes manually threading a narrative through disconnected log streams.
That was the baseline: 45 minutes mean time to detect, for a payments platform processing real money in real time.
The fix wasn't more logs. They already had CloudWatch. The fix was distributed tracing — a single trace ID that flows through every service in a request chain, so that instead of grepping timestamps across eight services, you open one view and see the entire journey: which service called which, how long each hop took, where the error originated.
We instrumented the full stack with OpenTelemetry, exported traces to Grafana Tempo, and along the way found an N+1 query that had been adding 200ms to every payment for months. Nobody had noticed because nothing connected the app-level latency to the database behavior.
What We Were Working With
Eight microservices on EKS: an API gateway (Node.js), a payment orchestrator (Node.js), a fraud detection service (Node.js), a reconciliation worker (Go), an accounts ledger service (Go), a notification service (Node.js), a currency conversion service (Go), and a webhook delivery service (Node.js). Most traffic flowed through the API gateway → payment orchestrator → accounts ledger path.
The observability stack before we started:
| Component | Tool | Gap |
|---|---|---|
| Logs | CloudWatch Logs | No trace IDs, no correlation |
| Metrics | CloudWatch Metrics | No link to logs or traces |
| APM | None | — |
| DB slow queries | RDS Performance Insights | Disconnected from app traces |
| Alerting | CloudWatch Alarms | Fires on symptoms, not causes |
| MTTD (payment incidents) | 45 min (median) | — |
The team was competent. The tooling just made correlation impossible. You can't grep your way to root cause when the information is spread across eight separate log streams with no shared identifier.
Phase 1: OpenTelemetry Collector as DaemonSet
The first decision was deployment architecture for the OTEL Collector. We chose DaemonSet — one collector per node, shared across all pods on that node — over the sidecar pattern.
With 8 services on a cluster that autoscales to 15–20 nodes (and Karpenter managing the node lifecycle), the sidecar approach would mean 30–60 collector processes. A DaemonSet means 15–20. Less CPU overhead, centralized config, one place to update the export pipeline.
The full DaemonSet config:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 30
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 30
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.96.0
args:
- "--config=/conf/otel-collector-config.yaml"
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 500m
memory: 500Mi
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Collector metrics
volumeMounts:
- name: otel-collector-config
mountPath: /conf
volumes:
- name: otel-collector-config
configMap:
name: otel-collector-config
The collector config itself handles trace batching, retry logic, and export to Grafana Tempo. We also wired logs to Loki with the trace_id label extracted automatically:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
resource:
attributes:
- key: k8s.cluster.name
value: "payments-prod"
action: upsert
exporters:
otlp/tempo:
endpoint: http://grafana-tempo.monitoring.svc.cluster.local:4317
tls:
insecure: true
loki:
endpoint: http://grafana-loki.monitoring.svc.cluster.local:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
k8s.pod.name: "pod"
record:
traceID: "trace_id" # Extracted as Loki label for correlation
prometheus:
endpoint: "0.0.0.0:8889"
enable_open_metrics: true # Required for exemplar support
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [prometheus]
With this in place, every pod on every node had a local OTEL endpoint at http://$(NODE_IP):4317. Services point their OTEL SDK there via a NODE_IP env var injected from the pod spec's fieldRef.
Phase 2: Auto-Instrumentation for Node.js Services
Five of the eight services were Node.js. Rather than modifying each service individually, we used the OpenTelemetry Node.js auto-instrumentation package — zero code changes required, just a change to how the process starts.
The deployment patch for all Node.js services:
# Patch applied to all 5 Node.js deployments
spec:
template:
spec:
initContainers:
- name: otel-agent-init
image: busybox
command:
- sh
- -c
- "echo 'OTEL init complete'"
containers:
- name: app
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://$(NODE_IP):4317"
- name: OTEL_SERVICE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=production,k8s.namespace.name=$(POD_NAMESPACE)"
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_OPTIONS
value: "--require @opentelemetry/auto-instrumentations-node/register"
The NODE_OPTIONS line is the key — it tells Node.js to load the auto-instrumentation package before any application code runs. This instruments Express routes, HTTP outbound calls, PostgreSQL queries (pg client), Redis calls, and Kafka producers/consumers automatically.
We added the package to the base Node.js Docker image used by all services:
FROM node:20-alpine
RUN npm install --save @opentelemetry/[email protected]
# ... rest of Dockerfile
No individual service code changed. All five services started emitting spans within an hour.
Phase 3: Manual Spans for Critical Go Payment Flows
The Go services needed a different approach. Go has no auto-instrumentation equivalent — you instrument manually. But for the accounts ledger service (the critical path for every payment), we went further than basic HTTP tracing. We added custom spans around every business-critical operation.
package ledger
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("accounts-ledger")
func (s *LedgerService) ProcessPayment(ctx context.Context, payment Payment) (*Receipt, error) {
ctx, span := tracer.Start(ctx, "ledger.ProcessPayment",
trace.WithAttributes(
attribute.String("payment.id", payment.ID),
attribute.String("payment.currency", payment.Currency),
attribute.Int64("payment.amount_cents", payment.AmountCents),
attribute.String("payment.type", string(payment.Type)),
),
)
defer span.End()
// Validate sender balance — creates a child span automatically
// because we pass ctx through
balance, err := s.getAccountBalance(ctx, payment.SenderAccountID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "failed to fetch sender balance")
return nil, err
}
if balance < payment.AmountCents {
span.SetAttributes(attribute.Bool("payment.insufficient_funds", true))
span.SetStatus(codes.Error, "insufficient funds")
return nil, ErrInsufficientFunds
}
// Debit sender
ctx, debitSpan := tracer.Start(ctx, "ledger.DebitSender")
if err := s.debitAccount(ctx, payment.SenderAccountID, payment.AmountCents); err != nil {
debitSpan.RecordError(err)
debitSpan.SetStatus(codes.Error, "debit failed")
debitSpan.End()
return nil, err
}
debitSpan.End()
// Credit recipient
ctx, creditSpan := tracer.Start(ctx, "ledger.CreditRecipient")
if err := s.creditAccount(ctx, payment.RecipientAccountID, payment.AmountCents); err != nil {
creditSpan.RecordError(err)
creditSpan.SetStatus(codes.Error, "credit failed")
creditSpan.End()
return nil, err
}
creditSpan.End()
receipt := &Receipt{PaymentID: payment.ID, Status: "settled"}
span.SetAttributes(attribute.String("payment.status", "settled"))
span.SetStatus(codes.Ok, "payment settled")
return receipt, nil
}
The W3C TraceContext header (traceparent) flows automatically through the HTTP client calls thanks to the otelhttp transport wrapper:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
// Replace default http.Client transport with instrumented version
httpClient := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
For the Kafka consumer in the reconciliation service — where trace context comes from message headers, not HTTP headers — we extracted the context explicitly:
func (c *ReconciliationConsumer) processMessage(msg *kafka.Message) {
// Extract trace context from Kafka message headers
headers := make(map[string]string)
for _, h := range msg.Headers {
headers[string(h.Key)] = string(h.Value)
}
ctx := otel.GetTextMapPropagator().Extract(
context.Background(),
propagation.MapCarrier(headers),
)
ctx, span := tracer.Start(ctx, "reconciliation.ProcessMessage")
defer span.End()
// ... rest of processing with ctx
}
This was the one place where W3C propagation needed explicit wiring. Every other service boundary (HTTP) propagated automatically.
Phase 4: Exemplars and the N+1 Discovery
With traces flowing, we added Prometheus exemplars — the feature that links a metric spike directly to the trace that caused it.
The Node.js payment orchestrator used prom-client for metrics. We patched the histogram recording to attach the current trace ID:
const { metrics } = require('@opentelemetry/api');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
// Histogram with exemplar support
const paymentDuration = meter.createHistogram('payment_duration_seconds', {
description: 'Payment processing duration',
boundaries: [0.05, 0.1, 0.2, 0.5, 1, 2, 5],
});
// In the payment handler:
async function handlePayment(req, res) {
const startTime = Date.now();
const span = trace.getActiveSpan();
try {
const result = await processPayment(req.body);
const duration = (Date.now() - startTime) / 1000;
// Record with exemplar — trace ID attached automatically
paymentDuration.record(duration, {
'payment.type': req.body.type,
'payment.currency': req.body.currency,
});
res.json(result);
} catch (err) {
// ...
}
}
With exemplars working, a spike on the Grafana p99 latency panel became a one-click investigation. That's how we found the N+1 query.
The N+1 Discovery
Three weeks after the traces went live, we noticed the payment p99 latency had a persistent 200ms baseline that spiked to 400ms under moderate load. The metric had always looked like this — we just hadn't had the context to explain it.
We clicked the exemplar icon on the spike. The trace opened in Tempo. What we saw:
POST /payments/process 1.24s
├── fraud-detection.Evaluate 45ms
├── currency-conversion.Convert 12ms
└── ledger.ProcessPayment 890ms
├── ledger.getAccountBalance 8ms (account 1)
├── ledger.getAccountBalance 9ms (account 2)
├── ledger.getAccountBalance 7ms (account 3)
├── ledger.getAccountBalance 8ms (account 4)
├── ledger.getAccountBalance 9ms (account 5)
├── ledger.getAccountBalance 8ms (account 6)
├── ledger.getAccountBalance 7ms (account 7)
├── ledger.getAccountBalance 9ms (account 8)
├── ledger.getAccountBalance 8ms (account 9)
├── ledger.getAccountBalance 7ms (account 10)
├── ledger.getAccountBalance 9ms (account 11)
├── ledger.getAccountBalance 8ms (account 12)
├── ledger.DebitSender 620ms
└── ledger.CreditRecipient 15ms
Twelve sequential getAccountBalance calls before the actual debit. The DebitSender span then took 620ms on its own.
We ran EXPLAIN ANALYZE on the debit query:
EXPLAIN ANALYZE
SELECT id, balance, currency, status
FROM accounts
WHERE id = $1 AND status = 'active'
FOR UPDATE;
-- Output:
Seq Scan on accounts (cost=0.00..8420.00 rows=1 width=48)
(actual time=614.231..614.232 rows=1 loops=1)
Filter: ((id = '3f2a...'::uuid) AND (status = 'active'::account_status))
Rows Removed by Filter: 284197
Planning Time: 0.087 ms
Execution Time: 614.298 ms
Sequential scan on 284,000 rows every time. The accounts table had an index on id but the compound condition WHERE id = $1 AND status = 'active' was not covered — and the FOR UPDATE lock was amplifying the problem under concurrent load.
The twelve getAccountBalance calls were a logic bug: the reconciliation pre-check was iterating accounts that had already been validated upstream. That was a one-line fix.
The missing index was a two-line fix:
-- Added compound index to cover the query pattern
CREATE INDEX CONCURRENTLY idx_accounts_id_status
ON accounts (id, status)
WHERE status = 'active';
-- Verified with EXPLAIN ANALYZE after index creation:
-- Index Scan using idx_accounts_id_status on accounts
-- (actual time=0.041..0.042 rows=1 loops=1)
-- Execution Time: 0.089 ms
614ms → 0.089ms per debit operation. The 200ms baseline on payment p99 disappeared.
None of this was visible in CloudWatch. The RDS Performance Insights showed slow queries, but without the trace connecting the API request to the specific DB call, there was no way to know which service was causing it or why it was running 12 times per payment.
The Final Numbers
After four weeks with full tracing in production:
| Metric | Before | After | Change |
|---|---|---|---|
| MTTD (payment incidents) | 45 min | 4 min | -91% |
| Payment p99 latency | 890ms | 340ms | -62% |
| Debit DB query time | 614ms | 0.089ms | -99.9% |
| Incidents requiring DB team escalation | ~70% | ~15% | -78% |
| Observability tooling cost (monthly) | ~$2,100 (CloudWatch) | ~$420 (Tempo + Loki on S3) | -80% |
The cost comparison deserves a note: CloudWatch Logs Insights queries get expensive fast when you're running them at incident frequency across eight log groups. Grafana Tempo on S3 is close to free at this span volume. The FinOps analysis from our cost observability work confirmed the switch paid for the implementation within the first month.
What We'd Do Differently
Instrument the Kafka boundaries on day one, not week three. We deprioritized Kafka consumer instrumentation because it seemed like a minor edge case. It wasn't — the reconciliation service processes every failed payment retry, and missing trace context there meant a full category of incidents still had no correlation. We spent two weeks thinking retries were a separate problem.
Set p99 per-operation alerts from the start. We configured general p99 alerts but not per-span-name alerts. The N+1 query had been adding 200ms to ledger.ProcessPayment for months before we looked at it. Add span.name as a label in your Prometheus histograms and alert at that granularity — a per-operation SLO would have caught this on day one.
Run the OTEL Collector on a dedicated node group. We ran the DaemonSet on shared nodes and had two incidents where collector pods were OOMKilled under heavy telemetry load — trace gaps during exactly the incidents we most needed visibility into. A dedicated node group (one m5.large per AZ) with a NodePool toleration is worth ~$60/month. The Karpenter NodePool config makes it a five-line change.
Validate W3C propagation in staging before rollout. In production we found two services with a legacy X-Request-ID header they were treating as the trace ID — they appeared connected in logs but generated disconnected traces in Tempo. A single integration test asserting the same trace ID appears in spans from two services would have caught this in minutes.
Running microservices and spending 30+ minutes per incident reconstructing what happened? Distributed tracing is the single highest-leverage observability investment you can make. Let's talk through what instrumentation looks like for your stack.