A Datadog monitor fires at 2:14 AM. RDS CPU is at 95%. The on-call engineer opens their laptop, checks CloudWatch, eyeballs the Kubernetes pod logs, digs through Slack for the last time this happened, vaguely remembers a post-mortem somewhere in Confluence, can't find it, starts from scratch. Forty-five minutes later they have a root cause.
The agent we built takes 8 seconds.
Not because it's faster at reading metrics — any dashboard can read metrics. Because it simultaneously knows the company's architecture, retrieves the exact runbook for that RDS cluster, pulls up the post-mortem from the last time this happened, checks the Terraform config to understand why max_connections might be misconfigured, reads the recent Kubernetes rollout history, and cross-references the Cloudflare edge latency. It doesn't just see the fire. It understands the building.
That's the difference between an AI that reads your infrastructure and one that knows it. The technology that makes this possible is Retrieval-Augmented Generation — and it's not a minor enhancement. It's the whole game.
What We Were Working With
The client runs a multiplayer gaming platform: hundreds of thousands of concurrent players during peak events, sub-100ms latency requirements across the stack, and the unique threat model that comes with competitive gaming — DDoS attacks are a feature of the landscape, not an exception. Their infrastructure:
- Kubernetes (EKS) for all application workloads
- Datadog for APM, logs, metrics, monitors, and synthetics
- AWS for RDS (PostgreSQL), EC2 nodes, ELB load balancers, CloudWatch alarms
- Cloudflare for WAF, DDoS mitigation, CDN caching, and edge routing
- A Confluence wiki with years of architecture docs, runbooks, on-call playbooks, and post-mortems
- A Go backend and React frontend living in Git
- Terraform managing the full infrastructure
The on-call situation before we started:
| Metric | Baseline |
|---|---|
| Mean time to resolution (MTTR) | 45 minutes |
| Incidents requiring escalation | ~68% |
| Incidents correctly diagnosed in first 10 min | ~22% |
| On-call pages per week | ~31 |
| Engineer hours/week on incident response | ~18 hours |
The team wasn't slow. Incident response is genuinely hard when context is scattered across a dozen systems. We set out to centralize that context — and then make it queryable in real time.
The Architecture: Five Integrations, One RAG Brain
The agent has two layers: live data collection and indexed knowledge retrieval. Live data tells it what is happening right now. RAG tells it what it means.
Layer 1 — Live Integrations (Read-Only)
Datadog integration uses the Datadog API v2 with a scoped API/App key pair. The agent can query any metric, read any log, inspect any APM trace, and check monitor status and SLO burn rates in real time.
def query_datadog_metrics(metric: str, query: str, from_time: int, to_time: int) -> dict:
"""Query Datadog metrics API for a time series."""
url = "https://api.datadoghq.com/api/v1/query"
params = {
"from": from_time,
"to": to_time,
"query": query,
}
headers = {
"DD-API-KEY": DD_API_KEY,
"DD-APPLICATION-KEY": DD_APP_KEY,
}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
return response.json()
# Example: RDS CPU over last 30 minutes
rds_cpu = query_datadog_metrics(
metric="aws.rds.cpuutilization",
query="avg:aws.rds.cpuutilization{dbinstanceidentifier:prod-postgres-primary}",
from_time=int(time.time()) - 1800,
to_time=int(time.time()),
)
Kubernetes integration uses the official Python client with a read-only ClusterRole — get/list/watch on pods, events, deployments, HPAs, nodes, and replicasets. No write verbs anywhere.
AWS integration uses boto3 with a read-only IAM role: cloudwatch:GetMetricData, rds:Describe*, ec2:Describe*, elasticloadbalancing:Describe*. The agent can check RDS replication lag, connection counts, CloudWatch alarms, and ELB health in one call.
Cloudflare integration uses the Analytics API with a read-only token scoped to Zone Analytics, Firewall Analytics, and Cache Analytics. The agent checks WAF event counts, bot scores, DDoS mitigation status, cache hit ratios, and edge latency by region.
Layer 2 — The RAG Knowledge Base (The Hero)
This is where the agent goes from "useful dashboard" to "the engineer who's been here five years and remembers everything."
We indexed four corpora into Pinecone using OpenAI's text-embedding-3-large model (3,072 dimensions, consistently better recall than ada-002 on technical content):
-
Confluence wiki — every architecture doc, runbook, on-call playbook, incident timeline, and decision record. Chunked at 512 tokens with 64-token overlap. Metadata: page title, last modified, section path, author.
-
Source code — the Go backend and React frontend, chunked by function and module. Metadata includes file path, function name, package, and surrounding context. The agent can look up the exact implementation of an API endpoint, find all the places a particular database table is queried, or retrieve error handling patterns for a specific service.
-
Terraform/IaC — every resource definition, module, and variable file. When the agent sees an RDS alarm, it can retrieve the Terraform resource that defines that instance: the instance class,
max_connectionsparameter group values, backup retention, multi-AZ setting, and which security groups allow access. This closes the gap between "the alarm is firing" and "the configuration that's causing it." -
Incident RCAs and post-mortems — every incident report the team ever wrote. Embedded with the incident title, affected services, root cause summary, and resolution steps. The agent has the full pattern library of historical failures.
The retrieval function runs at query time with the full incident context as the query:
def retrieve_relevant_context(
query: str,
namespaces: list[str],
top_k: int = 8,
) -> list[dict]:
"""
Retrieve relevant chunks from Pinecone across multiple namespaces.
namespaces: ["runbooks", "source-code", "terraform", "incidents"]
"""
query_embedding = openai_client.embeddings.create(
input=query,
model="text-embedding-3-large",
).data[0].embedding
results = []
for namespace in namespaces:
response = pinecone_index.query(
vector=query_embedding,
top_k=top_k,
namespace=namespace,
include_metadata=True,
)
for match in response.matches:
results.append({
"namespace": namespace,
"score": match.score,
"text": match.metadata["text"],
"source": match.metadata.get("source", "unknown"),
"title": match.metadata.get("title", ""),
})
# Sort by relevance score across all namespaces
results.sort(key=lambda x: x["score"], reverse=True)
return results[:top_k * 2] # Return top matches across all namespaces
Phase 1 — Parallel Context Collection
When a Datadog webhook fires, the agent collects live data from all four integrations simultaneously. Gaming infrastructure moves fast — a DDoS can go from zero to full traffic in under 30 seconds — so sequential data collection is not an option.
import asyncio
from dataclasses import dataclass
@dataclass
class IncidentContext:
alert: dict
datadog: dict
kubernetes: dict
aws: dict
cloudflare: dict
rag_chunks: list[dict]
async def collect_incident_context(alert: dict) -> IncidentContext:
"""Collect all live context in parallel, then query RAG."""
# Stage 1: all live integrations fire simultaneously
dd_task = asyncio.create_task(collect_datadog_context(alert))
k8s_task = asyncio.create_task(collect_kubernetes_context(alert))
aws_task = asyncio.create_task(collect_aws_context(alert))
cf_task = asyncio.create_task(collect_cloudflare_context(alert))
dd_ctx, k8s_ctx, aws_ctx, cf_ctx = await asyncio.gather(
dd_task, k8s_task, aws_task, cf_task,
return_exceptions=True, # One failing source doesn't block the others
)
# Stage 2: build a rich query from the live context for RAG retrieval
rag_query = build_rag_query(alert, dd_ctx, k8s_ctx, aws_ctx, cf_ctx)
rag_chunks = await asyncio.get_event_loop().run_in_executor(
None,
retrieve_relevant_context,
rag_query,
["runbooks", "source-code", "terraform", "incidents"],
)
return IncidentContext(
alert=alert,
datadog=dd_ctx,
kubernetes=k8s_ctx,
aws=aws_ctx,
cloudflare=cf_ctx,
rag_chunks=rag_chunks,
)
Total collection time: 4–7 seconds for live data, plus ~1.2 seconds for the RAG embedding and retrieval. The engineer gets a complete picture before they've unlocked their laptop.
Phase 2 — Correlation and Analysis with Claude
The assembled context goes to Claude Sonnet for deep analysis. We use Sonnet (not Haiku) here because multi-source correlation — five live data streams plus eight or more RAG chunks — requires the kind of structured reasoning where the extra inference time (6–9 seconds) is worth it. For simple single-monitor events, we gate to Haiku first and only escalate if the confidence is MEDIUM or below.
def analyze_incident(context: IncidentContext) -> dict:
"""Run full incident analysis with Claude Sonnet."""
rag_context_block = format_rag_chunks(context.rag_chunks)
live_context_block = format_live_context(context)
system_prompt = """You are an SRE analyst with deep expertise in this specific gaming platform.
You have access to real-time metrics, logs, Kubernetes state, AWS infrastructure, and
Cloudflare edge data. You also have retrieved relevant runbooks, past incident reports,
Terraform configurations, and source code from the team's knowledge base.
Your job: given all this context, identify the most likely root cause, assess confidence,
and suggest the safest corrective action.
Rules:
- Correlate timing precisely: if a K8s rollout happened 90 seconds before the alert, flag it
- Cross-reference RAG findings explicitly: "per the RDS runbook retrieved (score 0.91)..."
- Distinguish DDoS from legitimate traffic spikes using Cloudflare WAF data and event calendar
- Rate confidence HIGH / MEDIUM / LOW with explicit reasoning
- Always suggest rollback before config changes; config changes before restarts
- If a past incident matches this pattern, reference it by incident ID"""
response = anthropic_client.messages.create(
model="claude-sonnet-4-5",
max_tokens=3000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"LIVE CONTEXT:\n{live_context_block}\n\nRAG RETRIEVED:\n{rag_context_block}",
}],
)
return parse_analysis(response.content[0].text)
Phase 3 — Slack Incident Brief
The agent posts a structured brief to the team's incident channel. The format evolved over several months of iteration — this is what the team actually uses to make decisions:
🔴 INCIDENT — prod-postgres-primary: CPU 95% (threshold: 80%)
Triggered: 02:14 UTC | Duration: 6 min
📊 LIVE STATE
• RDS CPU: 95% | Connections: 487/500 | Replication lag: 2.1s
• K8s: game-api deployment rolled out v3.41.0 @ 02:11 UTC (2 min before spike)
• Cloudflare: WAF clean, no DDoS signal | Cache hit: 71% (normal)
• ELB: healthy, no 5xx spike at edge
🔍 ROOT CAUSE — HIGH confidence (0.87)
New code in game-api v3.41.0 introduced a leaderboard query that performs
a full table scan on `player_scores` (retrieved from source code:
leaderboard_service.go:247, commit a3f91b2). Under concurrent load during
the 02:00 UTC daily tournament start, this query saturates RDS connections.
📚 KNOWLEDGE BASE MATCH
• Runbook: "RDS Connection Exhaustion — prod-postgres-primary" (score 0.93)
→ Recommends adding index on (game_id, score DESC) + connection pool cap at 400
• Past incident: INC-2024-047 (score 0.89) — identical pattern, same tournament window
→ Resolution: rolled back deploy + added index. RDS CPU normalized in 4 min.
• Terraform config: rds_prod.tf line 34 — max_connections via parameter group: 500
→ Runbook recommends 400 max to leave headroom for admin connections
💡 RECOMMENDED ACTIONS
1. ROLLBACK game-api to v3.40.9 (immediate, lowest risk)
→ kubectl rollout undo deployment/game-api -n production
2. After rollback stabilizes, add DB index (from INC-2024-047 resolution):
→ CREATE INDEX CONCURRENTLY idx_player_scores_game_score
ON player_scores (game_id, score DESC);
3. Cap connection pool in game-api config: DB_POOL_MAX=80 (per runbook)
📋 CONFIDENCE BREAKDOWN
• Deploy correlation: v3.41.0 deployed 2 min before spike ✓
• Source code match: full table scan in new leaderboard query ✓
• Historical match: INC-2024-047 identical pattern ✓
• Cloudflare clean: rules out DDoS ✓
The engineer reads this, verifies the rollout timestamp matches, and runs the kubectl command. No digging. No searching Confluence at 2 AM.
What the Agent Found That Humans Consistently Missed
Three categories of finding emerged over the first four months in production.
Slow degradation patterns. A memory leak in a Go microservice was climbing 0.3% per hour — only visible during 8-hour gaming sessions and invisible on any dashboard without the right time window. The agent correlated Datadog container memory metrics with Kubernetes pod restart history across three weeks of data and surfaced the trend before the next major tournament. The team had been attributing the restarts to "OOMKilled intermittently" without connecting them to session length.
Cross-system configuration drift. An RDS read replica was running 45 seconds behind primary during peak hours. The agent retrieved the Terraform config via RAG (rds_replica.tf, max_connections = 100 — the default) and compared it to the runbook recommendation (400 for this instance class). It also found the AWS RDS best practices doc in the wiki noting that low max_connections causes queue buildup under write-heavy replication. The team's humans had never connected those three documents in the same mental model.
Silent WAF false positives. A Cloudflare WAF rule added after a DDoS event was blocking legitimate WebSocket connection upgrades from specific regions — players in those areas reported random disconnects, which support tagged as "client-side issues." The agent correlated WAF block event logs (Cloudflare API) with player complaint tickets embedded in the support wiki (RAG), matched them by region and timestamp, and raised the finding with a specific WAF rule ID and the recommended exception pattern. This had been invisible for six weeks.
The Final Numbers
After four months in production:
| Metric | Before | After | Change |
|---|---|---|---|
| Mean time to resolution (MTTR) | 45 min | 8 min | -82% |
| Incidents auto-diagnosed (HIGH confidence) | — | 71% | — |
| False positive rate | — | 6% | — |
| On-call escalations requiring wake-up | ~68% | ~27% | -60% |
| Engineer hours/week on incident response | ~18 hrs | ~6 hrs | -67% |
| Incidents resolved before engineer acts | 0% | 43% | — |
The 43% "resolved before engineer acts" figure covers L1 incidents where the agent's diagnosis was HIGH confidence, the action was a rollback or known-safe config change, and the auto-remediation gate (second Claude instance validation + action allowlist) cleared it automatically. No human woken up.
What We'd Do Differently
Index the metrics schema first, not last. We spent two weeks after the initial deploy building mappings between Datadog metric names and their business meaning — which aws.rds.* metric corresponds to which specific RDS cluster, what the normal range is for each service's p99. Embedding this schema into the RAG index as a structured document (not trying to infer it at query time) would have improved accuracy from day one. Do this before indexing anything else.
Chunk Terraform by resource, not by file. We initially chunked Terraform files by token count, which sometimes split a resource block across two chunks, losing the connection between a resource's attributes. Switching to resource-aware chunking — each resource {} block as an atomic unit, always together — noticeably improved the agent's ability to understand infrastructure configuration. The right chunk boundary is semantic, not positional.
Build the RAG update pipeline before the agent, not after. We shipped the agent talking to a static index, then scrambled to build CI/CD webhook integrations to keep it current. A stale knowledge base is worse than no knowledge base — the agent confidently retrieves outdated runbooks. The re-indexing pipeline should be the first thing you build, not the last.
Start with incidents-only RAG before adding source code. The post-mortems and runbooks corpus delivered immediate, measurable accuracy improvements. The source code corpus added significant value but also introduced retrieval noise — Go function signatures sometimes matched on syntax rather than semantic relevance. We should have tuned retrieval quality on the simpler corpus first, then added the noisier one.
Building an SRE agent for a complex multi-stack environment? The integrations are the easy part — getting the RAG knowledge base right is where most teams underinvest. Let's talk through the architecture before you start indexing.