RAG-Powered SRE Agent: Building Total Situational Awareness for a Gaming Platform

Luminous gold neural mesh absorbing data streams from Datadog, Kubernetes, AWS, and Cloudflare arteries in a deep navy void

A Datadog monitor fires at 2:14 AM. RDS CPU is at 95%. The on-call engineer opens their laptop, checks CloudWatch, eyeballs the Kubernetes pod logs, digs through Slack for the last time this happened, vaguely remembers a post-mortem somewhere in Confluence, can't find it, starts from scratch. Forty-five minutes later they have a root cause.

The agent we built takes 8 seconds.

Not because it's faster at reading metrics — any dashboard can read metrics. Because it simultaneously knows the company's architecture, retrieves the exact runbook for that RDS cluster, pulls up the post-mortem from the last time this happened, checks the Terraform config to understand why max_connections might be misconfigured, reads the recent Kubernetes rollout history, and cross-references the Cloudflare edge latency. It doesn't just see the fire. It understands the building.

That's the difference between an AI that reads your infrastructure and one that knows it. The technology that makes this possible is Retrieval-Augmented Generation — and it's not a minor enhancement. It's the whole game.

What We Were Working With

The client runs a multiplayer gaming platform: hundreds of thousands of concurrent players during peak events, sub-100ms latency requirements across the stack, and the unique threat model that comes with competitive gaming — DDoS attacks are a feature of the landscape, not an exception. Their infrastructure:

  • Kubernetes (EKS) for all application workloads
  • Datadog for APM, logs, metrics, monitors, and synthetics
  • AWS for RDS (PostgreSQL), EC2 nodes, ELB load balancers, CloudWatch alarms
  • Cloudflare for WAF, DDoS mitigation, CDN caching, and edge routing
  • A Confluence wiki with years of architecture docs, runbooks, on-call playbooks, and post-mortems
  • A Go backend and React frontend living in Git
  • Terraform managing the full infrastructure

The on-call situation before we started:

Metric Baseline
Mean time to resolution (MTTR) 45 minutes
Incidents requiring escalation ~68%
Incidents correctly diagnosed in first 10 min ~22%
On-call pages per week ~31
Engineer hours/week on incident response ~18 hours

The team wasn't slow. Incident response is genuinely hard when context is scattered across a dozen systems. We set out to centralize that context — and then make it queryable in real time.

The Architecture: Five Integrations, One RAG Brain

The agent has two layers: live data collection and indexed knowledge retrieval. Live data tells it what is happening right now. RAG tells it what it means.

Layer 1 — Live Integrations (Read-Only)

Datadog integration uses the Datadog API v2 with a scoped API/App key pair. The agent can query any metric, read any log, inspect any APM trace, and check monitor status and SLO burn rates in real time.

def query_datadog_metrics(metric: str, query: str, from_time: int, to_time: int) -> dict:
    """Query Datadog metrics API for a time series."""
    url = "https://api.datadoghq.com/api/v1/query"
    params = {
        "from": from_time,
        "to": to_time,
        "query": query,
    }
    headers = {
        "DD-API-KEY": DD_API_KEY,
        "DD-APPLICATION-KEY": DD_APP_KEY,
    }
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

# Example: RDS CPU over last 30 minutes
rds_cpu = query_datadog_metrics(
    metric="aws.rds.cpuutilization",
    query="avg:aws.rds.cpuutilization{dbinstanceidentifier:prod-postgres-primary}",
    from_time=int(time.time()) - 1800,
    to_time=int(time.time()),
)

Kubernetes integration uses the official Python client with a read-only ClusterRoleget/list/watch on pods, events, deployments, HPAs, nodes, and replicasets. No write verbs anywhere.

AWS integration uses boto3 with a read-only IAM role: cloudwatch:GetMetricData, rds:Describe*, ec2:Describe*, elasticloadbalancing:Describe*. The agent can check RDS replication lag, connection counts, CloudWatch alarms, and ELB health in one call.

Cloudflare integration uses the Analytics API with a read-only token scoped to Zone Analytics, Firewall Analytics, and Cache Analytics. The agent checks WAF event counts, bot scores, DDoS mitigation status, cache hit ratios, and edge latency by region.

Layer 2 — The RAG Knowledge Base (The Hero)

This is where the agent goes from "useful dashboard" to "the engineer who's been here five years and remembers everything."

We indexed four corpora into Pinecone using OpenAI's text-embedding-3-large model (3,072 dimensions, consistently better recall than ada-002 on technical content):

  1. Confluence wiki — every architecture doc, runbook, on-call playbook, incident timeline, and decision record. Chunked at 512 tokens with 64-token overlap. Metadata: page title, last modified, section path, author.

  2. Source code — the Go backend and React frontend, chunked by function and module. Metadata includes file path, function name, package, and surrounding context. The agent can look up the exact implementation of an API endpoint, find all the places a particular database table is queried, or retrieve error handling patterns for a specific service.

  3. Terraform/IaC — every resource definition, module, and variable file. When the agent sees an RDS alarm, it can retrieve the Terraform resource that defines that instance: the instance class, max_connections parameter group values, backup retention, multi-AZ setting, and which security groups allow access. This closes the gap between "the alarm is firing" and "the configuration that's causing it."

  4. Incident RCAs and post-mortems — every incident report the team ever wrote. Embedded with the incident title, affected services, root cause summary, and resolution steps. The agent has the full pattern library of historical failures.

The retrieval function runs at query time with the full incident context as the query:

def retrieve_relevant_context(
    query: str,
    namespaces: list[str],
    top_k: int = 8,
) -> list[dict]:
    """
    Retrieve relevant chunks from Pinecone across multiple namespaces.
    namespaces: ["runbooks", "source-code", "terraform", "incidents"]
    """
    query_embedding = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-large",
    ).data[0].embedding

    results = []
    for namespace in namespaces:
        response = pinecone_index.query(
            vector=query_embedding,
            top_k=top_k,
            namespace=namespace,
            include_metadata=True,
        )
        for match in response.matches:
            results.append({
                "namespace": namespace,
                "score": match.score,
                "text": match.metadata["text"],
                "source": match.metadata.get("source", "unknown"),
                "title": match.metadata.get("title", ""),
            })

    # Sort by relevance score across all namespaces
    results.sort(key=lambda x: x["score"], reverse=True)
    return results[:top_k * 2]  # Return top matches across all namespaces

Phase 1 — Parallel Context Collection

When a Datadog webhook fires, the agent collects live data from all four integrations simultaneously. Gaming infrastructure moves fast — a DDoS can go from zero to full traffic in under 30 seconds — so sequential data collection is not an option.

import asyncio
from dataclasses import dataclass

@dataclass
class IncidentContext:
    alert: dict
    datadog: dict
    kubernetes: dict
    aws: dict
    cloudflare: dict
    rag_chunks: list[dict]

async def collect_incident_context(alert: dict) -> IncidentContext:
    """Collect all live context in parallel, then query RAG."""

    # Stage 1: all live integrations fire simultaneously
    dd_task = asyncio.create_task(collect_datadog_context(alert))
    k8s_task = asyncio.create_task(collect_kubernetes_context(alert))
    aws_task = asyncio.create_task(collect_aws_context(alert))
    cf_task = asyncio.create_task(collect_cloudflare_context(alert))

    dd_ctx, k8s_ctx, aws_ctx, cf_ctx = await asyncio.gather(
        dd_task, k8s_task, aws_task, cf_task,
        return_exceptions=True,  # One failing source doesn't block the others
    )

    # Stage 2: build a rich query from the live context for RAG retrieval
    rag_query = build_rag_query(alert, dd_ctx, k8s_ctx, aws_ctx, cf_ctx)
    rag_chunks = await asyncio.get_event_loop().run_in_executor(
        None,
        retrieve_relevant_context,
        rag_query,
        ["runbooks", "source-code", "terraform", "incidents"],
    )

    return IncidentContext(
        alert=alert,
        datadog=dd_ctx,
        kubernetes=k8s_ctx,
        aws=aws_ctx,
        cloudflare=cf_ctx,
        rag_chunks=rag_chunks,
    )

Total collection time: 4–7 seconds for live data, plus ~1.2 seconds for the RAG embedding and retrieval. The engineer gets a complete picture before they've unlocked their laptop.

Phase 2 — Correlation and Analysis with Claude

The assembled context goes to Claude Sonnet for deep analysis. We use Sonnet (not Haiku) here because multi-source correlation — five live data streams plus eight or more RAG chunks — requires the kind of structured reasoning where the extra inference time (6–9 seconds) is worth it. For simple single-monitor events, we gate to Haiku first and only escalate if the confidence is MEDIUM or below.

def analyze_incident(context: IncidentContext) -> dict:
    """Run full incident analysis with Claude Sonnet."""

    rag_context_block = format_rag_chunks(context.rag_chunks)
    live_context_block = format_live_context(context)

    system_prompt = """You are an SRE analyst with deep expertise in this specific gaming platform.
You have access to real-time metrics, logs, Kubernetes state, AWS infrastructure, and
Cloudflare edge data. You also have retrieved relevant runbooks, past incident reports,
Terraform configurations, and source code from the team's knowledge base.

Your job: given all this context, identify the most likely root cause, assess confidence,
and suggest the safest corrective action.

Rules:
- Correlate timing precisely: if a K8s rollout happened 90 seconds before the alert, flag it
- Cross-reference RAG findings explicitly: "per the RDS runbook retrieved (score 0.91)..."
- Distinguish DDoS from legitimate traffic spikes using Cloudflare WAF data and event calendar
- Rate confidence HIGH / MEDIUM / LOW with explicit reasoning
- Always suggest rollback before config changes; config changes before restarts
- If a past incident matches this pattern, reference it by incident ID"""

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=3000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"LIVE CONTEXT:\n{live_context_block}\n\nRAG RETRIEVED:\n{rag_context_block}",
        }],
    )

    return parse_analysis(response.content[0].text)

Phase 3 — Slack Incident Brief

The agent posts a structured brief to the team's incident channel. The format evolved over several months of iteration — this is what the team actually uses to make decisions:

🔴 INCIDENT — prod-postgres-primary: CPU 95% (threshold: 80%)
Triggered: 02:14 UTC | Duration: 6 min

📊 LIVE STATE
• RDS CPU: 95% | Connections: 487/500 | Replication lag: 2.1s
• K8s: game-api deployment rolled out v3.41.0 @ 02:11 UTC (2 min before spike)
• Cloudflare: WAF clean, no DDoS signal | Cache hit: 71% (normal)
• ELB: healthy, no 5xx spike at edge

🔍 ROOT CAUSE — HIGH confidence (0.87)
New code in game-api v3.41.0 introduced a leaderboard query that performs
a full table scan on `player_scores` (retrieved from source code: 
leaderboard_service.go:247, commit a3f91b2). Under concurrent load during
the 02:00 UTC daily tournament start, this query saturates RDS connections.

📚 KNOWLEDGE BASE MATCH
• Runbook: "RDS Connection Exhaustion — prod-postgres-primary" (score 0.93)
  → Recommends adding index on (game_id, score DESC) + connection pool cap at 400
• Past incident: INC-2024-047 (score 0.89) — identical pattern, same tournament window
  → Resolution: rolled back deploy + added index. RDS CPU normalized in 4 min.
• Terraform config: rds_prod.tf line 34 — max_connections via parameter group: 500
  → Runbook recommends 400 max to leave headroom for admin connections

💡 RECOMMENDED ACTIONS
1. ROLLBACK game-api to v3.40.9 (immediate, lowest risk)
   → kubectl rollout undo deployment/game-api -n production
2. After rollback stabilizes, add DB index (from INC-2024-047 resolution):
   → CREATE INDEX CONCURRENTLY idx_player_scores_game_score 
      ON player_scores (game_id, score DESC);
3. Cap connection pool in game-api config: DB_POOL_MAX=80 (per runbook)

📋 CONFIDENCE BREAKDOWN
• Deploy correlation: v3.41.0 deployed 2 min before spike ✓
• Source code match: full table scan in new leaderboard query ✓  
• Historical match: INC-2024-047 identical pattern ✓
• Cloudflare clean: rules out DDoS ✓

The engineer reads this, verifies the rollout timestamp matches, and runs the kubectl command. No digging. No searching Confluence at 2 AM.

What the Agent Found That Humans Consistently Missed

Three categories of finding emerged over the first four months in production.

Slow degradation patterns. A memory leak in a Go microservice was climbing 0.3% per hour — only visible during 8-hour gaming sessions and invisible on any dashboard without the right time window. The agent correlated Datadog container memory metrics with Kubernetes pod restart history across three weeks of data and surfaced the trend before the next major tournament. The team had been attributing the restarts to "OOMKilled intermittently" without connecting them to session length.

Cross-system configuration drift. An RDS read replica was running 45 seconds behind primary during peak hours. The agent retrieved the Terraform config via RAG (rds_replica.tf, max_connections = 100 — the default) and compared it to the runbook recommendation (400 for this instance class). It also found the AWS RDS best practices doc in the wiki noting that low max_connections causes queue buildup under write-heavy replication. The team's humans had never connected those three documents in the same mental model.

Silent WAF false positives. A Cloudflare WAF rule added after a DDoS event was blocking legitimate WebSocket connection upgrades from specific regions — players in those areas reported random disconnects, which support tagged as "client-side issues." The agent correlated WAF block event logs (Cloudflare API) with player complaint tickets embedded in the support wiki (RAG), matched them by region and timestamp, and raised the finding with a specific WAF rule ID and the recommended exception pattern. This had been invisible for six weeks.

The Final Numbers

After four months in production:

Metric Before After Change
Mean time to resolution (MTTR) 45 min 8 min -82%
Incidents auto-diagnosed (HIGH confidence) 71%
False positive rate 6%
On-call escalations requiring wake-up ~68% ~27% -60%
Engineer hours/week on incident response ~18 hrs ~6 hrs -67%
Incidents resolved before engineer acts 0% 43%

The 43% "resolved before engineer acts" figure covers L1 incidents where the agent's diagnosis was HIGH confidence, the action was a rollback or known-safe config change, and the auto-remediation gate (second Claude instance validation + action allowlist) cleared it automatically. No human woken up.

What We'd Do Differently

Index the metrics schema first, not last. We spent two weeks after the initial deploy building mappings between Datadog metric names and their business meaning — which aws.rds.* metric corresponds to which specific RDS cluster, what the normal range is for each service's p99. Embedding this schema into the RAG index as a structured document (not trying to infer it at query time) would have improved accuracy from day one. Do this before indexing anything else.

Chunk Terraform by resource, not by file. We initially chunked Terraform files by token count, which sometimes split a resource block across two chunks, losing the connection between a resource's attributes. Switching to resource-aware chunking — each resource {} block as an atomic unit, always together — noticeably improved the agent's ability to understand infrastructure configuration. The right chunk boundary is semantic, not positional.

Build the RAG update pipeline before the agent, not after. We shipped the agent talking to a static index, then scrambled to build CI/CD webhook integrations to keep it current. A stale knowledge base is worse than no knowledge base — the agent confidently retrieves outdated runbooks. The re-indexing pipeline should be the first thing you build, not the last.

Start with incidents-only RAG before adding source code. The post-mortems and runbooks corpus delivered immediate, measurable accuracy improvements. The source code corpus added significant value but also introduced retrieval noise — Go function signatures sometimes matched on syntax rather than semantic relevance. We should have tuned retrieval quality on the simpler corpus first, then added the noisier one.


Building an SRE agent for a complex multi-stack environment? The integrations are the easy part — getting the RAG knowledge base right is where most teams underinvest. Let's talk through the architecture before you start indexing.

Frequently Asked Questions

Why does RAG matter here — can't you just put the runbooks in the system prompt?

No — context window limits make static prompts impractical at real knowledge base scale. A gaming company's full Confluence wiki, all past post-mortems, every Terraform module, and the source code of a Go backend adds up to tens of millions of tokens. You cannot fit that in any context window. RAG solves this by retrieving only the relevant chunks at query time — the runbook for this specific RDS instance, the post-mortem where this exact pattern appeared, the Terraform config that shows why max_connections is set too low. The agent gets surgical precision instead of noise.

What gets indexed in the RAG knowledge base and how do you keep it current?

We indexed four corpora: (1) Confluence/wiki — architecture docs, runbooks, on-call playbooks; (2) source code — Go backend and React frontend, chunked by function/module with full file path metadata; (3) Terraform/IaC — every resource definition, so the agent knows your actual infrastructure topology; (4) incident RCAs and post-mortems — the historical pattern library. Re-indexing is triggered by CI/CD pipeline webhooks (source code changes) and Confluence webhooks (doc changes). The Pinecone index stays current within minutes of any update.

How does the agent distinguish a DDoS from a legitimate traffic spike?

Correlation across three data sources simultaneously: Cloudflare WAF (bot scores, rate-limit hits, challenge rates — a DDoS shows distinct WAF signature patterns), Datadog APM (latency distribution — organic traffic spikes correlate with higher p99 but not the same request diversity drop that DDoS shows), and the game event calendar retrieved via RAG from the company wiki (a scheduled tournament explains a 10x traffic spike with no WAF anomaly; a surprise spike with WAF anomalies does not). Gaming platforms see both constantly — the agent learned the difference after ingesting the team's historical incident data.

Why Claude Sonnet for analysis instead of GPT-4 or Gemini?

We use Claude Sonnet for complex multi-source correlation (the full incident analysis) and Claude Haiku for fast triage (quick checks, health pings, simple monitor evaluations). Sonnet's reasoning quality on long, structured technical contexts — correlating 5 data sources plus retrieved RAG chunks simultaneously — was measurably better in our benchmarks against the alternative models. Haiku's sub-2-second latency keeps quick checks from adding friction to the pipeline. The two-model architecture lets us optimize for quality where it matters and speed where it doesn't.

What read-only permissions does the agent actually need?

Datadog: an API/App key pair scoped to metrics_read, logs_read, apm_read, monitors_read. Kubernetes: a ClusterRole with get/list/watch on pods, events, deployments, HPAs, nodes — no write verbs. AWS: an IAM role with cloudwatch:GetMetricData, ec2:Describe*, rds:Describe*, elasticloadbalancing:Describe* — no mutating actions. Cloudflare: a read-only API token scoped to Zone Analytics, Firewall Analytics, and Cache Analytics. Total blast radius if a credential leaks: zero changes to production.

What did the agent catch that humans consistently missed?

Three categories: slow patterns that unfold over days or weeks — a memory leak that only materialized during 8-hour gaming sessions, visible only when correlating Datadog memory metrics with K8s pod restart history across 3 weeks of data; cross-system configuration drift — a Terraform-defined max_connections default that conflicted with the runbook recommendation and only became a problem under peak load; and false-positive WAF blocks — a rule silently blocking legitimate WebSocket connections from certain regions, caught only by correlating WAF block logs with player support tickets from the wiki RAG. Humans notice the dramatic failures; the agent notices the quiet ones.