SREincident responseoutage

Incident Response Playbook for Outages Impacting CDN, DNS and Cloud Providers

UUnknown

2026-02-12

10 min read

Concise incident runbook for SREs to triage, mitigate, and communicate during wide CDN/DNS/cloud outages like the Jan 2026 incidents.

Immediate triage when the internet breaks: a concise runbook for SREs

Hook: When a CDN, DNS or cloud provider outage wipes out traffic and your on-call rotation is scrambling, you don’t have time for long diagnostics. You need a short, battle-tested runbook that gets your key services back, keeps stakeholders informed, and preserves evidence for a clear postmortem.

This playbook is targeted at SREs and DevOps teams in 2026. It synthesizes real-world lessons from recent multi-provider incidents (including the Jan 2026 X/Cloudflare/AWS disruption), plus current trends — RPKI adoption, wider DNS-over-HTTPS usage, edge eBPF observability, and multi-provider FaaS strategies. Follow it verbatim during an outage and adapt it into your automation and incident drills.

TL;DR — What to do in the first 60 minutes

0–5 minutes: Confirm scope, declare incident, establish incident command, open an incident channel, notify stakeholders.
5–15 minutes: Run quick network and DNS checks (dig, traceroute, mtr), check provider status pages, verify BGP and RPKI signals, and test alternate resolvers.
15–60 minutes: Apply mitigations: failover DNS, bypass CDN to origin, switch to backup providers, enable static fallbacks, and throttle features via feature flags.
Ongoing: Keep cadence updates every 15–30 minutes. Capture logs, traces, and reproduce exact failing requests for postmortem.

Incident command — roles and first actions

Use a lightweight incident command structure (adapted from ICS). Assign clear roles in the first 5 minutes:

Incident Commander (IC) — owns the incident, decides priority, and signs off communications.
Communications Lead — writes status updates, liaises with product and legal, posts to status pages.
Tech Lead / Triage Engineer — runs diagnostics, coordinates mitigations.
Edge/CDN SME — handles CDN toggles, cache policies, edge worker logic.
DNS/Network SME — run DNS failovers, BGP checks, and ESS/Route53 changes.

Open an incident channel — template

Create a dedicated Slack/Teams channel and pin a simple incident header:

/incident: CDN/DNS/Cloud outage — IC: @alice — Channel: #incident-cdn-2026-01-16 — Priority: P1

First message example (post to channel and to rotation): “Declared P1 outage — IC @alice. Scope: partial global HTTP failures; suspected provider-level CDN/DNS degradation reported since 10:30 ET. Triage in progress.”

0–15 minutes: fast diagnostics (what to run now)

Start with reproducible checks and collect authoritative evidence. Run these in parallel — use at least two engineers.

Check provider status and public telemetry

Provider status pages (Cloudflare/AWS/etc) and official Twitter/X posts.
DownDetector, ThousandEyes, Grafana Cloud synthetic tests (if subscribed), and public BGP monitors (e.g., RIPE, BGPStream).

Network and DNS checks — commands

# DNS resolution
dig @8.8.8.8 example.com ANY +noall +answer

# Trace path to CDN / origin
mtr -r -c 30 example.com

# Check HTTP from multiple locations (quick curl)
curl -v --resolve example.com:443:198.51.100.10 https://example.com/health

# BGP prefix visibility (use bgp.tools or remote API)
# From an engineer machine, check alternate resolver
dig @1.1.1.1 example.com A +short

What to look for: NXDOMAIN or SERVFAIL across resolvers (DNS problem), AS path churn or withdrawn prefixes (BGP problem), consistent 5xx from edge nodes but 200 from origin (CDN problem), or API gateway timeouts pointing to upstream cloud provider issues.

15–60 minutes: mitigation playbook (ordered, auditable actions)

Do not change many systems at once. Prioritize reversible, low-risk moves first and document each change with timestamp and author.

1. If DNS resolution is failing

Switch resolvers in diagnostics to confirm global vs localized issues.
If your primary DNS provider is down, activate secondary authoritative DNS (pre-provisioned). Keep TTLs low for critical records (60–300s) in normal ops to enable fast switchover.
Example: AWS Route53 failover. Use CLI to set a standby record set (ensure you have an IaC template ready):

aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://failover.json

# failover.json contains the alternate A/CNAME pointing to backup IPs or a static landing page.

2. If CDN edges are returning errors

Confirm whether the CDN control plane is reachable (can you toggle dashboard values?).
Fallback to origin: update DNS or bypass the CDN using a temporary host record that points directly to origin or LB IPs.
Reduce complexity: disable edge workers or custom rules that might be failing. Switch to cached content or serve a static maintenance page.

# Example: temporary bypass - update CNAME to origin-lb.example.net
# Use DNS change batch or your DNS provider CLI/console

3. If cloud provider APIs / control plane is impacted

Check whether the problem is control-plane only or also data-plane (compute/storage). If control plane forms are slow but VMs/containers still serve traffic, avoid redeploys.
Escalate with provider support and leverage public status APIs to track incident keys and correlation IDs.
Prepare to run critical services from alternate regions or providers if resilient cloud-native architectures or multi-cloud failover is in place.

4. Edge function / FaaS fallbacks

For short-lived functions and edge compute, route traffic to a fallback static host or prewarmed provider. Use feature flags to disable non-essential edge compute to reduce cold-start chaos.

// Example: forward trace id header so we can correlate failures
fetch(request, {
  headers: { 'x-trace-id': request.headers.get('x-trace-id') || uuid() }
});

Observability and debugging during ephemeral, wide-impact outages

Observability is the most valuable artifact for post-incident analysis. Capture traces, logs, and metrics even if noisy. In 2026, teams commonly use OpenTelemetry, eBPF-based packet captures at the edge, and aggregated RUM to correlate client symptoms with edge events.

Quick observability checklist

Enable debug-level logs on the minimal set of services required to reproduce the error.
Capture distributed traces: ensure edge adds trace headers that are propagated to backend. If missing, instrument a short-lived middleware to inject and log a correlation id.
Take packet captures on an edge or gateway host using eBPF tools (if accessible) to catch TCP resets or TLS handshake failures — this has become standard for diagnosing edge packet issues.
Export synthetic test results and RUM sessions around the outage window; they help show client impact.

Commands and snippets

# kubectl logs for recent pods
kubectl logs -n production --since=1h -l app=frontend --tail=200

# Query traces from your tracing backend (pseudo-OTLP query)
curl -s "https://tracing.example/api/v1/traces?service=edge&start=2026-01-16T10:20:00Z&end=2026-01-16T10:40:00Z" | jq .

# eBPF capture example (on Linux host) - requires bpftrace
sudo bpftrace -e 'kprobe:tcp_v4_connect { @[comm] = count(); }'

Communication: internal and external

Clear, honest, and frequent communication reduces organizational friction. Keep messages short, timestamped, and status-focused. The cadence depends on severity but aim for 15–30 minute updates while the incident is active.

External status page templates

Initial public status page post (concise):

We are investigating reports of increased errors and degraded performance affecting website access and API responses since 10:30 ET (Jan 16, 2026). Engineers have identified an issue affecting CDN/DNS infrastructure and are working with providers. We will post an update at 10:45 ET.

Ongoing updates (15–30m cadence)

What changed since last update
Systems impacted
Mitigation actions taken
Next update time

Internal update example

10:42 ET - Update 1
IC: @alice
Scope: ~40% API errors for US customers; static assets failing for some CDNs
Action: Bypassed CDN -> origin for /api and enabled static maintenance page for non-critical assets
Next update: 11:00 ET

Postmortem: evidence, RCA, and preventive actions

Start the postmortem within 48–72 hours. Your deliverable should be actionable; avoid vague conclusions. Use a blameless format and include timeline, evidence, and prevention tasks mapped to owners and due dates.

Essential postmortem sections

Summary of outage and customer impact (quantified: requests failed, revenue impact if known)
Timeline of detection, mitigation, and resolution with timestamps and commands executed
Root cause(s) and contributing factors
Corrective actions and owners with deadlines
Follow-up experiments: runbooks to codify, tests to add

Metrics to include: MTTD (mean time to detect), MTTR, % of traffic affected, errors/sec before/after, and any SLO burn measured in the incident window.

Example corrective actions (from the Jan 2026 incidents)

Preconfigure secondary authoritative DNS with automated failover and CI-validated record sets.
Lower critical DNS TTLs to 60–300s for high-risk services (balanced with cache efficiency).
Adopt multi-CDN routing with active health checks to reduce single-provider exposure.
Instrument edge workers to log trace IDs and HTTP status breakdowns for 5xx spikes.
Run yearly BGP/RPKI validation drills and maintain BGP prefix monitoring subscriptions.

Testing and drills — make this repeatable

Outages will happen — plan and rehearse. In 2026, mature teams run DNS failover drills, cd-deploy fallback tests, and multi-cloud failover simulations as part of their SRE calendar.

Checklist for an annual drill

Kick off a simulated CDN outage: switch to a blackhole rule in a staging CDN, run traffic to ensure failover executes.
Test DNS provider switch: ensure change propagation happens within expected TTL and rollback is reliable.
Verify edge function fallbacks: ensure static landing pages are served from secondary storage (S3, GitHub Pages, etc.).
Review incident communications: time to first status page, stakeholder notification, and cadence.

Automation and IaC: remove toil before the outage

Manual changes in high-pressure incidents increase risk. Pre-commit playbook steps into automation: runbook automation, IaC templates for DNS failover, runbook automation for CDN toggles, and pre-approved IAM roles for emergency changes.

# Example GitOps route53 failover (pseudo YAML)
resources:
  - type: AWS::Route53::RecordSet
    properties:
      Name: example.com.
      Type: A
      Failover: PRIMARY/SECONDARY
      TTL: 60

Advanced strategies and 2026 trends to reduce future impact

Multi-provider edge routing: Use steering DNS + health checks to spread risk across CDNs.
Edge observability: eBPF plus OpenTelemetry at the network edge for packet-level and trace-level correlation; affordable edge bundles make this more accessible.
RPKI monitoring: Validate BGP origin authorization to detect prefix hijack quickly.
DoH/DoT-awareness: Adopt monitoring that covers DoH/DoT resolver behavior — visibility changed as browsers defaulted to encrypted DNS.
Prewarmed provisioned concurrency for critical FaaS: Avoid cold-start tail latency for essential endpoints; evaluate your provider's free-tier and edge options when planning multi-provider FaaS strategies.

A final concise runbook checklist (printable)

Declare incident, assign IC and channel (0–5m)
Collect evidence: status pages, dig, curl, mtr (0–10m)
Confirm scope and impact; notify leadership and comms (0–15m)
Apply reversible mitigations (DNS failover, bypass CDN, enable static fallback) (15–60m)
Maintain 15–30m status cadence, escalate to providers (ongoing)
Record all commands and actions; capture logs/traces for postmortem (ongoing)
Postmortem within 72 hours; schedule follow-ups and drills (post-incident)

Always keep one engineer focused on communication and one focused on evidence collection — the temptation to do both in one person leads to missing important traces.

Closing: practical takeaways

Prepare for provider-level outages with pre-provisioned DNS/backup records and IaC-driven failovers.
Instrument edge and origin with trace propagation and eBPF telemetry to shorten RCA time.
Automate reversible mitigations and keep an up-to-date incident playbook that maps actions to owners.
Run failover drills and update the playbook after every incident — learning is the primary remediation.

Call to action

Use this runbook as your baseline. Convert the steps into automated scripts and add them to your on-call runbook repo. If you want, copy the examples above into your incident runbook, run one DNS failover drill this quarter, and schedule a postmortem template review within 48 hours of any outage. Stay ready — the next major provider disruption will test how repeatable your recovery is.

Want a downloadable incident runbook checklist and status templates? Export this article into your runbook repo today and schedule your first simulated CDN/DNS outage drill within 30 days.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.