Outage Forensics Playbook

A minimum viable lab for reproducing carrier-scale network outages. Three devices + a packet-crafting endpoint — the template carrier engineering teams start from when running a post-mortem.

The Prompt

Copy this into NetPilot:

Build a 3-node outage reproduction lab: Cisco IOL router, Juniper cRPD router, and Arista cEOS switch, peering via eBGP (Cisco AS 65001, Juniper AS 65002, Arista AS 65003) with BGP EVPN address family enabled for cross-vendor Type-2 route exchange. Place a route reflector role on Juniper. Add a Linux control node connected to all three devices with tc, netem, and Scapy pre-installed for link impairment and malformed packet generation. Advertise 10.100.0.0/24 from Cisco and Arista. I want to reproduce a specific production outage trigger — the lab should be a minimum viable topology that can exhibit cross-vendor EVPN bugs, BGP session flaps, and BUM replication issues.

What You'll Build

1 Cisco IOL router (production-equivalent role)
1 Juniper cRPD router (with route-reflector configuration)
1 Arista cEOS switch
1 Linux control node (tc, netem, Scapy, hping3, iperf3)
Full eBGP underlay + BGP EVPN overlay
Baseline verification commands

Concepts Demonstrated

Minimum viable topology for carrier outage reproduction
Multi-vendor BGP EVPN with a route reflector
Scripted trigger injection via Linux control node
Post-mortem playbook Phase 2 (minimum viable topology) + Phase 3 (prompt)

Vendors Used

Cisco IOL
Juniper cRPD
Arista cEOS
Linux control node

Difficulty

⭐⭐⭐ Advanced — assumes familiarity with BGP EVPN and outage post-mortem workflow

Variations to Try

"Swap Arista for Nokia SR Linux to test SP-specific interop"
"Add a second Juniper route reflector to test iBGP mesh scale"
"Pin vendor versions to match production firmware exactly (via enterprise BYOI)"
"Add firewall (Palo Alto or Fortinet) between two leaves to test policy-driven outage scenarios"
"Add sustained link impairment on one spine-leaf link to test convergence under stress"

Why This Matters

Carrier outages take weeks to post-mortem when reproduction requires physical hardware. This template collapses that to hours: stand up a minimum viable topology, match production config, script the trigger, iterate on fixes. Each repro lab is a saved artifact — when the same bug class reappears, the lab is already waiting.

Use cases:

Incident post-mortem — reproduce and validate before closing the ticket
Vendor TAC engagement — ship vendors a prompt, not a packet capture
Pre-deployment validation — test a candidate config change against historical bug repros
Team training — new engineers debug historical incidents as ramp-up

Related Resources

Try It

Open in NetPilot →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Outage Forensics Playbook

The Prompt

What You'll Build

Concepts Demonstrated

Vendors Used

Difficulty

Variations to Try

Why This Matters

Related Resources

Try It

Uh oh!

FilesExpand file tree

outage-forensics-playbook.md

Latest commit

History

outage-forensics-playbook.md

File metadata and controls

Outage Forensics Playbook

The Prompt

What You'll Build

Concepts Demonstrated

Vendors Used

Difficulty

Variations to Try

Why This Matters

Related Resources

Try It