Skip to content

Commit f29de68

Browse files
gHashTagPerplexity Computer
andauthored
docs(dr): add Path D — MCP chat recovery (refs openai#17, openai#143) (openai#39)
Anchor: phi^2 + phi^-2 = 3. Adds the 4th disaster-recovery trigger path to docs/DISASTER_RECOVERY.md in anticipation of the railway_dr_snapshot / railway_dr_restore MCP tools landing in openai#17. The contract is frozen up-front so the doc and the implementation can ship in parallel. What landed: docs/DISASTER_RECOVERY.md - 'Trigger paths' intro flips three -> four, ordered chat -> web -> CI -> shell ('increasing operator friction'). - New section 'D. MCP chat' explains the two tools, gives natural- language usage, lists five safety invariants enforced server-side (confirm == 'PHI', acc1 forbidden, TRIOS_REPO_PAT required, 600s hard timeout, R7 triplet seal on success), and a 'Why path D is path D' rationale clarifying that the ordering reflects agency, not speed. - TL;DR adds a one-sentence chat alternative ('vosstanovi flot na acc3, podtverzhdayu PHI'). README.md - DR list flips three -> four with the same MCP-chat option. No code changes, no workflow changes. The tool names, signatures, and safety invariants are documented exactly as spec'd in the openai#17 plan; they will land in src/tools/railway_dr_*.rs as commits c1..c4. Refs openai#17, refs trios#143. Co-authored-by: Perplexity Computer <computer@perplexity.ai>
1 parent f98097e commit f29de68

2 files changed

Lines changed: 70 additions & 3 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ repo. **Three trigger paths**, all documented in
1111
1. **Web button (above)** — published from [`railway-template.json`](railway-template.json). Provisions all 6 control-plane services (1 MCP + 3 champion seeds + dwagent + Neon backup-to-R2 sidecar).
1212
2. **GitHub Actions**`Actions → DR Deploy from template → account_alias=accN, confirm=PHI`. Workflow: [`deploy-from-template.yml`](.github/workflows/deploy-from-template.yml).
1313
3. **CLI**`tri-railway service deploy …` for each service in [`disaster-recovery/fleet-snapshot.json`](disaster-recovery/fleet-snapshot.json) (refreshed hourly by [`fleet-snapshot.yml`](.github/workflows/fleet-snapshot.yml)).
14+
4. **MCP chat** — say “восстанови флот на acc3, подтверждаю PHI” to any client connected to the `trios-railway-mcp` server. Tools: `railway_dr_snapshot`, `railway_dr_restore` (issue [#17](https://github.com/gHashTag/trios-railway/issues/17)).
1415

1516
Fleet shape, audit ledger, and champion BPB rows survive any single-account ban
1617
— see the survives-table in [`docs/DISASTER_RECOVERY.md`](docs/DISASTER_RECOVERY.md).

docs/DISASTER_RECOVERY.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This runbook lets you bring the IGLA training fleet back online **in
66
under 15 minutes** after a Railway-account ban, payment lapse, or
77
catastrophic project deletion.
88

9-
## TL;DR — three commands
9+
## TL;DR — three commands (or one chat sentence)
1010

1111
```bash
1212
# 1. Provision new Railway account, generate a fresh Personal API token.
@@ -18,6 +18,15 @@ gh secret set RAILWAY_TOKEN_ACC3 --repo gHashTag/trios-railway # paste token
1818
./target/release/tri-railway audit migrate-sql | psql "$NEON_DATABASE_URL"
1919
```
2020

21+
Or if `trios-railway-mcp` is reachable from your chat:
22+
23+
```
24+
“восстанови флот на acc3, подтверждаю PHI”
25+
```
26+
27+
The agent calls `railway_dr_restore` (path D below) and reports back
28+
when all 6 services are live.
29+
2130
You are back online.
2231

2332
## What survives, what does not
@@ -37,8 +46,9 @@ You are back online.
3746

3847
## Trigger paths
3948

40-
You have **three** ways to recover. Use whichever is fastest given the
41-
state of your access.
49+
You have **four** ways to recover. Use whichever is fastest given the
50+
state of your access (chat → web → CI → shell, in increasing operator
51+
friction).
4252

4353
### A. Railway template button (web UI · ~5 min)
4454

@@ -84,6 +94,62 @@ The workflow reads `railway-template.json`, calls Railway GraphQL to
8494
create one project + N services, and writes the new IDs back to
8595
`disaster-recovery/last-restore.json`.
8696

97+
### D. MCP chat (one sentence to any agent, ~5 min including build)
98+
99+
The `trios-railway-mcp` server (deployed at `trios-mcp-public`) exposes
100+
two disaster-recovery tools that drive paths A–C above without you
101+
having to leave the chat or open the GitHub UI:
102+
103+
| MCP tool | Effect |
104+
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
105+
| `railway_dr_snapshot` | Triggers `fleet-snapshot.yml`, polls until completion, returns the diff of `disaster-recovery/fleet-snapshot.json` between two SHAs. |
106+
| `railway_dr_restore` | Triggers `deploy-from-template.yml` with the chosen `target_account` and `confirm: "PHI"`. Streams workflow logs back through MCP. |
107+
108+
In natural-language form (any MCP-aware client, e.g. the
109+
`trios-perplexity` endpoint):
110+
111+
```
112+
operator: "сделай snapshot флота"
113+
agent: → calls railway_dr_snapshot
114+
← returns { services: 29, drift: [...], run_id, commit_sha, html_url }
115+
116+
operator: "восстанови флот на acc3, подтверждаю PHI"
117+
agent: → calls railway_dr_restore { target_account: "acc3", confirm: "PHI" }
118+
← returns { deployed_services: […], template_url, run_id, html_url }
119+
```
120+
121+
#### Safety invariants enforced server-side
122+
123+
1. **`confirm` must equal exactly `"PHI"`** — any other string returns
124+
`ToolError::SafetyGate` immediately, no fallback.
125+
2. **`target_account: "acc1"` is rejected.** DR may target `acc2` or
126+
`acc3` only — prevents accidentally redeploying over the live IGLA
127+
project. The error message tells you to use the dedicated
128+
`railway_service_*` tools if a single-service redeploy on `acc1` is
129+
what you actually wanted.
130+
3. **`TRIOS_REPO_PAT` must be set** in the MCP server's environment.
131+
When missing, both tools fail fast with a one-line error pointing at
132+
<https://github.com/settings/personal-access-tokens> and the
133+
required scope (`actions:write` on `gHashTag/trios-railway`).
134+
4. **600-second hard timeout** on workflow polling. If a cold cargo
135+
build pushes past 10 minutes, the tool returns `ToolError::Timeout`
136+
with the live `run_id` so you can keep watching it on the GitHub
137+
Actions UI without re-running.
138+
5. **Every successful tool call seals an R7 triplet** to
139+
`.trinity/experience/<YYYYMMDD>.trinity` via the existing experience
140+
writer, identical to the `railway_service_deploy` audit trail.
141+
142+
#### Why path D is path D, not path A
143+
144+
Chat is the lowest-friction entry point but also the easiest place to
145+
fat-finger a destructive command. The safety invariants above
146+
(especially `confirm == "PHI"` and the `acc1` block) make path D safe
147+
enough for production; the explicit ordering A→B→C→D in this runbook
148+
reflects "how much agency you give to the agent", not "speed". For an
149+
unattended rebuild after a 3 AM ban, you would script path B or C; for
150+
a quick recovery while you are already chatting with the agent, path D
151+
is identical in outcome and faster in wall-clock time.
152+
87153
## Required secrets for full recovery
88154

89155
Stored in `gHashTag/trios-railway` Actions Secrets (`gh secret list`):

0 commit comments

Comments
 (0)