bluegreen deploy launches new fleet (and never destroys blue) when fly.toml lacks an explicit [processes] block


---

### Summary

When `fly.toml` declares process groups *implicitly* (i.e. via
`http_service.processes` only, with no top-level `[processes]` block),
`fly deploy --strategy bluegreen` falls into the launch-path code
(`internal/command/deploy/plan.go` → `updateProcessGroup`) on every deploy,
launches a fresh fleet, and **never destroys the existing machines** after
the green fleet passes health checks.

Result: every deploy doubles the fleet size. Fly's public LB rotates
traffic across all started machines regardless of `fly_release_version`,
so production serves a roughly 50/50 mix of old and new SHAs for several
minutes per deploy until the operator (or a pre-deploy reap step) destroys
the orphans.

This bites silently because `fly deploy` exits 0, the green machines do
pass health checks, and the LB does keep responding 200 — the reconciler
just never gets to the cordon-blue + destroy-blue phase.

---

### Repro

`fly.toml` (notice the absence of `[processes]`):

```toml
app = "repro-app"
primary_region = "sea"

[build]

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "off"
  auto_start_machines = true
  min_machines_running = 2
  processes = ["app"]

[deploy]
  strategy = "bluegreen"
```

Run `fly deploy` twice in a row. On the second run, observe:

```
==> Verifying app config
--> Verified app config
==> Building image
...
Process groups have changed. This will:
 * create 2 "app" machines

> Launching new machine
No machines in group app, launching a new machine
...
```

…even though `fly machines list` shows the two existing machines have
`metadata.fly_process_group = "app"`. After the deploy completes:

```sh
$ fly machines list --app repro-app | wc -l
4    # (2 existing + 2 new — orphans not destroyed)
```

The 4-machine fleet persists indefinitely. Subsequent deploys add 2 more
each time.

### Workaround (and the suggested fix path)

Adding the block explicitly resolves the issue:

```toml
[processes]
  app = ""
```

The empty CMD preserves the implicit default behavior (Dockerfile's CMD
remains the entrypoint). With this block present, the same `fly deploy`
goes through the update-path code, cordons the existing machines, and
destroys them after the green fleet is healthy:

```
==> Verifying app config
--> Verified app config
==> Building image
...
==> Updating existing machines in 'repro-app' with bluegreen strategy
...
==> Cordoning blue machines
==> Destroying blue machines
...
```

### Expected vs actual

**Expected:** Fly's docs ([Run multiple process groups][docs1]) say:

> If you don't define any processes, the Machines in a Fly App belong to
> the default `app` process group.

So `[processes]\n  app = ""` should be a no-op — the resulting
`fly_process_group` metadata is `"app"` either way.

**Actual:** flyctl's bluegreen reconciler distinguishes the two cases.
With the implicit form, `updateProcessGroup` decides "no machines in
group app" and falls into the launch-path; with the explicit form, it
correctly identifies the existing machines as the previous release and
follows the bluegreen update-path.

### Proposed fix (one of)

1. **In the reconciler:** when comparing `fly.toml` process groups against
   the live fleet's `metadata.fly_process_group` for bluegreen planning,
   treat the implicit `app` group as equivalent to an explicit
   `[processes]\n  app = ""` block. The mapping `http_service.processes →
   fly_process_group` is already established at machine-create time;
   the comparison just needs to honor that mapping symmetrically.

2. **In `fly config validate`:** if `[processes]` is missing AND
   `[deploy].strategy = "bluegreen"`, warn (or auto-inject the implicit
   form into the materialized config). This is the cheaper fix — it
   doesn't change reconciler logic, it just nudges configs into the
   shape that already works.

3. **In docs:** at minimum, [Run multiple process groups][docs1] and
   [App configuration → strategy = "bluegreen"][docs2] should mention
   that bluegreen requires an explicit `[processes]` block. Right now
   the docs imply the two forms are equivalent.

### Impact

This affects any user running bluegreen on an app whose `fly.toml`
declares processes only via `http_service.processes`. The default Fly
templates currently emit fly.tomls in this shape (no top-level
`[processes]`), so it's likely a common config.

The user-visible symptom is hard to attribute: `fly deploy` exits 0,
health checks pass, the new SHA is "deployed" — but production traffic
gets ~50% old code for minutes per deploy. Smoke tests that retry until
*any* response matches (a common idiom) silently mask it; only smoke
tests that require *every* sample to match catch it.

We caught it via an aggressive smoke ([scripts/smoke-test-prod.sh][repo-smoke])
that takes 20 samples against `/api/health` and asserts every one matches
the expected `git_sha`. Pre-fix we were rolling back deploys (false-positive
rollbacks because the new SHA *was* deployed correctly, just not exclusively
serving traffic).

### Environment

- flyctl: `v0.4.42` darwin/arm64 commit `cd20495611543c8e04b01448e819b62907046eac` (build date 2026-04-28)
- Apps v2 (Machines)
- 2 machines, primary region `sea`, bluegreen strategy

### Related

- #1917 — same root concept (implicit vs explicit process groups treated
  inconsistently), different symptom (logs report wrong group during
  destroy). That one's about cosmetic log attribution; this one is about
  the bluegreen reconciler taking the wrong code path entirely.
- #4528 — nil-pointer panic in `updateProcessGroup` (different failure
  mode, but adjacent code: `internal/command/deploy/plan.go:340`).

[docs1]: https://fly.io/docs/launch/processes/
[docs2]: https://fly.io/docs/reference/configuration/#the-deploy-section
[repo-smoke]: https://github.com/Autono-Labs/System2/blob/main/scripts/smoke-test-prod.sh

---



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bluegreen deploy launches new fleet (and never destroys blue) when fly.toml lacks an explicit [processes] block #4886

Summary

Repro

Workaround (and the suggested fix path)

Expected vs actual

Proposed fix (one of)

Impact

Environment

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bluegreen deploy launches new fleet (and never destroys blue) when fly.toml lacks an explicit [processes] block #4886

Description

Summary

Repro

Workaround (and the suggested fix path)

Expected vs actual

Proposed fix (one of)

Impact

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions