Skip to content

bluegreen deploy launches new fleet (and never destroys blue) when fly.toml lacks an explicit [processes] block #4886

@stefanstoll

Description

@stefanstoll

Summary

When fly.toml declares process groups implicitly (i.e. via
http_service.processes only, with no top-level [processes] block),
fly deploy --strategy bluegreen falls into the launch-path code
(internal/command/deploy/plan.goupdateProcessGroup) on every deploy,
launches a fresh fleet, and never destroys the existing machines after
the green fleet passes health checks.

Result: every deploy doubles the fleet size. Fly's public LB rotates
traffic across all started machines regardless of fly_release_version,
so production serves a roughly 50/50 mix of old and new SHAs for several
minutes per deploy until the operator (or a pre-deploy reap step) destroys
the orphans.

This bites silently because fly deploy exits 0, the green machines do
pass health checks, and the LB does keep responding 200 — the reconciler
just never gets to the cordon-blue + destroy-blue phase.


Repro

fly.toml (notice the absence of [processes]):

app = "repro-app"
primary_region = "sea"

[build]

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "off"
  auto_start_machines = true
  min_machines_running = 2
  processes = ["app"]

[deploy]
  strategy = "bluegreen"

Run fly deploy twice in a row. On the second run, observe:

==> Verifying app config
--> Verified app config
==> Building image
...
Process groups have changed. This will:
 * create 2 "app" machines

> Launching new machine
No machines in group app, launching a new machine
...

…even though fly machines list shows the two existing machines have
metadata.fly_process_group = "app". After the deploy completes:

$ fly machines list --app repro-app | wc -l
4    # (2 existing + 2 new — orphans not destroyed)

The 4-machine fleet persists indefinitely. Subsequent deploys add 2 more
each time.

Workaround (and the suggested fix path)

Adding the block explicitly resolves the issue:

[processes]
  app = ""

The empty CMD preserves the implicit default behavior (Dockerfile's CMD
remains the entrypoint). With this block present, the same fly deploy
goes through the update-path code, cordons the existing machines, and
destroys them after the green fleet is healthy:

==> Verifying app config
--> Verified app config
==> Building image
...
==> Updating existing machines in 'repro-app' with bluegreen strategy
...
==> Cordoning blue machines
==> Destroying blue machines
...

Expected vs actual

Expected: Fly's docs (Run multiple process groups) say:

If you don't define any processes, the Machines in a Fly App belong to
the default app process group.

So [processes]\n app = "" should be a no-op — the resulting
fly_process_group metadata is "app" either way.

Actual: flyctl's bluegreen reconciler distinguishes the two cases.
With the implicit form, updateProcessGroup decides "no machines in
group app" and falls into the launch-path; with the explicit form, it
correctly identifies the existing machines as the previous release and
follows the bluegreen update-path.

Proposed fix (one of)

  1. In the reconciler: when comparing fly.toml process groups against
    the live fleet's metadata.fly_process_group for bluegreen planning,
    treat the implicit app group as equivalent to an explicit
    [processes]\n app = "" block. The mapping http_service.processes → fly_process_group is already established at machine-create time;
    the comparison just needs to honor that mapping symmetrically.

  2. In fly config validate: if [processes] is missing AND
    [deploy].strategy = "bluegreen", warn (or auto-inject the implicit
    form into the materialized config). This is the cheaper fix — it
    doesn't change reconciler logic, it just nudges configs into the
    shape that already works.

  3. In docs: at minimum, Run multiple process groups and
    App configuration → strategy = "bluegreen" should mention
    that bluegreen requires an explicit [processes] block. Right now
    the docs imply the two forms are equivalent.

Impact

This affects any user running bluegreen on an app whose fly.toml
declares processes only via http_service.processes. The default Fly
templates currently emit fly.tomls in this shape (no top-level
[processes]), so it's likely a common config.

The user-visible symptom is hard to attribute: fly deploy exits 0,
health checks pass, the new SHA is "deployed" — but production traffic
gets ~50% old code for minutes per deploy. Smoke tests that retry until
any response matches (a common idiom) silently mask it; only smoke
tests that require every sample to match catch it.

We caught it via an aggressive smoke (scripts/smoke-test-prod.sh)
that takes 20 samples against /api/health and asserts every one matches
the expected git_sha. Pre-fix we were rolling back deploys (false-positive
rollbacks because the new SHA was deployed correctly, just not exclusively
serving traffic).

Environment

  • flyctl: v0.4.42 darwin/arm64 commit cd20495611543c8e04b01448e819b62907046eac (build date 2026-04-28)
  • Apps v2 (Machines)
  • 2 machines, primary region sea, bluegreen strategy

Related


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions