Summary
When fly.toml declares process groups implicitly (i.e. via
http_service.processes only, with no top-level [processes] block),
fly deploy --strategy bluegreen falls into the launch-path code
(internal/command/deploy/plan.go → updateProcessGroup) on every deploy,
launches a fresh fleet, and never destroys the existing machines after
the green fleet passes health checks.
Result: every deploy doubles the fleet size. Fly's public LB rotates
traffic across all started machines regardless of fly_release_version,
so production serves a roughly 50/50 mix of old and new SHAs for several
minutes per deploy until the operator (or a pre-deploy reap step) destroys
the orphans.
This bites silently because fly deploy exits 0, the green machines do
pass health checks, and the LB does keep responding 200 — the reconciler
just never gets to the cordon-blue + destroy-blue phase.
Repro
fly.toml (notice the absence of [processes]):
app = "repro-app"
primary_region = "sea"
[build]
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "off"
auto_start_machines = true
min_machines_running = 2
processes = ["app"]
[deploy]
strategy = "bluegreen"
Run fly deploy twice in a row. On the second run, observe:
==> Verifying app config
--> Verified app config
==> Building image
...
Process groups have changed. This will:
* create 2 "app" machines
> Launching new machine
No machines in group app, launching a new machine
...
…even though fly machines list shows the two existing machines have
metadata.fly_process_group = "app". After the deploy completes:
$ fly machines list --app repro-app | wc -l
4 # (2 existing + 2 new — orphans not destroyed)
The 4-machine fleet persists indefinitely. Subsequent deploys add 2 more
each time.
Workaround (and the suggested fix path)
Adding the block explicitly resolves the issue:
The empty CMD preserves the implicit default behavior (Dockerfile's CMD
remains the entrypoint). With this block present, the same fly deploy
goes through the update-path code, cordons the existing machines, and
destroys them after the green fleet is healthy:
==> Verifying app config
--> Verified app config
==> Building image
...
==> Updating existing machines in 'repro-app' with bluegreen strategy
...
==> Cordoning blue machines
==> Destroying blue machines
...
Expected vs actual
Expected: Fly's docs (Run multiple process groups) say:
If you don't define any processes, the Machines in a Fly App belong to
the default app process group.
So [processes]\n app = "" should be a no-op — the resulting
fly_process_group metadata is "app" either way.
Actual: flyctl's bluegreen reconciler distinguishes the two cases.
With the implicit form, updateProcessGroup decides "no machines in
group app" and falls into the launch-path; with the explicit form, it
correctly identifies the existing machines as the previous release and
follows the bluegreen update-path.
Proposed fix (one of)
-
In the reconciler: when comparing fly.toml process groups against
the live fleet's metadata.fly_process_group for bluegreen planning,
treat the implicit app group as equivalent to an explicit
[processes]\n app = "" block. The mapping http_service.processes → fly_process_group is already established at machine-create time;
the comparison just needs to honor that mapping symmetrically.
-
In fly config validate: if [processes] is missing AND
[deploy].strategy = "bluegreen", warn (or auto-inject the implicit
form into the materialized config). This is the cheaper fix — it
doesn't change reconciler logic, it just nudges configs into the
shape that already works.
-
In docs: at minimum, Run multiple process groups and
App configuration → strategy = "bluegreen" should mention
that bluegreen requires an explicit [processes] block. Right now
the docs imply the two forms are equivalent.
Impact
This affects any user running bluegreen on an app whose fly.toml
declares processes only via http_service.processes. The default Fly
templates currently emit fly.tomls in this shape (no top-level
[processes]), so it's likely a common config.
The user-visible symptom is hard to attribute: fly deploy exits 0,
health checks pass, the new SHA is "deployed" — but production traffic
gets ~50% old code for minutes per deploy. Smoke tests that retry until
any response matches (a common idiom) silently mask it; only smoke
tests that require every sample to match catch it.
We caught it via an aggressive smoke (scripts/smoke-test-prod.sh)
that takes 20 samples against /api/health and asserts every one matches
the expected git_sha. Pre-fix we were rolling back deploys (false-positive
rollbacks because the new SHA was deployed correctly, just not exclusively
serving traffic).
Environment
- flyctl:
v0.4.42 darwin/arm64 commit cd20495611543c8e04b01448e819b62907046eac (build date 2026-04-28)
- Apps v2 (Machines)
- 2 machines, primary region
sea, bluegreen strategy
Related
Summary
When
fly.tomldeclares process groups implicitly (i.e. viahttp_service.processesonly, with no top-level[processes]block),fly deploy --strategy bluegreenfalls into the launch-path code(
internal/command/deploy/plan.go→updateProcessGroup) on every deploy,launches a fresh fleet, and never destroys the existing machines after
the green fleet passes health checks.
Result: every deploy doubles the fleet size. Fly's public LB rotates
traffic across all started machines regardless of
fly_release_version,so production serves a roughly 50/50 mix of old and new SHAs for several
minutes per deploy until the operator (or a pre-deploy reap step) destroys
the orphans.
This bites silently because
fly deployexits 0, the green machines dopass health checks, and the LB does keep responding 200 — the reconciler
just never gets to the cordon-blue + destroy-blue phase.
Repro
fly.toml(notice the absence of[processes]):Run
fly deploytwice in a row. On the second run, observe:…even though
fly machines listshows the two existing machines havemetadata.fly_process_group = "app". After the deploy completes:The 4-machine fleet persists indefinitely. Subsequent deploys add 2 more
each time.
Workaround (and the suggested fix path)
Adding the block explicitly resolves the issue:
The empty CMD preserves the implicit default behavior (Dockerfile's CMD
remains the entrypoint). With this block present, the same
fly deploygoes through the update-path code, cordons the existing machines, and
destroys them after the green fleet is healthy:
Expected vs actual
Expected: Fly's docs (Run multiple process groups) say:
So
[processes]\n app = ""should be a no-op — the resultingfly_process_groupmetadata is"app"either way.Actual: flyctl's bluegreen reconciler distinguishes the two cases.
With the implicit form,
updateProcessGroupdecides "no machines ingroup app" and falls into the launch-path; with the explicit form, it
correctly identifies the existing machines as the previous release and
follows the bluegreen update-path.
Proposed fix (one of)
In the reconciler: when comparing
fly.tomlprocess groups againstthe live fleet's
metadata.fly_process_groupfor bluegreen planning,treat the implicit
appgroup as equivalent to an explicit[processes]\n app = ""block. The mappinghttp_service.processes → fly_process_groupis already established at machine-create time;the comparison just needs to honor that mapping symmetrically.
In
fly config validate: if[processes]is missing AND[deploy].strategy = "bluegreen", warn (or auto-inject the implicitform into the materialized config). This is the cheaper fix — it
doesn't change reconciler logic, it just nudges configs into the
shape that already works.
In docs: at minimum, Run multiple process groups and
App configuration → strategy = "bluegreen" should mention
that bluegreen requires an explicit
[processes]block. Right nowthe docs imply the two forms are equivalent.
Impact
This affects any user running bluegreen on an app whose
fly.tomldeclares processes only via
http_service.processes. The default Flytemplates currently emit fly.tomls in this shape (no top-level
[processes]), so it's likely a common config.The user-visible symptom is hard to attribute:
fly deployexits 0,health checks pass, the new SHA is "deployed" — but production traffic
gets ~50% old code for minutes per deploy. Smoke tests that retry until
any response matches (a common idiom) silently mask it; only smoke
tests that require every sample to match catch it.
We caught it via an aggressive smoke (scripts/smoke-test-prod.sh)
that takes 20 samples against
/api/healthand asserts every one matchesthe expected
git_sha. Pre-fix we were rolling back deploys (false-positiverollbacks because the new SHA was deployed correctly, just not exclusively
serving traffic).
Environment
v0.4.42darwin/arm64 commitcd20495611543c8e04b01448e819b62907046eac(build date 2026-04-28)sea, bluegreen strategyRelated
fly deploydoesn't list the process groups of the machines it's destroying #1917 — same root concept (implicit vs explicit process groups treatedinconsistently), different symptom (logs report wrong group during
destroy). That one's about cosmetic log attribution; this one is about
the bluegreen reconciler taking the wrong code path entirely.
updateProcessGroup(different failuremode, but adjacent code:
internal/command/deploy/plan.go:340).