Skip to content

TOOLS-2592 Tooling for shipping Triton service images from monitor-reef#21

Merged
nshalman merged 248 commits intomainfrom
how-to-ship
May 4, 2026
Merged

TOOLS-2592 Tooling for shipping Triton service images from monitor-reef#21
nshalman merged 248 commits intomainfrom
how-to-ship

Conversation

@nshalman
Copy link
Copy Markdown
Collaborator

@nshalman nshalman commented Mar 31, 2026

Portions generated by: Claude Opus 4.5 and 4.6 <noreply@anthropic.com>
Reviewed By: Travis Paul <tpaul@edgecast.io>

Add tritonadm CLI, SAPI/IMGAPI/NAPI/PAPI API conversions, and zone image build infrastructure

Summary

  • New API trait conversions: SAPI, IMGAPI, NAPI, and PAPI — each with full 5-phase conversion (plan → API trait → client → CLI → validation)
  • tritonadm CLI: New operator administration tool with subcommands for post-setup (grafana, portal, common-external-nics), image management (list, import, import-remote, delete),
    dc-maint status, and dev teardown helpers
  • Zone image build infrastructure: Design doc, Makefile-based build system (images/), and a triton-api service with SMF manifests and SAPI metadata
  • triton-tls crate: Portable TLS cert loading that works on both illumos and other platforms
  • Auto-discovery: tritonadm discovers SAPI/VMAPI URLs from Triton headnode config files
  • Client generator improvements: Error schema patches for all Node.js Triton API clients, new client registrations for SAPI/IMGAPI/NAPI/PAPI
  • Restify conversion skill improvements: Updated guidance based on lessons from SAPI conversion

Test plan

  • make package-build PACKAGE=tritonadm builds successfully
  • make package-test PACKAGE=sapi-cli / imgapi-cli / napi-cli / papi-cli pass
  • make openapi-check confirms generated specs are up-to-date
  • make clients-check confirms generated client code is up-to-date
  • make audit passes (with known pre-existing exceptions)
  • Verify tritonadm post-setup commands work against a Triton headnode
  • Verify tritonadm image import/list/delete against IMGAPI

nshalman and others added 3 commits March 26, 2026 12:44
Outlines the images/ directory approach for building multiple Triton
zone images from a single Rust monorepo, including per-service
Makefiles, SAPI integration, and a jenkins-joylib enhancement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document which services don't ship as images (bugview, jira-stub),
list the reference repos needed to understand the design, and add a
prerequisites checklist for the jenkins-joylib change and SmartOS
testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce triton-api, a Dropshot API service that will eventually
replace cloudapi. For now it has a single /ping endpoint. This also
establishes the images/ directory structure for building zone images
from the monorepo.

- apis/triton-api: API trait with /ping endpoint
- services/triton-api-server: service implementation
- images/triton-api: zone image Makefile, SMF manifests, SAPI
  manifests, and boot script
- images/image.defs.mk: shared image build definitions, sets
  ENGBLD_REPO_ROOT for eng Makefile compatibility
- deps/eng: updated to include ENGBLD_REPO_ROOT monorepo support
- .gitignore: add image build artifact patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setup.sh was committed without the execute bit, which would cause
SMF postboot to fail to start. Also move smf_include.sh source
before the first-boot marker check so $SMF_EXIT_OK is available
for the early exit path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nshalman and others added 24 commits April 6, 2026 16:46
$(shell) swallows exit codes, so git rev-parse and git submodule
update failures would leave ENGBLD_REPO_ROOT empty and eng includes
broken with confusing errors. Add explicit guards with clear messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update all REPO_ROOT references in code blocks to ENGBLD_REPO_ROOT
to match actual implementation. Renumber open questions (was 1,3,4,5
now 1,2,3,4). Reframe eng Makefile compatibility question to reflect
that ENGBLD_REPO_ROOT already addresses the root issue. Remove local
filesystem path from TODO.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Files were untracked artifacts, not committed to the branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add status and healthy fields to PingResponse matching VMAPI pattern.
Move types to types/ module for consistency with other API crates.
Add Clone derive and crate-level doc comment. Update server to
return populated response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add triton-api dependency and ManagedApiConfig entry so make
openapi-generate and openapi-check cover the new API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document that the bind address should come from the SAPI-generated
config file once this service is ready for production deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
*.tar.gz, bits/, proto/, make_stamps/ were repo-wide but only
needed for image builds. Scope to images/*/ to avoid accidentally
hiding legitimate files elsewhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rust successor to sdcadm for Triton datacenter administration.
All 16 top-level commands and 47 subcommands scaffolded as stubs
returning "not yet implemented". Shell completion works. Design doc
covers architecture, API client strategy, and first target
(post-setup portal).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal Triton APIs get the full trait-based pipeline (API trait →
OpenAPI spec → Progenitor client), not hand-written minimal clients.
Builds toward correct specs from day one and means the trait is ready
when we rewrite the Node.js services. jira-client is the sole
exception as a large external API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 5 API clients needed for post-setup portal also unlock services,
instances, avail, check-config, and check-health as low-hanging fruit.
Reordered priority list to reflect this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Grafana has a known-working sdcadm implementation to validate against.
Same APIs needed, but we can compare results on a real DC before
applying the pattern to a brand-new service (portal).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three patches applied to sapi-api.json:
- GET /mode: returns plain string, not ModeResponse JSON object
- POST /mode: returns 204 no content, not 200 with JSON body
- POST /loglevel: returns empty 200, not JSON body

Updated client-generator to use patched spec, regenerated client,
and fixed CLI to handle the new response types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trait changes (canonical type fixes):
- Create endpoints return 200 (matching Node.js Restify default), not 201
- LogLevelResponse.level is serde_json::Value (Bunyan returns integer)
- SetLogLevelBody.level is serde_json::Value (accepts string or integer)
- Add uuid and master fields to all create body types

Patch additions:
- GET /ping 500: documented as known limitation (Node.js returns
  PingResponse on 500, Progenitor can't handle multiple response types)
- Create status code safety net patch (no-op since trait already fixed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused ModeResponse type (GET /mode is patched to return string)
- Add StorageType enum for PingResponse.stor_type field
- Change PingResponse.mode from String to SapiMode enum
- Change get_mode trait to return SapiMode (patched to string in spec)
- Change set_mode trait to HttpResponseUpdatedNoContent (native 204)
- Remove dead UpdateAttributesBody re-export from sapi-client
- Simplify post_mode patch to no-op (trait now generates 204 natively)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1: Add sections for enum identification, Restify response
pattern cataloging, patch requirements, and hidden request fields.

Phase 2: Add guidance on using Phase 1 enums, matching Restify
response patterns to Dropshot types (200 not 201 for creates),
and avoiding dead wrapper types.

Phase 5: Add enum wire-value verification, status code checking,
dead schema detection, and remaining String→enum scan.

Reference: Add Restify response pattern table, Progenitor
limitations section (multiple body types, text/plain, empty bodies).

Orchestrator: Add Step 2b for applying OpenAPI spec patches
between API generation and client generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add sapi-client and vmapi-client dependencies to tritonadm. Convert
main to async with tokio. Implement `services` (alias `svcs`) and
`instances` (alias `insts`) as the first real commands, replacing their
stubs. Services output matches sdcadm columns (type, uuid, name, image,
insts). Instances enriches SAPI data with VM alias, state, and image
from VMAPI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nshalman and others added 27 commits April 23, 2026 13:19
Four crates (cloudapi-client, vmapi-client, triton-gateway-client,
bugview-service) were overriding the workspace's `rand = "0.8"` to
`rand = "0.9"` individually, pulling both major versions into the
graph. Bump the workspace to 0.9 and drop the per-crate overrides.

triton-auth's crypto stack (p256, p384, rsa, ed25519-dalek) is pinned
to rand_core 0.6 by their `*::random(&mut _)` APIs — rand 0.9 bundles
rand_core 0.9, whose `OsRng` cannot satisfy the 0.6 trait bounds.
Rather than split the workspace's rand version, depend directly on
`rand_core = "0.6"` for the OsRng value, which stays in the rand_core
0.6 ecosystem the crypto crates require. Migrates the three call
sites in triton-auth and one in triton-auth-session.

Enable rand's `os_rng` feature in the workspace so future callers that
do want rand's own OsRng (the 0.9 one, unrelated to the crypto stack)
can still reach it without a per-crate override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add async-trait, base64, bytes, clap_complete, dirs, futures-util,
rcgen, url, and urlencoding to [workspace.dependencies] and migrate
all consumer crates to `{ workspace = true }`. Each of these deps was
declared directly in two or more crates with either matching or
trivially-compatible versions; hoisting them makes version bumps a
single-file edit.

url gains `features = ["serde"]` in the workspace definition,
matching the shape triton-auth-session and triton-api-server were
already using. bugview-service's plain `url = "2.5"` picks up serde
as a no-op (the crate wasn't using it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three CLI crates carried identical `assert_cmd = "2.0"` and
`predicates = "3.0"` dev-deps. Hoist both to the workspace so new
CLIs pick them up by name and version drift stays impossible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoist every external dep that was still declared directly in a single
crate. The bulk is triton-auth's 18-crate crypto stack (ssh-key,
ssh-encoding, md-5, rsa, p384, ed25519-dalek, dsa, sha1, sha2, pkcs8,
sec1, pem-rfc7468, signature, aes, cbc, des, der, time, plus the
rand_core 0.6 pin and serial_test dev-dep), triton-tls's three TLS
helpers (rustls-native-certs, rustls-pemfile, webpki-roots),
triton-cli's TUI/test stack (serde_yaml, comfy-table, dialoguer,
indicatif, getrandom, rpassword, test-case, pretty_assertions,
hostname, regex), and the genuinely one-off deps indexmap
(bugview-service), libc (tritonadm), syn (client-generator).

Also mops up two stray `http = "1"` direct declarations in
cloudapi-client and triton-gateway-client that the D2 sweep missed
(pattern only matched "1.1") and triton-cli's `http = "1.0"`.

After this commit, every workspace member's [dependencies],
[dev-dependencies], and [build-dependencies] declares every external
dep via `{ workspace = true }`. New crates pick up versions and
features for free; future bumps touch one file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumped versions of every external dep where an upgrade was available
(per `cargo upgrades`) and the compiler accepts it. Held-back versions
are documented in the workspace Cargo.toml comments.

Bumped:
- build-data    0.2 → 0.3 (set_GIT_COMMIT_SHORT / no_debug_rebuilds now return Result; handled in triton-cli/build.rs)
- getrandom     0.3 → 0.4
- jsonwebtoken  9.3 → 10.3 (now requires explicit CryptoProvider; enable `rust_crypto` feature)
- ldap3         0.11 → 0.12 (feature rename: "tls-rustls" → "tls-rustls-ring")
- progenitor    0.12 → 0.13
- progenitor-client 0.12 → 0.13
- progenitor-impl 0.12 → 0.13
- strum         0.27 → 0.28
- tokio-tungstenite 0.28 → 0.29

Held back:
- dropshot (0.16) and dropshot-api-manager (0.3): dropshot 0.17 emits
  richer WebSocket response schemas (101 / 4XX / 5XX instead of
  `default`) which triggers a progenitor 0.13 panic (`assertion
  failed: response_types.len() <= 1` in method.rs). Progenitor 0.13
  runs fine against dropshot 0.16's older spec shape, so keep dropshot
  at 0.16 until both upstreams cut a coordinated release.
- schemars (0.8): still pinned by dropshot 0.16.
- crypto stack (aes 0.8, cbc 0.1, der 0.7, des 0.8, md-5 0.10,
  sec1 0.7, sha1 0.10, sha2 0.10): all locked by rsa 0.9 /
  ssh-key 0.6, which still use digest 0.10 traits. Bumping the
  stack without bumping rsa/ssh-key produced trait-bound failures
  (BlockSizeUser, FixedOutput, HashMarker, etc.).
- rand (0.9), rand_core (0.6): rand 0.10 / rand_core 0.10 renamed
  `os_rng` → `sys_rng` on rand and removed OsRng from rand_core
  without a feature; the crypto stack above also holds rand_core 0.6.

make check (excluding openapi-check's stale-fix guidance, which was
fixed by regenerating) passes 1344 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move every `rustls::crypto::ring::default_provider().install_default()`
call in the workspace behind `triton_tls::install_default_crypto_provider`
so a future backend switch is a single-file edit. Introduce a
`selected_crypto_provider()` helper that returns the active
`CryptoProvider`, and use it both for the install and for
`NoCertVerifier::supported_verify_schemes`. Services that had copy-pasted
the install helper (bugview-service main + tests, triton-api-server) now
depend on triton-tls and drop their local helpers; triton-gateway's
inline install collapses into one line too. Drop the now-unused direct
`rustls` deps from triton-api-server and bugview-service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `tritonadm post-setup tritonadmin` subcommand symmetric with
`portal` and `tritonapi`: TRITONADMIN_CONFIG ServiceConfig + a small
build_tritonadmin_metadata helper that seeds TRITON_ADMIN_JWT_SECRET. The
generic cmd_add_service path handles image fetch / SAPI service creation
/ instance provisioning unchanged. Default image source is "current"
(local IMGAPI) since triton-admin builds aren't on the updates server
yet.

Also flips PORTAL_CONFIG.delegate_dataset from false to true. The
mariana-trench user-portal image now generates a self-signed haproxy
cert at /data/tls on first boot and expects the dataset to persist
across reprovision; matches the rationale already in TRITONAPI_CONFIG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip PORTAL_CONFIG and TRITONADMIN_CONFIG to include_external_primary:
false, matching the adminui/imgapi/triton-api convention. The zone now
provisions on the admin network only; the operator must run
`post-setup common-external-nics` to attach the external NIC.

Avoids a foot-gun where running `tritonadm post-setup portal` (or
tritonadmin) on a real cluster would put a freshly-provisioned web UI
on the public network in a single step, before an operator has had a
chance to inspect it. The zone's haproxy still terminates TLS, but
testing/staging deployments shouldn't auto-expose anyway.

Also extend cmd_common_external_nics's svc_names to include the two
new services and update the surrounding doc comments and final
"nothing to do" message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the SAPI v2 accept-version header, services responses omit the
`type` field and exclude agent-typed rows entirely, so `tritonadm
services` rendered an empty TYPE column and silently dropped every
agent service. Match sdcadm by setting the header on the SAPI client.

Adds `triton_tls::build_http_client_with_headers` and a new
`sapi_client::build_client` helper that wires the header in once;
tritonadm's call sites switch to the helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in main's "newtype schema naming + HEAD endpoint fixes"
(a0c9430) and the Go CloudAPI client subtree (5c9940b). Reconciles
several auto-merged conflicts and refactors triton-gateway-client to
mirror cloudapi-client's improved shape (newtype schemas + action body
parameters).

Conflict resolutions:
- .gitignore / Makefile: combined our image+tritonadm targets with
  main's Go-toolchain and coverage rules.
- Cargo.lock: regenerated.
- cli/triton-cli/src/main.rs: kept HEAD's triton_tls::build_http_client.
- cli/triton-cli/src/commands/instance/{create,list,migration}.rs +
  rbac/{role,role_tags}.rs: kept the gateway-client direction.
- cloudapi-client/src/generated.rs and openapi-manager/src/transforms.rs:
  regenerated post-merge; dropped dead patch_cloudapi_error_schema in
  favour of HEAD's patch_node_triton_error_schema.

Refactor to align gateway-client with cloudapi-client's improvements:
- client-generator: added with_replacement patches for triton-gateway
  (Tags, MetadataObject->Metadata, RoleTags, ProvisioningLimits,
  Resolvers, PolicyRules, ImageAcl, AffinityRules, NetworkIds) plus a
  VmBrand value_enum patch.
- triton-gateway-client/src/lib.rs: split re-exports — action body
  types via `pub use types::*`, everything else via `pub use
  cloudapi_api::*`. Surface AffinityRules / ImageAcl / NetworkIds /
  Resolvers / RoleTags / PolicyRef / PolicyRules / ProvisioningLimit.
- TypedClient action methods (start/stop/reboot/resize/rename_machine,
  enable/disable_firewall, enable/disable_deletion_protection,
  export_image) take `&Request` body types instead of
  `Option<String> origin`.
- ListMachinesFilter.brand: Brand -> VmBrand to match the API's
  list_machines builder.
- Added chrono dep; fake_response_body's fake_ts is DateTime<Utc>.

CLI: global cloudapi_client:: -> triton_gateway_client:: rename,
Brand2 -> Brand, list.rs imports VmBrand for state comparisons,
migration.rs wraps affinity in AffinityRules::from, snapshot.rs test
client wraps AuthConfig in GatewayAuthConfig::ssh_key.

Plus: bump libs/triton-auth/src/signature.rs copyright to 2026.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The custom bits-upload recipe was passing `-d $(ENGBLD_DEST_OUT_PATH)`
where eng's standard target uses `-d $(ENGBLD_DEST_OUT_PATH)/$(NAME)`,
so tritonadm builds were landing in /public/builds/<timestamp>/ rather
than /public/builds/tritonadm/<timestamp>/. Match the convention so the
artifacts are grouped with the other tritonadm builds and the
tritonadm-latest symlink lands where consumers expect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was commented out while the pipeline was being validated. make check
now passes (rust check + tests + clippy + openapi-check + clients-check
+ go-vet + go-test), so gate builds on it before image upload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wait_for_snapshot_deleted tests build a TypedClient (and thus a
reqwest/rustls client) but ran without a process-global CryptoProvider.
Production binaries install one in main(); under nextest each test runs
in its own process, so the install must happen in the test helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several tritonadm code paths need to distinguish "the resource does not
exist" (404) from "the API call itself failed" (5xx, transport, auth).
The latter must not be silently downgraded to a default value, which is
how an admin tool ends up reporting "Done." while production state is
half-wired. Add a small `commands::errors` helper plus the
`progenitor-client` dep so the next few commits can apply the
distinction at each suspect call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `cmd_dc_maint_status` matched `Err(_) => false` on the SAPI
service lookups for cloudapi and docker, so a transient SAPI outage,
auth failure, or 5xx caused `tritonadm dc-maint status` to confidently
report "DC maintenance: off" even when maintenance was actually on and
we just couldn't read the state. An operator running this command to
confirm whether traffic is being shed could reach the wrong conclusion.

Propagate SAPI errors with `.context()` instead. The `metadata`
sub-lookup still falls back to `false` because an empty/missing field
genuinely means "not in maintenance"; only the API call itself failing
is a real failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `cmd_avail` and `cmd_instances` previously matched `Err(_)` on
their per-service / per-instance API lookups, silently dropping rows
or substituting "-". For an operator-facing inventory tool this is
exactly backwards: a partial outage masquerades as a fleet that's
healthy and up-to-date, or a fleet that's entirely unknown.

cmd_avail now distinguishes 404 (image truly absent in IMGAPI — keep
the silent skip) from any other error (collected and printed as a
warning summary so the operator knows the table is incomplete and
why). A non-UUID `image_uuid` in SAPI is also treated as a real signal
rather than a routine miss, since it indicates data corruption.

cmd_instances now shows "missing" on a 404 (the SAPI instance points
at a VM VMAPI doesn't know about — legitimate stale state) and "?ERR"
on any other VMAPI error, with a per-instance error summary printed
after the table. A wholesale VMAPI outage is now distinguishable from
a fleet that genuinely contains many unknown instances.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three call sites in this module previously collapsed every IMGAPI
error into "the resource is absent, retry":

* `ensure_origin_imported` matched `Err(_)` on the local manifest
  lookup, so a 5xx from local IMGAPI would launch an import action
  against a sick IMGAPI and produce a confusing chained error
  instead of pointing at the real problem.

* `import_remote_with_channel_fallback` retried channel-less on any
  error from the channel-scoped call. Network blips, auth, and TLS
  failures all got misread as "origin not on this channel," and the
  user-facing remediation hint suggested a workaround that wouldn't
  fix the actual cause.

* `wait_for_image_active` swallowed every `Err` for 4 minutes and
  then bailed with "timed out", masking a broken local IMGAPI as a
  slow import.

All three now match on 404 explicitly and propagate other errors
with `.context()`. `wait_for_image_active` tolerates up to 3
consecutive non-404 errors before bailing so a single transient
blip during a long import doesn't fail the whole operation.

The `Err(e) if action_is_404(&e)` arm in
`import_remote_with_channel_fallback` deliberately falls through to
the default-channel retry below, so it's tagged
`arch-lint: allow(no-error-swallowing)` with a reason; non-404s are
propagated by the next match arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four hazards in `cmd_add_service` and its helpers previously turned
real production-state failures into operator-perceived success:

* `existing_vm` lookup matched `Err(_) => None`, so a transient VMAPI
  failure made the reprovision-decision logic conclude "instance is
  up-to-date, nothing to do" without ever reading the actual VM
  state. Now 404 still maps to None (legitimate stale SAPI ->
  missing-VM state), but other errors propagate.

* `ensure_manta_nic` failures at both the post-create and post-update
  call sites were downgraded to a `Warning:` log and the command
  exited 0 with "Done." printed. The service was created/updated but
  not fully wired, and the operator had no signal to investigate.
  Both call sites now propagate via `.with_context()`.

* `find_image` used `is_err()` to decide "needs download" for the
  `latest` flow, collapsing 404 (correct download trigger) and 503
  (local IMGAPI is down — should bail, not start an import). The
  explicit-UUID flow had the same `Err(_) => try updates server`
  pattern. Both now match on 404 explicitly and propagate other
  errors with `.context()`.

* `wait_for_image_active` (a copy of the helper in `imgapi_util.rs`)
  printed dots for 4 minutes on any IMGAPI error and then claimed a
  timeout. Now it matches 404 explicitly, tolerates up to 3
  consecutive non-404 errors as transient, and propagates the real
  error after that — an operator no longer chases imaginary slow
  imports when local IMGAPI is broken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds with_patch entries in configure_sapi for ServiceType,
UpdateAction, and SapiMode so the Progenitor-generated copies of these
enums get clap::ValueEnum, matching the pattern already used by
configure_imgapi / configure_papi / configure_napi. Pulls clap into
sapi-client's [dependencies] (it was the only generated client without
it) so the new derives compile.

Mechanical follow-up: regenerated clients/internal/sapi-client/src/
generated.rs via 'make clients-generate'. The next commit replaces
hand-rolled parse_service_type / parse_action / set-mode parser in
tritonadm with these typed enums and deletes the helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Subcommand definition declared service_type, instance_type,
action, and the set-mode positional as Option<String> / String, with
hand-rolled parse_service_type / parse_action helpers and an inline
match in SetMode that re-implemented what clap::ValueEnum would
generate. This violated CLAUDE.md type-safety rules #2 (ValueEnum on
the canonical type) and #4 (no duplicate enum definitions): a typo in
the Rust match arms could disagree with the wire format with no
compile-time signal, and the parser strings would drift from the API
contract any time the OpenAPI spec changed.

With clap::ValueEnum now on the Progenitor copies (previous commit),
the args declare types::ServiceType / types::UpdateAction /
types::SapiMode directly with #[arg(value_enum)]. clap handles parsing
and `--help` autopopulates valid values. Deletes parse_service_type,
parse_action, and the inline mode match. Body construction loses the
.as_deref().map(parse_*).transpose()? boilerplate too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verify-only module behind `verify_totp(secret_base32, code)`. Parameters
fixed at SHA-1 / 30s step / 6 digits / +/-1 step skew to match piranha's
defaults so existing UFDS enrollments verify unchanged. Tests cover the
RFC 6238 SHA-1 vectors, the skew-window edges, and the malformed-secret
error path. Enrollment is intentionally left to piranha for v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `LdapService::read_user_metadata_value` (general capimetadata
reader, base-scoped search at `metadata=<ns>, uuid=<uuid>, <base>`)
and `LdapService::read_totp_secret`, which targets the piranha
schema (`portal` / `usemoresecurity`, JSON `{"secretkey": "..."}`).
The shared namespace lets existing piranha enrollments verify
unchanged. `noSuchObject`, missing `secretkey`, and empty
`secretkey` (the piranha-disable state) all collapse to `Ok(None)`
so callers can treat them uniformly as "not enrolled."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `JwtService::create_challenge_token` /
`verify_challenge_token` (with parallel methods on `JwtVerifier`).
Challenge tokens reuse the access-token signing key but carry a
distinct claim shape: `sub` + `username` only, plus a literal
`purpose: "2fa-pending"` and a 5-minute TTL. No `roles` /
`is_admin` is carried — those come from mahi only after TOTP
succeeds, so a leaked challenge can never elevate.

Cross-decoding fails by construction: an access token is missing
`purpose` (required by `ChallengeClaims`); a challenge is missing
`roles` and `is_admin` (required by `Claims`). Tests cover the
round trip, both cross-decoding paths, expired and wrong-purpose
challenges, and tokens signed by a different issuer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`POST /v1/auth/login` now returns a tagged `LoginOutcome`:
`complete` (the historical `LoginResponse` shape) when the user has
no second factor, or `challenge_required` carrying a 5-minute
challenge token plus the offered methods (currently `[totp]`) when
the user has a TOTP secret in UFDS `metadata=portal,
usemoresecurity`. The client posts the challenge token plus a code
to the new `POST /v1/auth/login/verify`, which re-reads the secret
server-side, runs `verify_totp`, and finishes the session with the
same `LoginResponse` + `Set-Cookie` the non-2FA path produces.

The challenge token never carries the secret. If 2FA is disabled
between login and verify the secret read returns `None` and verify
fails closed. SSH-key login (`/v1/auth/login-ssh`) is unchanged —
key possession already covers the second-factor role and matches
piranha parity.

Regenerated `openapi-specs/generated/triton-api.json` and the
merged `openapi-specs/patched/triton-gateway-api.json`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerates `triton-gateway-client/src/generated.rs` against the
new tritonapi spec (the LoginOutcome enum, LoginVerifyRequest,
ChallengeMethod, and the auth_login_verify operation), and teaches
`triton login --user <name>` to handle the
`LoginOutcome::ChallengeRequired` branch by prompting for an
authenticator code (or reading `TRITON_TOTP_CODE` for non-tty
flows) and exchanging it via `/v1/auth/login/verify` for the
`LoginResponse` the rest of the login pipeline expects.

If the server offers only second-factor methods this CLI does not
recognise (i.e. all entries reduce to `ChallengeMethod::Unknown`),
we refuse before prompting rather than collecting a code we
cannot use.

SSH-key login is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lost-authenticator recovery story belongs near the verify
handler -- that's where someone debugging the path will look.
Plain `//` rather than `///` keeps it off the OpenAPI spec and
out of the generated client docs, since "ssh into a headnode
and run sdc-ufds" is an ops concern, not part of the API
contract clients consume.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nshalman nshalman merged commit 6b52b8e into main May 4, 2026
5 checks passed
@nshalman nshalman deleted the how-to-ship branch May 4, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants