MANTA-5523 Authorization layer must be aware of bucket scoped accesskeys#30
MANTA-5523 Authorization layer must be aware of bucket scoped accesskeys#30
Conversation
Extend the accesskey replicator transform to handle the new accesskeyscope UFDS attribute. Scoped permanent credentials are stored in Redis as objects with secret and scope fields, while unscoped credentials retain the legacy string format. The reverse lookup key also stores scope metadata for scoped credentials. The modify() function preserves scope when a key is re-activated. The del() function needs no changes since it removes keys regardless of format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Handle both string (legacy/unscoped) and object (scoped) formats for permanent credential data in the SigV4 verification flow. Extract bucketScope from the key data and include it in the /aws-verify response so that downstream services can enforce per-bucket access restrictions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When AssumeRole creates temporary credentials, carry the parent key's bucketScope into both the Redis credential data and the UFDS LDAP entry (as accesskeyscope). The UFDS attribute is only stored when the parent key has a non-null scope, avoiding meaningless null values in LDAP. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes: - sigv4: handle JSON reverse-lookup for scoped permanent keys (C1) - replicator: fix vals[i] → vals[0] index bug in modify() (C2) - sigv4: return bucketScope for temporary credentials (C3) High fixes: - sts: getSessionToken() now inherits parent key scope (H1) - replicator: scope-only changes now replicate to Redis (H2) Low fix: - sigv4: add debug logging when keyData.secret missing from object Added test/accesskey-scope.test.js: 19 tests, 85 assertions covering replicator transforms, sigv4 verification, and STS builder functions for scoped access keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix jsstyle violations caught by make check: - typeof expr -> typeof (expr) - return expr; -> return (expr); - Block comment formatting (PART headers) - /** @type -> /* (block comment style) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /aws-auth/:accesskeyid endpoint (used by IAM/STS routing) calls getUserByAccessKey in redislib.js which assumes permanent credential reverse lookups are plain UUID strings. For scoped permanent keys, the replicator stores JSON with userUuid and scope fields — the same C1 type-confusion bug that was fixed in sigv4.js but in a different code path. Without this fix, scoped keys cannot call IAM operations (create-role, put-role-policy, etc.) because the JSON lookup value is used as a UUID, causing the user lookup to fail with "Invalid credentials". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /aws-auth endpoint (used by STS/IAM routing in buckets-api) calls getUserByAccessKey which returns account/user/roles but never included bucketScope. When AssumeRole is called with a scoped parent key, sts-client.js reads opts.caller.bucketScope which was always undefined because the STS routing path never set it. Extract bucketScope from: - Scoped permanent keys (scopedData.scope from JSON reverse lookup) - Temporary credentials with inherited scope (tempCredData.bucketScope) Include in the response object so sts-iam-routing.js sets it on req.caller and sts-client.js forwards it to mahi. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ial verify The handleTemporaryCredential function (UFDS fallback for when Redis misses) was missing bucketScope in its result object. The Redis path at line 972 correctly returned tempCredData.bucketScope, but the UFDS path at line 652 omitted it — causing the entire scope enforcement chain to be skipped for AssumeRole temp credentials. Add credData.accesskeyscope to the UFDS result, export the function for testing, and add two unit tests covering scoped and unscoped UFDS fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the replicator syncs temporary credentials from UFDS to Redis, the reverse-lookup entry (/accesskey/MSARXXX) was missing the bucketScope field. This caused the scope to be lost when sts.js wrote the entry with bucketScope but the replicator later overwrote it without. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… 2s poll
Unify permanent key Redis storage to always use { secret, scope }
object format, eliminating typeof branching in sigv4.js consumers.
Add POST /cache-push/:accesskeyid and POST /scope-revoke/:accesskeyid
operator endpoints for direct Redis writes. Parameterize replicator
poll interval via SAPI UFDS_POLL_INTERVAL (authcache: 2s, mahi: 10s).
Update sigv4.js to check object format first with legacy string
fallback for rolling upgrades.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a permanent key is not in Redis (replication lag), fall through to a direct UFDS LDAP search instead of returning AccessKeyNotFound. Eliminates the propagation window so the first request after key creation succeeds without client retry. Add verifySigV4Signature() shared helper to deduplicate the signature verification logic. Add negative cache (30s TTL, 10k max entries) to prevent garbage key IDs from hammering UFDS. UFDS search errors return 503 (ReplicatorNotReadyError) instead of masking the outage as "key not found." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes: 1. Wrap unguarded JSON.parse(userRes) in try/catch at both the permanent credential path and temporary credential path. During Redis format migration (string -> object), corrupt values caused uncaught SyntaxError (500/503) instead of 403. 2. Return InvalidSignatureError instead of AccessKeyNotFoundError when key is not in Redis and no UFDS fallback is available. Matches S3 behavior (403 Forbidden for unknown keys). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add isNegativelyCached check before the Redis GET in the permanent credential lookup. Add addToNegativeCache when the key is not found and UFDS is unavailable. Prevents repeated Redis lookups for the same invalid key ID under brute-force attack. Reuses the existing negativeKeyCache (30s TTL, 10k max entries) from the UFDS read-through path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only check the negative cache when UFDS is unavailable. When UFDS is present, a key missing from Redis may still be found via UFDS read-through (newly created key during replication lag). Add debug log on negative cache hit for operational visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace O(n) Object.keys(negativeKeyCache).length with a simple counter variable. Avoids allocating a temporary 10k-element string array on every cache insert under sustained brute-force attack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix intermittent test failures caused by 10s replicator poll interval being longer than the test REPL_WAIT. Three layers of defense: 1. setup.sh: set UFDS_POLL_INTERVAL=2000 in SAPI for authcache zones 2. mahi2/template: Mustache default to 2000 if variable unset 3. main.js: runtime fallback to 2000ms if config interval invalid Also: document two-writer model (CloudAPI cache-push + replicator) in accesskey.js transform, and fix pre-existing sigv4.js debug log typo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Mustache conditional syntax for UFDS_POLL_INTERVAL with a plain 2000ms default. The Mustache syntax broke json --validate in make check because the template isn't valid JSON until SAPI renders it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mahi UFDS connection pool uses ldapjs via the ufds module. ufds sets idleTimeout to opts.idleTimeout || 90000, so connections in the pool silently unbind after 90 seconds of idle. generic-pool has no testOnBorrow and hands these dead clients to callers. The subsequent LDAP add() in sts/assume-role then hangs until the pool times out and retries with a reconnect, pushing the total past requestTimeout:10000 and returning RequestTimeout to the caller. In the pool create factory, build a shallow copy of ufdsConfig with idleTimeout overridden to 0. The ufds module is patched separately to treat 0 as "disabled" (opts.idleTimeout != null) rather than falling back to the default. Pool-level lifecycle (idleTimeoutMillis) remains the sole mechanism controlling when connections are destroyed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…illis The pool validate() used client.connected !== false, but ufds does not expose a connected property. undefined !== false is always true, so validate() never rejected dead connections dropped by the UFDS server closing the TCP socket (888 such drops observed in logs). Add a _poolAlive flag on each client set to false on the close event. Update validate() to check _poolAlive so pool.acquire() skips dead clients and creates a fresh connection instead of handing a broken client to the caller. Raise acquireTimeoutMillis from 3000 to 10000. When multiple pool connections die simultaneously, generic-pool must create replacements before any acquire can succeed. TCP connect + LDAP bind regularly exceeded 3s, causing spurious acquire timeouts under load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, document negative cache and scope-revoke Item 1: Parse bucketScope in credential-provider.js loadCaller() and store pre-parsed object as req.caller.parsedBucketScope, eliminating the per-request memoization in enforceBucketScope. The raw string is kept for sts-client.js which forwards it to mahi as-is. Item 2: Add PART 6 cross-path round-trip tests to accesskey-scope.test.js: - replicator transform.add() → sigv4.verifySigV4() — tests the UFDS replication pipeline end-to-end with scope preservation - cachePush format → sigv4.verifySigV4() — tests the write-through cache path; guards against format divergence between the two paths Test count: 109 assertions (was 99). Item 3: Document negative-cache false-negative in sigv4.js — keys created during replicator lag without cache-push will be rejected for up to 30s after they arrive in Redis. Mitigation: always call POST /cache-push after key creation (CloudAPI already does this). Item 4: Add replicationWarning field to POST /scope-revoke response body explaining that the key will return from UFDS replication unless the UFDS record is also deleted via CloudAPI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The verifySigV4Handler passed { ufdsPool: req.ufdsPool } but sigv4.js
read opts.ufds — a different property name. Result: the UFDS
read-through path for permanent credentials was dead in production.
All Redis misses went directly to the negative cache with no UFDS
fallback.
Fix: create a thin proxy object in verifySigV4Handler that wraps the
UFDS connection pool with acquire/search/release semantics and pass it
as opts.ufds. sigv4.js already expects a client with .search(); the
proxy satisfies that interface using the pool.
Also updates the negative cache comment to accurately describe when
the cache fires (opts.ufds is null) vs when UFDS read-through is
used (opts.ufds is present).
Test: adds PART 7 UFDS read-through test — a permanent key absent
from Redis but found via a mock UFDS proxy. 114 assertions (was 109).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cachePushHandler (server.js) and the replicator transforms (add/modify)
independently constructed identical Redis structures for permanent keys.
The invariant was enforced by convention and a cross-path test, not by
shared code.
New module redis-accesskey-format.js provides:
- buildPermanentKeyEntry(secret, scope) → { secret, scope }
- buildPermanentKeyLookup(keyId, uuid, scope) → reverse lookup object
Both cachePushHandler and transforms/accesskey.js now call these
builders instead of constructing objects inline. The cross-path
tests (PART 6) confirm the output is identical.
Test: 125 assertions (was 114, +11 for builder unit tests).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three hardening fixes from nine-reviewer panel analysis:
Fix 1 — sigv4.js decomposition:
Extract handlePermanentCredentialRedis(), handleTemporaryCredentialRedis(),
buildPermanentResult(), buildTemporaryResult() from the 612-line
verifySigV4() monolith. All three critical bugs (C1, C2, C3) lived in
the inline code. Shared result builders prevent future divergence.
verifySigV4() is now a ~120-line dispatcher.
Fix 2 — Durable scope-revoke with tombstones:
POST /scope-revoke now writes a /revoked/{keyid} tombstone in Redis
with 24-hour TTL (same MULTI batch as the delete). The replicator
checks for tombstones before writing in add() and modify(), skipping
revoked keys. Repeated scope-revoke calls renew the TTL. Closes the
2-second revocation window where the replicator could re-add a
revoked key.
Fix 3 — Write versioning to prevent scope regression:
Redis entries now include a version field (changenumber for replicator,
Date.now() for cachePush). Writers skip if existing version is newer.
Prevents stale replicator writes from overwriting cachePush data
(silent scope regression from restricted to unrestricted).
Also: CHG-072 acquireTimeoutMillis 10s → 30s for UFDS pool recovery
after simultaneous connection death. Pool log defaults corrected.
Tests: PARTs 9-11 added (191 total assertions). make check clean.
Integration: 15/16 suites PASS (bulk-delete apostrophe pre-existing).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a temporary credential is created via AssumeRole or GetSessionToken from an unscoped parent key, store bucketScope: "none" instead of null. This allows enforceBucketScope in manta-buckets-api to distinguish "parent was explicitly unscoped" from "scope was lost in transit", closing a fail-open path on the STS auth chain. Permanent keys are unaffected — the sentinel applies only to the STS temporary credential creation path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change scope || null to (scope != null) ? scope : null in
buildPermanentKeyEntry and buildPermanentKeyLookup. The falsy
coalescing converted empty string "" to null (unrestricted),
which is the fail-open direction. With the explicit check,
empty string is preserved and downstream parseScope returns
null, causing fail-closed deny.
Update module header to document scope field values for both
permanent keys (null, JSON, "") and STS temporary credentials
("none" sentinel, JSON, null legacy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap three unguarded JSON.parse(userRes) calls with try/catch. If the user record in Redis is corrupt, these would throw an uncaught exception and crash the mahi process. Now they return 500 with a descriptive error instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consistent with the scope field fix — use (version != null) instead of version || 0 to prevent falsy coalescing issues if the default value ever changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cache sweep, rename key-revoke, vals[0] regression test - Add runtime check that changenumber < 1e9 in replicator add() and modify(). Logs error if the version comparison invariant with cachePush (Date.now()) is at risk. - Validate scope JSON in buildPermanentKeyEntry. Malformed scope strings are rejected at the write boundary rather than stored in Redis where every consumer must handle them. - Replace cliff-reset negative cache with periodic sweep. Expired entries are evicted every 60s; new entries are dropped when the cache is full instead of clearing everything. - Rename /scope-revoke to /key-revoke to match actual semantics (revokes the entire key, not just the scope). - Add regression test for the vals[i] -> vals[0] fix in modify(). Status change at index > 0 in the changes array must still be extracted correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cache saturation, log-not-throw on malformed scope - Rename route name and handler function from scopeRevoke to keyRevoke to match the /key-revoke endpoint path. - Add rate-limited console.error when the negative cache is full and dropping new entries (possible garbage-key attack). - Change buildPermanentKeyEntry scope validation from throw to console.error + store-as-is. Malformed scope is stored in Redis where downstream parseScope fails and enforceBucketScope denies (fail-closed). Throwing would crash the process since callers lack try/catch guards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pool error handler, negative cache counter drift
- Replace remaining bucketScope || null with != null checks
in buildPermanentResult, buildTemporaryResult, cache-push
handler (body.scope), and replicator modify (vals[0]).
- Sanitize accessKeyId with /^[a-zA-Z0-9_-]+$/ regex before
LDAP filter construction in handlePermanentCredentialUfds
to prevent filter injection from the Authorization header.
- Add client.on('error') handler to UFDS pool clients so
protocol errors mark the connection dead (previously only
close events were handled, leaving zombie connections).
- Fix negative cache counter drift: remove inline eviction
from isNegativelyCached, rely solely on periodic sweep for
delete + decrement to prevent double-decrement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment referenced a commit hash from this feature branch as if it were already in production. Rewrite to explain the actual situation: the string fallback handles the pre-upgrade Redis format during rolling deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the cache-push and key-revoke internal endpoints including request format, behavior, examples, and the tombstone-based revocation workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
danmcd
left a comment
There was a problem hiding this comment.
Lots more too-narrow lines, but I did notice how you're making tombstones happen. I have a bit of distrust of JS math so pardon any questions there.
| (24 hours) if the key still exists in UFDS. To permanently | ||
| revoke a key: | ||
|
|
||
| 1. Call `DELETE /:account/accesskeys/:id` via CloudAPI |
There was a problem hiding this comment.
If I want this key GONE, wouldn't it make more sense to have THIS endpoint, after marking the tombstone, just generate the above DELETE?
Or are there cases people actually want to revoke for 24 hours, figure it out, and maybe let it go?
There was a problem hiding this comment.
The only good thing of doing it in two steps, is that if you made a mistake the next UFDS sync will correct it. But I agree that doing this into steps just is more complex, because if you forget to do the second step they keys will be there again.
danmcd
left a comment
There was a problem hiding this comment.
Again, some of this clip-at-60-or-less columns makes this code harder to read and review.
| min: ufdsConfig.poolMin || 5, | ||
| max: ufdsConfig.poolMax || 20, | ||
| acquireTimeoutMillis: ufdsConfig.poolTimeout || 3000, | ||
| acquireTimeoutMillis: ufdsConfig.poolTimeout || 30000, |
There was a problem hiding this comment.
That's a big jump in acquisition time out (3s -> 30s). Have the defaults in config files been 30s and we're just acknowledging that reality?
There was a problem hiding this comment.
No, this is the new default fallback, when the UDFs pool was introduced as part of STS/IAM work I defaulted this value to 3000. Now, with this new feature I needed to bump this value up to give the UFDS pool more time to recover after dead connections / simultaneous connection loss. This new default value was obtained empirically in my COAL environment, I think there is TODO item on my side to validate a more accurate value for this, but considering an underpowered headnode is functioning with this default, I'm assuming that this is a safe value to be considered as a default.
|
|
||
| var batch = req.redis.multi(); | ||
|
|
||
| batch.set(userKey2, |
There was a problem hiding this comment.
This is an example of too-squashed I was talking about. This should be one line.
| requestTimeMs: requestTime, | ||
| systemTimeMs: currentTime, | ||
| timeDiff: timeDiff | ||
| }, 'Y2038: Request timestamp beyond Y2038 ' + |
There was a problem hiding this comment.
Heh, thanks for remembering this. Hopefully before a Y2038 window happens we'll have 64-bit-ted this via rewrite.
| * | ||
| * Shared by handleTemporaryCredentialRedis and | ||
| * handleTemporaryCredential (UFDS) to prevent divergence in | ||
| * result construction — the same pattern that caused bugs C1 |
These changes are required to allow the creation of bucket-scoped access keys to be used by manta-buckets-api.
This PR adds bucket-level permission scoping to the mahi/authcache authentication service. Bucket scoped access keys restrict S3 operations to declared buckets at specific permission levels (read, readwrite, full).
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com