Skip to content

feat: multi-tenant auth with per-user workspace isolation#1118

Merged
ilblackdragon merged 8 commits intonearai:stagingfrom
standardtoaster:refile/multi-tenant-auth
Mar 24, 2026
Merged

feat: multi-tenant auth with per-user workspace isolation#1118
ilblackdragon merged 8 commits intonearai:stagingfrom
standardtoaster:refile/multi-tenant-auth

Conversation

@standardtoaster
Copy link
Copy Markdown
Contributor

Rebased refile of #351 (closed in backlog triage). Previously reviewed by @serrrfirat and @zmanian — all review feedback was addressed. Rebased onto staging with no functional changes.

This addresses the same class of vulnerability as #760 (thread_id context pollution) architecturally — when every request is scoped to an authenticated user_id via GATEWAY_USER_TOKENS, cross-user pollution can't occur regardless of the attack vector.

This PR includes 39 HTTP-level integration tests for auth, isolation, and ownership checks — including DB-backed job ownership tests using in-memory libSQL. These don't map naturally to the trajectory format (auth happens before the agent loop), but happy to discuss what multi-tenant trajectory coverage should look like.

Broader context — I'm building a multi-user personal AI assistant on IronClaw and want to make sure I'm contributing in a useful direction. Would be great to sync on priorities if there's a good channel for that.

Depends on #1117.

Original PR: #351


Part 3 of 3 for Issue #59 (multi-tenancy). Depends on #1112 and #1117 — merge those first. The diff here includes all three PRs; once the first two merge, the diff shrinks to ~1,300 lines across 21 files.

Summary

Adds token-based multi-user authentication to the web gateway, giving each
user a fully isolated workspace with independent memory layers and
cross-scope read access. Builds on the layered memory (#1112) and
multi-scope reads (#1117) to deliver end-to-end multi-tenant workspace
isolation.

How it works

Single-user mode is the default and behaves identically to today. Multi-user
mode activates only when GATEWAY_USER_TOKENS is set:

GATEWAY_USER_TOKENS='{"tok-alice":{"user_id":"alice","workspace_read_scopes":["shared"]},"tok-bob":{"user_id":"bob","workspace_read_scopes":["shared"]}}'

Each user gets:

Key components

Component Purpose
MultiAuthState Maps bearer tokens → UserIdentity (user_id, read scopes, memory layers)
AuthenticatedUser Axum extractor that provides the resolved identity to handlers
WorkspacePool Lazily creates and caches per-user workspaces with double-checked locking
PerUserRateLimiter Independent sliding-window rate limits per user_id
ScopedEvent SSE envelope with optional user_id; subscribers filter to their own events

Security hardening

After the initial implementation, three rounds of AI-assisted code review
identified shared resources that become cross-tenant data leaks once
multiple users share a gateway. These were pre-existing on upstream but
harmless in single-user mode — multi-tenancy is what makes them
exploitable.

Fixed in this PR

Issue Fix
SSE broadcast to all subscribers ScopedEvent envelope; subscribers filter by user_id
Single shared rate limiter PerUserRateLimiter with independent windows per user
Routine handlers had no auth AuthenticatedUser + routine.user_id ownership check
Job prompt handler skipped ownership when store=None Require store (503)
SSE/WS subscribe was unscoped Pass authenticated user_id to subscribe filter
OpenAI compat used default_user_id for rate limiting Extract AuthenticatedUser, use user.user_id
IPv6 WebSocket origin validation Extract is_local_origin() helper with bracket handling
conversation_belongs_to_user returned false on DB error Propagate error as 500 instead of masking
sandbox_job_belongs_to_user returned false on DB error Same (2 instances)
PerUserRateLimiter panicked on lock poisoning into_inner() recovery
WS approval delivery was fire-and-forget Send error to client on failure
jobs_detail_handler swallowed DB errors as 404 Propagate as 500
jobs_cancel_handler swallowed DB errors as 404 Same
get_sandbox_job_mode silently defaulted to Worker on DB error Log warning, then default
chat_threads_handler silently dropped DB errors Log error before in-memory fallback
send_status silently broadcast globally when user_id missing Added debug log
UserTokenConfig empty user_id Validation in config parsing

Known limitations (documented, need broader changes)

Issue Required change
Sandbox job SSE events broadcast to all tenants Orchestrator needs per-job user_id tracking
Process-wide log stream shared across tenants Needs per-user filtering or RBAC
Extension auth/status broadcasts are global Extensions need user context threaded through

Changes (this PR only)

File What
auth.rs MultiAuthState, UserIdentity, AuthenticatedUser extractor, case-insensitive Bearer parsing
server.rs WorkspacePool, PerUserRateLimiter, resolve_workspace(), handler auth/ownership, is_local_origin()
sse.rs ScopedEvent envelope, broadcast_for_user(), user-scoped subscribe()/subscribe_raw()
ws.rs Pass user_id to subscribe, scope auth broadcasts, approval error reporting
mod.rs new_multi_auth(), with_workspace_pool(), scoped channel broadcasts
openai_compat.rs Per-user rate limiter extraction
config/channels.rs GATEWAY_USER_TOKENS parsing, UserTokenConfig, validation
extensions/manager.rs Document user-scoping limitations
main.rs Multi-user auth state + workspace pool wiring
test_helpers.rs TestGatewayBuilder::start_multi() for multi-user server tests
tests/multi_tenant_integration.rs 39 integration tests (see below)
tests/openai_compat_integration.rs Updated for new GatewayState fields
tests/ws_gateway_integration.rs Updated for new GatewayState fields
tests/support/gateway_workflow_harness.rs Updated for new GatewayState fields

Integration test coverage (39 tests)

Unit-level (22): MultiAuthState token→identity mapping, query token auth restrictions, per-user rate limiting isolation, SSE event scoping (user A can't see user B's events), edge cases (empty token, prefix match, first_token/first_identity).

Handler-level without DB (12): Full HTTP stack through real Axum server — protected endpoints reject unauthenticated/unknown tokens, public health endpoint accessible without auth, chat send flows through auth to agent channel, query token accepted on SSE but rejected on non-SSE endpoints, WebSocket per-user event isolation (Alice's scoped events not visible to Bob's WS connection).

Handler-level with DB (5): In-memory libSQL — Alice lists jobs and sees only hers, Bob requests Alice's job by ID and gets 404 (not 403, preventing enumeration), Alice can see her own job detail, Bob lists jobs and sees only his, nonexistent job returns 404.

Test plan

  • cargo clippy --all --benches --tests --examples — zero warnings
  • cargo test --lib — 3025 tests pass
  • cargo test --test multi_tenant_integration — 39 tests pass
  • Single-user mode unaffected (no GATEWAY_USER_TOKENS → identical behavior)

@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: channel/cli TUI / CLI channel scope: channel/web Web gateway channel scope: tool Tool infrastructure scope: tool/builtin Built-in tools scope: db Database trait / abstraction scope: db/postgres PostgreSQL backend scope: workspace Persistent memory / workspace scope: orchestrator Container orchestrator scope: extensions Extension management size: XL 500+ changed lines risk: high Safety, secrets, auth, or critical infrastructure contributor: new First-time contributor labels Mar 13, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive multi-tenancy capabilities to the gateway, enabling distinct user experiences with isolated data and resources. It focuses on architecturally preventing cross-user data pollution and enhancing security by ensuring that all interactions and data access are strictly tied to an authenticated user's identity. The changes lay the groundwork for a secure, multi-user personal AI assistant environment.

Highlights

  • Multi-User Authentication: Implemented token-based multi-user authentication for the web gateway, allowing each user a fully isolated workspace. This is configured via GATEWAY_USER_TOKENS and introduces MultiAuthState and UserIdentity.
  • Per-User Workspace Isolation: Introduced WorkspacePool to lazily create and cache per-user workspaces, ensuring independent memory layers and optional cross-scope read access based on user configuration.
  • Scoped Event Broadcasting: Refactored Server-Sent Events (SSE) and WebSocket connections to deliver events scoped by user_id, preventing cross-user data leaks and ensuring users only see their own events.
  • Per-User Rate Limiting: Replaced the global chat rate limiter with a PerUserRateLimiter, ensuring that one user cannot exhaust the rate limit for others in a multi-tenant setup.
  • Security Hardening: Addressed several potential cross-tenant data leaks and vulnerabilities identified through AI-assisted code review, including unscoped SSE broadcasts, shared rate limiters, and missing ownership checks on routine and job handlers.
  • Layered Memory Writes with Privacy: Enhanced memory write operations to support named memory layers, privacy classification, and automatic redirection of sensitive content from shared to private layers, with options to force writes or append content.
  • Extensive Integration Tests: Added 39 HTTP-level integration tests covering authentication, isolation, ownership checks, and DB-backed job ownership using in-memory libSQL, ensuring robust multi-tenant behavior.
Changelog
  • src/agent/job_monitor.rs
    • Updated job event broadcast receiver to include user ID for scoping.
  • src/app.rs
    • Configured workspace initialization to support multi-user settings and memory layers.
  • src/channels/web/auth.rs
    • Refactored AuthState to MultiAuthState for multi-user support.
    • Introduced UserIdentity struct and AuthenticatedUser extractor.
    • Updated auth_middleware to use MultiAuthState and insert UserIdentity.
  • src/channels/web/handlers/chat.rs
    • Adapted chat handlers to utilize authenticated user identities for rate limiting, session management, and event subscriptions.
  • src/channels/web/handlers/jobs.rs
    • Implemented user authentication and job ownership checks across all job-related API handlers.
  • src/channels/web/handlers/memory.rs
    • Removed deprecated memory_write_handler in favor of a new layer-aware implementation in server.rs.
  • src/channels/web/handlers/mod.rs
    • Adjusted module structure by moving jobs handlers out of the dead_code section.
  • src/channels/web/handlers/routines.rs
    • Updated routine trigger handler to use the default user ID.
  • src/channels/web/handlers/settings.rs
    • Modified settings handlers to manage user-specific settings using the default user ID.
  • src/channels/web/mod.rs
    • Refactored GatewayChannel to support multi-user authentication and per-user SSE broadcasting.
  • src/channels/web/openai_compat.rs
    • Updated OpenAI compatibility chat handler to use authenticated user for per-user rate limiting.
  • src/channels/web/server.rs
    • Introduced PerUserRateLimiter and WorkspacePool structs.
    • Updated GatewayState to include workspace_pool and default_user_id.
    • Modified start_server to accept MultiAuthState.
    • Integrated AuthenticatedUser into numerous API handlers for user context and ownership checks.
    • Added is_local_origin helper for WebSocket origin validation and verify_project_ownership for project access control.
  • src/channels/web/sse.rs
    • Introduced ScopedEvent enum to wrap SseEvent with an optional user_id.
    • Modified SseManager to use ScopedEvent for broadcasting and filtering events, allowing per-user delivery.
  • src/channels/web/test_helpers.rs
    • Updated test helpers to support multi-user authentication and per-user rate limiting for gateway testing.
  • src/channels/web/types.rs
    • Extended MemoryWriteRequest with layer, append, and force fields.
    • Extended MemoryWriteResponse with redirected and actual_layer for layered memory writes.
  • src/channels/web/ws.rs
    • Modified handle_ws_connection to accept UserIdentity and subscribe to SSE events with user scoping.
    • Updated clear_auth_mode calls to include user_id.
  • src/cli/oauth_defaults.rs
    • Updated OAuth flow to use the SseManager for broadcasting status events.
  • src/config/channels.rs
    • Added workspace_read_scopes, memory_layers, and user_tokens fields to GatewayConfig.
    • Introduced UserTokenConfig struct and added validation logic for memory layers and user tokens.
  • src/db/mod.rs
    • Extended WorkspaceStore trait with default multi-scope read methods for database backends.
  • src/db/postgres.rs
    • Provided optimized PostgreSQL implementations for multi-scope workspace read operations.
  • src/error.rs
    • Extended WorkspaceError enum with new variants for layered memory and privacy-related failures.
  • src/extensions/manager.rs
    • Updated extension manager to use the SseManager for broadcasting status events.
  • src/main.rs
    • Enhanced main application logic to support multi-user authentication and per-user workspace management.
  • src/orchestrator/api.rs
    • Updated job event broadcasting to include user ID for event scoping in multi-tenant environments.
  • src/orchestrator/mod.rs
    • Modified orchestrator setup to include user ID in job event broadcast type.
  • src/tools/builtin/job.rs
    • Modified job creation tool to include user ID in job event broadcast type.
  • src/tools/builtin/memory.rs
    • Enhanced memory write tool to support layered memory, privacy classification, and improved identity file protection.
  • src/tools/registry.rs
    • Modified tool registry to include user ID in job event broadcast type.
  • src/workspace/document.rs
    • Added utility function to merge and deduplicate workspace entries from multiple sources.
  • src/workspace/layer.rs
    • Added new module for defining and managing memory layers, including MemoryLayer struct and LayerSensitivity enum.
  • src/workspace/mod.rs
    • Introduced layer and privacy modules.
    • Added WriteResult struct.
    • Updated Workspace struct with read_user_ids, memory_layers, and privacy_classifier.
    • Implemented with_memory_layers, with_privacy_classifier, with_additional_read_scopes methods.
    • Modified read, exists, list, list_all, read_or_create, append_memory, and search_with_config to support multi-scope reads and layered writes.
  • src/workspace/privacy.rs
    • Added new module for privacy classification of content, including PrivacyClassifier trait and PatternPrivacyClassifier implementations.
  • src/workspace/repository.rs
    • Implemented optimized PostgreSQL queries for multi-scope search and document retrieval.
  • tests/layered_memory.rs
    • Added integration tests for layered memory features, including privacy and write access controls.
  • tests/multi_scope_functional.rs
    • Added functional tests for multi-scope workspace reads, ensuring data isolation and correct merging behavior.
  • tests/multi_tenant_integration.rs
    • Added comprehensive integration tests for multi-tenant authentication, event isolation, and resource management.
  • tests/openai_compat_integration.rs
    • Updated OpenAI compatibility integration tests to reflect multi-user authentication and rate limiting changes.
  • tests/support/gateway_workflow_harness.rs
    • Updated gateway workflow harness to support multi-user authentication and per-user rate limiting.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive multi-user support by refactoring authentication to use MultiAuthState and UserIdentity for token-to-user mapping. This enables per-user scoping for SSE events, WebSocket connections, and chat rate limiting. User ownership and access control are enforced across job management, routines, and settings handlers. A new layered memory system is implemented within the workspace, supporting multi-scope read operations (e.g., from shared and private layers) while maintaining write isolation to a primary user scope. This includes new API endpoints and tool parameters for layer-aware writes, with optional privacy classification and redirection for sensitive content. The review comments suggest adding warning logs for poisoned read and write locks in the PerUserRateLimiter to improve debugging capabilities.

Comment on lines +157 to +160
let map = match self.limiters.read() {
Ok(m) => m,
Err(e) => e.into_inner(),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While recovering from a poisoned lock using into_inner() is a robust approach to prevent the server from crashing, it would be beneficial to log a warning when this occurs. This would help in diagnosing the root cause of the panic that led to the poisoned lock.

Suggested change
let map = match self.limiters.read() {
Ok(m) => m,
Err(e) => e.into_inner(),
};
let map = match self.limiters.read() {
Ok(m) => m,
Err(e) => {
tracing::warn!("PerUserRateLimiter read lock poisoned. Recovering, but the original panic should be investigated.");
e.into_inner()
}
};

Comment on lines +166 to +169
let mut map = match self.limiters.write() {
Ok(m) => m,
Err(e) => e.into_inner(),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the read lock, it would be beneficial to log a warning when recovering from a poisoned write lock. This will aid in debugging the underlying panic.

Suggested change
let mut map = match self.limiters.write() {
Ok(m) => m,
Err(e) => e.into_inner(),
};
let mut map = match self.limiters.write() {
Ok(m) => m,
Err(e) => {
tracing::warn!("PerUserRateLimiter write lock poisoned. Recovering, but the original panic should be investigated.");
e.into_inner()
}
};

@standardtoaster standardtoaster force-pushed the refile/multi-tenant-auth branch from 47dd931 to 0ade5ac Compare March 13, 2026 12:29
Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR cannot be merged in its current state.

Blocker: Committed merge conflict markers

There are 37+ committed merge conflict markers throughout the source files. The code will not compile.

Scope

This PR stacks layered memory (#1112), multi-scope reads (#1117), and multi-tenant auth into a single 4800-line change across 40 files. Please:

  1. Fix all merge conflict markers
  2. Land #1112 and #1117 first as separate, reviewable PRs
  3. Rebase this PR on top of those merged changes so the diff shows only the multi-tenant auth work

Security note

The PR removes constant-time token comparison (subtle::ConstantTimeEq) in favor of HashMap::get(). This is a timing side-channel for bearer tokens. If this tradeoff is intentional for local-only use, add a startup warning when GATEWAY_HOST is not 127.0.0.1/localhost.

Also

  • Verify list_sandbox_jobs_for_user has both postgres and libsql implementations
  • No real CI ran (fork PR -- classify/scope only)

@standardtoaster standardtoaster force-pushed the refile/multi-tenant-auth branch from 0ade5ac to 4ed95b2 Compare March 13, 2026 21:17
@standardtoaster
Copy link
Copy Markdown
Contributor Author

Apologies for the state of the last push — the rebase was badly botched. The code parsed as valid Rust (so cargo check and clippy passed) but had duplicate function parameters, stale struct field references, and broken test constructors from unresolved merge damage. I should have run the full test suite before pushing. Won't happen again.

This version:

  • Clean rebase onto feat(workspace): multi-scope workspace reads #1117feat(workspace): layered memory with sensitivity-based privacy redirect #1112staging
  • Diff shows only the 14 multi-tenant commits
  • All merge conflicts properly resolved (13 conflict regions in server.rs)
  • Fixed sse_sendersse_manager struct rename across server.rs, sse.rs, extensions/manager.rs
  • Restored constant-time token comparison using subtle::ConstantTimeEq — replaces the HashMap::get() that introduced a timing side-channel. O(n) iteration over all tokens with ct_eq; negligible for < 10 users.
  • Build, clippy (zero warnings), and all 3,070 lib tests pass

Re: trajectory-based testing — happy to discuss what that would look like for the multi-tenant feature set. Is there a pointer to the trajectory system you mentioned?

@standardtoaster
Copy link
Copy Markdown
Contributor Author

Pushed additional fixes since last comment:

Restored constant-time token comparisonauthenticate() was using HashMap::get(), introducing a timing side-channel. Replaced with O(n) iteration using subtle::ConstantTimeEq. The crate was already a dependency but went unused after the multi-user rewrite. Validated by 203 existing auth tests.

Scoped extension secrets per-userExtensionManager had a hardcoded user_id = "default" set once at construction. All secrets operations (OAuth tokens, API keys, extension config) went through this single namespace regardless of which user was authenticated. Removed the field and threaded the authenticated user's ID through all 35+ methods. Web extension handlers now extract AuthenticatedUser and pass the real user_id.

Found this by auditing every self.user_id reference in ExtensionManager and tracing the flow from HTTP request → auth → secrets store. The database layer was correctly scoped (all queries use WHERE user_id = $1), but the application layer was passing the wrong user_id. Tests didn't catch it because the suite only exercises single-user mode — no test authenticates as user A and verifies user B's secrets are inaccessible.

Also fixed handlers/settings.rs and handlers/routines.rs — same pattern, default_user_id instead of authenticated user. However, I suspect these are dead code: server.rs defines its own inline versions of the same handlers (which already use AuthenticatedUser), and the route registrations resolve to the server.rs versions. Fixed them anyway for consistency, but worth confirming whether the handlers/ module definitions are intended to replace the inline ones or should be removed.

Remaining known gap: Slack OAuth callback (~line 936-990 in server.rs) uses state.default_user_id for secrets operations. This can't take AuthenticatedUser since it's a public OAuth callback with no auth header — the user_id needs to come from the stored OAuth flow state instead. Flagging for a follow-up.

Build, clippy (zero warnings), 3,070 lib tests pass.

@standardtoaster
Copy link
Copy Markdown
Contributor Author

Rebased onto updated #1117 (which now includes identity scope isolation). All prior review feedback addressed (Mar 13). Ready for re-review. @zmanian

@standardtoaster standardtoaster force-pushed the refile/multi-tenant-auth branch from 95e1d89 to a6aa03d Compare March 16, 2026 13:04
@standardtoaster
Copy link
Copy Markdown
Contributor Author

Rebased onto the updated #1117 (which now includes identity isolation and the WorkspaceConfig refactor).

Fixed chat handlers using default_user_id instead of authenticated identity. chat_send_handler, chat_history_handler, chat_threads_handler, chat_new_thread_handler, chat_approval_handler, chat_auth_cancel_handler, and chat_ws_handler were all reading state.default_user_id — meaning in multi-user mode, every user shared the same inbox. All 7 handlers now extract AuthenticatedUser from the middleware and use identity.user_id for message attribution, history scoping, and rate limiting.

Added 11 multi-user auth integration tests that exercise the full middleware chain with AuthenticatedUser extraction: each token resolves to the correct user_id and workspace_read_scopes, unknown tokens are rejected, and query param fallback works in multi-user mode.

@ilblackdragon ilblackdragon force-pushed the refile/multi-tenant-auth branch from 32c5a86 to e188d42 Compare March 23, 2026 23:19
…r handling

Security fixes:
- Hash tokens with SHA-256 at construction time so authentication
  compares fixed-size 32-byte digests, eliminating length-oracle
  timing leaks
- Scope auth SSE broadcasts per-user in chat_auth_token_handler —
  AuthRequired/AuthCompleted events were leaking across tenants
- Propagate DB errors in restart handlers instead of silently
  swallowing via `if let Ok(Some(...))` pattern

Code quality:
- Log SSE serialization failures instead of silently producing empty
  strings via unwrap_or_default()
- Remove dead `pub type AuthState = MultiAuthState` alias
- Replace `.unwrap()` with `Arc::clone(db)` in app.rs multi-tenant
  workspace setup (db is guaranteed Some in context, but unwrap
  violates project convention)
- Fix telegram setup test to inject UserIdentity into request
  extensions (handler now requires AuthenticatedUser)
- Add safety comments on test-only expect/unwrap calls for CI
- Apply cargo fmt to fix pre-existing formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon force-pushed the refile/multi-tenant-auth branch from e188d42 to 6372421 Compare March 24, 2026 00:30
@ilblackdragon
Copy link
Copy Markdown
Member

Review fixes pushed (6372421)

Addressed all outstanding review comments. Here's the full list:

@zmanian — constant-time token comparison

Fixed. Tokens are now SHA-256 hashed at construction time (hash_token() in auth.rs). authenticate() compares fixed-size 32-byte digests via subtle::ConstantTimeEq, eliminating the length-oracle timing leak that existed when using raw HashMap::get() with variable-length tokens.

@serrrfirat — 5 inline comments

Comment Status Details
Token read scopes ignored in WorkspacePool Already fixed in 6f4050d get_or_create() applies both global scopes (line 268) and per-token workspace_read_scopes (line 273)
Multi-tenant workspaces drop config (search/layers) Already fixed in 6f4050d WorkspacePool::get_or_create() applies with_search_config, embeddings, global read scopes, and with_memory_layers before caching
Job summary leaks global counts Already fixed in 6f4050d Uses sandbox_job_summary_for_user / agent_job_summary_for_user (lines 100, 114)
Agent prompts always return 404 Fixed in this commit Refactored agent ownership check from 3-way && chain to explicit match — DB errors now propagate as 500, missing jobs return 404, and the ownership check can't be silently bypassed
Agent restart missing ownership check Fixed in this commit Both sandbox and agent restart paths now use match with user_id ownership verification and proper DB error propagation

@gemini-code-assist — lock poisoning logging

Already fixed in 6f4050dtracing::warn! on both read and write lock poisoning recovery.

Additional fixes in this commit

  • Auth SSE broadcasts scoped per-userchat_auth_token_handler was using state.sse.broadcast() (global) for AuthRequired/AuthCompleted events, leaking auth flow across tenants. Changed to broadcast_for_user().
  • Restart handlers propagate DB errors — Both sandbox and agent restart paths used if let Ok(Some(...)) which silently swallowed DB errors as 404. Converted to match with explicit Err arms returning 500.
  • SSE serialization logs failuresserde_json::to_string(&event).unwrap_or_default() replaced with filter_map that logs via tracing::warn! on failure.
  • Dead type alias removedpub type AuthState = MultiAuthState was unused.
  • .unwrap() in app.rs removed — Replaced with Arc::clone(db) since the variable is guaranteed Some in context but .unwrap() violates project convention.
  • Telegram setup test fixed — Injects UserIdentity into request extensions for handler requiring AuthenticatedUser.
  • cargo fmt applied — Pre-existing formatting issues fixed.

All CI checks pass (fmt, clippy x3, no-panics, cargo-deny, regression test enforcement).

ilblackdragon
ilblackdragon previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Member

@ilblackdragon ilblackdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approved with fixes applied

Solid, well-tested multi-tenant auth PR. I've applied fixes for the issues identified in the initial review:

Fixes applied

  1. Unified duplicate workspace poolWorkspacePool now implements WorkspaceResolver, eliminating the near-identical PerUserWorkspaceResolver in memory.rs. app.rs now uses WorkspacePool directly for multi-tenant memory tools.

  2. Fixed sse_tx: None scheduler regression — Changed the scheduler/worker chain from broadcast::Sender<SseEvent> to Arc<SseManager>. The scheduler now receives the SseManager reference and passes it to workers, restoring SSE event broadcasting for scheduled agent jobs.

  3. Added job owner cache in orchestratorOrchestratorState now has a job_owner_cache: Arc<RwLock<HashMap<Uuid, String>>> that caches job_id → user_id mappings. First event per job still hits the DB (cache miss), subsequent events use the cache.

  4. Deduplicated ext_user_id in main.rs — Extracted the repeated computation to a single let ext_user_id = ... before the two blocks that use it.

  5. Removed unused _gateway_state variable from main.rs.

  6. Fixed pre-existing test bugmulti_auth_state_first_token_returns_any_token was calling .unwrap() on first_token() in multi-user mode, but the implementation intentionally returns None in multi-user mode. Fixed the test to assert is_none().

Verification

  • cargo clippy --all --benches --tests --examples --all-features — zero warnings
  • cargo test --test multi_tenant_integration — 39/39 pass
  • cargo test --test openai_compat_integration --test ws_gateway_integration — 27/27 pass
  • Orchestrator, memory, scheduler, job_monitor, and web multi-tenant unit tests all pass

Note: multi_tenant_system_prompt tests are expected to fail (documented as "expected to FAIL until the bug is fixed" in the test file header).

…on, cache job owners

- Unify WorkspacePool and PerUserWorkspaceResolver: WorkspacePool now
  implements WorkspaceResolver, eliminating duplicate per-user workspace
  construction logic. app.rs uses WorkspacePool directly.

- Fix sse_tx: None scheduler regression: change scheduler/worker SSE
  broadcasting from broadcast::Sender<SseEvent> to Arc<SseManager>,
  restoring SSE event delivery for scheduled agent jobs.

- Cache job owner in orchestrator: add job_owner_cache to
  OrchestratorState so job_event_handler avoids a DB round-trip on
  every event after the first per job.

- Deduplicate ext_user_id computation in main.rs.

- Remove unused _gateway_state variable.

- Fix pre-existing test: first_token() returns None in multi-user mode
  by design; align test assertion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the scope: worker Container worker label Mar 24, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon dismissed zmanian’s stale review March 24, 2026 03:21

Concerns addressed

Move memory API handlers out of server.rs into their own module,
consistent with how jobs, routines, and skills handlers are organized.
The resolve_workspace() helper moves with them since it is only used
by memory handlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon merged commit b441ebe into nearai:staging Mar 24, 2026
14 checks passed
@ilblackdragon ilblackdragon mentioned this pull request Mar 24, 2026
23 tasks
bkutasi pushed a commit to bkutasi/ironclaw that referenced this pull request Mar 28, 2026
* feat: multi-tenant auth with per-user scoping

Multi-user authentication and authorization for IronClaw gateway:
- Token-based auth mapping tokens to user IDs via GATEWAY_USER_TOKENS
- Per-user SSE broadcast scoping
- Per-user rate limiting with poisoned lock recovery
- Handler auth and ownership checks for jobs, settings, routines
- Extension secrets scoped per-user
- Chat handlers use authenticated identity
- Reverse proxy deployment documentation
- Comprehensive integration tests for auth, SSE, rate limiting, and job isolation

* fix: scope memory tools per-user in multi-tenant mode

Memory tools (search, write, read, tree) held a single workspace
created at startup with GATEWAY_USER_ID. In multi-tenant mode, all
users' tool calls searched the default user's scope.

Add WorkspaceResolver trait that resolves workspaces per-request using
JobContext.user_id. In single-user mode, returns the startup workspace.
In multi-tenant mode (GATEWAY_USER_TOKENS configured), creates and
caches per-user workspaces on demand.

Includes regression tests for workspace resolution and user isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: comprehensive multi-tenant isolation audit

Address all review findings from @serrrfirat plus 7 additional gaps
found via full security audit:

Reviewer findings (5):
- WorkspacePool now applies search config, memory layers, embedding
  cache, identity read scopes, and global config scopes (was bare)
- jobs_summary_handler uses per-user queries instead of global counters
- jobs_prompt_handler restructured to not 404 agent jobs + ownership check
- jobs_restart_handler agent branch now verifies user ownership
- agent_job_summary_for_user added to Database trait + both backends

Audit findings (7):
- Delete dead handlers/memory.rs (stale copies with no auth)
- Add AuthenticatedUser to logs_events, logs_level_get, logs_level_set
- Add AuthenticatedUser to extensions_tools_handler, gateway_status_handler
- Add auth + ownership checks to all 6 routines handlers
- Add auth to all 4 skills handlers with audit logging on mutations
- Scope extension setup SSE broadcast to user (broadcast_for_user)
- Fix pre-existing test compilation errors in extensions/manager.rs

17 new multi-tenant isolation tests covering:
- WorkspacePool config propagation and scope merging
- Jobs handler per-user isolation (summary, restart, prompt, cancel)
- Routines handler auth enforcement and cross-user rejection
- Auth middleware enforcement on logs, skills, status endpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: second-pass multi-tenant audit — scope SSE broadcasts, DB queries, dead handlers

Second audit pass applying learned patterns across the codebase:

- OAuth callback SSE broadcasts now use broadcast_for_user (lines 773, 912)
- jobs_list_handler uses list_agent_jobs_for_user instead of fetching
  all users' jobs and filtering in Rust
- list_agent_jobs_for_user added to Database trait + postgres + libsql
- Dead handler files (extensions.rs, static_files.rs) hardened with
  AuthenticatedUser to prevent auth regression if migrated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — token hashing, broadcast scoping, error handling

Security fixes:
- Hash tokens with SHA-256 at construction time so authentication
  compares fixed-size 32-byte digests, eliminating length-oracle
  timing leaks
- Scope auth SSE broadcasts per-user in chat_auth_token_handler —
  AuthRequired/AuthCompleted events were leaking across tenants
- Propagate DB errors in restart handlers instead of silently
  swallowing via `if let Ok(Some(...))` pattern

Code quality:
- Log SSE serialization failures instead of silently producing empty
  strings via unwrap_or_default()
- Remove dead `pub type AuthState = MultiAuthState` alias
- Replace `.unwrap()` with `Arc::clone(db)` in app.rs multi-tenant
  workspace setup (db is guaranteed Some in context, but unwrap
  violates project convention)
- Fix telegram setup test to inject UserIdentity into request
  extensions (handler now requires AuthenticatedUser)
- Add safety comments on test-only expect/unwrap calls for CI
- Apply cargo fmt to fix pre-existing formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — unify workspace pool, fix SSE regression, cache job owners

- Unify WorkspacePool and PerUserWorkspaceResolver: WorkspacePool now
  implements WorkspaceResolver, eliminating duplicate per-user workspace
  construction logic. app.rs uses WorkspacePool directly.

- Fix sse_tx: None scheduler regression: change scheduler/worker SSE
  broadcasting from broadcast::Sender<SseEvent> to Arc<SseManager>,
  restoring SSE event delivery for scheduled agent jobs.

- Cache job owner in orchestrator: add job_owner_cache to
  OrchestratorState so job_event_handler avoids a DB round-trip on
  every event after the first per job.

- Deduplicate ext_user_id computation in main.rs.

- Remove unused _gateway_state variable.

- Fix pre-existing test: first_token() returns None in multi-user mode
  by design; align test assertion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix formatting in app.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract memory handlers back into handlers/memory.rs

Move memory API handlers out of server.rs into their own module,
consistent with how jobs, routines, and skills handlers are organized.
The resolve_workspace() helper moves with them since it is only used
by memory handlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ilblackdragon@gmail.com <ilblackdragon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: regular 2-5 merged PRs risk: high Safety, secrets, auth, or critical infrastructure scope: agent Agent core (agent loop, router, scheduler) scope: channel/cli TUI / CLI channel scope: channel/web Web gateway channel scope: db/postgres PostgreSQL backend scope: db Database trait / abstraction scope: docs Documentation scope: extensions Extension management scope: orchestrator Container orchestrator scope: tool/builtin Built-in tools scope: tool Tool infrastructure scope: worker Container worker scope: workspace Persistent memory / workspace size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants