feat: multi-tenant auth with per-user workspace isolation#1118
feat: multi-tenant auth with per-user workspace isolation#1118ilblackdragon merged 8 commits intonearai:stagingfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive multi-tenancy capabilities to the gateway, enabling distinct user experiences with isolated data and resources. It focuses on architecturally preventing cross-user data pollution and enhancing security by ensuring that all interactions and data access are strictly tied to an authenticated user's identity. The changes lay the groundwork for a secure, multi-user personal AI assistant environment. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive multi-user support by refactoring authentication to use MultiAuthState and UserIdentity for token-to-user mapping. This enables per-user scoping for SSE events, WebSocket connections, and chat rate limiting. User ownership and access control are enforced across job management, routines, and settings handlers. A new layered memory system is implemented within the workspace, supporting multi-scope read operations (e.g., from shared and private layers) while maintaining write isolation to a primary user scope. This includes new API endpoints and tool parameters for layer-aware writes, with optional privacy classification and redirection for sensitive content. The review comments suggest adding warning logs for poisoned read and write locks in the PerUserRateLimiter to improve debugging capabilities.
| let map = match self.limiters.read() { | ||
| Ok(m) => m, | ||
| Err(e) => e.into_inner(), | ||
| }; |
There was a problem hiding this comment.
While recovering from a poisoned lock using into_inner() is a robust approach to prevent the server from crashing, it would be beneficial to log a warning when this occurs. This would help in diagnosing the root cause of the panic that led to the poisoned lock.
| let map = match self.limiters.read() { | |
| Ok(m) => m, | |
| Err(e) => e.into_inner(), | |
| }; | |
| let map = match self.limiters.read() { | |
| Ok(m) => m, | |
| Err(e) => { | |
| tracing::warn!("PerUserRateLimiter read lock poisoned. Recovering, but the original panic should be investigated."); | |
| e.into_inner() | |
| } | |
| }; |
| let mut map = match self.limiters.write() { | ||
| Ok(m) => m, | ||
| Err(e) => e.into_inner(), | ||
| }; |
There was a problem hiding this comment.
Similar to the read lock, it would be beneficial to log a warning when recovering from a poisoned write lock. This will aid in debugging the underlying panic.
| let mut map = match self.limiters.write() { | |
| Ok(m) => m, | |
| Err(e) => e.into_inner(), | |
| }; | |
| let mut map = match self.limiters.write() { | |
| Ok(m) => m, | |
| Err(e) => { | |
| tracing::warn!("PerUserRateLimiter write lock poisoned. Recovering, but the original panic should be investigated."); | |
| e.into_inner() | |
| } | |
| }; |
47dd931 to
0ade5ac
Compare
zmanian
left a comment
There was a problem hiding this comment.
This PR cannot be merged in its current state.
Blocker: Committed merge conflict markers
There are 37+ committed merge conflict markers throughout the source files. The code will not compile.
Scope
This PR stacks layered memory (#1112), multi-scope reads (#1117), and multi-tenant auth into a single 4800-line change across 40 files. Please:
- Fix all merge conflict markers
- Land #1112 and #1117 first as separate, reviewable PRs
- Rebase this PR on top of those merged changes so the diff shows only the multi-tenant auth work
Security note
The PR removes constant-time token comparison (subtle::ConstantTimeEq) in favor of HashMap::get(). This is a timing side-channel for bearer tokens. If this tradeoff is intentional for local-only use, add a startup warning when GATEWAY_HOST is not 127.0.0.1/localhost.
Also
- Verify list_sandbox_jobs_for_user has both postgres and libsql implementations
- No real CI ran (fork PR -- classify/scope only)
0ade5ac to
4ed95b2
Compare
|
Apologies for the state of the last push — the rebase was badly botched. The code parsed as valid Rust (so This version:
Re: trajectory-based testing — happy to discuss what that would look like for the multi-tenant feature set. Is there a pointer to the trajectory system you mentioned? |
|
Pushed additional fixes since last comment: Restored constant-time token comparison — Scoped extension secrets per-user — Found this by auditing every Also fixed Remaining known gap: Slack OAuth callback (~line 936-990 in Build, clippy (zero warnings), 3,070 lib tests pass. |
ba72739 to
95e1d89
Compare
95e1d89 to
a6aa03d
Compare
|
Rebased onto the updated #1117 (which now includes identity isolation and the Fixed chat handlers using Added 11 multi-user auth integration tests that exercise the full middleware chain with |
e42fe90 to
1fbb0dd
Compare
32c5a86 to
e188d42
Compare
…r handling Security fixes: - Hash tokens with SHA-256 at construction time so authentication compares fixed-size 32-byte digests, eliminating length-oracle timing leaks - Scope auth SSE broadcasts per-user in chat_auth_token_handler — AuthRequired/AuthCompleted events were leaking across tenants - Propagate DB errors in restart handlers instead of silently swallowing via `if let Ok(Some(...))` pattern Code quality: - Log SSE serialization failures instead of silently producing empty strings via unwrap_or_default() - Remove dead `pub type AuthState = MultiAuthState` alias - Replace `.unwrap()` with `Arc::clone(db)` in app.rs multi-tenant workspace setup (db is guaranteed Some in context, but unwrap violates project convention) - Fix telegram setup test to inject UserIdentity into request extensions (handler now requires AuthenticatedUser) - Add safety comments on test-only expect/unwrap calls for CI - Apply cargo fmt to fix pre-existing formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e188d42 to
6372421
Compare
Review fixes pushed (
|
| Comment | Status | Details |
|---|---|---|
Token read scopes ignored in WorkspacePool |
Already fixed in 6f4050d |
get_or_create() applies both global scopes (line 268) and per-token workspace_read_scopes (line 273) |
| Multi-tenant workspaces drop config (search/layers) | Already fixed in 6f4050d |
WorkspacePool::get_or_create() applies with_search_config, embeddings, global read scopes, and with_memory_layers before caching |
| Job summary leaks global counts | Already fixed in 6f4050d |
Uses sandbox_job_summary_for_user / agent_job_summary_for_user (lines 100, 114) |
| Agent prompts always return 404 | Fixed in this commit | Refactored agent ownership check from 3-way && chain to explicit match — DB errors now propagate as 500, missing jobs return 404, and the ownership check can't be silently bypassed |
| Agent restart missing ownership check | Fixed in this commit | Both sandbox and agent restart paths now use match with user_id ownership verification and proper DB error propagation |
@gemini-code-assist — lock poisoning logging
Already fixed in 6f4050d — tracing::warn! on both read and write lock poisoning recovery.
Additional fixes in this commit
- Auth SSE broadcasts scoped per-user —
chat_auth_token_handlerwas usingstate.sse.broadcast()(global) forAuthRequired/AuthCompletedevents, leaking auth flow across tenants. Changed tobroadcast_for_user(). - Restart handlers propagate DB errors — Both sandbox and agent restart paths used
if let Ok(Some(...))which silently swallowed DB errors as 404. Converted tomatchwith explicitErrarms returning 500. - SSE serialization logs failures —
serde_json::to_string(&event).unwrap_or_default()replaced withfilter_mapthat logs viatracing::warn!on failure. - Dead type alias removed —
pub type AuthState = MultiAuthStatewas unused. .unwrap()inapp.rsremoved — Replaced withArc::clone(db)since the variable is guaranteedSomein context but.unwrap()violates project convention.- Telegram setup test fixed — Injects
UserIdentityinto request extensions for handler requiringAuthenticatedUser. cargo fmtapplied — Pre-existing formatting issues fixed.
All CI checks pass (fmt, clippy x3, no-panics, cargo-deny, regression test enforcement).
ilblackdragon
left a comment
There was a problem hiding this comment.
Review: Approved with fixes applied
Solid, well-tested multi-tenant auth PR. I've applied fixes for the issues identified in the initial review:
Fixes applied
-
Unified duplicate workspace pool —
WorkspacePoolnow implementsWorkspaceResolver, eliminating the near-identicalPerUserWorkspaceResolverinmemory.rs.app.rsnow usesWorkspacePooldirectly for multi-tenant memory tools. -
Fixed
sse_tx: Nonescheduler regression — Changed the scheduler/worker chain frombroadcast::Sender<SseEvent>toArc<SseManager>. The scheduler now receives theSseManagerreference and passes it to workers, restoring SSE event broadcasting for scheduled agent jobs. -
Added job owner cache in orchestrator —
OrchestratorStatenow has ajob_owner_cache: Arc<RwLock<HashMap<Uuid, String>>>that cachesjob_id → user_idmappings. First event per job still hits the DB (cache miss), subsequent events use the cache. -
Deduplicated
ext_user_idinmain.rs— Extracted the repeated computation to a singlelet ext_user_id = ...before the two blocks that use it. -
Removed unused
_gateway_statevariable frommain.rs. -
Fixed pre-existing test bug —
multi_auth_state_first_token_returns_any_tokenwas calling.unwrap()onfirst_token()in multi-user mode, but the implementation intentionally returnsNonein multi-user mode. Fixed the test to assertis_none().
Verification
cargo clippy --all --benches --tests --examples --all-features— zero warningscargo test --test multi_tenant_integration— 39/39 passcargo test --test openai_compat_integration --test ws_gateway_integration— 27/27 pass- Orchestrator, memory, scheduler, job_monitor, and web multi-tenant unit tests all pass
Note: multi_tenant_system_prompt tests are expected to fail (documented as "expected to FAIL until the bug is fixed" in the test file header).
…on, cache job owners - Unify WorkspacePool and PerUserWorkspaceResolver: WorkspacePool now implements WorkspaceResolver, eliminating duplicate per-user workspace construction logic. app.rs uses WorkspacePool directly. - Fix sse_tx: None scheduler regression: change scheduler/worker SSE broadcasting from broadcast::Sender<SseEvent> to Arc<SseManager>, restoring SSE event delivery for scheduled agent jobs. - Cache job owner in orchestrator: add job_owner_cache to OrchestratorState so job_event_handler avoids a DB round-trip on every event after the first per job. - Deduplicate ext_user_id computation in main.rs. - Remove unused _gateway_state variable. - Fix pre-existing test: first_token() returns None in multi-user mode by design; align test assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move memory API handlers out of server.rs into their own module, consistent with how jobs, routines, and skills handlers are organized. The resolve_workspace() helper moves with them since it is only used by memory handlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: multi-tenant auth with per-user scoping Multi-user authentication and authorization for IronClaw gateway: - Token-based auth mapping tokens to user IDs via GATEWAY_USER_TOKENS - Per-user SSE broadcast scoping - Per-user rate limiting with poisoned lock recovery - Handler auth and ownership checks for jobs, settings, routines - Extension secrets scoped per-user - Chat handlers use authenticated identity - Reverse proxy deployment documentation - Comprehensive integration tests for auth, SSE, rate limiting, and job isolation * fix: scope memory tools per-user in multi-tenant mode Memory tools (search, write, read, tree) held a single workspace created at startup with GATEWAY_USER_ID. In multi-tenant mode, all users' tool calls searched the default user's scope. Add WorkspaceResolver trait that resolves workspaces per-request using JobContext.user_id. In single-user mode, returns the startup workspace. In multi-tenant mode (GATEWAY_USER_TOKENS configured), creates and caches per-user workspaces on demand. Includes regression tests for workspace resolution and user isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: comprehensive multi-tenant isolation audit Address all review findings from @serrrfirat plus 7 additional gaps found via full security audit: Reviewer findings (5): - WorkspacePool now applies search config, memory layers, embedding cache, identity read scopes, and global config scopes (was bare) - jobs_summary_handler uses per-user queries instead of global counters - jobs_prompt_handler restructured to not 404 agent jobs + ownership check - jobs_restart_handler agent branch now verifies user ownership - agent_job_summary_for_user added to Database trait + both backends Audit findings (7): - Delete dead handlers/memory.rs (stale copies with no auth) - Add AuthenticatedUser to logs_events, logs_level_get, logs_level_set - Add AuthenticatedUser to extensions_tools_handler, gateway_status_handler - Add auth + ownership checks to all 6 routines handlers - Add auth to all 4 skills handlers with audit logging on mutations - Scope extension setup SSE broadcast to user (broadcast_for_user) - Fix pre-existing test compilation errors in extensions/manager.rs 17 new multi-tenant isolation tests covering: - WorkspacePool config propagation and scope merging - Jobs handler per-user isolation (summary, restart, prompt, cancel) - Routines handler auth enforcement and cross-user rejection - Auth middleware enforcement on logs, skills, status endpoints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: second-pass multi-tenant audit — scope SSE broadcasts, DB queries, dead handlers Second audit pass applying learned patterns across the codebase: - OAuth callback SSE broadcasts now use broadcast_for_user (lines 773, 912) - jobs_list_handler uses list_agent_jobs_for_user instead of fetching all users' jobs and filtering in Rust - list_agent_jobs_for_user added to Database trait + postgres + libsql - Dead handler files (extensions.rs, static_files.rs) hardened with AuthenticatedUser to prevent auth regression if migrated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address review findings — token hashing, broadcast scoping, error handling Security fixes: - Hash tokens with SHA-256 at construction time so authentication compares fixed-size 32-byte digests, eliminating length-oracle timing leaks - Scope auth SSE broadcasts per-user in chat_auth_token_handler — AuthRequired/AuthCompleted events were leaking across tenants - Propagate DB errors in restart handlers instead of silently swallowing via `if let Ok(Some(...))` pattern Code quality: - Log SSE serialization failures instead of silently producing empty strings via unwrap_or_default() - Remove dead `pub type AuthState = MultiAuthState` alias - Replace `.unwrap()` with `Arc::clone(db)` in app.rs multi-tenant workspace setup (db is guaranteed Some in context, but unwrap violates project convention) - Fix telegram setup test to inject UserIdentity into request extensions (handler now requires AuthenticatedUser) - Add safety comments on test-only expect/unwrap calls for CI - Apply cargo fmt to fix pre-existing formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address review findings — unify workspace pool, fix SSE regression, cache job owners - Unify WorkspacePool and PerUserWorkspaceResolver: WorkspacePool now implements WorkspaceResolver, eliminating duplicate per-user workspace construction logic. app.rs uses WorkspacePool directly. - Fix sse_tx: None scheduler regression: change scheduler/worker SSE broadcasting from broadcast::Sender<SseEvent> to Arc<SseManager>, restoring SSE event delivery for scheduled agent jobs. - Cache job owner in orchestrator: add job_owner_cache to OrchestratorState so job_event_handler avoids a DB round-trip on every event after the first per job. - Deduplicate ext_user_id computation in main.rs. - Remove unused _gateway_state variable. - Fix pre-existing test: first_token() returns None in multi-user mode by design; align test assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: fix formatting in app.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract memory handlers back into handlers/memory.rs Move memory API handlers out of server.rs into their own module, consistent with how jobs, routines, and skills handlers are organized. The resolve_workspace() helper moves with them since it is only used by memory handlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: ilblackdragon@gmail.com <ilblackdragon@gmail.com>
Rebased refile of #351 (closed in backlog triage). Previously reviewed by @serrrfirat and @zmanian — all review feedback was addressed. Rebased onto
stagingwith no functional changes.This addresses the same class of vulnerability as #760 (thread_id context pollution) architecturally — when every request is scoped to an authenticated user_id via
GATEWAY_USER_TOKENS, cross-user pollution can't occur regardless of the attack vector.This PR includes 39 HTTP-level integration tests for auth, isolation, and ownership checks — including DB-backed job ownership tests using in-memory libSQL. These don't map naturally to the trajectory format (auth happens before the agent loop), but happy to discuss what multi-tenant trajectory coverage should look like.
Broader context — I'm building a multi-user personal AI assistant on IronClaw and want to make sure I'm contributing in a useful direction. Would be great to sync on priorities if there's a good channel for that.
Depends on #1117.
Original PR: #351
Summary
Adds token-based multi-user authentication to the web gateway, giving each
user a fully isolated workspace with independent memory layers and
cross-scope read access. Builds on the layered memory (#1112) and
multi-scope reads (#1117) to deliver end-to-end multi-tenant workspace
isolation.
How it works
Single-user mode is the default and behaves identically to today. Multi-user
mode activates only when
GATEWAY_USER_TOKENSis set:Each user gets:
user_idat the DB levelKey components
MultiAuthStateUserIdentity(user_id, read scopes, memory layers)AuthenticatedUserWorkspacePoolPerUserRateLimiterScopedEventuser_id; subscribers filter to their own eventsSecurity hardening
After the initial implementation, three rounds of AI-assisted code review
identified shared resources that become cross-tenant data leaks once
multiple users share a gateway. These were pre-existing on upstream but
harmless in single-user mode — multi-tenancy is what makes them
exploitable.
Fixed in this PR
ScopedEventenvelope; subscribers filter by user_idPerUserRateLimiterwith independent windows per userAuthenticatedUser+routine.user_idownership checkdefault_user_idfor rate limitingAuthenticatedUser, useuser.user_idis_local_origin()helper with bracket handlingconversation_belongs_to_userreturned false on DB errorsandbox_job_belongs_to_userreturned false on DB errorPerUserRateLimiterpanicked on lock poisoninginto_inner()recoveryjobs_detail_handlerswallowed DB errors as 404jobs_cancel_handlerswallowed DB errors as 404get_sandbox_job_modesilently defaulted to Worker on DB errorchat_threads_handlersilently dropped DB errorssend_statussilently broadcast globally when user_id missingUserTokenConfigempty user_idKnown limitations (documented, need broader changes)
Changes (this PR only)
auth.rsMultiAuthState,UserIdentity,AuthenticatedUserextractor, case-insensitive Bearer parsingserver.rsWorkspacePool,PerUserRateLimiter,resolve_workspace(), handler auth/ownership,is_local_origin()sse.rsScopedEventenvelope,broadcast_for_user(), user-scopedsubscribe()/subscribe_raw()ws.rsmod.rsnew_multi_auth(),with_workspace_pool(), scoped channel broadcastsopenai_compat.rsconfig/channels.rsGATEWAY_USER_TOKENSparsing,UserTokenConfig, validationextensions/manager.rsmain.rstest_helpers.rsTestGatewayBuilder::start_multi()for multi-user server teststests/multi_tenant_integration.rstests/openai_compat_integration.rsGatewayStatefieldstests/ws_gateway_integration.rsGatewayStatefieldstests/support/gateway_workflow_harness.rsGatewayStatefieldsIntegration test coverage (39 tests)
Unit-level (22): MultiAuthState token→identity mapping, query token auth restrictions, per-user rate limiting isolation, SSE event scoping (user A can't see user B's events), edge cases (empty token, prefix match, first_token/first_identity).
Handler-level without DB (12): Full HTTP stack through real Axum server — protected endpoints reject unauthenticated/unknown tokens, public health endpoint accessible without auth, chat send flows through auth to agent channel, query token accepted on SSE but rejected on non-SSE endpoints, WebSocket per-user event isolation (Alice's scoped events not visible to Bob's WS connection).
Handler-level with DB (5): In-memory libSQL — Alice lists jobs and sees only hers, Bob requests Alice's job by ID and gets 404 (not 403, preventing enumeration), Alice can see her own job detail, Bob lists jobs and sees only his, nonexistent job returns 404.
Test plan
cargo clippy --all --benches --tests --examples— zero warningscargo test --lib— 3025 tests passcargo test --test multi_tenant_integration— 39 tests passGATEWAY_USER_TOKENS→ identical behavior)