refactor(rust): add scaffolding for mcpgateway_rust initial components#3029
Closed
refactor(rust): add scaffolding for mcpgateway_rust initial components#3029
Conversation
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* optimize response_cache_by_prompt lookup with inverted index Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * fix type hint Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * flake8 fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test: add unit tests for response_cache_by_prompt inverted index Add comprehensive test coverage for the inverted index optimization: - Tokenization and vectorization functions - Basic cache store and hit functionality - Inverted index population and candidate filtering - Eviction and index rebuild scenarios - Max entries cap with index consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Shoumi <shoumimukherjee@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: Add Gateway permission constants Add GATEWAYS_CREATE, GATEWAYS_READ, GATEWAYS_UPDATE, and GATEWAYS_DELETE permission constants to the Permissions class for consistency with other resource types (tools, resources, prompts, servers). Note: The original PR #2186 attempted to fix issue #2185 by modifying the visibility query logic, but that change was incorrect. The team filter should only show resources BELONGING to the filtered team, not all public resources globally. See todo/rbac.md for documentation. Issue #2185 needs further investigation - the reported bug may have a different root cause. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat: Add gateway permission patterns to token scoping middleware Add gateway routes to token scoping middleware for consistent permission enforcement: - Add gateway pattern to _RESOURCE_PATTERNS for ID extraction - Add gateway CRUD patterns to _PERMISSION_PATTERNS: - POST /gateways (exact) -> gateways.create - POST /gateways/{id}/... (sub-resources) -> gateways.update - PUT/DELETE -> gateways.update/delete - Add gateway handling in _check_resource_team_ownership: - Public: accessible by all - Team: accessible by team members - Private: owner-only access (per RBAC doc) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: Enforce owner-only access for private visibility across all resources Per RBAC doc, private visibility means "owner only" - not "team members". Fixed private visibility checks for all resource types to validate owner_email == requester instead of team membership: - Servers - Tools - Resources - Prompts - Gateways (already correct from previous commit) This aligns token scoping middleware with the documented RBAC model. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: Add tests for gateway permissions and visibility RBAC Add unit tests covering: - Gateway permission patterns (POST create vs POST update sub-resources) - Private visibility enforces owner-only access - Team visibility allows team members only - Public visibility allows all authenticated users These tests validate the RBAC fixes in token scoping middleware. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat-2187: add additional default roles while bootstrap
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fix lint issues
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fixing review comments
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fixing review comments
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: test fix
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* fix: Improve bootstrap roles validation and documentation
Fixes identified by code review:
1. Path resolution: Fixed parent.parent.parent -> parent.parent to correctly
resolve project root from mcpgateway/bootstrap_db.py
2. JSON validation: Added validation that loaded JSON is a list of dicts with
required keys (name, scope, permissions). Invalid entries are skipped with
warnings instead of crashing bootstrap.
3. Improved logging: Log all attempted paths when file not found
Added tests:
- test_bootstrap_roles_with_dict_instead_of_list: Validates error when JSON is
a dict instead of array
- test_bootstrap_roles_with_missing_required_keys: Validates warning when roles
are missing required fields
Added documentation:
- docs/docs/manage/rbac.md: New "Bootstrap Custom Roles" section with
configuration examples for Docker Compose and Kubernetes
- docs/docs/architecture/adr/036-bootstrap-custom-roles.md: ADR documenting
the feature design, error handling, and security considerations
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: Make description and is_system_role optional for bootstrap roles
ChatGPT review identified that description and is_system_role were accessed
unconditionally via role_def["key"], causing KeyError for minimal roles.
Fix:
- Use role_def.get("description", "") with empty string default
- Use role_def.get("is_system_role", False) with False default
Added test:
- test_bootstrap_roles_with_minimal_valid_role: Verifies a role with only
required fields (name, scope, permissions) is created successfully with
correct defaults for optional fields
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
---------
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Nithin Katta <Nithin.Katta@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…y blockers (#2394) * Remove last 2 security issues from Sonarqube Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> * Remove 5 of 8 blocker maintainability issues Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> * Correct linting errors Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> --------- Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
…ad (#2157) * perf(crypto): offload Argon2/Fernet to threadpool via asyncio.to_thread Add async wrappers (hash_password_async, verify_password_async, encrypt_secret_async, decrypt_secret_async) and update all call sites to use them, preventing event loop blocking during CPU-intensive crypto operations. Closes #1836 Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com> * fix(tests): update tests for async crypto operations Update test mocks to use async versions of password service and encryption service methods (hash_password_async, verify_password_async, encrypt_secret_async) following the changes in the crypto offload PR. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sso): add missing await for async create/update provider methods The crypto offload PR made SSOService.create_provider() and update_provider() async, but forgot to update call sites: - mcpgateway/routers/sso.py: add await in admin endpoints - mcpgateway/utils/sso_bootstrap.py: convert to async, add awaits - mcpgateway/main.py: make attempt_to_bootstrap_sso_providers async Without this fix, the router endpoints would return coroutine objects instead of provider objects, causing runtime errors (500) when accessing provider.id. The bootstrap would silently skip provider creation with "coroutine was never awaited" warnings. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(crypto): add tests for async crypto wrappers and SSO bootstrap Add test coverage for the async crypto operations introduced by the crypto offload PR: - test_async_crypto_wrappers.py: Tests for hash_password_async, verify_password_async, encrypt_secret_async, decrypt_secret_async including roundtrip verification and sync/async compatibility - test_sso_bootstrap.py: Tests for async SSO bootstrap ensuring create_provider and update_provider are properly awaited Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* chore-2193: add Rocky Linux setup script Add setup script for Rocky Linux and RHEL-compatible distributions. Adapts the Ubuntu setup script with the following changes: - Use dnf package manager instead of apt - Docker CE installation via RHEL repository - OS detection for Rocky, RHEL, CentOS, and AlmaLinux - Support for x86_64 and aarch64 architectures Closes #2193 Signed-off-by: Jonathan Springer <jps@s390x.com> * chore-2193: add Docker login check before compose-up Check if Docker is logged in before running docker-compose to avoid image pull failures. If not logged in, prompt user with options: - Interactive login (username/password prompts) - Username with password from stdin (for automation) - Skip login (continue without authentication) Supports custom registry URLs for non-Docker Hub registries. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: add non-interactive mode and git repo check to setup scripts Apply to both Rocky and Ubuntu setup scripts: - Add -y/--yes flag for fully non-interactive operation - Check for .git directory before running git pull - Fail fast with clear error if directory exists but isn't a git repo - Auto-confirm prompts in non-interactive mode - Exit with error on unsupported OS in non-interactive mode Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes #2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue #2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue #2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: #2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
#2507) Updates unique constraints for Resources and Prompts tables to support Gateway-level namespacing. Previously, these entities enforced uniqueness globally per Team/Owner (team_id, owner_email, uri/name). This prevented users from registering the same Gateway multiple times with different names. Changes: - Add gateway_id to unique constraints for resources and prompts - Add partial unique indexes for local items (where gateway_id IS NULL) - Make migration idempotent with proper existence checks Closes #2352 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…2517) * fix(transport): support mixed content types from MCP server tool call response Closes #2512 This fix addresses tool invocation failures for tools that return complex content types (like ResourceLink, ImageContent, AudioContent) or contain Pydantic-specific types like AnyUrl. Root causes fixed: 1. tool_service.py: Usage of model_dump() without mode='json' preserved pydantic.AnyUrl objects, violating internal model's str type constraints. 2. streamablehttp_transport.py: Code blindly assumed types.TextContent, accessing .text on every item, which crashed for ResourceLink or ImageContent. Changes: - Updated tool_service.py to use model_dump(by_alias=True, mode='json'), forcing conversion of AnyUrl to JSON-compatible strings. - Refactored streamablehttp_transport.py to inspect content.type and correctly map to proper MCP SDK types (TextContent, ImageContent, AudioContent, ResourceLink, EmbeddedResource) ensuring full protocol compatibility. - Updated return type annotation to include all MCP content types. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(transport): preserve metadata in mixed content type conversion Addresses dropped metadata fields identified in PR #2517 review: - Preserve annotations and _meta for TextContent, ImageContent, AudioContent - Preserve size and _meta for ResourceLink (critical for file metadata) - Handle EmbeddedResource via model_validate Add comprehensive regression tests for: - Mixed content types (text, image, audio, resource_link, embedded) - Metadata preservation (annotations, _meta, size) - Unknown content type fallback - Missing optional metadata handling Closes #2512 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(transport): convert gateway Annotations to dict for MCP SDK compatibility mcpgateway.common.models.Annotations is a different Pydantic class from mcp.types.Annotations. Passing gateway Annotations directly to MCP SDK types causes ValidationError at runtime when real MCP responses include annotations. Fix: - Add _convert_annotations() helper to convert gateway Annotations to dict - Add _convert_meta() helper for consistent meta handling - Apply conversion to all content types (text, image, audio, resource_link) Add regression tests using actual gateway model types: - test_call_tool_with_gateway_model_annotations - test_call_tool_with_gateway_model_image_annotations These tests use mcpgateway.common.models.TextContent/ImageContent with mcpgateway.common.models.Annotations to verify the conversion works. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(tool_service): add AnyUrl serialization tests for mode='json' fix Add explicit tests for the AnyUrl serialization fix (Issue #2512 root cause): - test_anyurl_serialization_without_mode_json - demonstrates the problem - test_anyurl_serialization_with_mode_json - verifies the fix - test_resource_link_anyurl_serialization - ResourceLink uri field - test_tool_result_with_resource_link_serialization - ToolResult with ResourceLink - test_mixed_content_with_anyurl_serialization - mixed content types These tests verify that mode='json' in model_dump() correctly serializes AnyUrl objects to strings, preventing validation errors when content is passed to MCP SDK types. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs(transport): add docstrings to _convert_annotations and _convert_meta Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs(transport): add Args/Returns to helper function docstrings Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Add user information (email, full_name, is_admin) to the plugin global context, enabling plugins like Cedar RBAC to make access control decisions based on user attributes beyond just email. Changes: - Add _inject_userinfo_instate() function to auth.py that populates global_context.user as a dictionary when include_user_info is enabled - Update GlobalContext.user type to Union[str, dict] for backward compat - Add include_user_info config option to plugin_settings (default: false) - Prevent tool_service from overwriting user dict with string email The feature is disabled by default to maintain backward compatibility with existing plugins that expect global_context.user to be a string. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Shoumi <shoumimukherjee@gmail.com>
* Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(db): release DB sessions before external HTTP calls to prevent pool exhaustion This commit addresses issue #2518 where DB connection pool exhaustion occurred during A2A and RPC tool calls due to sessions being held during slow upstream HTTP requests. Changes: - tool_service.py: Extract A2A agent data to local variables before calling db.commit(), allowing HTTP calls to proceed without holding the DB session. The A2A tool invocation logic now uses pre-extracted data instead of querying during the HTTP call phase. - rbac.py: Add db.commit() and db.close() calls before returning user context in all authentication paths (proxy, anonymous, disabled auth). This ensures DB sessions are released early and not held during subsequent request processing. - test_rbac.py: Update test to provide mock db parameter and verify that db.commit() and db.close() are called for proper session cleanup. The fix follows the pattern established in other services: extract all needed data from ORM objects, call db.commit() to release the transaction, then proceed with external HTTP calls. This prevents "idle in transaction" states that exhaust PgBouncer's connection pool under high load. Load test results (4000 concurrent users, 1M+ requests): - Success rate: 99.81% - 502 errors reduced to 0.02% (edge cases with very slow upstreams) - P50: 450ms, P95: 4300ms Closes #2518 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * perf(config): tune connection pools for high concurrency Based on profiling with 4000 concurrent users (~2000 RPS): - MCP_SESSION_POOL_MAX_PER_KEY: 50 → 200 (reduce session creation) - IDLE_TRANSACTION_TIMEOUT: 120s → 300s (handle slow MCP calls) - CLIENT_IDLE_TIMEOUT: 120s → 300s (align with transaction timeout) - HTTPX_MAX_CONNECTIONS: 200 → 500 (more outbound capacity) - HTTPX_MAX_KEEPALIVE_CONNECTIONS: 100 → 300 - REDIS_MAX_CONNECTIONS: 150 → 100 (stay under maxclients) Results: - Failure rate: 0.446% → 0.102% (4.4x improvement) - RPC latency: 3,014ms → 1,740ms (42% faster) - CRUD latency: 1,207ms → 508ms (58% faster) See: todo/profile-full.md for detailed analysis Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix(helm): stabilize chart templates and configs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(helm): align migration job with bootstrap Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs(helm): refresh chart README Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* docs: sync env defaults and references Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: sync env templates and performance tuning Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* chore: stabilize coverage target Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: reduce test warnings Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: reduce test startup costs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: resolve bandit warning Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* test(playwright): handle admin password change Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(playwright): stabilize admin UI flows Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…2534) The MCP specification does not mandate that tool names must start with a letter - tool names are simply strings without pattern restrictions. This fix updates the validation pattern to align with SEP-986. Changes: - Update VALIDATION_TOOL_NAME_PATTERN from ^[a-zA-Z][a-zA-Z0-9._-]*$ to ^[a-zA-Z0-9_][a-zA-Z0-9._/-]*$ per SEP-986 - Allow leading underscore/number and slashes in tool names - Remove / from HTML special characters regex (not XSS-relevant) - Update all error messages, docstrings, and documentation - Update tests to verify new valid cases Tool names like `_5gpt_query_by_market_id` and `namespace/tool` are now accepted. Closes #2528 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…figuration (#2515) - Add passphrase-protected key support for Granian via --ssl-keyfile-password - Add KEY_FILE_PASSWORD and CERT_PASSPHRASE compatibility in run-granian.sh - Export KEY_FILE in run-gunicorn.sh for Python SSL manager access - Improve Makefile cert targets with proper permissions (640) and group 0 - Split certs-passphrase into two-step generation (genrsa + req) for AES-256 - Add SSL configuration templates to nginx.conf for client and backend TLS - Expose port 443 in NGINX Dockerfile for HTTPS support - Update docker-compose.yml with TLS-related comments and correct cert paths - Add comprehensive TLS configuration documentation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…2537) During gateway activation with OAuth Authorization Code flow, `_initialize_gateway` returns empty lists because the user hasn't completed authorization yet. Health checks then treat these empty responses as legitimate and delete all existing tools/resources/prompts. This change adds an `oauth_auto_fetch_tool_flag` parameter to `_initialize_gateway` that: - When False (default): Returns empty lists for auth_code gateways during health checks, preserving existing tools - When True (activation): Skips the early return for auth_code gateways, allowing activation to proceed The existing check in `_refresh_gateway_tools_resources_prompts` at lines 4724-4729 prevents stale deletion for auth_code gateways with empty responses. Fixed issues from original PR: - Corrected typo: oath -> oauth in parameter name - Removed duplicate docstring entry - Fixed logic bug that incorrectly skipped token fetch for client_credentials flow when flag was True Closes #2272 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(auth): add token revocation and proxy auth to admin middleware - Support token revocation checks in AdminAuthMiddleware - Enable proxy authentication for admin routes - Filter session listings by user ownership - Validate team membership for OAuth operations - Add configurable public registration setting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(config): change token validation defaults to secure-by-default - Set require_token_expiration default to true (was false) - Set require_jti default to true (was false) - Update .env.example to reflect new secure defaults Tokens without expiration or JTI claims will now be rejected by default. Set REQUIRE_TOKEN_EXPIRATION=false or REQUIRE_JTI=false to restore previous behavior if needed for backward compatibility. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs(security): expand securing guide with token lifecycle and access controls Add documentation for: - Token lifecycle management (revocation, validation settings) - Admin route authentication requirements - Session management access controls - User registration configuration - Updated production checklist with new settings Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(auth): address SSO redirect validation and admin middleware gaps - SSO redirect_uri validation now uses server-side allowlist only (allowed_origins, app_domain) instead of trusting Host header - Full origin comparison including scheme and port to prevent cross-port or HTTP downgrade redirects - AdminAuthMiddleware now supports API token authentication - AdminAuthMiddleware now honors platform admin bootstrap when REQUIRE_USER_IN_DB=false for fresh deployments Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(auth): add basic auth support to AdminAuthMiddleware Align AdminAuthMiddleware with require_admin_auth by supporting: - HTTP Basic authentication for legacy deployments - Basic auth users are treated as admin (consistent with existing behavior) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(auth): finalize secure defaults and update changelog for RC1 - Move hashlib/base64 imports to top-level in main.py (pylint C0415) - Add CHANGELOG entry for 1.0.0-RC1 secure defaults release - Add Security Defaults section to .env.example - Update test helpers to include JTI by default for REQUIRE_JTI=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(auth): streamline authentication model and update documentation - Simplify Admin UI to use session-based email/password authentication - Add API_ALLOW_BASIC_AUTH setting for granular API auth control - Scope gateway credentials to prevent unintended forwarding - Update 25+ documentation files for auth model clarity - Add comprehensive test coverage for auth settings - Fix REQUIRE_TOKEN_EXPIRATION and REQUIRE_JTI defaults in docs - Remove BASIC_AUTH_* from Docker examples (not needed by default) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: update changelog with neutral language and ignore coverage.svg - Reword RC1 changelog entries to use neutral language - Add coverage.svg to .gitignore (generated by make coverage) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
#2514) Refactors GatewayService, ExportService, ImportService, and A2AService to use globally-initialized service singletons (ToolService, PromptService, ResourceService, ServerService, RootService, GatewayService) instead of creating private, uninitialized instances. Uses lazy singleton pattern with __getattr__ to avoid import-time instantiation when only exception classes are imported. This ensures services are created after logging/plugin setup is complete. By importing the module-level services, all gateway operations now share the same EventService/Redis client. This ensures events such as activate/deactivate propagate correctly across workers and reach Redis subscribers. Changes: - Add lazy singleton pattern using __getattr__ to service modules - Update main.py to import singletons instead of instantiating services - Update GatewayService.__init__ to use lazy imports of singletons - Update ExportService.__init__ to use lazy imports of singletons - Update ImportService.__init__ to use lazy imports of singletons - Update A2AService methods to use tool_service singleton - Update tests to patch singleton methods instead of class instantiation - Add pylint disables for no-name-in-module (due to __getattr__) The fix resolves silent event drops caused by missing initialize() calls on locally constructed services. Cross-worker UI updates and subscriber notifications now behave as intended. Closes #2256 Signed-off-by: NAYANAR <nayana.r7813@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: support external plugin stdio launch options Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat: add streamable http uds support Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: tidy streamable http shutdown Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * style: fix docstring line length in client.py Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(security): harden UDS and cwd path validation - Add canonical path resolution (.resolve()) to cwd validation to prevent path traversal via symlinks or relative path escapes - Add UDS security validation: - Require absolute paths for Unix domain sockets - Verify parent directory exists - Warn if parent directory is world-writable (potential socket hijacking) - Return canonical resolved paths instead of raw input - Update tests to use tmp_path fixture for secure temp directories Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * style: fix pylint warnings in models.py Move logging import to top level and fix implicit string concatenation. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
) * feat(infra): add zero-config TLS for nginx via Docker Compose profile Add a new `--profile tls` Docker Compose profile that enables HTTPS with zero configuration. Certificates are auto-generated on first run or users can provide their own CA-signed certificates. Features: - One command TLS: `make compose-tls` starts with HTTPS on port 8443 - Auto-generates self-signed certs if ./certs/ is empty - Custom certs: place cert.pem/key.pem in ./certs/ before starting - Optional HTTP->HTTPS redirect via `make compose-tls-https` - Environment variable NGINX_FORCE_HTTPS=true for redirect mode - Works alongside other profiles (monitoring, benchmark) New files: - infra/nginx/nginx-tls.conf: TLS-enabled nginx configuration - infra/nginx/docker-entrypoint.sh: Handles NGINX_FORCE_HTTPS env var New Makefile targets: - compose-tls: Start with HTTP:8080 + HTTPS:8443 - compose-tls-https: Force HTTPS redirect (HTTP->HTTPS) - compose-tls-down: Stop TLS stack - compose-tls-logs: Tail TLS service logs - compose-tls-ps: Show TLS stack status Docker Compose additions: - cert_init service: Auto-generates certs using alpine/openssl - nginx_tls service: TLS-enabled nginx reverse proxy Documentation: - Updated tls-configuration.md with Quick Start section - Updated compose.md with TLS section - Added to deployment navigation - Updated README.md quick start Closes #2571 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(nginx): use smart port detection for HTTPS redirect Fix hard-coded :8443 port in HTTPS redirect that broke internal container-to-container calls. Problem: - External access via port 8080 correctly redirected to :8443 - Internal container calls (no port) also redirected to :8443 - But nginx_tls only listens on 443 internally, so internal redirects failed Solution: Add a map directive that detects request origin based on Host header: - Requests with :8080 in Host → redirect to :8443 (external) - Requests without port → redirect without port, defaults to 443 (internal) Tested: - External: curl http://localhost:8080/health → https://localhost:8443/health ✓ - Internal: curl http://nginx_tls/health → https://nginx_tls/health (443) ✓ Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
) * fix: resolve LLM admin router db session and add favicon redirect - Fix LLM admin router endpoints that failed with 500 errors due to db session being None from RBAC middleware (intentionally closed to prevent idle-in-transaction). Added explicit db: Session = Depends(get_db) to all 11 affected endpoints. - Add /favicon.ico redirect to /static/favicon.ico for browser compatibility (browsers request favicon at root path). - Update README.md Running section with clear table documenting the three running modes (make dev, make serve, docker-compose) with their respective ports, servers, and databases. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(llm-admin): pass kwargs to fetch_provider_models for permission check The require_permission decorator only searches kwargs for user context. sync_provider_models was calling fetch_provider_models with positional args, causing the decorator to raise 401 Unauthorized. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(testing): add JMeter performance testing baseline Add comprehensive JMeter test plans for industry-standard performance baseline measurements and CI/CD integration. Test Plans (10 .jmx files): - rest_api_baseline: REST API endpoints (1,000 RPS, 10min) - mcp_jsonrpc_baseline: MCP JSON-RPC protocol (1,000 RPS, 15min) - mcp_test_servers_baseline: Direct MCP server testing (2,000 RPS) - load_test: Production load simulation (4,000 RPS, 30min) - stress_test: Progressive stress to breaking point (10,000 RPS) - spike_test: Traffic spike recovery (1K→10K→1K) - soak_test: 24-hour memory leak detection (2,000 RPS) - sse_streaming_baseline: SSE connection stability (1,000 conn) - websocket_baseline: WebSocket performance (500 conn) - admin_ui_baseline: Admin UI user simulation (50 users) Infrastructure: - 12 Makefile targets for running tests and generating reports - Properties files for production and CI environments - CSV test data for parameterized testing - Performance SLAs documentation (P50/P95/P99 latencies) Closes #2541 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(testing): improve JMeter testing setup and fix test issues - Add jmeter-install target to download JMeter 5.6.3 locally - Add jmeter-ui target to launch JMeter GUI - Add jmeter-check to verify JMeter 5.x+ (required for -e -o flags) - Add jmeter-clean target to clean results directory - Fix jmeter-report to handle empty results gracefully - Fix load_test.jmx JEXL3 thread count expressions - Fix admin_ui_baseline.jmx HTMX endpoint paths - Add HTTPS/TLS testing documentation and configuration - Add .jmeter/ to .gitignore for local installation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(testing): fix JMeter JWT auth and add linter fixes - Fix JMETER_TOKEN generation: use python3 instead of python - Add JMETER_JWT_SECRET with default value (my-test-key) - Add encoding headers and fix import formatting from linter Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(testing): add jmeter-quick target for fast test verification Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Hardening: safer CORS + localhost bind defaults Signed-off-by: Theodor N. Engøy <theodornengoy@Mac.home> * langchain agent: DRY env parsing + Makefile HOST override Signed-off-by: Theodor N. Engøy <theodornengoy@eduroam-193-157-246-146.wlan.uio.no> * fix(security): harden CORS wildcard guard, validate LOG_LEVEL, add tests - Fix CORS wildcard bypass: check parsed origin list instead of raw string so '*,https://example.com' is caught - Validate LOG_LEVEL against allowed uvicorn levels with fallback - Add 44 differential tests for env_utils and CORS configuration - Remove unused pytest import (Ruff F401) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Theodor N. Engøy <theodornengoy@Mac.home> Signed-off-by: Theodor N. Engøy <theodornengoy@eduroam-193-157-246-146.wlan.uio.no> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Theodor N. Engøy <theodornengoy@Mac.home> Co-authored-by: Theodor N. Engøy <theodornengoy@eduroam-193-157-246-146.wlan.uio.no> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…y paths (#3090) * feat: improve log hygiene across auth and gateway flows - streamline auth/team/sso/gateway log messages for consistency - remove token-derived value details from routine debug/error logs - add regression tests for logging behavior (unit + AST-based checks) - cover oversized SSO callback branch behavior - make middleware overhead timing test more deterministic Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(ci): make helm-unittest install compatible with plugin verification defaults Use --verify=false when installing helm-unittest in linting-helm-unittest to avoid CI failures with plugin sources that do not provide verification metadata. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* docs: rebrand to ContextForge AI Gateway consistently across the project Unify all product naming from the inconsistent mix of "MCP Gateway", "Context Forge", "MCP Context Forge", and "ContextForge MCP Gateway" to consistently use "ContextForge" (or "ContextForge AI Gateway" for the full product name). Updated positioning to reflect all supported gateway patterns: - Tools Gateway (MCP, REST, gRPC, TOON) - Agent Gateway (A2A, OpenAI, Anthropic) - Model Gateway (LLM proxy, OpenAI API spec, 8+ providers) - API Gateway (rate limiting, auth, retries, reverse proxy) - Plugin Extensibility (40+ plugins) - Observability (OpenTelemetry) Preserved all code identifiers: mcpgateway (Python module), mcp-contextforge-gateway (PyPI), mcp-context-forge (GitHub/Docker), MCPGATEWAY_* (env vars), mcpContextForge (Helm), mcp.db (database). Closes #2714 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address review feedback on ContextForge rebrand - Fix failing playwright test (MCP_Gateway → ContextForge in Swagger title assertion) - Fix "ContextForge (ContextForge)" redundant parenthetical (6 locations) - Fix "ContextForges" wrong plural → "Gateways" or "ContextForge instances" - Fix missed "MCP Context-Forge" hyphenated variant (7 locations) - Fix missed "MCP CONTEXT FORGE" in Makefile header - Fix missed lowercase "MCP context forge" / "Context forge" in toolops, plugins - Drop article "the" before ContextForge (brand names don't take articles) - Fix "the ContextForge's" → "ContextForge's" - Update APP_NAME defaults in run.sh, .env.example, Helm schema, config schema Closes #2714 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat: enable LLMCHAT_ENABLED by default Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: update LLMCHAT_ENABLED default to true in docs and charts Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Replace the manual `mcpgateway-dev` CE secret with an automated flow that builds `.env.deploy` from `.env.example` + GitHub Secrets (CF_* prefix) on every push to main. Secrets are never logged. Key changes: - Add early validation step (fail-fast before build/push) - Generate .env.deploy via Python (safe for special chars in secrets) - Reject secrets containing embedded newlines - Assert all expected keys were replaced in the template - Update-or-create pattern for CE secret (atomic, no data loss) - Cleanup .env.deploy via trap on exit Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…3093) PR #3091 mechanically replaced "MCP Gateway" with "ContextForge" in several places, creating nonsensical text like "Enterprise ContextForge" and fake product names like "Apigee ContextForge". Fixes: - Title: "Enterprise ContextForge" → "Enterprise AI, Agent and MCP Gateway" - Heading: "Why an ContextForge?" → "Why ContextForge?" - Heading: "ContextForge Landscape" → "MCP Gateway Landscape" - Vendor names: restore Apigee MCP Hub, Azure API Management, Docker MCP Toolkit - Mermaid diagram: "ContextForge Options" → "MCP Gateway Options" - Nav label: "Why use an ContextForge" → "Why use ContextForge" - Roadmap #2272: restore original issue title with "MCP Gateway" - Copilot docs: "An ContextForge running" → "ContextForge running" Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…3096) The --from-env-file flag on `ibmcloud ce secret update` silently skips updates when the env file contains comments, blank lines, or section headers. The previous implementation passed the full .env.example (800+ lines of comments) through to .env.deploy, causing CE to report success while leaving stale placeholder values ("-") in the secret. Three fixes: - Strip .env.deploy to clean KEY=VALUE lines only (29 lines vs 800+) - Add explicit --from-literal overrides for the 5 critical secrets - Add post-update verification that fails the workflow if any secret still holds a placeholder value Also includes AGENTS.md updates (project overview, structure, env var defaults, issue/labeling guidelines, maintenance guardrails). Closes #3092 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
The validation step only checked for empty secrets, allowing placeholder values like '-' to pass. This caused the workflow to build and push the Docker image (~3 min) before failing at the verification step. Now rejects '-', 'changeme', and 'CHANGE_ME' early with a helpful error pointing to GitHub Settings. Also sets all 5 CF_* secrets in the production environment to strong random values (they were previously set to '-'). Closes #3096 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: implement Subresource Integrity (SRI) for CDN resources Implements comprehensive SRI protection for all external CDN resources to prevent tampering and ensure content integrity. Changes: - Pin 15 CDN resources to specific versions (HTMX 1.9.10, Alpine.js 3.14.1, etc.) - Generate SHA-384 integrity hashes for all CDN resources - Add integrity and crossorigin attributes to all CDN script/link tags - Create shared scripts/cdn_resources.py module for DRY configuration - Implement hash generation (scripts/generate-sri-hashes.py) - Implement hash verification (scripts/verify-sri-hashes.py) - Add CI pipeline verification (make sri-verify) - Document SRI in security guide and ADR-0014 - Add Tailwind JIT exclusion comments (dynamic content incompatible with SRI) - Use @lru_cache for efficient hash loading (no global variables) Templates updated: - admin.html: HTMX with canonical /dist/htmx.min.js URL - login.html: Font Awesome with corrected closing tag - change-password-required.html: Font Awesome All 15 CDN resources now protected with cryptographic integrity verification. Closes #2558 Signed-off-by: SuciuDaniel <Daniel.Vasile.Suciu@ibm.com> * style: add blank line before lru_cache decorator for PEP 8 compliance Signed-off-by: SuciuDaniel <Daniel.Vasile.Suciu@ibm.com> * test: add comprehensive test coverage for load_sri_hashes() - Add 8 test cases covering all code paths in load_sri_hashes() - Test success case, file not found, invalid JSON, permission errors - Test lru_cache behavior and integration with admin endpoints - Test edge cases: empty file, unicode content - Remove shebang from scripts/cdn_resources.py (library module) Closes #2558 Signed-off-by: SuciuDaniel <Daniel.Vasile.Suciu@ibm.com> * fix: review fixes for SRI implementation - Fix docs referencing wrong file (generate-sri-hashes.py -> cdn_resources.py) - Add *.json to pyproject.toml package-data for wheel builds - Add sri_hashes context to forgot-password and reset-password endpoints - Fix HTMX script tag indentation in admin.html - Use %s logger format instead of f-string in load_sri_hashes - Remove unused monkeypatch parameter from SRI tests - Add Tailwind JIT exclusion comments to forgot-password and reset-password Closes #2558 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Makefile update Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(test): remove Authorization header from cookie injection to fix CORS The SRI PR added crossorigin="anonymous" to CDN resources (required by the W3C SRI spec for cross-origin integrity checks). This changed CDN fetch mode from no-cors to cors. Playwright's set_extra_http_headers sends the Authorization header on ALL requests including cross-origin CDN fetches, triggering CORS preflight failures on CDNs that don't whitelist Authorization — blocking Alpine.js, CodeMirror, and other scripts from loading. Fix: use cookie-only auth in _inject_jwt_cookie and _set_admin_jwt_cookie. The JWT cookie alone is sufficient for same-origin page navigation and HTMX requests. Closes #2558 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: SuciuDaniel <Daniel.Vasile.Suciu@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…c, and ssrf (#3101) * fix: harden websocket auth and gate ws/reverse-proxy by default Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: token scoping hardening for unmatched paths (C-15) Default deny for unmatched scoped paths in token scoping middleware Preserve public/wildcard behavior and add regression coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: bearer scheme parsing consistency in token scoping (C-03) Accept case-insensitive Bearer authorization scheme Add regression coverage for mixed-case bearer and empty-token handling Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: token scope regression coverage for MCP/RPC paths (C-09) Assert scoped tokens are denied on /rpc and require servers.use on server MCP endpoints Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: cancellation notification authorization checks (C-10) Require run ownership or admin context before honoring notifications/cancelled Store run owner metadata at registration and add regression coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: enforce admin access on DCR management endpoints (O-05) Require admin context for registered OAuth client list/get/delete endpoints Add regression tests for non-admin denial Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: streamable auth revocation and user status checks (U-05) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: oidc id_token verification in sso callback flow (O-01) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: strict ssrf defaults with explicit cidr allowlist (S-01) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: harden cancellation auth paths and align ws/oidc security behavior Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add full regression coverage for new security paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: align locust expectations with feature flags and auth hardening Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix testing on github runner Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: security health auth validation hardening and behavior consistency Aligns /health/security token checks with JWT verification flow for consistent authenticated access behavior. Ref: C-38 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: rpc logging authorization hardening and behavior consistency Adds admin.system_config permission checks for JSON-RPC logging/setLevel to align RPC and HTTP behavior. Ref: C-30 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: utility transport permission consistency hardening and behavior consistency Aligns utility SSE/message permission checks with the canonical tools.execute action used by role defaults. Ref: C-13 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: auth dependency validation hardening and behavior consistency Adds token revocation and active-user enforcement to require_auth while preserving existing auth source precedence. Ref: C-33 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: admin auth validation hardening and behavior consistency Enforces revocation and active-user checks in require_admin_auth before admin access is granted. Ref: C-16 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: auth hardening consistency and documentation updates Aligns auth dependency behavior and updates RC2 changelog plus RBAC/security docs to reflect current permission and validation flows. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: cover auth hardening diff branches Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
… h-batch-2 (#3106) * fix: session and resource access hardening and behavior consistency Align websocket token handling, session ownership checks, resource visibility enforcement, and roots permission consistency for C-04 C-07 C-11 C-14 C-28 C-29. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: rc2 changelog hardening and migration clarity Document C-04 C-07 C-11 C-14 C-28 C-29 behavior changes and breaking-change migration guidance under 1.0.0-RC2. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: docstring lint compliance and coverage consistency Add missing Args/Returns/Raises docstrings for helper methods and nested search/session owner helpers to satisfy flake8 DAR rules and interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: session ownership claim hardening and auth semantics Use atomic owner claim for initialize, distinguish missing session from unverifiable owner metadata on message ingress, add defensive team-access guard, and extend regression coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update pylint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: distributed owner-claim fail-closed semantics Return unverifiable state when Redis ownership backend is unavailable and extend backend coverage for owner claim/session existence behavior. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand hardening regression and diff coverage Add backend-specific session owner claim/existence tests and helper-path regressions to reach 100% diff coverage for new hardening code. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: access control hardening and behavior consistency
- C-05: require tools.execute for both tools/call and legacy JSON-RPC tool invocation paths
- C-18: enforce scoped access on GET /resources/{resource_id}/info and maintain fail-closed ID ownership checks
- C-19: align root management endpoints with admin.system_config authorization requirements
- C-20: harden OAuth fetch-tools scope resolution and ownership checks with normalized token-team semantics
- C-35: validate server existence and scoped access before SSE setup, preserving deterministic 404/403 behavior
- C-39: sanitize imported scoped fields (team_id, owner_email, visibility, team) before persistence
- C-18: harden JWT rich-token teams semantics by distinguishing omitted teams from explicit teams=null
- add/update regression tests for allow/deny coverage across RPC, OAuth, resource info, import sanitization, and token helpers
- update CHANGELOG and local issue evidence/index entries for the hardening follow-up
Refs: C-05 C-18 C-19 C-20 C-35 C-39
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Update tests
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
---------
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…3111) * fix: visibility and admin scope hardening and behavior consistency (C-22 C-24 C-27 C-32 C-23) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: docstring completeness hardening and behavior consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: scope regression coverage hardening and behavior consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: streamable completion scope hardening and behavior consistency (C-24) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: completion scope branch coverage hardening in rpc and protocol paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
) * fix: oauth grant handling hardening and behavior consistency (O-11) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: sso flow validation hardening and behavior consistency (O-03 O-04 O-06 O-14) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: oauth access enforcement hardening and behavior consistency (O-02 O-15 O-16) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: auth lint compliance hardening and behavior consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: rc2 changelog and sso approval flow consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: oauth status request-context hardening (O-16) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand oauth and sso hardening regression coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: github sso email-claim handling and regression coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: oauth fetch-tools access hardening (O-15) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…utbound URL validation (h-batch-6) (#3115) * fix: oauth config hardening and behavior consistency Refs: A-02, A-05, O-10, O-17 - centralize oauth secret protection for service-layer CRUD - add server oauth masking parity for read/list responses - keep oauth secret decrypt to runtime token exchange paths - expand regression coverage for encryption and masking behavior Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: email auth timing hardening and behavior consistency Refs: A-06 - add dummy password verification on early login failures - enforce configurable minimum failed-login response duration - add focused regression tests for timing guard paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: outbound url validation hardening and behavior consistency Refs: S-02, S-03 - validate admin gateway test base URL before outbound requests - validate llmchat connect server URL before session setup - add regression tests for strict local/private URL rejection Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: regression coverage hardening and behavior consistency Refs: A-02, A-05, A-06, O-10, O-17 - add branch-focused regression tests for oauth secret handling and runtime decrypt guards - add legacy update-object coverage for server oauth update path - align helper docstrings with linting policy requirements Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: hardening consistency for oauth storage, auth timing, and SSRF validation (A-02 A-05 A-06 O-10 O-17 S-02 S-03) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: harden admin endpoints and align load-test payloads Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
) * fix: llm proxy hardening and behavior consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update AGENTS.md Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: lint docstring hardening and behavior consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: harden alembic sqlite migration compatibility Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: tighten llm token scoping and update rbac docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
d8969d6 to
f48aef7
Compare
Dependency ReviewThe following issues were found:
|
Signed-off-by: lucarlig <luca.carlig@ibm.com> Signed-off-by: Luca <lucarlig@protonmail.com>
f48aef7 to
edc482e
Compare
Member
|
Reopened as #3161. CI/CD will re-run on the new PR. You are still credited as the author. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TBD