Skip to content

[BUG][NETWORK]: mTLS and Certificate Rotation Issues #3579

@bogdanmariusc10

Description

@bogdanmariusc10

🐞 Bug Summary

The networking validation suite identified two issues that impact the Gateway's ability to operate in highly secure and high-availability environments. These issues prevent seamless Mutual TLS (mTLS) federation and block Zero-Downtime maintenance.


🧩 Affected Component

Select the area of the project impacted:

  • mcpgateway - API
  • mcpgateway - UI (admin panel)
  • mcpgateway.wrapper - stdio wrapper
  • Federation or Transports
  • CLI, Makefiles, or shell scripts
  • Container setup (Docker/Podman/Compose)
  • Other (explain below)

🔁 Steps to Reproduce

NET-01: mTLS Integration

The Issue: The Gateway fails to attach the required client certificates during its automated internal health checks.

Impact: When a target MCP server is set to Strict mTLS mode (CERT_REQUIRED), it rejects any connection that lacks a valid client certificate.

Consequence: Even if the server is healthy, the Gateway's "ping" fails, causing the system to incorrectly flag the server as Offline. This prevents any tool execution, as the Gateway refuses to route traffic to a server it perceives as dead.

Step 1.1: Generate the mTLS Trust Chain

# Create CA
openssl req -x509 -newkey rsa:4096 -keyout ca.key -out ca.crt -days 365 -nodes -subj "/CN=Test CA"

# Create Server Cert (CN must match the registration URL)
openssl genrsa -out server.key 2048
openssl req -new -key server.key -out server.csr -subj "/CN=localhost"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt -days 365

# Create Client Cert
openssl genrsa -out client.key 2048
openssl req -new -key client.key -out client.csr -subj "/CN=Client"
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out client.crt -days 365

Step 1.2: Start the Target MCP Server in Strict Mode
Ensure the target server (port 9000) is configured with ssl_context.verify_mode = ssl.CERT_REQUIRED.

Step 1.3: Register the Gateway via the API

curl -v --cacert ca.crt -X POST "https://localhost:8443/gateways" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "mtls-test-server",
    "url": "https://localhost:9000/sse",
    "tls_config": {
      "ca_cert": "/tmp/certs/ca.crt",
      "client_cert": "/tmp/certs/client.crt",
      "client_key": "/tmp/certs/client.key"
    }
  }'

Step 1.4: Attempt to call a tool

curl -v --cacert ca.crt -X POST "https://localhost:8443/mcp/http" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}'

NET-02: Certificate Rotation

The Issue: The Gateway process lacks a signal handler for SIGHUP and terminates abruptly instead of reloading its configuration.

Impact: The current implementation lacks a dedicated signal handler, resulting in a hard process termination and the immediate loss of all active SSE and WebSocket sessions.

Consequence: Instead of refreshing the SSL context in memory, the process performs a hard exit (Hangup: 1), leading to immediate service downtime. This forces a manual restart and drops all active SSE/WebSocket connections.

Step 2.1: Start the Gateway and a monitor loop

# Terminal 1: Monitor loop
while true; do 
  curl -s -o /dev/null -w "%{http_code}\n" --cacert /tmp/certs/ca.crt https://localhost:8443/health \
  && echo "$(date +%H:%M:%S) - OK" || echo "$(date +%H:%M:%S) - FAIL"; 
  sleep 0.5; 
done

Step 2.2: Generate a new "Rotation" certificate

openssl genrsa -out /tmp/certs/server-new.key 2048
openssl req -new -key /tmp/certs/server-new.key -out /tmp/certs/server-new.csr -subj "/CN=localhost"
openssl x509 -req -in /tmp/certs/server-new.csr -CA /tmp/certs/ca.crt -CAkey /tmp/certs/ca.key \
  -CAcreateserial -out /tmp/certs/server-new.crt -days 365

Step 2.3: Perform the hot swap on disk

cp /tmp/certs/server-new.crt /tmp/certs/server.crt
cp /tmp/certs/server-new.key /tmp/certs/server.key

Step 2.4: Send the reload signal
kill -HUP $(pgrep -f mcpgateway)


🤔 Expected Behavior

NET-01: The Gateway should successfully pass its internal health check by attaching the provided client_cert to the request, allowing the tool call to proceed.

NET-02: The monitor loop should continue to show OK without interruption. The server should catch the signal and re-initialize its SSLContext without dropping active connections or exiting.


📓 Logs / Error Output

NET-01: mTLS Initialization Failure

{"message":"Failed to initialize gateway at https://localhost:9000/sse: All connection attempts failed"}

NET-02: SIGHUP Termination

{"asctime": "2026-03-10T10:27:29", "levelname": "ERROR", "message": "WebSocket error: (<CloseCode.ABNORMAL_CLOSURE: 1006>, '')"}
make: *** [dev] Hangup: 1

🧠 Environment Info

Key Value
Version or commit 1.0.0-RC2
Runtime Python 3.14, Uvicorn
Platform / OS macOS
Container none

🧩 Additional Context

mcpgateway/utils/ssl_context_cache.py: The cert_hash is calculated solely on the ca_certificate. This makes the cache "Client-Blind." Even if a client cert is provided, the cache returns a generic context that lacks the identity files.

mcpgateway/plugins/framework/external/mcp/tls_utils.py: Hostname verification is enabled by default (check_hostname = True). We observed that using 127.0.0.1 against a CN=localhost certificate causes an immediate SSL abort.

Signal Handling: There is currently no SIGHUP handler implemented in the primary lifecycle management of the Gateway, leading to the default OS behavior of process termination.

Metadata

Metadata

Assignees

No one assigned

    Labels

    MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeapiREST API Related itembugSomething isn't workingperformancePerformance related itemspythonPython / backend development (FastAPI)securityImproves security

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions