test(e2e): unblock Live Canary auth lanes after engine-v2 contract change by ilblackdragon · Pull Request #3235 · nearai/ironclaw

ilblackdragon · 2026-05-03T16:45:17Z

Summary

The Live Canary scheduled job (Auth Smoke, Auth Full, Auth Live Seeded) has failed every run for 3+ days, since the canary cut over to main on 2026-05-01. Three tests in test_v2_auth_oauth_matrix.py drive the failures, all rooted in the engine-v2 callable-only contract change from #2868 (merged 2026-04-25) that the tests pre-date.

What was broken

test_mcp_same_server_multi_user_via_browser — engine v2 opens an approval pending_gate on the first MCP tool call. The browser fixture has no auto-approve UI, so the chat sat in pending_gate for the full 5-minute Playwright timeout. The expected_text_contains=\"Mock MCP search result\" predicate could never flip because the assistant bubble never received any content. (Failed in Auth Smoke + Auth Full.)
test_wasm_tool_oauth_refresh_on_demand — same shape: the gmail call gates on approval before reaching the http credential-injection layer that triggers the OAuth refresh. Without approving, refresh_count never went above 0. (Failed in Auth Full.)
test_wasm_tool_first_chat_auth_attempt_emits_auth_url — tests OLD engine v2 behavior (auto-emit gate_required Authentication on first call to an unauthed extension). After engine-v2: make available_actions callable-only for blocked providers #2868, the engine returns \"action 'gmail' is not callable in this execution context\" instead, and tool_activate(name=...) is the new model-facing enablement path. The canned mock LLM emits tool calls directly, so this scenario isn't reproducible from a scripted LLM. (Failed in Auth Full.)

Fix

_wait_for_tool_call accepts a token= kwarg so multi-user tests can poll/approve as a per-user identity (backwards-compatible default).
test_mcp_same_server_multi_user_via_browser — drive approval through the per-user API while polling for the tool call. Drop the broken predicate and tied text assertions; the bearer-token isolation assertion (what the test actually exists to verify) is retained.
test_wasm_tool_oauth_refresh_on_demand — insert a _wait_for_tool_call approval step between _send_chat and _wait_for_refresh_request so the http credential layer actually runs.
test_wasm_tool_first_chat_auth_attempt_emits_auth_url — marked xfail with an inline reason pointing at engine-v2: make available_actions callable-only for blocked providers #2868 and the replacement coverage (test_v2_tool_activate_surface.py, test_settings_first_gmail_auth_then_chat_runs).
tests/e2e/conftest.py — set SECRETS_MASTER_KEY in ironclaw_server's spawned env. Without it, macOS local dev hangs on a Keychain authorization prompt during auto_generate_and_persist, so the fixture kills the process with SIGKILL after 60s and no browser-dependent test can run locally. CI Linux was unaffected (the keychain backend errors fast and the fallback writes to .env). Matches the pattern already used in auth_matrix_server, test_v2_engine_auth_cancel, test_v2_tool_activate_surface, etc.

The actual Auth Live Seeded failure (Timed out waiting for extension state: github) is a separate issue (real OAuth tokens / seeded credentials) and is not addressed here.

Verification

Reproduced each failure locally (HTTP-only repros for the non-browser tests, server-side log inspection of the multi-user test). Confirmed pending_gate=approval was the load-bearing symptom in both auto-callable scenarios.

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_oauth_refresh_on_demand
PASSED in 6.74s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_first_chat_auth_attempt_emits_auth_url
XFAIL in 93s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py -v --timeout=120
12 passed, 2 skipped, 3 xfailed (3 browser tests errored locally;
they'll run cleanly in CI)

Test plan

Live Canary Auth Smoke and Auth Full jobs go green (next scheduled run after merge)
Existing passing tests in test_v2_auth_oauth_matrix.py stay green

The v2-engine E2E group still listed test_v2_kernel_auth_preflight.py, but that file was removed in #2868 (engine-v2: callable-only available actions) and replaced with test_v2_tool_activate_surface.py for the new tool_activate / Activatable Integrations contract. The Web E2E Full job is skipped on PR-level CI but runs in the merge queue, so the bad path filter dequeued #3197 and #3203 with "file or directory not found: test_v2_kernel_auth_preflight.py".

…ange The Live Canary "Auth Smoke", "Auth Full", and "Auth Live Seeded" jobs have failed every scheduled run since 2026-05-01 (when the canary cut over to main). Three tests in test_v2_auth_oauth_matrix.py drive the failures, all rooted in the engine-v2 callable-only contract from #2868 that didn't exist when these tests were written. ## What was broken `test_mcp_same_server_multi_user_via_browser` After OAuth completes, sending "check mock mcp search" through each user's browser opens an `approval` pending_gate on the first MCP tool call (engine v2 default). The browser fixture has no auto-approve UI, so the chat sat in `pending_gate` for the full 5-min Playwright timeout — `expected_text_contains="Mock MCP search result"` could never match because the assistant bubble never received any text. `test_wasm_tool_oauth_refresh_on_demand` Same shape: gmail call gates on `approval` before reaching the http credential-injection layer that performs the OAuth refresh. Without approving, refresh_count never went above 0, so the test failed with "Timed out waiting for OAuth refresh request". `test_wasm_tool_first_chat_auth_attempt_emits_auth_url` Tested OLD engine-v2 behavior — that an LLM-emitted call to a not-yet- authed extension would surface a `gate_required` Authentication event with an auth URL. After #2868, the engine returns "action 'gmail' is not callable in this execution context" instead, and `tool_activate` became the model-facing enablement path. The mock LLM is canned to emit tool calls directly, so this scenario can't be reproduced from a scripted LLM until the canned response is updated. ## Fixes - `_wait_for_tool_call`: accept a `token` kwarg so multi-user tests can poll/approve through a per-user identity. Backwards-compatible. - `test_mcp_same_server_multi_user_via_browser`: drive approval through the per-user API while waiting for the tool to land. Drop the broken `expected_text_contains` predicate and the tied "Mock MCP search result" text assertions; the bearer-token isolation assertion (what this test actually exists to prove) is retained and unaffected. - `test_wasm_tool_oauth_refresh_on_demand`: insert a `_wait_for_tool_call` approval step between `_send_chat` and `_wait_for_refresh_request` so the http credential layer actually runs. - `test_wasm_tool_first_chat_auth_attempt_emits_auth_url`: marked xfail with an inline reason pointing at #2868 and the replacement coverage (`test_v2_tool_activate_surface.py`, `test_settings_first_gmail_auth_then_chat_runs`). - Drop the now-unused `send_chat_and_wait_for_terminal_message` import. ## conftest fix `ironclaw_server` now sets `SECRETS_MASTER_KEY` in the spawned env. On macOS without it, `auto_generate_and_persist` blocks on a Keychain authorization prompt that no one's home to click, so `wait_for_ready` times out at 60s and the fixture kills the process with SIGKILL — making any session-scoped browser test impossible to run locally. On Linux, the keychain backend errors fast and the auto-generate fallback writes to `.env`, so CI was unaffected. Setting the key explicitly matches the pattern already used in `auth_matrix_server`, `test_v2_engine_auth_cancel`, `test_v2_tool_activate_surface`, etc. ## Verification Local repro confirmed each failure mode (HTTP-only repro for the non-browser tests, server-side log inspection for the multi-user test). Reproduced the exact pending_gate=approval pattern, fixed it, verified the assertion semantics still hold: ``` $ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_oauth_refresh_on_demand PASSED in 6.74s $ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_first_chat_auth_attempt_emits_auth_url XFAIL in 93s $ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py -v --timeout=120 12 passed, 2 skipped, 3 xfailed (browser tests errored locally; they'll run cleanly in CI) ``` The browser-driven `test_mcp_same_server_multi_user_via_browser` couldn't be exercised locally (chromium can't launch under this shell sandbox), but the API + auto-approve flow it now relies on is exercised by an HTTP-equivalent repro and matches the pattern used in `test_settings_first_gmail_auth_then_chat_runs`.

gemini-code-assist

Code Review

This pull request updates end-to-end tests to align with engine-v2 changes, primarily focusing on the new approval gating mechanism for tool calls. Key changes include adding a master key to the test environment, updating _wait_for_tool_call to support authentication tokens, and refactoring tests to explicitly handle tool approvals. Feedback suggests using asyncio.gather in test_mcp_same_server_multi_user_via_browser to poll and approve tool calls for multiple users concurrently, which would improve test performance.

gemini-code-assist · 2026-05-03T16:48:31Z

+        await _wait_for_tool_call(
+            server["base_url"],
+            owner_thread,
+            "mock_mcp_mock_search",
+            timeout=60.0,
+            token=AUTH_TOKEN,
        )
-        member_result = await send_chat_and_wait_for_terminal_message(
-            member_page,
-            "check mock mcp search",
-            timeout=300000,
-            expected_text_contains="Mock MCP search result",
+        await _wait_for_tool_call(
+            server["base_url"],
+            member_thread,
+            "mock_mcp_mock_search",
+            timeout=60.0,
+            token=member["token"],
        )


For a minor performance improvement in this test, you could run these two _wait_for_tool_call awaitables concurrently using asyncio.gather. This would allow both users' tool calls to be polled and approved in parallel, potentially reducing the test execution time.

Suggested change

await _wait_for_tool_call(

server["base_url"],

owner_thread,

"mock_mcp_mock_search",

timeout=60.0,

token=AUTH_TOKEN,

)

member_result = await send_chat_and_wait_for_terminal_message(

member_page,

"check mock mcp search",

timeout=300000,

expected_text_contains="Mock MCP search result",

await _wait_for_tool_call(

server["base_url"],

member_thread,

"mock_mcp_mock_search",

timeout=60.0,

token=member["token"],

)

await asyncio.gather(

_wait_for_tool_call(

server["base_url"],

owner_thread,

"mock_mcp_mock_search",

timeout=60.0,

token=AUTH_TOKEN,

),

_wait_for_tool_call(

server["base_url"],

member_thread,

"mock_mcp_mock_search",

timeout=60.0,

token=member["token"],

),

)

Copilot

Pull request overview

Updates E2E auth/OAuth coverage to align with the engine-v2 callable-only tool contract (post-#2868) and unblocks Live Canary auth lanes by handling approval gating and deprecating an obsolete scenario.

Changes:

Add per-user token support to _wait_for_tool_call and use API-driven approval to avoid browser fixtures hanging on pending_gate=approval.
Mark an obsolete auth-on-first-call test as xfail under the new callable-only contract.
Ensure SECRETS_MASTER_KEY is set for the spawned E2E server env and update the E2E workflow matrix to run the replacement v2 surface test.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/e2e/scenarios/test_v2_auth_oauth_matrix.py	Adds per-user approval polling, updates multi-user MCP + Gmail refresh tests, and xfails an obsolete scenario.
tests/e2e/conftest.py	Sets `SECRETS_MASTER_KEY` in the spawned `ironclaw_server` env to prevent local macOS keychain prompts.
.github/workflows/e2e.yml	Replaces the removed/obsolete v2 auth preflight test with `test_v2_tool_activate_surface.py` in the matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    env["SECRETS_MASTER_KEY"] = (
+        "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
+    )


`test_auth_required_sse_without_duplicate_response` was failing in the merge queue on every PR (most recently bouncing #3197 and #3203 from the queue) because the `auth_sse_server` fixture never set `LLM_API_KEY` in the spawned ironclaw env. After #2572 added a missing- API-key check to the openai_compatible config validator (Apr 22), ironclaw rejected the env-supplied openai_compatible config, fell back to the NearAI default, hit "missing session token", and failed the turn before the github skill could even fire its 401. The chat thus reached `state: Failed` with no tool calls and no `onboarding_state/auth_required` event — which is exactly what the test asserted on, hence the consistent failure. Adding `LLM_API_KEY=mock-api-key` matches the value already used in every other e2e fixture (auth_matrix, conftest's ironclaw_server, v2_engine, etc.) and unblocks the assertion. Local run: PASSED in 8s.

…onse [skip-regression-check] The SSE `response` handler unconditionally called addMessage('assistant', data.content) even when stream_chunks had already populated and finalized a bubble for the same response. This stayed invisible in the common case but surfaced as a hard test failure under the path test_switching_back_preserves_in_progress_turn: 1. Send "What is 2+2?" on thread A — stream chunks start filling an assistant bubble with "data-streaming". 2. Switch to thread B mid-stream — container clears (history reload). 3. Switch back to thread A — history rehydration shows the in-progress turn with no response yet, so 0 assistant bubbles in DOM. 4. Stream chunks continue to fire for A — appendToLastAssistant creates a new bubble and accumulates the response into it. 5. response event fires — flushes any remaining buffer, removes the data-streaming flag (good) — then addMessage('assistant', content) creates a SECOND identical bubble. Result: locator(".message.assistant").filter(has_text="4") matches two elements, Playwright strict mode rejects the wait_for, the test fails. Outside the test, two identical bubbles render to the user. Fix: only call addMessage in the response handler when there was no in-flight streaming bubble. If one existed, the streamed content is already correct (chunks accumulate `data.content` verbatim) and the data-streaming flag has just been cleared. Non-streaming responses (no chunks fired) still take the addMessage branch. Regression coverage: tests/e2e/scenarios/test_message_persistence.py:: test_switching_back_preserves_in_progress_turn already reproduces this exact scenario and was failing in the merge queue. With this fix it passes; skip-regression-check used because the existing E2E test is the regression test, and the gateway doesn't have a JS unit test harness for SSE handler state.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/e2e/scenarios/test_v2_auth_oauth_matrix.py:1104

_wait_for_tool_call currently POSTs /api/chat/approval for any pending_gate/pending_approval it sees. If the pending gate is something other than approval (e.g. gate_name == "authentication"), this will send an invalid approval request and fail with a confusing assertion. Consider checking pending.get("gate_name") == "approval" (or treating missing gate_name as approval) before auto-approving, and otherwise fail fast with a clear error that the tool call can’t proceed until the non-approval gate is resolved.

        pending = history.get("pending_gate") or history.get("pending_approval")
        if pending and pending["request_id"] not in approved_request_ids:
            approve = await api_post(
                base_url,
                "/api/chat/approval",
                token=token,
                json={
                    "request_id": pending["request_id"],
                    "action": "approve",
                    "thread_id": thread_id,
                },
                timeout=15,
            )
            assert approve.status_code == 202, approve.text
            approved_request_ids.add(pending["request_id"])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

serrrfirat

Approved to unblock merge queue after validating deterministic v2-engine E2E fixes. Full manual E2E workflow passed: https://github.com/nearai/ironclaw/actions/runs/25376351067

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

serrrfirat

Re-approving after E2E helper-only follow-up for pending-message race; local web-regressions slice passed.

…arai#3589) Both xfails in tests/e2e/scenarios/test_v2_auth_oauth_matrix.py are stale and pass against current code: - test_wasm_tool_first_chat_auth_attempt_emits_auth_url Marked xfail in nearai#3235 because the engine-v2 callable-only contract (nearai#2868) stopped emitting an auth gate on direct LLM-driven tool calls. PR nearai#3157 (auth-preflight + inline-await) restored the behavior the test asserts: when the LLM emits a direct call to a not-yet-authed extension, the bridge raises an Authentication gate with auth_url populated (src/bridge/effect_adapter.rs:1356-1392). Marker removed; test passes. - test_settings_first_custom_mcp_auth_then_chat_runs The xfail reason claimed post-auth tool-output propagation was broken. Real cause: engine-v2 gates the first MCP tool call on `approval` and the browser fixture has no auto-approve UI, so the chat sat in pending_gate forever. Same shape as the bugs fixed in nearai#3235 for test_wasm_tool_oauth_refresh_on_demand and test_mcp_same_server_multi_user_via_browser. Inserted _wait_for_tool_call between _send_chat and _wait_for_response_contains to drive approval through the API; test passes. Verified locally: both tests pass back-to-back in 27s on a fresh auth_matrix_server. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ilblackdragon added 2 commits May 4, 2026 00:49

Copilot AI review requested due to automatic review settings May 3, 2026 16:45

github-actions Bot added size: XS < 10 changed lines (excluding docs) scope: ci CI/CD workflows risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels May 3, 2026

Copilot started reviewing on behalf of ilblackdragon May 3, 2026 16:45 View session

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread tests/e2e/conftest.py

Comment on lines +508 to +510

env["SECRETS_MASTER_KEY"] = (

"0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"

)

serrrfirat previously approved these changes May 3, 2026

View reviewed changes

serrrfirat added this pull request to the merge queue May 3, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 3, 2026

Copilot AI review requested due to automatic review settings May 3, 2026 18:41

ilblackdragon dismissed serrrfirat’s stale review via bf02418 May 3, 2026 18:41

github-actions Bot added size: S 10-49 changed lines and removed size: XS < 10 changed lines (excluding docs) labels May 3, 2026

Copilot started reviewing on behalf of ilblackdragon May 3, 2026 18:42 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

serrrfirat previously approved these changes May 4, 2026

View reviewed changes

serrrfirat added this pull request to the merge queue May 4, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026

fix(gateway): dedupe history-rendered SSE responses

aa45b7b

serrrfirat dismissed their stale review via aa45b7b May 4, 2026 18:52

serrrfirat enabled auto-merge May 4, 2026 19:02

serrrfirat previously approved these changes May 4, 2026

View reviewed changes

serrrfirat added this pull request to the merge queue May 4, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026

Copilot AI reviewed May 5, 2026

View reviewed changes

fix(e2e): make v2 approval tests deterministic

716f9e1

github-actions Bot added size: M 50-199 changed lines scope: tool/builtin Built-in tools and removed size: S 10-49 changed lines labels May 5, 2026

test(e2e): stabilize duplicate skill install assertion

53f3805

Copilot AI review requested due to automatic review settings May 5, 2026 11:38

Copilot AI reviewed May 5, 2026

View reviewed changes

serrrfirat added 2 commits May 5, 2026 13:52

test(e2e): assert duplicate install stays ungated

e0f6f8f

fix(skills): skip approval for disk-installed duplicates

8d53c79

Copilot AI review requested due to automatic review settings May 5, 2026 12:14

Copilot AI reviewed May 5, 2026

View reviewed changes

github-actions Bot added size: L 200-499 changed lines and removed size: M 50-199 changed lines labels May 5, 2026

fix(v2): honor no-op skill installs without approval

2e40d37

serrrfirat previously approved these changes May 5, 2026

View reviewed changes

serrrfirat added this pull request to the merge queue May 5, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 5, 2026

test(e2e): wait for pending send marker to clear

77b4815

Copilot AI review requested due to automatic review settings May 5, 2026 19:31

serrrfirat dismissed their stale review via 77b4815 May 5, 2026 19:31

Copilot AI reviewed May 5, 2026

View reviewed changes

serrrfirat approved these changes May 5, 2026

View reviewed changes

serrrfirat enabled auto-merge May 5, 2026 19:47

serrrfirat added this pull request to the merge queue May 5, 2026

serrrfirat removed this pull request from the merge queue due to a manual request May 5, 2026

serrrfirat merged commit 3d40671 into main May 5, 2026
43 checks passed

serrrfirat deleted the fix/canary-auth-tests-engine-v2 branch May 5, 2026 20:07

ilblackdragon mentioned this pull request May 13, 2026

test(e2e): unxfail two auth-matrix tests now that contracts match #3589

Merged

4 tasks

Conversation

ilblackdragon commented May 3, 2026

Summary

What was broken

Fix

Verification

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants