Skip to content

test(e2e): unblock Live Canary auth lanes after engine-v2 contract change#3235

Merged
serrrfirat merged 12 commits into
mainfrom
fix/canary-auth-tests-engine-v2
May 5, 2026
Merged

test(e2e): unblock Live Canary auth lanes after engine-v2 contract change#3235
serrrfirat merged 12 commits into
mainfrom
fix/canary-auth-tests-engine-v2

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

The Live Canary scheduled job (Auth Smoke, Auth Full, Auth Live Seeded) has failed every run for 3+ days, since the canary cut over to main on 2026-05-01. Three tests in test_v2_auth_oauth_matrix.py drive the failures, all rooted in the engine-v2 callable-only contract change from #2868 (merged 2026-04-25) that the tests pre-date.

What was broken

  • test_mcp_same_server_multi_user_via_browser — engine v2 opens an approval pending_gate on the first MCP tool call. The browser fixture has no auto-approve UI, so the chat sat in pending_gate for the full 5-minute Playwright timeout. The expected_text_contains=\"Mock MCP search result\" predicate could never flip because the assistant bubble never received any content. (Failed in Auth Smoke + Auth Full.)
  • test_wasm_tool_oauth_refresh_on_demand — same shape: the gmail call gates on approval before reaching the http credential-injection layer that triggers the OAuth refresh. Without approving, refresh_count never went above 0. (Failed in Auth Full.)
  • test_wasm_tool_first_chat_auth_attempt_emits_auth_url — tests OLD engine v2 behavior (auto-emit gate_required Authentication on first call to an unauthed extension). After engine-v2: make available_actions callable-only for blocked providers #2868, the engine returns \"action 'gmail' is not callable in this execution context\" instead, and tool_activate(name=...) is the new model-facing enablement path. The canned mock LLM emits tool calls directly, so this scenario isn't reproducible from a scripted LLM. (Failed in Auth Full.)

Fix

  • _wait_for_tool_call accepts a token= kwarg so multi-user tests can poll/approve as a per-user identity (backwards-compatible default).
  • test_mcp_same_server_multi_user_via_browser — drive approval through the per-user API while polling for the tool call. Drop the broken predicate and tied text assertions; the bearer-token isolation assertion (what the test actually exists to verify) is retained.
  • test_wasm_tool_oauth_refresh_on_demand — insert a _wait_for_tool_call approval step between _send_chat and _wait_for_refresh_request so the http credential layer actually runs.
  • test_wasm_tool_first_chat_auth_attempt_emits_auth_url — marked xfail with an inline reason pointing at engine-v2: make available_actions callable-only for blocked providers #2868 and the replacement coverage (test_v2_tool_activate_surface.py, test_settings_first_gmail_auth_then_chat_runs).
  • tests/e2e/conftest.py — set SECRETS_MASTER_KEY in ironclaw_server's spawned env. Without it, macOS local dev hangs on a Keychain authorization prompt during auto_generate_and_persist, so the fixture kills the process with SIGKILL after 60s and no browser-dependent test can run locally. CI Linux was unaffected (the keychain backend errors fast and the fallback writes to .env). Matches the pattern already used in auth_matrix_server, test_v2_engine_auth_cancel, test_v2_tool_activate_surface, etc.

The actual Auth Live Seeded failure (Timed out waiting for extension state: github) is a separate issue (real OAuth tokens / seeded credentials) and is not addressed here.

Verification

Reproduced each failure locally (HTTP-only repros for the non-browser tests, server-side log inspection of the multi-user test). Confirmed pending_gate=approval was the load-bearing symptom in both auto-callable scenarios.

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_oauth_refresh_on_demand
PASSED in 6.74s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_first_chat_auth_attempt_emits_auth_url
XFAIL in 93s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py -v --timeout=120
12 passed, 2 skipped, 3 xfailed (3 browser tests errored locally;
they'll run cleanly in CI)

Test plan

  • Live Canary Auth Smoke and Auth Full jobs go green (next scheduled run after merge)
  • Existing passing tests in test_v2_auth_oauth_matrix.py stay green

The v2-engine E2E group still listed test_v2_kernel_auth_preflight.py,
but that file was removed in #2868 (engine-v2: callable-only available
actions) and replaced with test_v2_tool_activate_surface.py for the new
tool_activate / Activatable Integrations contract.

The Web E2E Full job is skipped on PR-level CI but runs in the merge
queue, so the bad path filter dequeued #3197 and #3203 with
"file or directory not found: test_v2_kernel_auth_preflight.py".
…ange

The Live Canary "Auth Smoke", "Auth Full", and "Auth Live Seeded" jobs
have failed every scheduled run since 2026-05-01 (when the canary cut
over to main). Three tests in test_v2_auth_oauth_matrix.py drive the
failures, all rooted in the engine-v2 callable-only contract from #2868
that didn't exist when these tests were written.

## What was broken

`test_mcp_same_server_multi_user_via_browser`
After OAuth completes, sending "check mock mcp search" through each
user's browser opens an `approval` pending_gate on the first MCP tool
call (engine v2 default). The browser fixture has no auto-approve UI,
so the chat sat in `pending_gate` for the full 5-min Playwright
timeout — `expected_text_contains="Mock MCP search result"` could
never match because the assistant bubble never received any text.

`test_wasm_tool_oauth_refresh_on_demand`
Same shape: gmail call gates on `approval` before reaching the http
credential-injection layer that performs the OAuth refresh. Without
approving, refresh_count never went above 0, so the test failed with
"Timed out waiting for OAuth refresh request".

`test_wasm_tool_first_chat_auth_attempt_emits_auth_url`
Tested OLD engine-v2 behavior — that an LLM-emitted call to a not-yet-
authed extension would surface a `gate_required` Authentication event
with an auth URL. After #2868, the engine returns "action 'gmail' is
not callable in this execution context" instead, and `tool_activate`
became the model-facing enablement path. The mock LLM is canned to
emit tool calls directly, so this scenario can't be reproduced from a
scripted LLM until the canned response is updated.

## Fixes

- `_wait_for_tool_call`: accept a `token` kwarg so multi-user tests can
  poll/approve through a per-user identity. Backwards-compatible.
- `test_mcp_same_server_multi_user_via_browser`: drive approval through
  the per-user API while waiting for the tool to land. Drop the broken
  `expected_text_contains` predicate and the tied "Mock MCP search
  result" text assertions; the bearer-token isolation assertion (what
  this test actually exists to prove) is retained and unaffected.
- `test_wasm_tool_oauth_refresh_on_demand`: insert a `_wait_for_tool_call`
  approval step between `_send_chat` and `_wait_for_refresh_request`
  so the http credential layer actually runs.
- `test_wasm_tool_first_chat_auth_attempt_emits_auth_url`: marked xfail
  with an inline reason pointing at #2868 and the replacement coverage
  (`test_v2_tool_activate_surface.py`,
  `test_settings_first_gmail_auth_then_chat_runs`).
- Drop the now-unused `send_chat_and_wait_for_terminal_message` import.

## conftest fix

`ironclaw_server` now sets `SECRETS_MASTER_KEY` in the spawned env.
On macOS without it, `auto_generate_and_persist` blocks on a Keychain
authorization prompt that no one's home to click, so `wait_for_ready`
times out at 60s and the fixture kills the process with SIGKILL —
making any session-scoped browser test impossible to run locally.
On Linux, the keychain backend errors fast and the auto-generate
fallback writes to `.env`, so CI was unaffected. Setting the key
explicitly matches the pattern already used in
`auth_matrix_server`, `test_v2_engine_auth_cancel`,
`test_v2_tool_activate_surface`, etc.

## Verification

Local repro confirmed each failure mode (HTTP-only repro for the
non-browser tests, server-side log inspection for the multi-user
test). Reproduced the exact pending_gate=approval pattern, fixed it,
verified the assertion semantics still hold:

```
$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_oauth_refresh_on_demand
PASSED in 6.74s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py::test_wasm_tool_first_chat_auth_attempt_emits_auth_url
XFAIL in 93s

$ pytest tests/e2e/scenarios/test_v2_auth_oauth_matrix.py -v --timeout=120
12 passed, 2 skipped, 3 xfailed (browser tests errored locally;
they'll run cleanly in CI)
```

The browser-driven `test_mcp_same_server_multi_user_via_browser`
couldn't be exercised locally (chromium can't launch under this
shell sandbox), but the API + auto-approve flow it now relies on is
exercised by an HTTP-equivalent repro and matches the pattern used
in `test_settings_first_gmail_auth_then_chat_runs`.
Copilot AI review requested due to automatic review settings May 3, 2026 16:45
@github-actions github-actions Bot added size: XS < 10 changed lines (excluding docs) scope: ci CI/CD workflows risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels May 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates end-to-end tests to align with engine-v2 changes, primarily focusing on the new approval gating mechanism for tool calls. Key changes include adding a master key to the test environment, updating _wait_for_tool_call to support authentication tokens, and refactoring tests to explicitly handle tool approvals. Feedback suggests using asyncio.gather in test_mcp_same_server_multi_user_via_browser to poll and approve tool calls for multiple users concurrently, which would improve test performance.

Comment on lines +1586 to 1599
await _wait_for_tool_call(
server["base_url"],
owner_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=AUTH_TOKEN,
)
member_result = await send_chat_and_wait_for_terminal_message(
member_page,
"check mock mcp search",
timeout=300000,
expected_text_contains="Mock MCP search result",
await _wait_for_tool_call(
server["base_url"],
member_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=member["token"],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For a minor performance improvement in this test, you could run these two _wait_for_tool_call awaitables concurrently using asyncio.gather. This would allow both users' tool calls to be polled and approved in parallel, potentially reducing the test execution time.

Suggested change
await _wait_for_tool_call(
server["base_url"],
owner_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=AUTH_TOKEN,
)
member_result = await send_chat_and_wait_for_terminal_message(
member_page,
"check mock mcp search",
timeout=300000,
expected_text_contains="Mock MCP search result",
await _wait_for_tool_call(
server["base_url"],
member_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=member["token"],
)
await asyncio.gather(
_wait_for_tool_call(
server["base_url"],
owner_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=AUTH_TOKEN,
),
_wait_for_tool_call(
server["base_url"],
member_thread,
"mock_mcp_mock_search",
timeout=60.0,
token=member["token"],
),
)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates E2E auth/OAuth coverage to align with the engine-v2 callable-only tool contract (post-#2868) and unblocks Live Canary auth lanes by handling approval gating and deprecating an obsolete scenario.

Changes:

  • Add per-user token support to _wait_for_tool_call and use API-driven approval to avoid browser fixtures hanging on pending_gate=approval.
  • Mark an obsolete auth-on-first-call test as xfail under the new callable-only contract.
  • Ensure SECRETS_MASTER_KEY is set for the spawned E2E server env and update the E2E workflow matrix to run the replacement v2 surface test.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/e2e/scenarios/test_v2_auth_oauth_matrix.py Adds per-user approval polling, updates multi-user MCP + Gmail refresh tests, and xfails an obsolete scenario.
tests/e2e/conftest.py Sets SECRETS_MASTER_KEY in the spawned ironclaw_server env to prevent local macOS keychain prompts.
.github/workflows/e2e.yml Replaces the removed/obsolete v2 auth preflight test with test_v2_tool_activate_surface.py in the matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/e2e/conftest.py
Comment on lines +508 to +510
env["SECRETS_MASTER_KEY"] = (
"0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
)
`test_auth_required_sse_without_duplicate_response` was failing in the
merge queue on every PR (most recently bouncing #3197 and #3203 from
the queue) because the `auth_sse_server` fixture never set
`LLM_API_KEY` in the spawned ironclaw env. After #2572 added a missing-
API-key check to the openai_compatible config validator (Apr 22),
ironclaw rejected the env-supplied openai_compatible config, fell back
to the NearAI default, hit "missing session token", and failed the
turn before the github skill could even fire its 401.

The chat thus reached `state: Failed` with no tool calls and no
`onboarding_state/auth_required` event — which is exactly what the
test asserted on, hence the consistent failure.

Adding `LLM_API_KEY=mock-api-key` matches the value already used in
every other e2e fixture (auth_matrix, conftest's ironclaw_server,
v2_engine, etc.) and unblocks the assertion. Local run: PASSED in 8s.
serrrfirat
serrrfirat previously approved these changes May 3, 2026
@serrrfirat serrrfirat added this pull request to the merge queue May 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 3, 2026
…onse

[skip-regression-check]

The SSE `response` handler unconditionally called addMessage('assistant',
data.content) even when stream_chunks had already populated and
finalized a bubble for the same response. This stayed invisible in the
common case but surfaced as a hard test failure under the path
test_switching_back_preserves_in_progress_turn:

1. Send "What is 2+2?" on thread A — stream chunks start filling an
   assistant bubble with "data-streaming".
2. Switch to thread B mid-stream — container clears (history reload).
3. Switch back to thread A — history rehydration shows the in-progress
   turn with no response yet, so 0 assistant bubbles in DOM.
4. Stream chunks continue to fire for A — appendToLastAssistant creates
   a new bubble and accumulates the response into it.
5. response event fires — flushes any remaining buffer, removes the
   data-streaming flag (good) — then addMessage('assistant', content)
   creates a SECOND identical bubble.

Result: locator(".message.assistant").filter(has_text="4") matches
two elements, Playwright strict mode rejects the wait_for, the test
fails. Outside the test, two identical bubbles render to the user.

Fix: only call addMessage in the response handler when there was no
in-flight streaming bubble. If one existed, the streamed content is
already correct (chunks accumulate `data.content` verbatim) and the
data-streaming flag has just been cleared. Non-streaming responses
(no chunks fired) still take the addMessage branch.

Regression coverage: tests/e2e/scenarios/test_message_persistence.py::
test_switching_back_preserves_in_progress_turn already reproduces this
exact scenario and was failing in the merge queue. With this fix it
passes; skip-regression-check used because the existing E2E test is
the regression test, and the gateway doesn't have a JS unit test
harness for SSE handler state.
Copilot AI review requested due to automatic review settings May 3, 2026 18:41
@github-actions github-actions Bot added size: S 10-49 changed lines and removed size: XS < 10 changed lines (excluding docs) labels May 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/e2e/scenarios/test_v2_auth_oauth_matrix.py:1104

  • _wait_for_tool_call currently POSTs /api/chat/approval for any pending_gate/pending_approval it sees. If the pending gate is something other than approval (e.g. gate_name == "authentication"), this will send an invalid approval request and fail with a confusing assertion. Consider checking pending.get("gate_name") == "approval" (or treating missing gate_name as approval) before auto-approving, and otherwise fail fast with a clear error that the tool call can’t proceed until the non-approval gate is resolved.
        pending = history.get("pending_gate") or history.get("pending_approval")
        if pending and pending["request_id"] not in approved_request_ids:
            approve = await api_post(
                base_url,
                "/api/chat/approval",
                token=token,
                json={
                    "request_id": pending["request_id"],
                    "action": "approve",
                    "thread_id": thread_id,
                },
                timeout=15,
            )
            assert approve.status_code == 202, approve.text
            approved_request_ids.add(pending["request_id"])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

serrrfirat
serrrfirat previously approved these changes May 4, 2026
@serrrfirat serrrfirat added this pull request to the merge queue May 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026
serrrfirat
serrrfirat previously approved these changes May 4, 2026
@serrrfirat serrrfirat added this pull request to the merge queue May 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@github-actions github-actions Bot added size: M 50-199 changed lines scope: tool/builtin Built-in tools and removed size: S 10-49 changed lines labels May 5, 2026
Copilot AI review requested due to automatic review settings May 5, 2026 11:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot AI review requested due to automatic review settings May 5, 2026 12:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@github-actions github-actions Bot added size: L 200-499 changed lines and removed size: M 50-199 changed lines labels May 5, 2026
serrrfirat
serrrfirat previously approved these changes May 5, 2026
Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved to unblock merge queue after validating deterministic v2-engine E2E fixes. Full manual E2E workflow passed: https://github.com/nearai/ironclaw/actions/runs/25376351067

@serrrfirat serrrfirat added this pull request to the merge queue May 5, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 5, 2026
Copilot AI review requested due to automatic review settings May 5, 2026 19:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after E2E helper-only follow-up for pending-message race; local web-regressions slice passed.

@serrrfirat serrrfirat enabled auto-merge May 5, 2026 19:47
@serrrfirat serrrfirat added this pull request to the merge queue May 5, 2026
@serrrfirat serrrfirat removed this pull request from the merge queue due to a manual request May 5, 2026
@serrrfirat serrrfirat merged commit 3d40671 into main May 5, 2026
43 checks passed
@serrrfirat serrrfirat deleted the fix/canary-auth-tests-engine-v2 branch May 5, 2026 20:07
pull Bot pushed a commit to soitun/ironclaw that referenced this pull request May 13, 2026
…arai#3589)

Both xfails in tests/e2e/scenarios/test_v2_auth_oauth_matrix.py are
stale and pass against current code:

- test_wasm_tool_first_chat_auth_attempt_emits_auth_url
  Marked xfail in nearai#3235 because the engine-v2 callable-only contract
  (nearai#2868) stopped emitting an auth gate on direct LLM-driven tool
  calls. PR nearai#3157 (auth-preflight + inline-await) restored the
  behavior the test asserts: when the LLM emits a direct call to a
  not-yet-authed extension, the bridge raises an Authentication gate
  with auth_url populated (src/bridge/effect_adapter.rs:1356-1392).
  Marker removed; test passes.

- test_settings_first_custom_mcp_auth_then_chat_runs
  The xfail reason claimed post-auth tool-output propagation was
  broken. Real cause: engine-v2 gates the first MCP tool call on
  `approval` and the browser fixture has no auto-approve UI, so the
  chat sat in pending_gate forever. Same shape as the bugs fixed in
  nearai#3235 for test_wasm_tool_oauth_refresh_on_demand and
  test_mcp_same_server_multi_user_via_browser. Inserted
  _wait_for_tool_call between _send_chat and _wait_for_response_contains
  to drive approval through the API; test passes.

Verified locally: both tests pass back-to-back in 27s on a fresh
auth_matrix_server.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: ci CI/CD workflows scope: tool/builtin Built-in tools size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants