Skip to content

feat(sandboxes): Improve code evaluator editor ux#12690

Merged
cephalization merged 17 commits intoversion-sandboxesfrom
cephalization/improve-code-evaluator-editor-ux
Apr 20, 2026
Merged

feat(sandboxes): Improve code evaluator editor ux#12690
cephalization merged 17 commits intoversion-sandboxesfrom
cephalization/improve-code-evaluator-editor-ux

Conversation

@cephalization
Copy link
Copy Markdown
Contributor

No description provided.

@github-project-automation github-project-automation Bot moved this to 📘 Todo in phoenix Apr 15, 2026
@cephalization cephalization changed the title Cephalization/improve code evaluator editor ux feat(sandboxes): Improve code evaluator editor ux Apr 15, 2026
Comment thread app/package.json Outdated
@cephalization cephalization marked this pull request as ready for review April 17, 2026 19:12
@cephalization cephalization requested review from a team as code owners April 17, 2026 19:12
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Apr 17, 2026
Comment on lines +304 to +306
type_name, sandbox_config_db_id = from_global_id(
inline_code_evaluator.sandbox_config_id
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type_name is extracted here but never validated — it's silently discarded. A client could pass a GlobalID with any type prefix (e.g., GlobalID("SomeOtherType", "5")) and the code would still look up SandboxConfig row with id=5, bypassing the type safety that GlobalIDs are meant to provide.

This is inconsistent with the same PR's handling in evaluator_mutations.py, which correctly uses from_global_id_with_expected_type:

output_configs=output_configs,
input_mapping=input.input_mapping.to_orm()
if input.input_mapping is not None
else InputMapping(literal_mapping={}, path_mapping={}),

The fix is to replace from_global_id with from_global_id_with_expected_type and validate against SandboxConfig.__name__ (you'll also need to ensure from_global_id_with_expected_type and SandboxConfig are imported in this file):

sandbox_config_db_id = from_global_id_with_expected_type(
    inline_code_evaluator.sandbox_config_id, SandboxConfig.__name__
)

Copy link
Copy Markdown
Contributor

@anticorrelator anticorrelator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error Handling & User Feedback

  • app/src/components/evaluators/CodeEvaluatorLanguageSandboxFields.tsx:118 — silent null save unlinks the sandbox in DB; next scheduled run fails with an error the user cannot explain
  • src/phoenix/server/api/mutations/chat_mutations.py:4318 — five distinct misconfigurations produce the same 'please select' message; users whose provider was disabled cannot self-diagnose
  • src/phoenix/server/api/mutations/chat_mutations.py:4286 — misrouted GlobalIDs produce less-specific errors and skip type validation used consistently in every other evaluator mutation

Schema & Validation Consistency

  • src/phoenix/server/api/evaluators.py:2422 — dataset examples with partial fields fail apply_input_mapping even though JS destructuring handles missing keys gracefully; schema and TS footer contradict each other
  • app/src/components/evaluators/codeEvaluatorUtils.ts:52 — code the editor's autocomplete promotes is rejected at runtime; editor and backend accept different function-parameter grammar

UI Regression & Accessibility

  • app/src/components/evaluators/EditCodeEvaluatorDialogContent.tsx:412 — existing descriptions survive Save but cannot be edited; field was silently dropped from the new header layout
  • app/src/components/evaluators/EditCodeEvaluatorDialogContent.tsx:720 — screen readers announce the remove button as just 'button'; it is the only unlabeled destructive control in the editor (same gap in CodeEvaluatorTestSection.tsx)
  • app/src/components/dataset/CreateCodeDatasetEvaluatorSlideover.tsx:268 — fresh-install users cannot draft any evaluator code; the empty-sandbox gate hides the entire editor form

Test Coverage

  • src/phoenix/server/api/mutations/chat_mutations.py:281 — five guard paths and the happy path have no resolver tests; regressions in inline_code_evaluator ship without any CI signal

Additional comments

  • src/phoenix/server/api/mutations/chat_mutations.py:4286 — The inline branch decodes the sandbox GlobalID with from_global_id(...) and discards type_name, diverging from every other branch of evaluator_previews and from the sister create_code_evaluator / update_code_evaluator mutations edited in this same PR — all of which use from_global_id_with_expected_type(..., SandboxConfig.__name__). No security boundary is crossed today (SandboxConfig has no owner column, mutation is auth-gated) but the inconsistency is a defense-in-depth gap and produces less-specific errors for misrouted IDs. Replace with the expected-type helper.
  • src/phoenix/server/api/mutations/chat_mutations.py:4318sandbox_backend is None funnels five distinct misconfigurations (none selected, config missing, config disabled, provider missing, provider disabled) into the same BadRequest("No sandbox backend configured for language 'X'..."). Users who did select a sandbox and whose provider was disabled out from under them see 'please select a sandbox configuration' even though they already have. The language-mismatch case gets a dedicated error mid-block — extend the same treatment to the other branches so users can self-diagnose.
  • app/src/components/evaluators/codeEvaluatorUtils.ts:52 — The editor-side extractTypeScriptVariables accepts non-destructured shapes like function evaluate(ctx) { return ctx.output... } and auto-populates mappings from property access. The backend _infer_typescript_evaluate_input_schema in src/phoenix/server/api/evaluators.py:~2445 rejects that same signature with 'Use a destructured object parameter'. The editor now guides users into writing code the runtime refuses. Either share a grammar between the two, or have the UI consume the server's schema/validation result.

return (
<View>
<ComboBox
label="Sandbox"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mapSandboxConfigOptions drops the provider/config for a saved evaluator (provider disabled, backend unavailable), the Edit dialog falls through to selectedSandboxConfigId = null with no user-visible signal. On Save, { sandboxConfigId: null } is sent and evaluator_mutations.py treats the null as an explicit clear — silently unlinking the sandbox in the DB. The next scheduled run will fail with 'Please select a sandbox configuration'. Surface the invalidated selection (inline warning + keep the ID in a disabled/informational state) or omit sandboxConfigId from the mutation when the user didn't touch it.

/**
* Compact header bar with inline name, language, and sandbox fields
*/
const CompactHeaderBar = ({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CompactHeaderBar drops EvaluatorDescriptionInput. The store and GraphQL mutations still carry description (edit hydrates it, create initializes to ""), so existing values round-trip on Save but cannot be edited from the UI — a regression relative to the prior layout, which rendered a description field next to the name. Restore the description input (inline in the header or inside a disclosure in the sidebar).

updateValues(
values.filter((_, valueIndex) => valueIndex !== index)
);
if (values.length <= 2) return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The categorical-choice remove button renders only <Icon svg={<Icons.TrashOutline />} /> with no aria-label or visually-hidden label. Screen readers announce 'button' with no action hint — and the destructive remove is now the only unlabeled control in the editor while + Add choice retains its label. Add aria-label="Remove choice". The result-card close IconButton in CodeEvaluatorTestSection.tsx has the same problem.

_TYPESCRIPT_ARROW_SIGNATURE_RE = re.compile(r"(?:const|let|var)\s+evaluate\s*=\s*\(([^)]*)\)\s*=>")


def _extract_typescript_object_parameter_keys(params: str) -> tuple[list[str], list[str]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_typescript_object_parameter_keys marks every destructured key required unless it has = or ?. The default template function evaluate({ output, reference, input, metadata }: EvaluatorParams) (shipped in DEFAULT_CODE_EVALUATOR_SOURCE) therefore produces an input schema where all four are required, while the generated type footer renders them as optional (output?: Output; reference?: Reference; …). A dataset example containing only output + reference can fail apply_input_mapping validation even though JS destructuring of a missing key just yields undefined. Treat bare destructured names as optional at the schema level, or honor TS ? / default annotations only.

for eval_result in eval_results:
all_results.append(_to_evaluation_result(eval_result, eval_result["name"]))

elif inline_code_evaluator := evaluator_input.inline_code_evaluator:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ~80-LOC inline_code_evaluator branch has no resolver tests. Five BadRequest paths (sandbox_config_id None, config missing, disabled config, disabled provider, language mismatch) plus the happy path are unverified — test_code_evaluator_sandbox_mutations.py covers create/update/stored-preview flows, test_code_evaluator_runner.py covers the runner in isolation, and Playwright only hits one happy path indirectly. Add a resolver test per guard plus a success-path test; a regression that swallows any guard into a 500 or a silent pass currently ships without CI signal.

});
};

// Show empty state if no sandbox configs are available
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sandboxConfigs.length === 0 short-circuit renders only a 'Configure Sandbox' CTA — the user cannot see the editor, write code, or configure outputs. The backend still accepts a null sandbox_config_id on create (evaluator_mutations.py:~1187), so this is a UI-only gate. First-time users on a fresh install cannot draft any work. Either render the form with a disabled sandbox selector and inline banner, or explicitly confirm this is an intentional guard-rail (then mirror the same behavior in the edit flow for consistency).

@github-project-automation github-project-automation Bot moved this from 📘 Todo to 👍 Approved in phoenix Apr 17, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@cephalization cephalization merged commit 23e4726 into version-sandboxes Apr 20, 2026
39 checks passed
@cephalization cephalization deleted the cephalization/improve-code-evaluator-editor-ux branch April 20, 2026 19:00
@github-project-automation github-project-automation Bot moved this from 👍 Approved to ✅ Done in phoenix Apr 20, 2026
anticorrelator added a commit that referenced this pull request Apr 20, 2026
…sync factory

Base PR #12690 added _resolve_inline_code_evaluator_backend with a sync
call to get_or_create_backend. Our branch made that factory async with
required session/decrypt params for DB-backed credential resolution, so
the call site needs the same conversion applied at the other two sites:
await, session+decrypt, MissingSecretError/UnsupportedOperation/
ValidationError mapping to BadRequest, and admin-provider-wins merge order.
cephalization pushed a commit that referenced this pull request Apr 22, 2026
…sync factory

Base PR #12690 added _resolve_inline_code_evaluator_backend with a sync
call to get_or_create_backend. Our branch made that factory async with
required session/decrypt params for DB-backed credential resolution, so
the call site needs the same conversion applied at the other two sites:
await, session+decrypt, MissingSecretError/UnsupportedOperation/
ValidationError mapping to BadRequest, and admin-provider-wins merge order.
anticorrelator added a commit that referenced this pull request Apr 23, 2026
…12655)

* feat(sandbox): inject user-defined env vars and secret refs into sandbox execution

Extends per-adapter config schemas with env_vars (literal values or references to
the encrypted Secret table), internet_access, and dependencies fields. Wires
resolution through get_or_create_backend and forwards merged env to all seven
adapters. Adds SandboxConfigDialog UI for managing entries with a secret picker,
capability advertisement per adapter, and end-to-end tests.

* fix(sandbox): align test adapter signatures and ruff formatting

Add user_env parameter to test-only SandboxAdapter subclasses to match
the updated base signature, and apply ruff formatting to env_var_end_to_end.

* fix(sandbox): unpack packages list in daytona pip install command

Generated subprocess.run args were embedding self._packages as a nested
list literal, which raises TypeError at runtime. Use * unpacking so each
package becomes its own argv element.

Reported by Claude Code Review on PR #12655.

* feat(sandbox): resolve provider credentials via DB secrets + env fallback with GQL mutations

Adds 2-tier credential resolution (DB Secret → env var) for sandbox
provider authentication, centralized in `get_or_create_backend()` so
adapters stay credential-agnostic. Introduces `EnvVarSpec` as the
declarative contract between the factory and each adapter, and exposes
`setSandboxCredential` / `deleteSandboxCredential` GraphQL mutations
(with `env_var_specs` validation and `_BACKEND_CACHE` invalidation) so
admins can rotate sandbox credentials via the UI without a server
restart.

* feat(sandbox): unified config contract — fail-closed runtime + preserve-on-save UI

Lock the unified env_vars/dependencies/internet_access contract across all
sandbox adapters. AdapterMetadata is the canonical source of capability
advertisement; build_backend MUST raise UnsupportedOperation for any
capability-gated input the adapter cannot fulfill (no silent drops).

Backend reconciliation:
- MODAL: flip supports_env_vars=True (runtime already worked); make
  ModalSandboxBackend.execute() raise UnsupportedOperation for per-call
  env overrides (was a silent warning); session-level user_env preserved
- E2B/Vercel-Py/Vercel-TS/Modal: reject non-empty `dependencies` in
  build_backend
- E2B/Daytona/Modal/Vercel-Py/Vercel-TS/Deno: reject non-"none"
  `internet_access` in build_backend
- VercelPython/VercelTypescript/Deno configs: add Optional
  internet_access field for shape uniformity

Frontend (SandboxConfigDialog):
- formValuesToConfigPatch (extracted to utils.tsx for testability):
  deep-merge form state into a clone of the raw stored config; preserves
  capability-gated keys when flag is false or activeBackend is undefined,
  closing the False→True→False data-loss path
- Advanced Config JSON editor explicitly strips capability-gated keys on
  submit; structured editors are the only authors
- Hidden capability sections render a single muted "Not supported by the
  selected backend." placeholder instead of being omitted
- ALLOWLIST internet_access mode marked reserved-for-future

Tests:
- New parametrized test_unified_config_contract.py iterates over
  SANDBOX_ADAPTER_METADATA.keys() and enforces flag/config-model/runtime
  agreement for all three capabilities; new adapters must opt-in to pass
- test_capability_advertisement.py and test_queries.py refactored to
  parametrize from the same registry source — eliminates duplicate
  hard-coded matrices
- test_user_env_forwarding.py covers MODAL's new execute-time policy
- New formValuesToConfigPatch.test.ts covers the three preserve-on-save
  data-loss scenarios

Documents the contract on AdapterMetadata with cross-references from
SandboxAdapter.build_backend and validate_config so future adapters
inherit the rules from source.

* fix(sandbox): enforce dependencies/internet_access guards for DENO and WASM

The unified config contract tests expect adapters with dependencies_language=None
or internet_access="none" in SANDBOX_ADAPTER_METADATA to raise UnsupportedOperation
when build_backend is called with a conflicting config. DENO was missing the
packages guard; WASM was missing both packages and internet_access guards.
Mirrors the existing pattern in e2b/modal/vercel backends.

* feat(sandbox): enforce capability gates and reserved credential names in unified config

- Rename EnvVarSpec → ProviderCredentialSpec and
  AdapterMetadata.internet_access → internet_access_capability to
  disambiguate adapter capability flags from the runtime config block.
- validate_config now enforces capability gates (env_vars,
  internet_access, dependencies) and unique env_var names at write time
  rather than relying on per-adapter build_backend guards alone.
- Execute-code path: drop per-call env override; user env is carried by
  the adapter from build_backend() for the session lifetime.
- Reserved-name enforcement is case-insensitive and now covers both
  config.env_vars entries and top-level config keys, preventing
  credential shadowing via config's extra="allow" schema.
- setSecrets invalidates backend caches for any SandboxConfig whose
  env_vars reference a rotated or deleted secret_key.
- Add tests for reserved credential names and sandbox cache
  invalidation on secret rotation.

* fix(sandbox): convert rebased inline-evaluator preview call site to async factory

Base PR #12690 added _resolve_inline_code_evaluator_backend with a sync
call to get_or_create_backend. Our branch made that factory async with
required session/decrypt params for DB-backed credential resolution, so
the call site needs the same conversion applied at the other two sites:
await, session+decrypt, MissingSecretError/UnsupportedOperation/
ValidationError mapping to BadRequest, and admin-provider-wins merge order.

* fix(sandbox): map invalid sandbox_config_id to BadRequest in preview mutation

from_global_id_with_expected_type raises ValueError on type mismatch;
strawberry swallows that as "an unexpected error occurred", which fails
test_rejects_wrong_global_id_type's message assertion. Wrap the call at
the preview site to surface the real error to the client.

* fix(sandbox): PR #12655 review followups — capability contract polish

B1 (blocker): widen update_sandbox_config to catch ValidationError →
BadRequest, matching create_sandbox_config; capability-gate violations
no longer escape as 500s.

I1: relax update-path capability gate to semantically-changed sections
only via stored_config baseline; preserve-on-save no longer blocks
unrelated edits on flag-false adapters. Runtime build_backend guard
unchanged (fail-closed preserved).

I2: two-layer reserved-credential guard — _check_env_var_collision
rejects secret_ref.secret_key collisions at mutation layer;
_resolve_user_env pre-check rejects at runtime before DB lookup,
closing the pre-existing-row exploit path. Long-term namespace split
deferred to associate-secret-binder-identity-sandbox-config-pe.

I4: DaytonaSandboxBackend.execute now installs packages in the
ephemeral branch before code_run, honoring the advertised
dependencies_language='PYTHON' on every execution path.

I5/I6/I7: drop vestigial credential_specs from ModalAdapter and
DenoAdapter (Modal reads env directly per D6; Deno is local-only);
refresh AdapterMetadata docstring to describe constructor-time user_env
injection; extend _PHOENIX_SANDBOX_FALLBACK_CREDENTIAL_KEYS with
MODAL_TOKEN_ID/SECRET to preserve reserved-name coverage.

I8: rewrite getBackendDescription on actual registry keys
(DAYTONA_PYTHON, VERCEL_PYTHON, VERCEL_TYPESCRIPT, MODAL); align
ENV_PHOENIX_SANDBOX_PROVIDER docstring. Note Vercel
internet_access_capability as scaffolding pending runtime wiring.

* fix(sandbox): align test expectations with no-per-call-env design

- drop test_execute_raises_unsupported_operation_when_env_passed:
  execute() signature takes no env kwarg; env is passed at
  build_backend() time per types.SandboxBackend.execute docstring
- fake E2BAdapter now raises ValueError (caught by
  sandboxBackends resolver) instead of NotImplementedError, so the
  backend-availability probe downgrades to UNAVAILABLE rather than
  bubbling an unhandled 500

* fix(sandbox): PR #12655 simplification — correctness fixes + test consolidation

Phase 1 (correctness): replace hand-rolled equality helpers with
pydantic-canonicalized `_normalize_section` (via `model_dump`); future field
additions to `InternetAccessConfig` / `PythonDependenciesConfig` now track
automatically. `_env_vars_equal` switches from `frozenset` to `Counter` so
duplicate multiplicity is preserved. Extend runtime `_resolve_user_env` to
reject reserved `EnvVarLiteral.name` symmetrically with the existing
`EnvVarSecretRef.secret_key` check, closing the pre-mutation-guard-row
exposure D7 was meant to close.

Phase 2 (test consolidation): drop three integration-layer duplicate files
(`test_env_var_end_to_end.py`, `test_credential_resolution_integration.py`,
`test_reserved_credential_names.py`) after folding unique scenarios into
their unit-level counterparts; prune pydantic-framework-rechecking tests
(`test_unknown_keys_preserved`, `test_empty_config_returns_defaults`, trivial
round-trips) from `test_sandbox_config_validation.py`; drop `backend._user_env`
internal-attribute assertions from `test_user_env_forwarding.py`; replace
`_BACKEND_CACHE` dict-shape peeks in `test_sandbox_cache_invalidation.py` with
identity-based observable checks; parametrize per-adapter shape assertions in
`test_capability_advertisement.py` over `SANDBOX_ADAPTER_METADATA.keys()`; add
module-level docstrings to `test_daytona_backend.py` / `test_vercel_backend.py`
/ `test_modal_backend.py` stating the per-adapter-files-only-for-unique-lifecycle
principle; move `test_sandbox_config_validation.py` from `tests/unit/server/api/`
to `tests/unit/server/sandbox/` (tests sandbox pydantic models, not API).

Phase 3 (clarity): rename `_PHOENIX_SANDBOX_FALLBACK_CREDENTIAL_KEYS` →
`_PHOENIX_RESERVED_CREDENTIAL_ONLY_KEYS` with an inline docstring distinguishing
it from `SandboxAdapter.credential_specs` and `RESERVED_CREDENTIAL_NAMES`.

Net: +599 / -718. All 429 sandbox-relevant tests pass.

* test(sandbox): re-scope test surface to durable interfaces — −2,013 lines

PR #12655 Phase 2 re-scope. Prior Phase 2 (commit 68d130c) netted only
124 lines of reduction vs a target of 1,100–1,900 because "fold unique
scenarios" was interpreted too conservatively. This pass recalibrates on a
sharper thesis: keep the **mutation boundary** (gql_client tests) and the
**cross-adapter conformance** parametrized over
`SANDBOX_ADAPTER_METADATA.keys()`. Delete per-adapter SDK mocks,
internal-factory scenario walks, and pydantic framework round-trips —
they re-verify library behavior or couple to implementation details a
reasonable refactor would change.

Phase 1 — delete per-adapter SDK test files (−1,181 lines):
- tests/unit/server/sandbox/test_daytona_backend.py (111 → 0)
- tests/unit/server/sandbox/test_vercel_backend.py (425 → 0)
- tests/unit/server/sandbox/test_modal_backend.py (246 → 0)
- tests/unit/server/sandbox/test_user_env_forwarding.py (399 → 0)
Cross-adapter conformance (test_unified_config_contract.py +
test_capability_advertisement.py) already parametrizes over
SANDBOX_ADAPTER_METADATA.keys(); per-adapter SDK forwarding mocks add no
observable contract coverage.

Phase 2 — shrink test_env_var_resolution.py (598 → 155, −443):
Keep the reserved-name defense (TestResolveUserEnvReservedSecretKey),
is_reserved_credential_name helper contract
(TestReservedCredentialNameHelper),
_enforce_unique_env_var_names validator
(TestValidateConfigRejectsDuplicateEnvVarNames, collapsed), and one
_resolve_user_env shape happy path. Delete TestEndToEndResolution
(5 tests absorbed from prior integration fold — scenario walks; covered
by mutation-layer cache-invalidation suite) and all per-kind/per-scenario
resolution walks.

Phase 3 — shrink test_credential_resolution.py (286 → 66, −220):
Keep two direct _resolve_sandbox_credentials invariants — DB wins over
env, and missing-key is omitted (not None). Delete
TestCredentialResolutionIntegration (3 tests absorbed from prior
integration fold — factory-chain walks covered by mutation-layer
cache-invalidation tests) and the `_CapturingAdapter` scaffolding used
only by them.

Phase 4 — shrink test_sandbox_config_validation.py (495 → 353, −142):
Keep Phase 1 equality-helper classes byte-identical
(TestEnvVarsEqual, TestInternetAccessEqual, TestPackagesEqual),
TestValidateConfigErrorPath (ValueError/ValidationError wrapping),
TestEnvVarEntryDiscriminatedUnion (extra="forbid" + secret_ref shape),
TestConfigFieldSpecsFromModel (reduced to one representative adapter
plus the distinct schema-branch cases), and one forbid-extra assertion
per nested leaf model. Delete the seven per-adapter pydantic-schema
round-trip classes — pydantic schema re-verification is framework
behavior, not authored invariant.

Phase 5 — light prune mutation-layer sandbox tests (−30 lines,
conservative per plan D4):
- test_sandbox_config_mutations.py (509 → 479): drop
  test_reserved_modal_token_id_rejected. The MODAL_TOKEN_ID reserved-name
  case is covered parametrically in
  tests/unit/server/api/mutations/test_reserved_credential_names.py for
  the env_var[].name position; the secret_ref.secret_key mechanism is
  already tested via VERCEL_TOKEN in the same class. No observable
  coverage lost.
No other mutation-layer files modified — plan D4 mandates light prune
only; these are the durable GQL boundary tests.

Housekeeping — test_unified_config_contract.py docstring updated: prior
text pointed positive-path coverage at the now-deleted
test_user_env_forwarding.py and per-adapter SDK files. New docstring
states the contract honestly: structural registry + unsupported-capability
rejection are covered here; positive SDK-forwarding mock coverage is
intentionally absent.

Verification:
- Full sandbox + mutation-layer suite: 126 passed, 53 skipped (postgres
  --db sqlite), 0 failed.
- Cross-adapter conformance
  (test_sandbox_backends_capability_flags_per_adapter,
  test_env_vars_false_raises_for_non_empty_user_env, etc.) still
  collects one case per adapter key in SANDBOX_ADAPTER_METADATA.keys()
  — all 7 adapters (WASM, E2B, DAYTONA_PYTHON, VERCEL_PYTHON,
  VERCEL_TYPESCRIPT, DENO, MODAL) present.
- pytest --collect-only -q delta (target file list):
  297 → 179 tests collected (−118 tests).
- Line count delta (target file list):
  4,498 → 2,485 lines (−2,013 lines, 105 insertions / 2,118 deletions).

Note on outcome vs target: plan target was ~2,000–2,100 total; final is
2,485. The gap is in Phase 4 (ended at 353 vs ~200) because the plan
also mandates "Kept classes are byte-identical to the originals (no
accidental edits to custom-validator logic)" for the equality-helper
classes — those alone are ~75 lines. Phase 4's keep-set was fully
satisfied; the ~200 target was aspirational. Phase 5's −30 is under the
−50 floor but within the spirit of the plan's "conservative — these are
the boundary tests we are deliberately keeping."

Commands used:
- uv run pytest tests/unit/server/sandbox/ \
    tests/unit/server/api/mutations/test_sandbox_*.py \
    tests/unit/server/api/mutations/test_reserved_credential_names.py \
    tests/unit/server/api/test_capability_advertisement.py
- uv run pytest --collect-only -q <same paths>

* Fix auth guard, improve config dialog styles

* Filter sandboxes settings tab when permissions are not satisfied

* fix(sandbox): align rebased branch with required EvaluatorInputMappingInput fields

PR #12727 on main made literal_mapping / path_mapping required on
EvaluatorInputMappingInput to fix GraphQL introspection. After rebase,
our handlers that relied on the removed default-empty fallback no
longer type-check, and tests that omitted inputMapping failed schema
validation.

Apply option (b) consistently: raise BadRequest when input_mapping is
None across all six evaluator create/update handlers rather than
silently defaulting to empty mappings. Update tests to supply explicit
inputMapping values. Also fix two surfaced typing issues in sandbox
config: dict(JSON) casts and JSON return-type casts.

* fix(integration-tests): supply inputMapping to dataset evaluator fixture

Commit d725531 made inputMapping a required field on
createDatasetLlmEvaluator after the EvaluatorInputMappingInput schema
tightened in PR #12727. Unit tests were updated there; the integration
fixture was missed, so test_experiment_with_evaluators fails with
"input_mapping is required". Pass empty literal/path mappings — the
per-evaluator inputMapping used at subscription time is supplied
separately in the subscription variables.

---------

Co-authored-by: Tony Powell <apowell@arize.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants