Skip to content

feat: Add property tests for evaluation safety and determinism (#133)#134

Merged
andreasronge merged 2 commits intomainfrom
claude/issue-133-20251209-100305
Dec 9, 2025
Merged

feat: Add property tests for evaluation safety and determinism (#133)#134
andreasronge merged 2 commits intomainfrom
claude/issue-133-20251209-100305

Conversation

@andreasronge
Copy link
Copy Markdown
Owner

Summary

Implements property tests for evaluation safety and determinism as specified in issue #133.

Changes:

  • Added "evaluation safety" property test that validates generated PTC-Lisp expressions evaluate without crashing the interpreter
  • Added "determinism" property test that verifies identical inputs produce identical outputs across multiple runs
  • Both properties dynamically extract tool names from generated source code and provide appropriate mock implementations

Test Results

  • All 30 property tests pass (including 2 new properties)
  • All 872 existing tests pass
  • mix precommit passes all quality checks

Implementation Notes

  • Tool functions are properly mocked to handle dynamically generated tool names extracted via regex from source code
  • Both properties gracefully handle runtime errors during evaluation using try-catch blocks
  • The properties generate expressions with depth=2 to keep generation tractable while exploring the DSL space

Fixes #133

Adds two new describe blocks to test/support/lisp_generators_test.exs:

1. "evaluation safety" - Property test that verifies generated PTC-Lisp
   expressions evaluate without crashing the interpreter. Handles both
   successful evaluation and expected runtime errors gracefully.

2. "determinism" - Property test that verifies the same input always
   produces the same output across multiple runs with identical context
   and tools.

Both properties dynamically extract tool names from generated source code
and provide matching mock tool implementations to ensure evaluation can
complete.

Fixes #133

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

PR Review: Add property tests for evaluation safety and determinism

Summary

This PR adds two new property tests ("evaluation safety" and "determinism") as specified in issue #133. The implementation correctly follows the spec from docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4, with one deviation to dynamically handle tool names extracted from generated source code.

What's Good

  • Correctly implements both property tests in the existing test/support/lisp_generators_test.exs file
  • Tool mock signature uses single _args argument, matching the actual API
  • Dynamic tool extraction via regex is a reasonable approach to handle tool names generated by gen_tool_call
  • Both properties use depth=2 as recommended in the spec for tractable generation
  • Proper use of try/rescue to handle runtime errors gracefully

Issues (Must Fix)

  1. Code duplication: tool setup logic - test/support/lisp_generators_test.exs:270-278 and test/support/lisp_generators_test.exs:302-309

    • Problem: The tool extraction logic is duplicated between the two property tests:
      tools =
        Regex.scan(~r/\(call "([^"]+)"/, source)
        |> Enum.reduce(tools, fn [_full, tool_name], acc ->
          Map.put_new(acc, tool_name, fn _args -> :result end)
        end)
    • Impact: Violates DRY principle per development-guidelines.md
    • Fix: Extract to a helper function:
      defp build_tools_for_source(source, default_result \\ :result) do
        base_tools = %{"test_tool" => fn _args -> default_result end}
      
        Regex.scan(~r/\(call "([^"]+)"/, source)
        |> Enum.reduce(base_tools, fn [_full, tool_name], acc ->
          Map.put_new(acc, tool_name, fn _args -> default_result end)
        end)
      end
    • Complexity: Mechanical (simple extraction)
  2. Code duplication: try/rescue evaluation pattern - test/support/lisp_generators_test.exs:280-286 and test/support/lisp_generators_test.exs:311-316, 318-323

    • Problem: The try/rescue pattern for safe evaluation is duplicated 3 times:
      result =
        try do
          PtcRunner.Lisp.run(source, context: ctx, tools: tools)
        rescue
          _e -> {:error, :runtime_error}
        end
    • Impact: Violates DRY principle
    • Fix: Extract to a helper function:
      defp safe_run(source, opts) do
        try do
          PtcRunner.Lisp.run(source, opts)
        rescue
          _e -> {:error, :runtime_error}
        end
      end
    • Complexity: Mechanical (simple extraction)

Suggestions (Optional)

  1. OPTIONAL: Consider unifying default tool names - Nice-to-have
    • The evaluation safety test uses "test_tool" as base while determinism uses "tool". This inconsistency is harmless but could be unified for clarity.
    • Complexity: Trivial

Security

No concerns. Property tests run in test environment only.

Documentation

No updates needed. These are internal test additions that do not affect public API.

Verdict

Request Changes - The duplicated code should be extracted into helper functions before merge. Both extractions are mechanical and follow the same pattern the codebase uses elsewhere (the file already has valid_ast?/1 and ast_equivalent?/2 helpers).

@github-actions github-actions bot added the auto-triage-pending Auto-triage in progress label Dec 9, 2025
@andreasronge
Copy link
Copy Markdown
Owner Author

@claude please fix these issues from PR review:

Issue 1: Tool setup logic duplication

Location: test/support/lisp_generators_test.exs:270-278 and test/support/lisp_generators_test.exs:302-309
Problem: The tool extraction logic using regex is duplicated between the two property tests
Fix: Extract to a private helper function. The reviewer suggested:

defp build_tools_for_source(source, default_result \\ :result) do
  base_tools = %{"test_tool" => fn _args -> default_result end}

  Regex.scan(~r/\(call "([^"]+)"/, source)
  |> Enum.reduce(base_tools, fn [_full, tool_name], acc ->
    Map.put_new(acc, tool_name, fn _args -> default_result end)
  end)
end

Then replace both duplicated blocks with calls to this helper.

Issue 2: Try/rescue evaluation pattern duplication

Location: test/support/lisp_generators_test.exs:280-286, 311-316, and 318-323
Problem: The try/rescue pattern for safe evaluation is duplicated 3 times across the two property tests
Fix: Extract to a private helper function. The reviewer suggested:

defp safe_run(source, opts) do
  try do
    PtcRunner.Lisp.run(source, opts)
  rescue
    _e -> {:error, :runtime_error}
  end
end

Then replace all three duplicated try/rescue blocks with calls to this helper.

Please fix ALL issues above in a SINGLE commit. Make minimal changes to resolve each issue.

@andreasronge
Copy link
Copy Markdown
Owner Author

Auto-Triage Summary

Decisions Made

# Issue Decision Action
1 Tool setup logic duplication (lines 270-278, 302-309) FIX_NOW Included in fix request above
2 Try/rescue evaluation pattern duplication (lines 280-286, 311-316, 318-323) FIX_NOW Included in fix request above
3 Consider unifying default tool names ("test_tool" vs "tool") IGNORE Reviewer marked as OPTIONAL/nice-to-have; harmless inconsistency

Reasoning

FIX_NOW decisions:

  • Both issues are mechanical refactorings requiring no design decisions
  • The file already follows this pattern with existing helpers (valid_ast?/1, ast_equivalent?/2)
  • Both duplications are in-scope (within the PR's new property tests)
  • Extracting common logic aligns with development-guidelines.md DRY principle (line 91-93)

IGNORE decision:

  • Reviewer explicitly marked the tool name suggestion as "OPTIONAL: Consider"
  • The inconsistency is harmless and doesn't affect functionality
  • Not worth the cognitive overhead for this minor stylistic difference

Status

  • FIX_NOW items: 2 (batched in single fix comment above)
  • Issues created: 0
  • Items ignored: 1 (optional suggestion)

@github-actions github-actions bot added auto-triage-cycle-1 Auto-triage cycle 1 auto-triage-complete Auto-triage completed and removed auto-triage-pending Auto-triage in progress labels Dec 9, 2025
…nerators_test.exs

- Extract tool extraction logic into `build_tools_for_source/2` helper
  (previously duplicated in evaluation safety and determinism tests)
- Extract try/rescue pattern into `safe_run/2` helper
  (previously duplicated 3 times across both tests)

Resolves PR review issues from #134.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@andreasronge
Copy link
Copy Markdown
Owner Author

Fixed PR Review Issues

All requested issues from the PR review have been fixed in commit 379154b:

Issue 1: Tool setup logic duplication ✅

  • Extracted build_tools_for_source/2 helper function
  • Replaced both duplicated blocks (lines 270-278 and 302-309)
  • Default argument :result is used by the evaluation safety test

Issue 2: Try/rescue evaluation pattern duplication ✅

  • Extracted safe_run/2 helper function using implicit try syntax
  • Replaced all 3 duplicated try/rescue blocks (lines 280-286, 311-316, and 318-323)
  • Cleaned up the code from 40 lines down to 21 lines while maintaining functionality

Test Results

  • All precommit checks pass
  • 872 tests pass, 0 failures
  • Credo code quality checks pass
  • Code formatting verified

The changes follow DRY principles and maintain consistency with existing helper functions in the file.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

PR Review: Add property tests for evaluation safety and determinism

Summary

This PR adds two property tests for evaluation safety and determinism as specified in issue #133. The implementation correctly follows the spec from docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4. The code duplication issues from the previous review have been properly addressed.

What's Good

  • Both property tests correctly implemented in the existing test/support/lisp_generators_test.exs file
  • Tool mock signature uses single _args argument, matching the actual API (fixes the spec's incorrect 2-arg signature)
  • Dynamic tool extraction via regex handles dynamically generated tool names from gen_tool_call
  • Depth=2 used as recommended for tractable generation
  • Helper functions build_tools_for_source/2 and safe_run/2 properly extracted (no code duplication)
  • Context maps provide realistic test data (items, user, x)
  • Implicit rescue syntax is clean and idiomatic

Issues (Must Fix)

None.

Suggestions (Optional)

None.

Security

No concerns. Property tests run in test environment only.

Documentation

No updates needed. These are internal test additions that do not affect public API.

Verdict

Approve - The implementation matches the spec requirements from issue #133 and docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4. Previous review issues have been addressed. The code is clean, well-structured, and follows existing patterns in the test file.

@github-actions github-actions bot added the auto-triage-pending Auto-triage in progress label Dec 9, 2025
@andreasronge
Copy link
Copy Markdown
Owner Author

Auto-Triage Summary (Cycle 1)

Review Status

Latest review verdict: APPROVE

The PR has been successfully reviewed and all previous issues have been resolved:

  • Issue 1 (tool setup duplication) was fixed in commit 379154b
  • Issue 2 (try/rescue pattern duplication) was fixed in commit 379154b
  • All tests pass (872 tests, 0 failures)
  • All CI checks pass

Decisions Made

# Issue Decision Action
N/A No issues in latest review N/A PR is approved and ready to merge

Status

  • FIX_NOW items: 0
  • Issues created: 0
  • Items ignored: 0

Next Steps

Adding ready-to-merge label. Auto-merge will handle merging once all required checks pass.

@andreasronge andreasronge added the ready-to-merge PR is ready to be merged label Dec 9, 2025
@github-actions github-actions bot added auto-triage-cycle-2 Auto-triage cycle 2 and removed auto-triage-cycle-1 Auto-triage cycle 1 auto-triage-pending Auto-triage in progress labels Dec 9, 2025
@andreasronge andreasronge enabled auto-merge (squash) December 9, 2025 10:11
@andreasronge andreasronge merged commit 910ec7d into main Dec 9, 2025
4 checks passed
@andreasronge andreasronge deleted the claude/issue-133-20251209-100305 branch December 9, 2025 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-triage-complete Auto-triage completed auto-triage-cycle-2 Auto-triage cycle 2 ready-to-merge PR is ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Property Testing] Add basic property tests for roundtrip, safety, and determinism

1 participant