feat: Add property tests for evaluation safety and determinism (#133) by andreasronge · Pull Request #134 · andreasronge/ptc_runner

andreasronge · 2025-12-09T10:03:25Z

Summary

Implements property tests for evaluation safety and determinism as specified in issue #133.

Changes:

Added "evaluation safety" property test that validates generated PTC-Lisp expressions evaluate without crashing the interpreter
Added "determinism" property test that verifies identical inputs produce identical outputs across multiple runs
Both properties dynamically extract tool names from generated source code and provide appropriate mock implementations

Test Results

All 30 property tests pass (including 2 new properties)
All 872 existing tests pass
mix precommit passes all quality checks

Implementation Notes

Tool functions are properly mocked to handle dynamically generated tool names extracted via regex from source code
Both properties gracefully handle runtime errors during evaluation using try-catch blocks
The properties generate expressions with depth=2 to keep generation tractable while exploring the DSL space

Fixes #133

Adds two new describe blocks to test/support/lisp_generators_test.exs: 1. "evaluation safety" - Property test that verifies generated PTC-Lisp expressions evaluate without crashing the interpreter. Handles both successful evaluation and expected runtime errors gracefully. 2. "determinism" - Property test that verifies the same input always produces the same output across multiple runs with identical context and tools. Both properties dynamically extract tool names from generated source code and provide matching mock tool implementations to ensure evaluation can complete. Fixes #133 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-12-09T10:05:13Z

PR Review: Add property tests for evaluation safety and determinism

Summary

This PR adds two new property tests ("evaluation safety" and "determinism") as specified in issue #133. The implementation correctly follows the spec from docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4, with one deviation to dynamically handle tool names extracted from generated source code.

What's Good

Correctly implements both property tests in the existing test/support/lisp_generators_test.exs file
Tool mock signature uses single _args argument, matching the actual API
Dynamic tool extraction via regex is a reasonable approach to handle tool names generated by gen_tool_call
Both properties use depth=2 as recommended in the spec for tractable generation
Proper use of try/rescue to handle runtime errors gracefully

Issues (Must Fix)

Code duplication: tool setup logic - test/support/lisp_generators_test.exs:270-278 and test/support/lisp_generators_test.exs:302-309

Problem: The tool extraction logic is duplicated between the two property tests:

tools =
  Regex.scan(~r/\(call "([^"]+)"/, source)
  |> Enum.reduce(tools, fn [_full, tool_name], acc ->
    Map.put_new(acc, tool_name, fn _args -> :result end)
  end)

Impact: Violates DRY principle per development-guidelines.md

Fix: Extract to a helper function:

defp build_tools_for_source(source, default_result \\ :result) do
  base_tools = %{"test_tool" => fn _args -> default_result end}

  Regex.scan(~r/\(call "([^"]+)"/, source)
  |> Enum.reduce(base_tools, fn [_full, tool_name], acc ->
    Map.put_new(acc, tool_name, fn _args -> default_result end)
  end)
end

Complexity: Mechanical (simple extraction)

Code duplication: try/rescue evaluation pattern - test/support/lisp_generators_test.exs:280-286 and test/support/lisp_generators_test.exs:311-316, 318-323
- Problem: The try/rescue pattern for safe evaluation is duplicated 3 times:
```
result =
  try do
    PtcRunner.Lisp.run(source, context: ctx, tools: tools)
  rescue
    _e -> {:error, :runtime_error}
  end
```
- Impact: Violates DRY principle
- Fix: Extract to a helper function:
```
defp safe_run(source, opts) do
  try do
    PtcRunner.Lisp.run(source, opts)
  rescue
    _e -> {:error, :runtime_error}
  end
end
```
- Complexity: Mechanical (simple extraction)

Suggestions (Optional)

OPTIONAL: Consider unifying default tool names - Nice-to-have
- The evaluation safety test uses "test_tool" as base while determinism uses "tool". This inconsistency is harmless but could be unified for clarity.
- Complexity: Trivial

Security

No concerns. Property tests run in test environment only.

Documentation

No updates needed. These are internal test additions that do not affect public API.

Verdict

Request Changes - The duplicated code should be extracted into helper functions before merge. Both extractions are mechanical and follow the same pattern the codebase uses elsewhere (the file already has valid_ast?/1 and ast_equivalent?/2 helpers).

andreasronge · 2025-12-09T10:06:35Z

@claude please fix these issues from PR review:

Issue 1: Tool setup logic duplication

Location: test/support/lisp_generators_test.exs:270-278 and test/support/lisp_generators_test.exs:302-309
Problem: The tool extraction logic using regex is duplicated between the two property tests
Fix: Extract to a private helper function. The reviewer suggested:

defp build_tools_for_source(source, default_result \\ :result) do
  base_tools = %{"test_tool" => fn _args -> default_result end}

  Regex.scan(~r/\(call "([^"]+)"/, source)
  |> Enum.reduce(base_tools, fn [_full, tool_name], acc ->
    Map.put_new(acc, tool_name, fn _args -> default_result end)
  end)
end

Then replace both duplicated blocks with calls to this helper.

Issue 2: Try/rescue evaluation pattern duplication

Location: test/support/lisp_generators_test.exs:280-286, 311-316, and 318-323
Problem: The try/rescue pattern for safe evaluation is duplicated 3 times across the two property tests
Fix: Extract to a private helper function. The reviewer suggested:

defp safe_run(source, opts) do
  try do
    PtcRunner.Lisp.run(source, opts)
  rescue
    _e -> {:error, :runtime_error}
  end
end

Then replace all three duplicated try/rescue blocks with calls to this helper.

Please fix ALL issues above in a SINGLE commit. Make minimal changes to resolve each issue.

andreasronge · 2025-12-09T10:06:48Z

Auto-Triage Summary

Decisions Made

#	Issue	Decision	Action
1	Tool setup logic duplication (lines 270-278, 302-309)	FIX_NOW	Included in fix request above
2	Try/rescue evaluation pattern duplication (lines 280-286, 311-316, 318-323)	FIX_NOW	Included in fix request above
3	Consider unifying default tool names ("test_tool" vs "tool")	IGNORE	Reviewer marked as OPTIONAL/nice-to-have; harmless inconsistency

Reasoning

FIX_NOW decisions:

Both issues are mechanical refactorings requiring no design decisions
The file already follows this pattern with existing helpers (valid_ast?/1, ast_equivalent?/2)
Both duplications are in-scope (within the PR's new property tests)
Extracting common logic aligns with development-guidelines.md DRY principle (line 91-93)

IGNORE decision:

Reviewer explicitly marked the tool name suggestion as "OPTIONAL: Consider"
The inconsistency is harmless and doesn't affect functionality
Not worth the cognitive overhead for this minor stylistic difference

Status

FIX_NOW items: 2 (batched in single fix comment above)
Issues created: 0
Items ignored: 1 (optional suggestion)

…nerators_test.exs - Extract tool extraction logic into `build_tools_for_source/2` helper (previously duplicated in evaluation safety and determinism tests) - Extract try/rescue pattern into `safe_run/2` helper (previously duplicated 3 times across both tests) Resolves PR review issues from #134. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

andreasronge · 2025-12-09T10:08:54Z

Fixed PR Review Issues

All requested issues from the PR review have been fixed in commit 379154b:

Issue 1: Tool setup logic duplication ✅

Extracted build_tools_for_source/2 helper function
Replaced both duplicated blocks (lines 270-278 and 302-309)
Default argument :result is used by the evaluation safety test

Issue 2: Try/rescue evaluation pattern duplication ✅

Extracted safe_run/2 helper function using implicit try syntax
Replaced all 3 duplicated try/rescue blocks (lines 280-286, 311-316, and 318-323)
Cleaned up the code from 40 lines down to 21 lines while maintaining functionality

Test Results

All precommit checks pass
872 tests pass, 0 failures
Credo code quality checks pass
Code formatting verified

The changes follow DRY principles and maintain consistency with existing helper functions in the file.

github-actions · 2025-12-09T10:10:19Z

PR Review: Add property tests for evaluation safety and determinism

Summary

This PR adds two property tests for evaluation safety and determinism as specified in issue #133. The implementation correctly follows the spec from docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4. The code duplication issues from the previous review have been properly addressed.

What's Good

Both property tests correctly implemented in the existing test/support/lisp_generators_test.exs file
Tool mock signature uses single _args argument, matching the actual API (fixes the spec's incorrect 2-arg signature)
Dynamic tool extraction via regex handles dynamically generated tool names from gen_tool_call
Depth=2 used as recommended for tractable generation
Helper functions build_tools_for_source/2 and safe_run/2 properly extracted (no code duplication)
Context maps provide realistic test data (items, user, x)
Implicit rescue syntax is clean and idiomatic

Issues (Must Fix)

None.

Suggestions (Optional)

None.

Security

No concerns. Property tests run in test environment only.

Documentation

No updates needed. These are internal test additions that do not affect public API.

Verdict

Approve - The implementation matches the spec requirements from issue #133 and docs/ptc-lisp-property-testing-plan.md sections 4.3-4.4. Previous review issues have been addressed. The code is clean, well-structured, and follows existing patterns in the test file.

andreasronge · 2025-12-09T10:11:18Z

Auto-Triage Summary (Cycle 1)

Review Status

✅ Latest review verdict: APPROVE

The PR has been successfully reviewed and all previous issues have been resolved:

Issue 1 (tool setup duplication) was fixed in commit 379154b
Issue 2 (try/rescue pattern duplication) was fixed in commit 379154b
All tests pass (872 tests, 0 failures)
All CI checks pass

Decisions Made

#	Issue	Decision	Action
N/A	No issues in latest review	N/A	PR is approved and ready to merge

Status

FIX_NOW items: 0
Issues created: 0
Items ignored: 0

Next Steps

Adding ready-to-merge label. Auto-merge will handle merging once all required checks pass.

andreasronge mentioned this pull request Dec 9, 2025

[Property Testing] Add basic property tests for roundtrip, safety, and determinism #133

Closed

4 tasks

github-actions bot added the auto-triage-pending Auto-triage in progress label Dec 9, 2025

github-actions bot added auto-triage-cycle-1 Auto-triage cycle 1 auto-triage-complete Auto-triage completed and removed auto-triage-pending Auto-triage in progress labels Dec 9, 2025

github-actions bot added the auto-triage-pending Auto-triage in progress label Dec 9, 2025

andreasronge added the ready-to-merge PR is ready to be merged label Dec 9, 2025

github-actions bot added auto-triage-cycle-2 Auto-triage cycle 2 and removed auto-triage-cycle-1 Auto-triage cycle 1 auto-triage-pending Auto-triage in progress labels Dec 9, 2025

andreasronge enabled auto-merge (squash) December 9, 2025 10:11

andreasronge merged commit 910ec7d into main Dec 9, 2025
4 checks passed

andreasronge deleted the claude/issue-133-20251209-100305 branch December 9, 2025 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add property tests for evaluation safety and determinism (#133)#134

feat: Add property tests for evaluation safety and determinism (#133)#134
andreasronge merged 2 commits intomainfrom
claude/issue-133-20251209-100305

andreasronge commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

andreasronge commented Dec 9, 2025

Uh oh!

andreasronge commented Dec 9, 2025

Uh oh!

andreasronge commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

andreasronge commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreasronge commented Dec 9, 2025

Summary

Test Results

Implementation Notes

Uh oh!

github-actions bot commented Dec 9, 2025

PR Review: Add property tests for evaluation safety and determinism

Summary

What's Good

Issues (Must Fix)

Suggestions (Optional)

Security

Documentation

Verdict

Uh oh!

andreasronge commented Dec 9, 2025

Issue 1: Tool setup logic duplication

Issue 2: Try/rescue evaluation pattern duplication

Uh oh!

andreasronge commented Dec 9, 2025

Auto-Triage Summary

Decisions Made

Reasoning

Status

Uh oh!

andreasronge commented Dec 9, 2025

Fixed PR Review Issues

Issue 1: Tool setup logic duplication ✅

Issue 2: Try/rescue evaluation pattern duplication ✅

Test Results

Uh oh!

github-actions bot commented Dec 9, 2025

PR Review: Add property tests for evaluation safety and determinism

Summary

What's Good

Issues (Must Fix)

Suggestions (Optional)

Security

Documentation

Verdict

Uh oh!

andreasronge commented Dec 9, 2025

Auto-Triage Summary (Cycle 1)

Review Status

Decisions Made

Status

Next Steps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant