Skip to content

fix: Fix ZMQ context termination deadlock issue#469

Merged
ajcasagrande merged 4 commits intomainfrom
ajc/zmq-fix
Nov 14, 2025
Merged

fix: Fix ZMQ context termination deadlock issue#469
ajcasagrande merged 4 commits intomainfrom
ajc/zmq-fix

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented Nov 14, 2025

Problem: aiperf hung indefinitely after completing benchmarks due to a ZeroMQ context termination call that blocks in uninterruptible C code, waiting for network sockets that never fully close.

Root Cause: Python's timeout mechanisms cannot interrupt system-level blocking calls, and the singleton context architecture means termination is attempted while other components still hold socket references, creating a deadlock condition.

Solution: Remove the explicit context termination call and delegate cleanup to the operating system on process exit—a standard practice for short-lived processes recommended by the ZeroMQ maintainers that eliminates the hang while maintaining clean resource management through kernel-level cleanup.

Summary by CodeRabbit

Release Notes

  • Refactor

    • Improved process termination and cleanup mechanisms for enhanced application stability.
  • Tests

    • Added lifecycle validation tests for the proxy manager to ensure reliable shutdown behavior.

@github-actions
Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@ajc/zmq-fix

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@ajc/zmq-fix

@github-actions github-actions Bot added the fix label Nov 14, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 14, 2025

Walkthrough

The pull request delegates ZMQ context cleanup responsibility from explicit in-process termination to process exit, removes unused imports (asyncio, zmq.asyncio, sys, Environment), and changes program termination from Python-level sys.exit(0) to OS-level os._exit(0). A new lifecycle test validates that ZMQ context termination is not called during ProxyManager shutdown.

Changes

Cohort / File(s) Summary
Lifecycle and Resource Cleanup
src/aiperf/controller/proxy_manager.py, src/aiperf/controller/system_controller.py
Removed unused imports (asyncio, zmq.asyncio, sys, Environment); replaced ZMQ context termination logic with debug log delegating cleanup to process exit; changed program termination from sys.exit(0) to os._exit(0) for immediate OS-level exit.
Test Coverage
tests/unit/controller/test_proxy_manager.py
Added TestProxyManagerLifecycle unit test class validating ProxyManager initialization, start, and stop lifecycle; asserts ZMQ global context termination is never called during shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

  • Specific areas requiring attention:
    • The change from sys.exit(0) to os._exit(0) bypasses Python-level cleanup (finalizers, atexit handlers) and should be validated for any required shutdown sequences
    • Verify that delegating ZMQ context termination to process exit doesn't introduce resource leaks or hanging connections in edge cases
    • Confirm test assertions accurately reflect intended behavior regarding ZMQ context lifecycle

Poem

🐰 Context cleanup now rests with the OS,
No tangled threads or untimely moss,
Imports trimmed neat, exit paths bright,
Process exit handles the cleanup right!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main issue being fixed: preventing ZMQ context termination deadlock. It directly relates to the core change across modified files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ecce27 and 0b72f85.

📒 Files selected for processing (3)
  • src/aiperf/controller/proxy_manager.py (1 hunks)
  • src/aiperf/controller/system_controller.py (1 hunks)
  • tests/unit/controller/test_proxy_manager.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit/controller/test_proxy_manager.py (3)
src/aiperf/common/config/service_config.py (1)
  • ServiceConfig (28-173)
src/aiperf/controller/proxy_manager.py (1)
  • ProxyManager (12-67)
tests/unit/conftest.py (1)
  • mock_zmq_globally (39-75)
src/aiperf/controller/proxy_manager.py (1)
src/aiperf/common/protocols.py (1)
  • debug (64-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: integration-tests (macos-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.10)
  • GitHub Check: integration-tests (macos-latest, 3.13)
  • GitHub Check: integration-tests (ubuntu-latest, 3.10)
  • GitHub Check: integration-tests (ubuntu-latest, 3.13)
  • GitHub Check: integration-tests (ubuntu-latest, 3.12)
  • GitHub Check: integration-tests (ubuntu-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.12)
  • GitHub Check: build (macos-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.12)
  • GitHub Check: build (macos-latest, 3.13)
  • GitHub Check: build (ubuntu-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.13)
  • GitHub Check: build (ubuntu-latest, 3.11)
  • GitHub Check: build (macos-latest, 3.12)
🔇 Additional comments (2)
tests/unit/controller/test_proxy_manager.py (1)

15-40: Well-structured test validating the core requirement.

This test clearly validates that context.term() is not called during ProxyManager shutdown, which is the key behavioral change in this PR. The test setup is appropriate, uses existing fixtures correctly, and the assertion directly verifies the requirement.

src/aiperf/controller/proxy_manager.py (1)

52-67: The PyZMQ documentation claim in the code comment is inaccurate and should be corrected.

PyZMQ's documented best practices for short-lived processes recommend using context managers or ctx.destroy(linger=0) to close leftover sockets and avoid hangs, not avoiding context cleanup entirely as the comment suggests.

While the pragmatic decision to skip context.term() may be justified by your production experience with indefinite hangs, the documented approach for short-lived processes is ctx.destroy(linger=0) or context managers, which provide proper cleanup without the risk of blocking.

Recommended fixes:

  • Remove the misleading reference to PyZMQ documentation supporting the no-cleanup approach
  • Either: (1) consider using ctx.destroy(linger=0) instead of relying on process exit, or (2) clarify the comment to explain this is a workaround for production hangs, not a documented best practice

Likely an incorrect or invalid review comment.

Comment thread src/aiperf/controller/system_controller.py Outdated
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Copy link
Copy Markdown
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. Especially like the documenting comments.

@ajcasagrande ajcasagrande merged commit 8bef273 into main Nov 14, 2025
21 checks passed
@ajcasagrande ajcasagrande deleted the ajc/zmq-fix branch November 14, 2025 21:14
ajcasagrande added a commit that referenced this pull request Nov 17, 2025
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
vinhngx pushed a commit to vinhngx/aiperf that referenced this pull request Jan 12, 2026
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants