fix: Fix ZMQ context termination deadlock issue#469
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@ajc/zmq-fixRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@ajc/zmq-fix |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
WalkthroughThe pull request delegates ZMQ context cleanup responsibility from explicit in-process termination to process exit, removes unused imports (asyncio, zmq.asyncio, sys, Environment), and changes program termination from Python-level Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/aiperf/controller/proxy_manager.py(1 hunks)src/aiperf/controller/system_controller.py(1 hunks)tests/unit/controller/test_proxy_manager.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit/controller/test_proxy_manager.py (3)
src/aiperf/common/config/service_config.py (1)
ServiceConfig(28-173)src/aiperf/controller/proxy_manager.py (1)
ProxyManager(12-67)tests/unit/conftest.py (1)
mock_zmq_globally(39-75)
src/aiperf/controller/proxy_manager.py (1)
src/aiperf/common/protocols.py (1)
debug(64-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: integration-tests (macos-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.10)
- GitHub Check: integration-tests (macos-latest, 3.13)
- GitHub Check: integration-tests (ubuntu-latest, 3.10)
- GitHub Check: integration-tests (ubuntu-latest, 3.13)
- GitHub Check: integration-tests (ubuntu-latest, 3.12)
- GitHub Check: integration-tests (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.12)
- GitHub Check: build (macos-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.12)
- GitHub Check: build (macos-latest, 3.13)
- GitHub Check: build (ubuntu-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.13)
- GitHub Check: build (ubuntu-latest, 3.11)
- GitHub Check: build (macos-latest, 3.12)
🔇 Additional comments (2)
tests/unit/controller/test_proxy_manager.py (1)
15-40: Well-structured test validating the core requirement.This test clearly validates that
context.term()is not called during ProxyManager shutdown, which is the key behavioral change in this PR. The test setup is appropriate, uses existing fixtures correctly, and the assertion directly verifies the requirement.src/aiperf/controller/proxy_manager.py (1)
52-67: The PyZMQ documentation claim in the code comment is inaccurate and should be corrected.PyZMQ's documented best practices for short-lived processes recommend using context managers or
ctx.destroy(linger=0)to close leftover sockets and avoid hangs, not avoiding context cleanup entirely as the comment suggests.While the pragmatic decision to skip
context.term()may be justified by your production experience with indefinite hangs, the documented approach for short-lived processes isctx.destroy(linger=0)or context managers, which provide proper cleanup without the risk of blocking.Recommended fixes:
- Remove the misleading reference to PyZMQ documentation supporting the no-cleanup approach
- Either: (1) consider using
ctx.destroy(linger=0)instead of relying on process exit, or (2) clarify the comment to explain this is a workaround for production hangs, not a documented best practiceLikely an incorrect or invalid review comment.
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
debermudez
left a comment
There was a problem hiding this comment.
Nice job. Especially like the documenting comments.
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com> Signed-off-by: vinhn <vinhn@nvidia.com>
Problem: aiperf hung indefinitely after completing benchmarks due to a ZeroMQ context termination call that blocks in uninterruptible C code, waiting for network sockets that never fully close.
Root Cause: Python's timeout mechanisms cannot interrupt system-level blocking calls, and the singleton context architecture means termination is attempted while other components still hold socket references, creating a deadlock condition.
Solution: Remove the explicit context termination call and delegate cleanup to the operating system on process exit—a standard practice for short-lived processes recommended by the ZeroMQ maintainers that eliminates the hang while maintaining clean resource management through kernel-level cleanup.
Summary by CodeRabbit
Release Notes
Refactor
Tests