Skip to content

Bump zenoh to 1.8.0 - 2nd attempt#964

Merged
sloretz merged 8 commits intoros2:rollingfrom
ZettaScaleLabs:fix/subscriber-termination
Apr 13, 2026
Merged

Bump zenoh to 1.8.0 - 2nd attempt#964
sloretz merged 8 commits intoros2:rollingfrom
ZettaScaleLabs:fix/subscriber-termination

Conversation

@YuanYuYuan
Copy link
Copy Markdown
Contributor

@YuanYuYuan YuanYuYuan commented Apr 10, 2026

Summary

Key Changes

Root Cause (hang)

eclipse-zenoh/zenoh@e5db0ce changed Session::close() to call wait_callbacks() internally, blocking until all in-flight callbacks finish. The old teardown order let session_.reset() run while rmw entities (nodes, subscriptions) still held shared_ptr references. The session was only destroyed later inside ~Data() during nodes_.clear() — at which point callback handlers were being torn down simultaneously, causing a deadlock or STATUS_STACK_BUFFER_OVERRUN on Windows.

The fix calls session_->close() explicitly in shutdown(), at which point rclcpp::shutdown() has already exited the spin loop so no callbacks are in-flight. wait_callbacks() returns immediately, and the subsequent destructor path finds is_closed() == true and skips the blocking call.

Root Cause (ANSI codes, #951)

Zenoh 1.8.0 emits a new error log at Session shutdown, when a TCP link is closed at the same time and it fails to send an event to an already removed callback.
The Rust logger (env_logger) emits ANSI color escape sequences by default. These bled into captured output from ros2 param commands, causing yaml.reader.ReaderError when the output was parsed as YAML.
ros2topic.ros2topic.test.test_cli.test_cli is also parsing the test output and failing on this error log.

The fix is in Zenoh (commit eclipse-zenoh/zenoh@2687c51), removing those logs.
This PR makes rmw_zenoh to use this commit.

Related

Breaking Changes

None


Did you use Generative AI?

Yes. Claude (claude-sonnet-4-6) via Claude Code was used to assist with root cause analysis, reproducing the bug on Windows, and creating an initial prototype of the changes in this PR.

YuanYuYuan and others added 6 commits April 11, 2026 01:44
- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)
zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().
Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR ros2#945.
Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.
This commit re-applies changes made in ros2#935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()
@JEnoch
Copy link
Copy Markdown
Contributor

JEnoch commented Apr 11, 2026

Pulls: #964
Gist: https://gist.githubusercontent.com/JEnoch/c0e58f7787fc7125b36f872ca1555087/raw/79633048c6c15bd3a763f4dde0f58d572657cbec/ros2.repos
BUILD args: "--continue-on-error" --packages-above-and-dependencies zenoh_cpp_vendor zenoh_security_tools rmw_zenoh_cpp
TEST args: --packages-above zenoh_cpp_vendor zenoh_security_tools rmw_zenoh_cpp
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18928

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@JEnoch
Copy link
Copy Markdown
Contributor

JEnoch commented Apr 11, 2026

Pulls: #964
Gist: https://gist.githubusercontent.com/JEnoch/ca315444cd2b2446bdf12a325099cd35/raw/79633048c6c15bd3a763f4dde0f58d572657cbec/ros2.repos
BUILD args: "--continue-on-error" --packages-above-and-dependencies zenoh_cpp_vendor zenoh_security_tools rmw_zenoh_cpp
TEST args: --packages-above zenoh_cpp_vendor zenoh_security_tools rmw_zenoh_cpp
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18929

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@JEnoch
Copy link
Copy Markdown
Contributor

JEnoch commented Apr 12, 2026

The fix for ANSI color escape was not working and not sufficient for ros2topic.ros2topic.test.test_cli.test_cli which was failing trying to parse the Zenoh error log. See https://ci.ros2.org/job/ci_linux/28672/testReport/junit/ros2topic.ros2topic.test/test_cli/test_cli/

Those error logs in Zenoh are not legit anyway, since they occur at Session closure when it tries to call an already removed callback on a link closure event. In a branch based on version 1.8.0 I made removed those error logs in this commit:
eclipse-zenoh/zenoh@2687c51

In distinct branches, zenoh-c is using this commit:

This PR is now using those branches.

@JEnoch JEnoch changed the title fix: close session explicitly in shutdown() to prevent hang on Windows Bump zenoh to 1.8.0 - 2nd attempt Apr 12, 2026
@JEnoch
Copy link
Copy Markdown
Contributor

JEnoch commented Apr 12, 2026

This CI is now green for all Linux.
On Windows, there are no more timeouts and the total duration is back to ~5 hours.
As far as I can see the 24 failing tests are not with rmw_zenoh. Most are with rmw_connextdds, and some are related to ImportError: Could not find Qt binding.

@sloretz OK to merge before RMW freeze ?

@mjcarroll
Copy link
Copy Markdown
Member

Will leave to @sloretz on the final call here, but much better than we were a week ago.

Copy link
Copy Markdown
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for resolving those two issues! The Windows CI results LGTM

@sloretz sloretz merged commit ba1ab30 into ros2:rolling Apr 13, 2026
5 of 6 checks passed
@JEnoch
Copy link
Copy Markdown
Contributor

JEnoch commented Apr 13, 2026

https://github.com/Mergifyio backport kilted jazzy humble

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 13, 2026

backport kilted jazzy humble

✅ Backports have been created

Details

mergify Bot pushed a commit that referenced this pull request Apr 13, 2026
* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------

Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
(cherry picked from commit ba1ab30)
mergify Bot pushed a commit that referenced this pull request Apr 13, 2026
* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------

Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
(cherry picked from commit ba1ab30)
mergify Bot pushed a commit that referenced this pull request Apr 13, 2026
* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------

Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
(cherry picked from commit ba1ab30)
JEnoch added a commit that referenced this pull request Apr 13, 2026
* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------


(cherry picked from commit ba1ab30)

Co-authored-by: Yuyuan Yuan <az6980522@gmail.com>
Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
JEnoch added a commit that referenced this pull request Apr 13, 2026
* Bump zenoh to 1.8.0 - 2nd attempt (#964)

* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------

Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
(cherry picked from commit ba1ab30)

* Make uncrustify happy

---------

Co-authored-by: Yuyuan Yuan <az6980522@gmail.com>
Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
JEnoch added a commit that referenced this pull request Apr 13, 2026
* chore(zenoh_cpp_vendor): bump to latest zenoh-c and zenoh-cpp

- zenoh-c main: 102df1a3 (2026-04-10)
- zenoh-c ROS/rust-1.75: 0193595c (2026-04-07)
- zenoh-cpp main: af381b42 (2026-04-10)

* fix: close session explicitly in shutdown() to prevent hang on Windows

zenoh commit e5db0ce changed session.close() to call wait_callbacks(),
which blocks until all in-flight callbacks finish. With the older
teardown order, session_.reset() was called while node-level entities
(publishers, subscriptions, etc.) still held shared_ptr<Session> refs,
so the session wasn't actually destroyed until ~Data() called
nodes_.clear() — at which point wait_callbacks() would deadlock against
callbacks being concurrently destroyed on Windows.

Fix: call session_->close() explicitly in shutdown() before
session_.reset(). At shutdown time the spin loop has already exited,
so no callbacks are in-flight and wait_callbacks() returns immediately.
The session is then marked closed; when the shared_ptr refcount
eventually drops to zero during normal rcl teardown, the session
destructor finds is_closed()==true and skips the blocking close().

* chore(zenoh_cpp_vendor): restore get_cargo_version.cmake from #945

Extract cargo version detection into a reusable CMake function instead
of inlining execute_process, matching the approach from PR #945.

* fix: disable ANSI color codes in Zenoh log output (#951)

Set RUST_LOG_STYLE=never before initializing the Zenoh logger so that
color escape sequences do not leak into captured command output. This
fixes YAML parsing failures in ros2param tests where the ESC character
was treated as an unacceptable character.

The env var is set with overwrite=0 so callers can still override it.

* Use zenoh-c commits for Zenoh 1.8.0 + #2493

* Fix synchronization due to changes in undeclare in zenoh 1.8.0

This commit re-applies changes made in #935 , while keeping the explicit call to session_.close() added in rmw_context_impl_s::shutdown()

* Use zenoh 2687c5135

eclipse-zenoh/zenoh@2687c51

from branch https://github.com/eclipse-zenoh/zenoh/tree/suppress-admin-err-message-on-session-close

based on 1.8.0 plus few fixes, including removal of a error log at closure causing failure of a ros2cli test

* revert disable ANSI color codes in Zenoh log output

---------


(cherry picked from commit ba1ab30)

Co-authored-by: Yuyuan Yuan <az6980522@gmail.com>
Co-authored-by: Julien Enoch <julien.e@zettascale.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust bump to >= 1.75 affects downstream ros2param test

4 participants