Summary
Every call to Executor::spin() spawns a new OS thread for the WaitSetRunner, even for short-lived spins like spin(timeout=0). This makes the common pattern of calling spin() in a tight loop (equivalent to rclcpp's spin_some) extremely expensive, roughly ~9x slower than running the wait set on the calling thread.
Minimal reproduction
use rclrs::{Context, CreateBasicExecutor, IntoPrimitiveOptions, QoSProfile, SpinOptions};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::time::{Duration, Instant};
fn main() {
let context = Context::default();
let mut executor = context.create_basic_executor();
let node = executor.create_node("bench").unwrap();
let qos = QoSProfile::default().reliable().keep_last(1000);
let recv_count = Arc::new(AtomicU64::new(0));
let pub_handle = node
.create_publisher::<std_msgs::msg::UInt8MultiArray>("bench_topic".qos(qos.clone()))
.unwrap();
let count = Arc::clone(&recv_count);
let _sub = node
.create_subscription::<std_msgs::msg::UInt8MultiArray, _>(
"bench_topic".qos(qos.clone()),
move |_msg: std_msgs::msg::UInt8MultiArray| {
count.fetch_add(1, Ordering::Relaxed);
},
)
.unwrap();
// Publish-then-spin loop (same pattern as rclcpp benchmarks)
let mut sent: u64 = 0;
let start = Instant::now();
let duration = Duration::from_secs(5);
while start.elapsed() < duration {
let msg = std_msgs::msg::UInt8MultiArray::default();
pub_handle.publish(msg).ok();
sent += 1;
// Process available callbacks (this is the bottleneck)
executor.spin(SpinOptions::new().timeout(Duration::ZERO));
}
let received = recv_count.load(Ordering::Relaxed);
let elapsed = start.elapsed().as_secs_f64();
eprintln!("Sent: {sent}");
eprintln!("Received: {received}");
eprintln!("Throughput: {:.0} msg/s", received as f64 / elapsed);
}
Equivalent rclcpp benchmark
#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/u_int8_multi_array.hpp>
#include <atomic>
#include <chrono>
#include <cstdio>
int main(int argc, char** argv) {
rclcpp::init(argc, argv);
auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
auto node = std::make_shared<rclcpp::Node>("spin_bench");
executor->add_node(node);
auto qos = rclcpp::QoS(1000).reliable();
std::atomic<uint64_t> recv_count{0};
auto pub_handle = node->create_publisher<std_msgs::msg::UInt8MultiArray>("bench_topic", qos);
auto sub = node->create_subscription<std_msgs::msg::UInt8MultiArray>(
"bench_topic", qos,
[&recv_count](const std_msgs::msg::UInt8MultiArray::SharedPtr) {
recv_count.fetch_add(1, std::memory_order_relaxed);
});
uint64_t sent = 0;
auto start = std::chrono::steady_clock::now();
auto duration = std::chrono::seconds(5);
while (std::chrono::steady_clock::now() - start < duration) {
std_msgs::msg::UInt8MultiArray msg;
pub_handle->publish(msg);
sent++;
executor->spin_some(std::chrono::milliseconds(0));
}
uint64_t received = recv_count.load();
double elapsed = std::chrono::duration<double>(
std::chrono::steady_clock::now() - start).count();
fprintf(stderr, "Sent: %lu\n", sent);
fprintf(stderr, "Received: %lu\n", received);
fprintf(stderr, "Throughput: %.0f msg/s\n", received / elapsed);
rclcpp::shutdown();
return 0;
}
Results (release build, 3 runs averaged, same machine)
| Implementation |
Throughput |
rclcpp spin_some |
~445k msg/s |
rclrs spin() (current) |
~42k msg/s |
rclrs spin_some() (see below) |
~374k msg/s |
The current spin() is 10.6x slower than rclcpp.
Root cause
BasicExecutorRuntime::spin() does the following on every call:
- Drains all runners and sends them through a
futures::channel::mpsc channel
- Creates a
manage_workers async task
- Spawns a new OS thread via
std::thread::spawn() inside WaitSetRunner::run()
- The main thread polls async tasks until the runner thread finishes
- The runner thread sends results back through a blocking channel
- The main thread joins the results and stores runners back
For a spin(timeout=0) call, the runner thread simply calls rcl_wait(0) (which returns immediately), processes any ready entities, and exits. The actual work takes microseconds, but the thread lifecycle overhead dominates:
std::thread::spawn(): ~10-50 us on Linux
- Channel send/receive: additional synchronization overhead
manage_workers async task creation and polling: additional overhead
In contrast, rclcpp's spin_some() runs everything on the calling thread: direct function calls to rcl_wait, readiness checks, and callback execution. No threads, no channels, no async machinery.
Validation
To confirm that thread-per-spin is the bottleneck, I implemented an experimental spin_some() that calls WaitSetRunner::run_blocking() directly on the calling thread, bypassing thread creation entirely. With that change, throughput went from ~42k to ~374k msg/s, closing the gap to only 1.2x slower than rclcpp (see table above).
Impact
This affects any application that calls spin() repeatedly with short timeouts, which is the standard ROS 2 pattern for non-blocking spinning.
Summary
Every call to
Executor::spin()spawns a new OS thread for theWaitSetRunner, even for short-lived spins likespin(timeout=0). This makes the common pattern of callingspin()in a tight loop (equivalent to rclcpp'sspin_some) extremely expensive, roughly ~9x slower than running the wait set on the calling thread.Minimal reproduction
Equivalent rclcpp benchmark
Results (release build, 3 runs averaged, same machine)
spin_somespin()(current)spin_some()(see below)The current
spin()is 10.6x slower than rclcpp.Root cause
BasicExecutorRuntime::spin()does the following on every call:futures::channel::mpscchannelmanage_workersasync taskstd::thread::spawn()insideWaitSetRunner::run()For a
spin(timeout=0)call, the runner thread simply callsrcl_wait(0)(which returns immediately), processes any ready entities, and exits. The actual work takes microseconds, but the thread lifecycle overhead dominates:std::thread::spawn(): ~10-50 us on Linuxmanage_workersasync task creation and polling: additional overheadIn contrast, rclcpp's
spin_some()runs everything on the calling thread: direct function calls torcl_wait, readiness checks, and callback execution. No threads, no channels, no async machinery.Validation
To confirm that thread-per-spin is the bottleneck, I implemented an experimental
spin_some()that callsWaitSetRunner::run_blocking()directly on the calling thread, bypassing thread creation entirely. With that change, throughput went from ~42k to ~374k msg/s, closing the gap to only 1.2x slower than rclcpp (see table above).Impact
This affects any application that calls
spin()repeatedly with short timeouts, which is the standard ROS 2 pattern for non-blocking spinning.