Skip to content

Executor::spin() spawns a new OS thread per call, causing severe performance overhead #627

@azerupi

Description

@azerupi

Summary

Every call to Executor::spin() spawns a new OS thread for the WaitSetRunner, even for short-lived spins like spin(timeout=0). This makes the common pattern of calling spin() in a tight loop (equivalent to rclcpp's spin_some) extremely expensive, roughly ~9x slower than running the wait set on the calling thread.

Minimal reproduction

use rclrs::{Context, CreateBasicExecutor, IntoPrimitiveOptions, QoSProfile, SpinOptions};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::time::{Duration, Instant};

fn main() {
    let context = Context::default();
    let mut executor = context.create_basic_executor();
    let node = executor.create_node("bench").unwrap();

    let qos = QoSProfile::default().reliable().keep_last(1000);
    let recv_count = Arc::new(AtomicU64::new(0));

    let pub_handle = node
        .create_publisher::<std_msgs::msg::UInt8MultiArray>("bench_topic".qos(qos.clone()))
        .unwrap();

    let count = Arc::clone(&recv_count);
    let _sub = node
        .create_subscription::<std_msgs::msg::UInt8MultiArray, _>(
            "bench_topic".qos(qos.clone()),
            move |_msg: std_msgs::msg::UInt8MultiArray| {
                count.fetch_add(1, Ordering::Relaxed);
            },
        )
        .unwrap();

    // Publish-then-spin loop (same pattern as rclcpp benchmarks)
    let mut sent: u64 = 0;
    let start = Instant::now();
    let duration = Duration::from_secs(5);

    while start.elapsed() < duration {
        let msg = std_msgs::msg::UInt8MultiArray::default();
        pub_handle.publish(msg).ok();
        sent += 1;

        // Process available callbacks (this is the bottleneck)
        executor.spin(SpinOptions::new().timeout(Duration::ZERO));
    }

    let received = recv_count.load(Ordering::Relaxed);
    let elapsed = start.elapsed().as_secs_f64();
    eprintln!("Sent:       {sent}");
    eprintln!("Received:   {received}");
    eprintln!("Throughput: {:.0} msg/s", received as f64 / elapsed);
}

Equivalent rclcpp benchmark

#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/u_int8_multi_array.hpp>

#include <atomic>
#include <chrono>
#include <cstdio>

int main(int argc, char** argv) {
    rclcpp::init(argc, argv);

    auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
    auto node = std::make_shared<rclcpp::Node>("spin_bench");
    executor->add_node(node);

    auto qos = rclcpp::QoS(1000).reliable();
    std::atomic<uint64_t> recv_count{0};

    auto pub_handle = node->create_publisher<std_msgs::msg::UInt8MultiArray>("bench_topic", qos);

    auto sub = node->create_subscription<std_msgs::msg::UInt8MultiArray>(
        "bench_topic", qos,
        [&recv_count](const std_msgs::msg::UInt8MultiArray::SharedPtr) {
            recv_count.fetch_add(1, std::memory_order_relaxed);
        });

    uint64_t sent = 0;
    auto start = std::chrono::steady_clock::now();
    auto duration = std::chrono::seconds(5);

    while (std::chrono::steady_clock::now() - start < duration) {
        std_msgs::msg::UInt8MultiArray msg;
        pub_handle->publish(msg);
        sent++;

        executor->spin_some(std::chrono::milliseconds(0));
    }

    uint64_t received = recv_count.load();
    double elapsed = std::chrono::duration<double>(
        std::chrono::steady_clock::now() - start).count();

    fprintf(stderr, "Sent:       %lu\n", sent);
    fprintf(stderr, "Received:   %lu\n", received);
    fprintf(stderr, "Throughput: %.0f msg/s\n", received / elapsed);

    rclcpp::shutdown();
    return 0;
}

Results (release build, 3 runs averaged, same machine)

Implementation Throughput
rclcpp spin_some ~445k msg/s
rclrs spin() (current) ~42k msg/s
rclrs spin_some() (see below) ~374k msg/s

The current spin() is 10.6x slower than rclcpp.

Root cause

BasicExecutorRuntime::spin() does the following on every call:

  1. Drains all runners and sends them through a futures::channel::mpsc channel
  2. Creates a manage_workers async task
  3. Spawns a new OS thread via std::thread::spawn() inside WaitSetRunner::run()
  4. The main thread polls async tasks until the runner thread finishes
  5. The runner thread sends results back through a blocking channel
  6. The main thread joins the results and stores runners back

For a spin(timeout=0) call, the runner thread simply calls rcl_wait(0) (which returns immediately), processes any ready entities, and exits. The actual work takes microseconds, but the thread lifecycle overhead dominates:

  • std::thread::spawn(): ~10-50 us on Linux
  • Channel send/receive: additional synchronization overhead
  • manage_workers async task creation and polling: additional overhead

In contrast, rclcpp's spin_some() runs everything on the calling thread: direct function calls to rcl_wait, readiness checks, and callback execution. No threads, no channels, no async machinery.

Validation

To confirm that thread-per-spin is the bottleneck, I implemented an experimental spin_some() that calls WaitSetRunner::run_blocking() directly on the calling thread, bypassing thread creation entirely. With that change, throughput went from ~42k to ~374k msg/s, closing the gap to only 1.2x slower than rclcpp (see table above).

Impact

This affects any application that calls spin() repeatedly with short timeouts, which is the standard ROS 2 pattern for non-blocking spinning.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions