`Executor::spin()` spawns a new OS thread per call, causing severe performance overhead

## Summary

Every call to `Executor::spin()` spawns a new OS thread for the `WaitSetRunner`, even for short-lived spins like `spin(timeout=0)`. This makes the common pattern of calling `spin()` in a tight loop (equivalent to rclcpp's `spin_some`) extremely expensive, roughly **~9x slower** than running the wait set on the calling thread.

## Minimal reproduction

```rust
use rclrs::{Context, CreateBasicExecutor, IntoPrimitiveOptions, QoSProfile, SpinOptions};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::time::{Duration, Instant};

fn main() {
    let context = Context::default();
    let mut executor = context.create_basic_executor();
    let node = executor.create_node("bench").unwrap();

    let qos = QoSProfile::default().reliable().keep_last(1000);
    let recv_count = Arc::new(AtomicU64::new(0));

    let pub_handle = node
        .create_publisher::<std_msgs::msg::UInt8MultiArray>("bench_topic".qos(qos.clone()))
        .unwrap();

    let count = Arc::clone(&recv_count);
    let _sub = node
        .create_subscription::<std_msgs::msg::UInt8MultiArray, _>(
            "bench_topic".qos(qos.clone()),
            move |_msg: std_msgs::msg::UInt8MultiArray| {
                count.fetch_add(1, Ordering::Relaxed);
            },
        )
        .unwrap();

    // Publish-then-spin loop (same pattern as rclcpp benchmarks)
    let mut sent: u64 = 0;
    let start = Instant::now();
    let duration = Duration::from_secs(5);

    while start.elapsed() < duration {
        let msg = std_msgs::msg::UInt8MultiArray::default();
        pub_handle.publish(msg).ok();
        sent += 1;

        // Process available callbacks (this is the bottleneck)
        executor.spin(SpinOptions::new().timeout(Duration::ZERO));
    }

    let received = recv_count.load(Ordering::Relaxed);
    let elapsed = start.elapsed().as_secs_f64();
    eprintln!("Sent:       {sent}");
    eprintln!("Received:   {received}");
    eprintln!("Throughput: {:.0} msg/s", received as f64 / elapsed);
}
```

### Equivalent rclcpp benchmark

```cpp
#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/u_int8_multi_array.hpp>

#include <atomic>
#include <chrono>
#include <cstdio>

int main(int argc, char** argv) {
    rclcpp::init(argc, argv);

    auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
    auto node = std::make_shared<rclcpp::Node>("spin_bench");
    executor->add_node(node);

    auto qos = rclcpp::QoS(1000).reliable();
    std::atomic<uint64_t> recv_count{0};

    auto pub_handle = node->create_publisher<std_msgs::msg::UInt8MultiArray>("bench_topic", qos);

    auto sub = node->create_subscription<std_msgs::msg::UInt8MultiArray>(
        "bench_topic", qos,
        [&recv_count](const std_msgs::msg::UInt8MultiArray::SharedPtr) {
            recv_count.fetch_add(1, std::memory_order_relaxed);
        });

    uint64_t sent = 0;
    auto start = std::chrono::steady_clock::now();
    auto duration = std::chrono::seconds(5);

    while (std::chrono::steady_clock::now() - start < duration) {
        std_msgs::msg::UInt8MultiArray msg;
        pub_handle->publish(msg);
        sent++;

        executor->spin_some(std::chrono::milliseconds(0));
    }

    uint64_t received = recv_count.load();
    double elapsed = std::chrono::duration<double>(
        std::chrono::steady_clock::now() - start).count();

    fprintf(stderr, "Sent:       %lu\n", sent);
    fprintf(stderr, "Received:   %lu\n", received);
    fprintf(stderr, "Throughput: %.0f msg/s\n", received / elapsed);

    rclcpp::shutdown();
    return 0;
}
```

### Results (release build, 3 runs averaged, same machine)

| Implementation | Throughput |
|----------------|-----------|
| rclcpp `spin_some` | **~445k msg/s** |
| rclrs `spin()` (current) | **~42k msg/s** |
| rclrs `spin_some()` (see below) | **~374k msg/s** |

The current `spin()` is **10.6x slower** than rclcpp. 

## Root cause

`BasicExecutorRuntime::spin()` does the following on **every** call:

1. Drains all runners and sends them through a `futures::channel::mpsc` channel
2. Creates a `manage_workers` async task
3. **Spawns a new OS thread** via `std::thread::spawn()` inside `WaitSetRunner::run()`
4. The main thread polls async tasks until the runner thread finishes
5. The runner thread sends results back through a blocking channel
6. The main thread joins the results and stores runners back

For a `spin(timeout=0)` call, the runner thread simply calls `rcl_wait(0)` (which returns immediately), processes any ready entities, and exits. The actual work takes microseconds, but the thread lifecycle overhead dominates:

- `std::thread::spawn()`: ~10-50 us on Linux
- Channel send/receive: additional synchronization overhead
- `manage_workers` async task creation and polling: additional overhead

In contrast, rclcpp's `spin_some()` runs everything **on the calling thread**: direct function calls to `rcl_wait`, readiness checks, and callback execution. No threads, no channels, no async machinery.

## Validation

To confirm that thread-per-spin is the bottleneck, I implemented an experimental `spin_some()` that calls `WaitSetRunner::run_blocking()` directly on the calling thread, bypassing thread creation entirely. With that change, throughput went from ~42k to ~374k msg/s, closing the gap to only 1.2x slower than rclcpp (see table above).

## Impact

This affects any application that calls `spin()` repeatedly with short timeouts, which is the standard ROS 2 pattern for non-blocking spinning.


Implementation	Throughput
rclcpp `spin_some`	~445k msg/s
rclrs `spin()` (current)	~42k msg/s
rclrs `spin_some()` (see below)	~374k msg/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Executor::spin()` spawns a new OS thread per call, causing severe performance overhead #627

Summary

Minimal reproduction

Equivalent rclcpp benchmark

Results (release build, 3 runs averaged, same machine)

Root cause

Validation

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Executor::spin() spawns a new OS thread per call, causing severe performance overhead #627

Description

Summary

Minimal reproduction

Equivalent rclcpp benchmark

Results (release build, 3 runs averaged, same machine)

Root cause

Validation

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Executor::spin()` spawns a new OS thread per call, causing severe performance overhead #627