Skip to content

v1.3.0 - the hwloc edition

Choose a tag to compare

@tzcnt tzcnt released this 08 Jan 05:55
· 67 commits to main since this release

This release dramatically enhances the runtime hardware detection and thread configuration capabilities of TooManyCooks. This makes it possible to write applications that will scale effortlessly on a variety of systems, including bare-metal monolithic, hybrid, or chiplet architecture CPUs, many-core/NUMA machines, or containers/virtualized environments.

There are several new examples demonstrating these capabilities, located here:

>> Examples Link <<

Enhancements to hwloc integration (with TMC_USE_HWLOC)

Prior State in v1.2 - ex_cpu Work Stealing Groups (Automatic)

The following was used internally to optimize work stealing, but was not directly visible to the user:
The number of shared L3 caches on the system is detected and thread groups are created according to those L3 caches. If an executor contains multiple such groups, threads prefer to steal work from other threads in their group before looking outside their group to steal. This is most effective on AMD Zen chiplet architectures, which may have many such caches (one per CCD/chiplet) and high latency for inter-chiplet access.

Thread affinity is set so that each thread may run on any core inside its cache group. This prevents expensive cross-cache thread migrations, while allowing some flexibility in scheduling, if there are other threads running on the same system.

(Docs Link) ex_cpu Work Stealing Groups Improvements

  • Different CPU kinds (Performance or Efficiency cores on hybrid CPUs) are detected and will also be treated as independent groups. If an executor contains multiple CPU kinds, threads prefer to steal from other threads with the same CPU kind before stealing from threads running on a different CPU kind.
  • Caches of any level that are shared among multiple cores can create a group. For example, Apple M processors only expose L2 caches.
  • Irregular cache hierarchies are handled as well. For example, the Intel 13th gen will have an L3 cache group for the P-cores, and an L2 cache group for each cluster of 4 E-cores.

(Docs Link) New Header tmc/topology.hpp

  • tmc::topology::query() can be called to query the CPU topology and return a view optimized for TMC usage that includes NUMA nodes, cache groups and core counts. It also exposes information about CPU kinds, the number of CPUs of each kind, and the SMT level of each cache group (since often only P-cores have SMT). This disambiguates between P-cores, E-cores, and Low-Power E-cores (as seen on latest gen Intel laptop chips).
  • New types cpu_topology, core_group, cpu_kind exposed by the topology object
  • New types thread_pinning_level, thread_packing_strategy, thread_info used by ex_cpu
  • New type topology_filter used by multiple executors to control where threads are allocated
  • New function pin_thread to allow users to match external thread affinity to executor affinity

(Docs Link) New Method on ex_cpu/ex_cpu_st/ex_asio

  • add_partition() allows you to specify which physical cores an executor is allowed to run on (like the taskset command for a single executor). The input to this function is a tmc::topology::topology_filter which can be constructed with information retrieved from the topology query, to specify a specific set of cores, cache groups, or NUMA nodes.

(Docs Link) New Methods on ex_cpu

  • fill_thread_occupancy() fills SMT levels individually for all cores, with awareness of their CPU kind
  • set_thread_init_hook() / set_thread_teardown_hook() have new overloads that receive a tmc::topology::thread_info struct with info about the thread's group and CPU kind.
  • set_thread_pinning_level() defaults to GROUP, but allows pinning to CORE (for benchmarks)
  • set_thread_packing_strategy() controls how threads should be allocated when set_thread_count() is less than the whole system
  • set_work_stealing_strategy controls the work stealing matrix type

(Docs Link) New Killer Feature: Hybrid Work Steering

For ex_cpu only, add_partition() can be called multiple times to split work between multiple partitions at different priority levels. This can be called with any partition type, but is probably most useful when used to split P- and E-cores. These priority ranges can be overlapping (as shown below) or non-overlapping.

    tmc::topology::topology_filter p_cores;
    p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);
    tmc::topology::topology_filter e_cores;
    e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);

    // P-cores handle high (priority 0) and medium (priority 1) work
    // E-cores handle medium (priority 1) and low (priority 2) work
    // Work stealing between core types can happen for priority 1 work
    tmc::cpu_executor()
      .add_partition(p_cores, 0, 2)
      .add_partition(e_cores, 1, 3)
      .set_priority_count(3)
      .init();

(Docs Link) New Debug Compile Flag

If you define the preprocessor macro TMC_DEBUG_THREAD_CREATION, executors will print information about thread groups, affinities, and work stealing matrixes when init() is called.

(Docs Link) Container (cgroups) CPU Quota Detection for ex_cpu

  • ex_cpu will automatically detect if Linux cgroups (v1 or v2) CPU quotas have been configured for the application. If so, it will
    create a default number of threads equal to the quota, rounded down to a minimum of 1. This means that if you run with docker run --cpus=2 then 2 threads will be created. This feature does not require TMC_USE_HWLOC and is always active.
  • If TMC_USE_HWLOC is enabled, hwloc can detect if a specific cpuset was allocated. That is if you run with docker run --cpuset-cpus=0,1 then 2 threads will be created, and the usual optimizations based on CPU cache groupings will apply.
  • These features also work with Kubernetes, as long as the underlying containerization is implemented using Linux cgroups.
  • This can be overridden by calling set_thread_count().

(Docs Link) Unlimited Threads Support

By default, ex_cpu uses machine-word-sized bitmaps for thread state tracking. This is highly efficient, but imposes a limit of 32 or 64 threads, based on system word size.

In v1.3, if you define the preprocessor macro TMC_MORE_THREADS, ex_cpu will support an unlimited number of threads. This uses a dynamic bitmap, which does have a small additional performance cost.

(Docs Link) New Executor: ex_manual_st

This is an executor that doesn't own any threads. Work posted to this executor will be queued, but not executed until it is polled by calling a run_*() function, during which the calling thread will execute work on behalf of the executor. This can be used to integrate with an external event loop, e.g. a game engine's main loop, and to poll for continuations at a specific time, without needing synchronization with other elements of the loop. Although any number of threads can post work to it simultaneously, only one thread should call run_*() at any given time.

(Docs Link) New Method on ex_cpu/ex_cpu_st

  • set_spins() controls how many times executor threads spin looking for work before going to sleep

Optimizations

  • Optimized ex_cpu thread sleep/wake logic for the most common scenario - when all threads are working and submitting work within their own priority group.
  • Optimized ex_cpu enqueue/dequeue logic for the most common scenario - when a thread is pushing and popping to its own highest priority queue.
  • Simplified thread sleep/wake logic overall by preferring to wake threads starting at index 0. In addition to reducing the latency of the thread waking calculation, this also has the effect of reducing data migrations, and making it easier for the OS to schedule external threads efficiently.
  • Dynamically size the std::atomic::wait() type used to notify threads that there is more work, based on the target platform. On Linux this type should optimally be 4 bytes, but on Windows it should be 8. On MacOS it doesn't matter.

Removed Footguns

  • Made tmc::task::operator bool() explicit. Having this be be implicit allowed a task to be converted to an int...
  • Removed tmc::task::done(). This was originally provided for compatibility with the std::coroutine_handle API, but was never useful. Since tmc::task destroys itself on completion, this could never safely return true. The only way (currently) to check if a tmc::task is done is by making use of an awaitable.

Breaking Changes

Executor member functions named task_enter_context() and the traits concept requirement tmc::executor_traits::task_enter_context() have been renamed to dispatch() instead. This makes the naming consistent with asio::dispatch which has the same functionality - resume the work inline if running on the same executor, or post it if coming from a different executor.

This is only a breaking change for users that have implemented a custom executor (probably nobody at this point).