Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 72 additions & 6 deletions rfcs/proposed/core_types/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,17 @@

### Motivation

By default, oneTBB includes all available core types in a task arena unless explicitly constrained.
The current oneTBB API allows users to constrain task execution to a single core type using
`task_arena::constraints::set_core_type(core_type_id)`. While this provides control, it creates limitations for
real-world applications running on processors with more than two core types (e.g., on a system with performance (P),
efficient (E), and low power efficient (LP E) cores):

#### 1. **Flexibility and Resource Utilization**

Many parallel workloads can execute efficiently on multiple core types. For example:
While it is often best to allow the OS to use all core types and flexibly schedule threads, some advanced users may find it necessary to constrain scheduling.
When there are more than two core types, it may be desired to constrain execution to not just a single core type.
Many parallel workloads can execute efficiently on multiple core types that make up a subset of the available core types. For example:
- A parallel algorithm with good scalability works well on both P-cores and E-cores
- Background processing can run on E-cores or LP E-cores depending on availability
- Mixed workloads benefit from utilizing any available performance-class cores (P or E)
Expand All @@ -27,7 +30,9 @@ Applications often have workloads that don't fit neatly into a single core type

#### 3. **Avoiding Inappropriate Core Selection**

Without the ability to specify "P-cores OR E-cores (but not LP E-cores)", applications face a dilemma:
Without the ability to specify "P-cores OR E-cores (but not LP E-cores)" or
"LP E-cores OR E-cores but not P-cores" applications face dilemmas.
For example, without being able to specify "P-cores OR E-cores (but not LP E-cores)":
- **No constraint**: Work might be scheduled on LP E-cores, causing significant performance degradation
- **P-cores only**: Leaves E-cores idle, reducing parallelism
- **E-cores only**: Misses opportunities to use faster P-cores when available
Expand All @@ -51,12 +56,21 @@ This forces applications to choose one of these suboptimal strategies:
|----------|------|------|
| **P-cores only** | Maximum single-threaded performance | Leaves E-cores idle; limited parallelism; higher power |
| **E-cores only** | Good for parallel workloads | Doesn't utilize P-core performance; excludes LP E-cores |
| **LP E-cores only** | Minimal power consumption | Severe performance impact for most workloads |
| **LP E-cores only** | Minimal power consumption | Severe performance impact for some workloads that require large, shared caches. |
| **No constraint** | Maximum flexibility | May schedule on inappropriate cores (e.g., LP E-cores for compute) |

None of these options provide the desired behavior: **"Use P-cores or E-cores, but avoid LP E-cores"** or **"Use any
efficiency cores (E-core or LP E-core)"**.

### Compatibility Requirements

This proposal must maintain compatibility with previous oneTBB library versions:
- **API and Backward Compatibility (Old Application + New Library)**: Existing code using the current
`set_core_type(core_type_id)` API must compile and behave identically with newer oneTBB binaries.
- **Binary Compatibility (ABI)**: The `task_arena::constraints` struct layout must remain unchanged.
- **Forward Compatibility (New Application + Old Library)**: Applications compiled with the proposed new functionality
must be able to handle execution against older oneTBB binaries gracefully, without crashes or undefined behavior.

## Proposal

We propose extending the `task_arena::constraints` API to support specifying multiple acceptable core types, enabling
Expand Down Expand Up @@ -182,11 +196,27 @@ The design ensures full backward compatibility:
| **Behavior** | All existing code paths would preserve exact semantics |
| **ABI** | No changes to struct size or layout |

### Forward Compatibility

With the `constraints` API being header-only, the unmodified ABI, and no new library entry points, applications
compiled with the proposed new functionality can handle execution against older oneTBB binaries through runtime
detection and fallback mechanisms. Runtime detection is achieved using `TBB_runtime_interface_version()`, which allows
applications to verify that the loaded oneTBB binary supports the new API before attempting to use it. When the runtime
check indicates an older library version, applications can gracefully fall back to alternative strategies: either using
all available core types (no constraint) or constraining to a single core type using the existing `set_core_type()`
API. This approach satisfies the forward compatibility requirement stated in the "Compatibility Requirements" section.

### Usage Examples

Core type capabilities vary by hardware platform, and the benefits of constraining execution are highly
application-dependent. In most cases, systems with hybrid CPU architecture show reasonable performance without
additional API calls. However, in some exceptional scenarios, performance may be tuned by specifying preferred
core types. The following examples demonstrate these advanced use cases.

#### Example 1: Performance-Class Cores (P or E, not LP E)

Most compute workloads should avoid LP E-cores but can use either P-cores or E-cores:
In rare cases, compute-intensive tasks may be scheduled to LP E-cores. To fully prevent this, execution can be
constrained to P-cores and E-cores. The example shows how to set multiple preferred core types:

```cpp
auto core_types = tbb::info::core_types();
Expand All @@ -204,7 +234,8 @@ arena.execute([] {

#### Example 2: Adaptive Core Selection

Different arenas for different workload priorities:
For applications with well-understood workload characteristics, different arenas may be configured with different core
type constraints. The example shows how to create arenas for different workload priorities:

```cpp
auto core_types = tbb::info::core_types();
Expand Down Expand Up @@ -253,6 +284,40 @@ The test infrastructure could generate all possible core type combinations using
| **Runtime impact** | Negligible compared to task scheduling overhead |
| **Affinity operations** | Linear in number of core types, performed once at arena creation |

## Alternatives Considered

### Alternative 1: Accept Multiple Constraints Instances

Instead of modifying the `constraints` struct, introduce a new `task_arena` constructor that accepts a vector of
`constraints` instances. The arena would compute the union of affinity masks from all provided constraints, enabling
specification of multiple NUMA nodes and core types in a single arena.

```cpp
// Example usage
tbb::task_arena arena({
tbb::task_arena::constraints{}.set_core_type(core_types[1]),
tbb::task_arena::constraints{}.set_core_type(core_types[2])
});
```

**Pros:**
- More scalable: can extend to any other constraint type and specify multiple platform portions as a unified constraint
- Reuses existing `constraints` struct without modification
- Avoids bit-packing, format markers, and special value handling
- No risk of misinterpretation of existing single core type constraints

**Cons:**
- Requires creating multiple `constraints` objects for simple core type combinations
- Vector of `constraints` instances vs. single integer field with bit-packing creates memory overhead
- Unclear how to handle conflicting `max_concurrency` or `max_threads_per_core` across instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible way of handling conflicting max_concurrency or max_threads_per_core can be interpreting each constraint instance as an independent constraint with its own max_concurrency and max_threads_per_core. Once affinity mask is determined for each constraint, the union of them is created, so the sum of max_concurrency-ies, which can be used the overall max_concurrency for result task_arena instance.

max_threads_per_core is not needed to be handled anyhow specifically as it is directly reflected in the mask of the constraint.

Copy link
Contributor

@vossmjp vossmjp Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked a similar question in PR1926 but it applies here too. Assume there is a system with three core types, indexes 0, 1 and 2 and there are 4 of each kind. What if the user wants 6 slots and to use the most performant cores. From what you describe, I think they would create a constraint for index 2 and max_concurrency of 4 slots. And then another for index 1 and max_concurrency of 2? Is that right? The result would be a max_concurrency of six and a mask that includes both index 2 and index 1 core types? So if all core types are idle, the arena may populate with 4 index 1 types and 2 index 2 types, or 2 index 1 types and 4 index 2 types, right? Would one be expected over the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this from the cons. If we adopt this alternative, the new behavior will be part of the proposal.

Copy link
Contributor

@aleksei-fedotov aleksei-fedotov Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vossmjp I think in this case the only possible way to populate arena would be with 4 index 2 and 2 index 1 threads, since user has specified such constraints that result in the mask that maps threads joining the arena to 4 P-cores and 2 E-cores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you would need to pick which e-cores are in the mask.

Copy link
Contributor

@aleksei-fedotov aleksei-fedotov Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand your question now. It seems we can try applying the same approach we are using right now, which is "Let the OS decide". Currently, if the user specifies 2 threads as max_concurrency for the arena constrained to E-cores on the platform which has, let's say, 10 of them, the mask will be created for all these 10 e-cores, right? But only two threads will join the arena and has their affinity set to the mask.
I guess with multiple constraints, arena should apply the mask per constraint. For the example you described above, first four threads joining the arena will have affinity of the P-cores, while the last two threads will be assigned E-cores mask.

Or simply extend the current logic - do not make any preference for what cores to populate first. That is, use the united mask that will include all P-cores and all E-cores, and let the system decide on the threads migration.

- Library entry points `constraints_default_concurrency()` and `constraints_threads_per_core()` accept single
`constraints`; would require new overloads or replacement APIs, affecting ABI

**Future Extensibility Consideration:** This approach naturally extends to other constraint types—if `set_core_types`
is added, a corresponding `set_numa_ids` function would likely follow. The choice between a vector of `constraints`
instances versus dedicated multi-value setters affects API consistency and usability: the former provides a unified
pattern for combining any constraints, while the latter offers more intuitive, type-specific methods.

## Open Questions

1. **API Naming**: Is `set_core_types` (plural) sufficiently distinct from `set_core_type` (singular)?
Expand Down Expand Up @@ -282,4 +347,5 @@ void clear_core_types();
- Pros: Simpler logic, easier to extend
- Cons: Increases struct size, breaks ABI compatibility

6. **Info API**: Should `info::core_types()` be extended to return a count instead of/in addition to a vector?
6. **Info API**: Should `info::core_types()` be augmented with a method to return a count instead of a vector, e.g.,
`info::num_core_types()`?
Loading