Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance by ShaobinChen-AH · Pull Request #367 · NVIDIA/recsys-examples

ShaobinChen-AH · 2026-04-17T13:06:40Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary

This PR adds hash_roundrobin as a new DynamicEmb RW routing mode and makes it the default for DynamicEmb row-wise planning.

The goal is narrow: fix load imbalance caused by pathological raw-key patterns that can break plain modulo-based roundrobin. The new mode hashes the raw key first, then assigns the owner rank from the hashed key.

This PR does not claim to solve general hot-key or Zipf-skew load balancing.

Changes

add hash_roundrobin support in the DynamicEmb input-distribution path
map hash_roundrobin to dist_type = 2 for the CUDA extension path
“adds hash_roundrobin as an opt-in RW routing mode”, default remains roundrobin
add a regression test file covering:
- CPU reference owner mapping
- adversarial modulo-aliasing patterns
- same logical workload with different raw-key encodings
- CUDA kernel parity against CPU reference
- comparative real/Zipf-style robustness checks
add the new regression test to the regular test/unit_test.sh flow
document supported dist_type values and clarify the intended scope of hash_roundrobin

Why

Issue #350 points out that plain roundrobin can become imbalanced when raw keys follow special patterns. Hashing the raw key before RW rank assignment makes the routing much less sensitive to those patterns while preserving the existing overall bucketization flow.

Validation

Validated on a clean rebuild in the target Ubuntu Docker environment.

python3 -m pytest -svv test/unit_tests/test_hash_roundrobin_kuairand.py
- 16 passed
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 ./test/unit_tests/test_sequence_embedding_fw.py --print_sharding_plan --optimizer_type "adam" --use_index_dedup True
- passed on NVIDIA RTX A6000

Notes

The new benchmark is intentionally framed around pattern robustness, which is what issue [FEA] Shuffle keys using hash instead of roundrobin #350 asks for.
Real/Zipf-style checks are kept as comparative robustness validation, not as a claim of universal fairness.

greptile-apps · 2026-04-17T13:10:47Z

Greptile Summary

This PR adds hash_roundrobin as a new DynamicEmb row-wise routing mode (mapped to dist_type=2) that applies a MurmurHash3 finalizer before modulo-rank assignment, mitigating load imbalance caused by pathological raw-key patterns. It also saves dist_type in checkpoint metadata and validates it on load (with a "roundrobin" legacy fallback), ensuring checkpoint compatibility is guarded by a clear error rather than silent mismatch.

P1 (example.py lines 68–72): The module-level rank_print closure still references local_rank, which was removed from module scope and moved inside init_runtime(). builtins.print is set to this broken function at import time, so any print() call before init_runtime() completes raises NameError: name 'local_rank' is not defined.

Confidence Score: 4/5

Safe to merge after fixing the stale local_rank reference in the module-level rank_print; core routing and checkpoint logic is sound.

One P1 defect in example.py: the module-level rank_print closure references local_rank which is no longer in module scope, making builtins.print broken between import and init_runtime(). All other changes — CUDA kernel, checkpoint metadata, dist_type validation, and tests — are correct and well-tested.

corelib/dynamicemb/example/example.py (stale local_rank reference in module-level rank_print)

Important Files Changed

Filename	Overview
corelib/dynamicemb/example/example.py	Major refactoring to RuntimeContext dataclass; introduces --dist_type arg; stale local_rank reference in module-level rank_print causes NameError if print is called before init_runtime()
corelib/dynamicemb/dynamicemb/dynamicemb_config.py	Adds SUPPORTED_DIST_TYPES constant, dist_type field to DynamicEmbTableOptions with validation and grouped_key inclusion; clean and correct
corelib/dynamicemb/src/sparse_block_bucketize_features.cu	Adds MurmurHash3-finalizer hash_key device function, changes dist_type parameter from bool to int32_t in both kernels, adds hash_roundrobin branch; mixed tabs/spaces in new if/else blocks
corelib/dynamicemb/dynamicemb/key_value_table.py	Saves dist_type to checkpoint metadata and validates it on load with legacy fallback to roundrobin; well-handled
corelib/dynamicemb/test/unit_tests/test_hash_roundrobin_kuairand.py	New test file covering CPU reference, adversarial modulo-aliasing, CUDA kernel parity, and Zipf robustness; thorough and well-structured
corelib/dynamicemb/test/test_batched_dynamic_embedding_tables_v2.py	Adds dist_type round-trip dump/load tests and mismatch rejection test; correctly tests legacy metadata fallback

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[bucketize_kjt_before_all2all] --> B{dist_type_str per feature}
    B -->|continuous| C[dist_type = 0]
    B -->|roundrobin| D[dist_type = 1]
    B -->|hash_roundrobin| E[dist_type = 2]
    C --> F[torch.int32 tensor]
    D --> F
    E --> F
    F --> G[CUDA: block_bucketize_sparse_features]
    G --> H{dist_type per index}
    H -->|0| I[p = idx / blk_size or idx % my_size]
    H -->|1| J[p = idx % my_size]
    H -->|2| K[p = hash_key(idx) % my_size]
    I --> L[new_lengths / new_indices]
    J --> L
    K --> L

    subgraph Checkpoint Safety
        M[DynamicEmbDump] -->|writes dist_type to meta.json| N[checkpoint]
        N -->|load| O{dist_type match?}
        O -->|yes| P[load succeeds]
        O -->|no| Q[ValueError raised]
        O -->|key absent - legacy| R[default to roundrobin]
    end

Comments Outside Diff (1)

corelib/dynamicemb/example/example.py, line 68-72 (link)

Module-level rank_print references undefined local_rank

The module-level rank_print (line 68–69) was left unchanged from the original code, but the local_rank = dist.get_rank() assignment it depends on was removed from module scope and moved inside init_runtime(). Line 72 then installs this broken function as builtins.print. Any call to print() between module import and init_runtime() completing — including from argparse error paths or third-party library code — will raise NameError: name 'local_rank' is not defined.

The fix is to either remove the stale module-level rank_print / builtins.print assignment (since init_runtime() now handles it), or replace local_rank with a safe fallback:

_{Reviews (6): Last reviewed commit: "update example" | Re-trigger Greptile}

shijieliu · 2026-04-20T02:17:17Z

hi @ShaobinChen-AH thanks for your contribution!

I think you also need to modify load to accommodate the changes in input dist
how did you trigger the code path for hash_roundrobin?

jiashuy

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

ShaobinChen-AH · 2026-04-21T13:02:30Z

hi @ShaobinChen-AH thanks for your contribution!

I think you also need to modify load to accommodate the changes in input dist

how did you trigger the code path for hash_roundrobin?

dist_type is now exposed via DynamicEmbTableOptions, with roundrobin kept as the default for compatibility. The planner now reads opts.dist_type instead of hardcoding the routing mode, so hash_roundrobin is an explicit opt-in path.

ShaobinChen-AH · 2026-04-21T13:05:08Z

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value

use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

I updated dump/load to persist dist_type in checkpoint metadata and validate it at load time, so mismatched input-distribution settings now fail loudly instead of silently loading. I also added end-to-end validation through the user-facing path (DynamicEmbTableOptions(dist_type="hash_roundrobin") -> planner/sharding/input-dist -> dump/load smoke), in addition to the existing kernel/parity benchmark.

shijieliu · 2026-04-23T00:49:48Z

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

shijieliu · 2026-04-23T00:49:57Z

/build

ShaobinChen-AH · 2026-04-26T06:02:26Z

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

I updated the DynamicEmb MovieLens example to demonstrate different input distribution policies via a new --dist_type CLI option (continuous / roundrobin / hash_roundrobin, default roundrobin).
I also made the example entrypoint import-safe/help-safe, so python3 example.py --help now works without distributed-init errors.
Validation:
python3 example.py --help
python3 -c "import example"
torchrun --standalone --nproc_per_node=2 example.py --train --dist_type hash_roundrobin
torchrun --standalone --nproc_per_node=2 example.py --train

ALL PASSED.

jiashuy · 2026-04-27T02:18:52Z

/build

jiashuy · 2026-04-27T02:20:26Z

CI

jiashuy

LGTM

JacoCheung · 2026-04-28T01:28:12Z

/build

JacoCheung · 2026-04-28T01:32:40Z

❌ Pipeline #49661872 -- failed

Job	Status	Log
pre_check	❌ failed	view
train_build	✅ success	view
inference_build	✅ success	view
tritonserver_build	✅ success	view
build_whl	✅ success	view
dynamicemb_test_fwd_bwd_8gpus	✅ success	view
dynamicemb_test_load_dump_8gpus	✅ success	view
unit_test_1gpu_a100	✅ success	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	✅ success	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	✅ success	view
inference_unit_test_1gpu	✅ success	view
inference_test_1gpu	✅ success	view

Result: 11/14 jobs passed

View full pipeline

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread corelib/dynamicemb/dynamicemb/planner/planner.py Outdated

Comment thread corelib/dynamicemb/src/sparse_block_bucketize_features.cu

ShaobinChen-AH changed the title ~~fix-issue-350~~ Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance Apr 17, 2026

shijieliu requested a review from jiashuy April 21, 2026 01:20

jiashuy reviewed Apr 21, 2026

View reviewed changes

ShaobinChen-AH added 4 commits April 23, 2026 08:48

fix-issue-350

b35dc09

fix

dead94c

fix

e2fee25

fix

009c5e3

shijieliu force-pushed the fix-issue-350 branch from 5e9efdc to 009c5e3 Compare April 23, 2026 00:48

update example

53d8edf

jiashuy approved these changes Apr 27, 2026

View reviewed changes

shijieliu approved these changes Apr 27, 2026

View reviewed changes

shijieliu merged commit e4fc876 into NVIDIA:main Apr 27, 2026
1 check failed

shijieliu mentioned this pull request Apr 27, 2026

[dynamicemb] Fix Torch 2.11 DataLoader worker initialization in MovieLens example #368

Open

3 tasks

shijieliu mentioned this pull request Apr 28, 2026

[FEA] Shuffle keys using hash instead of roundrobin #350

Closed

Conversation

ShaobinChen-AH commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary

Changes

Why

Validation

Notes

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

shijieliu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiashuy left a comment

Choose a reason for hiding this comment

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

ShaobinChen-AH commented Apr 26, 2026

Uh oh!

jiashuy commented Apr 27, 2026

Uh oh!

jiashuy commented Apr 27, 2026

Uh oh!

jiashuy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JacoCheung commented Apr 28, 2026

Uh oh!

JacoCheung commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ShaobinChen-AH commented Apr 17, 2026 •

edited

Loading

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

shijieliu commented Apr 20, 2026 •

edited

Loading

JacoCheung commented Apr 28, 2026 •

edited

Loading