Skip to content

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367

Merged
shijieliu merged 5 commits intoNVIDIA:mainfrom
ShaobinChen-AH:fix-issue-350
Apr 27, 2026
Merged

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367
shijieliu merged 5 commits intoNVIDIA:mainfrom
ShaobinChen-AH:fix-issue-350

Conversation

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor

@ShaobinChen-AH ShaobinChen-AH commented Apr 17, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary

This PR adds hash_roundrobin as a new DynamicEmb RW routing mode and makes it the default for DynamicEmb row-wise planning.

The goal is narrow: fix load imbalance caused by pathological raw-key patterns that can break plain modulo-based roundrobin. The new mode hashes the raw key first, then assigns the owner rank from the hashed key.

This PR does not claim to solve general hot-key or Zipf-skew load balancing.

Changes

  • add hash_roundrobin support in the DynamicEmb input-distribution path
  • map hash_roundrobin to dist_type = 2 for the CUDA extension path
  • “adds hash_roundrobin as an opt-in RW routing mode”, default remains roundrobin
  • add a regression test file covering:
    • CPU reference owner mapping
    • adversarial modulo-aliasing patterns
    • same logical workload with different raw-key encodings
    • CUDA kernel parity against CPU reference
    • comparative real/Zipf-style robustness checks
  • add the new regression test to the regular test/unit_test.sh flow
  • document supported dist_type values and clarify the intended scope of hash_roundrobin

Why

Issue #350 points out that plain roundrobin can become imbalanced when raw keys follow special patterns. Hashing the raw key before RW rank assignment makes the routing much less sensitive to those patterns while preserving the existing overall bucketization flow.

Validation

Validated on a clean rebuild in the target Ubuntu Docker environment.

  • python3 -m pytest -svv test/unit_tests/test_hash_roundrobin_kuairand.py
    • 16 passed
  • CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 ./test/unit_tests/test_sequence_embedding_fw.py --print_sharding_plan --optimizer_type "adam" --use_index_dedup True
    • passed on NVIDIA RTX A6000

Notes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR adds hash_roundrobin as a new DynamicEmb row-wise routing mode (mapped to dist_type=2) that applies a MurmurHash3 finalizer before modulo-rank assignment, mitigating load imbalance caused by pathological raw-key patterns. It also saves dist_type in checkpoint metadata and validates it on load (with a "roundrobin" legacy fallback), ensuring checkpoint compatibility is guarded by a clear error rather than silent mismatch.

  • P1 (example.py lines 68–72): The module-level rank_print closure still references local_rank, which was removed from module scope and moved inside init_runtime(). builtins.print is set to this broken function at import time, so any print() call before init_runtime() completes raises NameError: name 'local_rank' is not defined.

Confidence Score: 4/5

Safe to merge after fixing the stale local_rank reference in the module-level rank_print; core routing and checkpoint logic is sound.

One P1 defect in example.py: the module-level rank_print closure references local_rank which is no longer in module scope, making builtins.print broken between import and init_runtime(). All other changes — CUDA kernel, checkpoint metadata, dist_type validation, and tests — are correct and well-tested.

corelib/dynamicemb/example/example.py (stale local_rank reference in module-level rank_print)

Important Files Changed

Filename Overview
corelib/dynamicemb/example/example.py Major refactoring to RuntimeContext dataclass; introduces --dist_type arg; stale local_rank reference in module-level rank_print causes NameError if print is called before init_runtime()
corelib/dynamicemb/dynamicemb/dynamicemb_config.py Adds SUPPORTED_DIST_TYPES constant, dist_type field to DynamicEmbTableOptions with validation and grouped_key inclusion; clean and correct
corelib/dynamicemb/src/sparse_block_bucketize_features.cu Adds MurmurHash3-finalizer hash_key device function, changes dist_type parameter from bool to int32_t in both kernels, adds hash_roundrobin branch; mixed tabs/spaces in new if/else blocks
corelib/dynamicemb/dynamicemb/key_value_table.py Saves dist_type to checkpoint metadata and validates it on load with legacy fallback to roundrobin; well-handled
corelib/dynamicemb/test/unit_tests/test_hash_roundrobin_kuairand.py New test file covering CPU reference, adversarial modulo-aliasing, CUDA kernel parity, and Zipf robustness; thorough and well-structured
corelib/dynamicemb/test/test_batched_dynamic_embedding_tables_v2.py Adds dist_type round-trip dump/load tests and mismatch rejection test; correctly tests legacy metadata fallback

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[bucketize_kjt_before_all2all] --> B{dist_type_str per feature}
    B -->|continuous| C[dist_type = 0]
    B -->|roundrobin| D[dist_type = 1]
    B -->|hash_roundrobin| E[dist_type = 2]
    C --> F[torch.int32 tensor]
    D --> F
    E --> F
    F --> G[CUDA: block_bucketize_sparse_features]
    G --> H{dist_type per index}
    H -->|0| I[p = idx / blk_size or idx % my_size]
    H -->|1| J[p = idx % my_size]
    H -->|2| K[p = hash_key(idx) % my_size]
    I --> L[new_lengths / new_indices]
    J --> L
    K --> L

    subgraph Checkpoint Safety
        M[DynamicEmbDump] -->|writes dist_type to meta.json| N[checkpoint]
        N -->|load| O{dist_type match?}
        O -->|yes| P[load succeeds]
        O -->|no| Q[ValueError raised]
        O -->|key absent - legacy| R[default to roundrobin]
    end
Loading

Comments Outside Diff (1)

  1. corelib/dynamicemb/example/example.py, line 68-72 (link)

    P1 Module-level rank_print references undefined local_rank

    The module-level rank_print (line 68–69) was left unchanged from the original code, but the local_rank = dist.get_rank() assignment it depends on was removed from module scope and moved inside init_runtime(). Line 72 then installs this broken function as builtins.print. Any call to print() between module import and init_runtime() completing — including from argparse error paths or third-party library code — will raise NameError: name 'local_rank' is not defined.

    The fix is to either remove the stale module-level rank_print / builtins.print assignment (since init_runtime() now handles it), or replace local_rank with a safe fallback:

Reviews (6): Last reviewed commit: "update example" | Re-trigger Greptile

Comment thread corelib/dynamicemb/dynamicemb/planner/planner.py Outdated
Comment thread corelib/dynamicemb/src/sparse_block_bucketize_features.cu
@ShaobinChen-AH ShaobinChen-AH changed the title fix-issue-350 Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance Apr 17, 2026
@shijieliu
Copy link
Copy Markdown
Collaborator

shijieliu commented Apr 20, 2026

hi @ShaobinChen-AH thanks for your contribution!

  1. I think you also need to modify load to accommodate the changes in input dist
  2. how did you trigger the code path for hash_roundrobin?

@shijieliu shijieliu requested a review from jiashuy April 21, 2026 01:20
Copy link
Copy Markdown
Collaborator

@jiashuy jiashuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor Author

hi @ShaobinChen-AH thanks for your contribution!

  1. I think you also need to modify load to accommodate the changes in input dist
  2. how did you trigger the code path for hash_roundrobin?

dist_type is now exposed via DynamicEmbTableOptions, with roundrobin kept as the default for compatibility. The planner now reads opts.dist_type instead of hardcoding the routing mode, so hash_roundrobin is an explicit opt-in path.

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor Author

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

I updated dump/load to persist dist_type in checkpoint metadata and validate it at load time, so mismatched input-distribution settings now fail loudly instead of silently loading. I also added end-to-end validation through the user-facing path (DynamicEmbTableOptions(dist_type="hash_roundrobin") -> planner/sharding/input-dist -> dump/load smoke), in addition to the existing kernel/parity benchmark.

@shijieliu
Copy link
Copy Markdown
Collaborator

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

@shijieliu
Copy link
Copy Markdown
Collaborator

/build

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor Author

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

I updated the DynamicEmb MovieLens example to demonstrate different input distribution policies via a new --dist_type CLI option (continuous / roundrobin / hash_roundrobin, default roundrobin).
I also made the example entrypoint import-safe/help-safe, so python3 example.py --help now works without distributed-init errors.
Validation:
python3 example.py --help
python3 -c "import example"
torchrun --standalone --nproc_per_node=2 example.py --train --dist_type hash_roundrobin
torchrun --standalone --nproc_per_node=2 example.py --train

ALL PASSED.

@jiashuy
Copy link
Copy Markdown
Collaborator

jiashuy commented Apr 27, 2026

/build

@jiashuy
Copy link
Copy Markdown
Collaborator

jiashuy commented Apr 27, 2026

CI

Copy link
Copy Markdown
Collaborator

@jiashuy jiashuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shijieliu shijieliu merged commit e4fc876 into NVIDIA:main Apr 27, 2026
1 check failed
@JacoCheung
Copy link
Copy Markdown
Collaborator

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator

JacoCheung commented Apr 28, 2026

Pipeline #49661872 -- failed

Job Status Log
pre_check ❌ failed view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ✅ success view
dynamicemb_test_fwd_bwd_8gpus ✅ success view
dynamicemb_test_load_dump_8gpus ✅ success view
unit_test_1gpu_a100 ✅ success view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ✅ success view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ✅ success view

Result: 11/14 jobs passed

View full pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants