Add deduplication pass for initializer tensors #67

AbhishekHerbertSamuel · 2025-06-05T05:53:07Z

Summary

This PR adds a new graph transformation pass: DeduplicateInitializersPass.

It removes duplicate initializer tensors (typically model weights) based on a unique fingerprint derived from:

Tensor byte content (tobytes())
Data type (dtype)
Shape

All redundant initializers are removed, and nodes referencing them are updated to use the canonical (first-seen) tensor.

Implementation Details

Fingerprints are tracked using a dictionary: (tobytes, dtype, shape) → name
Redundant initializers are removed using graph.initializers.pop(...)
Node inputs are updated via node.replace_input_with(...) for correctness and safety

Benefits

Reduces memory and file size by eliminating duplicated weight tensors
Simplifies graph structure for downstream optimization and export

File Added

src/onnx_ir/passes/common/deduplicate_initializers.py

Closes

Closes #66

src/onnx_ir/passes/common/deduplicate_initializers.py

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

justinchuby

It’s fine to use an AI for contribution. Please ensure however that the code actually works

AbhishekHerbertSamuel · 2025-06-05T14:25:38Z

Thank you for the feedback, Justin. I'll check if it works and then only send it here.

…bgraph traversal Address reviewer feedback: - Optimized memory by grouping by dtype and shape before comparing values - Used iterate_graph to handle subgraphs - Validated on normal and subgraph models; deduplication works as expected Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel · 2025-06-06T06:32:17Z

Hi Justin,

Thanks again for your feedback! I've verified that the updated implementation works as intended. Here's the test setup and output: (I ran the test locally, didn't push it here)

Local file path for the test: /Users/abhishekherbertsamuel/ir-py/src/test_local_dedup.py

Test code:
import numpy as np
from onnx_ir._core import Graph, Node, Tensor, Value
from onnx_ir.passes.common.deduplicate_initializers import DeduplicateInitializersPass
def test_normal_and_subgraph_dedup():
print("\n=== TEST: Normal Graph and Subgraph Deduplication ===")

# Shared tensor content
arr = np.array([1, 2, 3])
t1 = Tensor(arr)
t2 = Tensor(arr.copy())  # clone with same content

# Main graph values
v1 = Value(name="w1", const_value=t1)
v2 = Value(name="w2", const_value=t2)

# Subgraph has its own separate Value object (same tensor, new graph-safe instance)
sub_tensor = Tensor(arr.copy())
sub_val = Value(name="w3", const_value=sub_tensor)

# Subgraph node and graph
sub_node = Node("", "Conv", inputs=[sub_val], outputs=[])
subgraph = Graph(
    inputs=[],
    outputs=[],
    nodes=[sub_node],
    initializers=[sub_val],
    name="subgraph"
)

# Main graph node
main_node = Node("", "Add", inputs=[v1, v2], outputs=[])

# Attach subgraph manually to the node (mimics nested block structure)
main_node.blocks = [subgraph]

# Construct main graph
parent_graph = Graph(
    inputs=[],
    outputs=[],
    nodes=[main_node],
    initializers=[v1, v2],
    name="main_graph"
)

print("Before Deduplication:")
print("Main Graph Initializers:", list(parent_graph.initializers.keys()))
print("Main Node inputs:", [v.name for v in main_node.inputs])
print("Subgraph Initializers:", list(subgraph.initializers.keys()))
print("Subgraph Node inputs:", [v.name for v in sub_node.inputs])

# Apply deduplication
DeduplicateInitializersPass().apply(parent_graph)

print("\nAfter Deduplication:")
print("Main Graph Initializers:", list(parent_graph.initializers.keys()))
print("Main Node inputs:", [v.name for v in main_node.inputs])
print("Subgraph Initializers:", list(subgraph.initializers.keys()))
print("Subgraph Node inputs:", [v.name for v in sub_node.inputs])

if name == "main":
test_normal_and_subgraph_dedup()

Test Screenshot: (Have uploaded it here)

If I have missed out on anything, please let me know.

With regards,
Abhishek Herbert Samuel

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel · 2025-06-07T18:57:11Z

Hi @justinchuby,

I've pushed the finalized implementation and test as separate, signed commits. The following have been addressed:

DeduplicateInitializersPass: Added under passes/common, follows repo conventions, uses (dtype, shape) → {tobytes: name} grouping for memory efficiency, and traverses all subgraphs via RecursiveGraphIterator.

Test coverage: A dedicated unittest verifies correct deduplication in the main graph and ensures subgraphs remain isolated.

Coding standards: Followed the structure and documentation style of other passes (e.g., topological_sort.py).

Commit signed: Used -s with a clean message summarizing the functionality.

I have also attached a screenshot of the unit test which passed successfully on my local copy of this repository.

Please let me know if any final changes are needed. Thanks again for your guidance and mentorship throughout this PR!

Best,
Abhishek Herbert Samuel

src/onnx_ir/passes/common/deduplicate_initializers.py

src/onnx_ir/passes/common/deduplicate_initializers_test.py

justinchuby · 2025-06-07T19:30:43Z

Please feel free to ask questions when you are going through the code base or need help understanding parts of the code. It would be helpful to take a look at other existing passes and usages to ensure they are implemented in a similar style.

justinchuby · 2025-06-07T19:33:27Z

My concern with this pass in particular is that we are using the full bytes in the look up table. This is memory intensive. I wonder if there is a good (efficient) hash method that can be apply to the bytes content, and use the hash value in the look up table. Only when the hash matches do we compare the actual bytes.

AbhishekHerbertSamuel · 2025-06-08T04:06:24Z

Hi @justinchuby,
Thanks a lot for your detailed feedback :)

I’ll update the class to inherit from ir.passes.InPlacePass as suggested and move the main logic into the call method, following the repo’s conventions (like in constant_manipulation.py).
I’ll also change the test imports to follow the module-only import guideline — thanks for pointing me to the correct example!

Regarding the memory concern:
You're absolutely right — using tobytes() directly is memory-intensive. I’ll switch to using sha256 to hash the tensor bytes first, which helps group potential duplicates quickly. Then, to avoid any risk of false positives from rare hash collisions, I’ll still compare the full bytes only when the hashes match. This keeps things memory-efficient while still being safe and accurate. Thanks again for the suggestion!

Will push the changes shortly. Please let me know if I missed anything else. Appreciate your guidance!

Warm regards,
Abhishek Herbert Samuel

- Implemented DeduplicateInitializersPass to remove redundant initializers with identical shape, dtype, and values within individual graphs. - Ensured deduplication is confined to the same graph scope (no cross-subgraph merging). - Added unit tests covering: - Exact duplicates - Different shapes/dtypes - Scalars - Multiple duplicates - Non-deduplicable distinct values - Removed subgraph-related tests due to ONNX serialization behavior omitting their initializers. Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel · 2025-06-09T10:21:08Z

Hi @justinchuby,
I've pushed the finalized version of DeduplicateInitializersPass along with a focused set of unit tests. The current tests comprehensively validate deduplication behavior across various scenarios—shape, dtype, scalar, and value uniqueness.

Tests involving subgraph initializers were removed, as ONNX drops those during serialization, making them unreliable to assert against. Let me know if you'd like a different strategy for subgraph coverage.

Thanks again for your guidance throughout!

Warm regards,
Abhishek Herbert Samuel

codecov · 2025-06-09T10:23:27Z

Codecov Report

Attention: Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.44%. Comparing base (d41327e) to head (a039526).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...onnx_ir/passes/common/initializer_deduplication.py	84.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   74.39%   74.44%   +0.05%     
==========================================
  Files          37       38       +1     
  Lines        4648     4673      +25     
  Branches      950      954       +4     
==========================================
+ Hits         3458     3479      +21     
- Misses        839      841       +2     
- Partials      351      353       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/onnx_ir/passes/common/deduplicate_initializers.py

src/onnx_ir/passes/common/deduplicate_initializers_test.py

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel · 2025-06-12T17:31:01Z

Sure @justinchuby, will fix it and maintain code consistency :)

src/onnx_ir/passes/common/deduplicate_initializers.py

AbhishekHerbertSamuel · 2025-06-13T04:44:13Z

Thank you @xadupre @inisis @justinchuby for the feedback. Will make the requested changes and ensure that the PR is ready to be merged.

…nd size limit - Avoids comparing large tensors >1024 elements to reduce performance overhead - Compares shape and dtype before accessing tensor content - Adds test coverage for subgraph deduplication (If node branches) - Passes all linters: ruff, mypy, editorconfig Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel · 2025-06-13T18:15:33Z

@xadupre @justinchuby @inisis I have made the requested changes. Please check and let me know if it's ready for merging or if other changes need to be made prior to that. Thank you once again :)

AbhishekHerbertSamuel · 2025-06-16T17:08:16Z

Hi @justinchuby, is the code I submitted fine? Please let me know if there are any issues so that I can resolve it. As of now it's showing that 20/21 checks have passed (with 1 skipped).

justinchuby · 2025-06-16T17:14:52Z

Will take a look soon, thanks!

Signed-off-by: Justin Chu <[email protected]>

…test.py Signed-off-by: Justin Chu <[email protected]>

justinchuby · 2025-06-19T18:26:52Z

Thank for your contribution. I updated your code to simplify some of the logic and moved to do simple byte comparison for now because we have a small enough size limit.

Signed-off-by: Justin Chu <[email protected]>

src/onnx_ir/passes/common/initializer_deduplication.py

src/onnx_ir/passes/common/initializer_deduplication_test.py

Signed-off-by: Justin Chu <[email protected]>

src/onnx_ir/passes/common/initializer_deduplication.py

Signed-off-by: Justin Chu <[email protected]>

Copilot

Pull Request Overview

This pull request adds a new graph transformation pass to deduplicate initializer tensors based on their content, data type, and shape.

Introduces DeduplicateInitializersPass to remove redundant initializer tensors.
Adds unit tests to verify deduplication behavior across various scenarios.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/onnx_ir/passes/common/initializer_deduplication_test.py	Adds unit tests for deduplication behavior
src/onnx_ir/passes/common/initializer_deduplication.py	Implements the deduplication pass for initializer tensors

AbhishekHerbertSamuel · 2025-06-20T03:52:05Z

@justinchuby thank you for the mentorship and support throughout this PR. This was my first time contributing to an open source repository and I learnt a lot through this process. @xadupre @inisis @titaiwangms thank you for the constructive suggestions on this PR and related PR's (98,99) which helped bring this to completion. Looking forward to learning and building more in the ONNX community :)

Warm regards,
Abhishek Herbert Samuel

AbhishekHerbertSamuel requested review from titaiwangms and a team as code owners June 5, 2025 05:53

AbhishekHerbertSamuel mentioned this pull request Jun 5, 2025

Create a tensor de-duplication pass #66

Closed

justinchuby reviewed Jun 5, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

justinchuby reviewed Jun 5, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

justinchuby reviewed Jun 5, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

AbhishekHerbertSamuel added 2 commits June 5, 2025 13:27

Add deduplication pass for initializer tensors (onnx#66)

159e89a

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

Address feedback: optimize tensor fingerprinting and traverse subgraphs

ae8f078

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel force-pushed the add-deduplicate-initializers-pass branch from f99fa0c to ae8f078 Compare June 5, 2025 07:57

justinchuby requested changes Jun 5, 2025

View reviewed changes

Add DeduplicateInitializersPass and test covering graph and subgraph

ef46092

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

justinchuby reviewed Jun 7, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

justinchuby reviewed Jun 7, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

justinchuby reviewed Jun 7, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers_test.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Jun 9, 2025

View reviewed changes

justinchuby self-assigned this Jun 9, 2025

Finalize DeduplicateInitializersPass implementation and test coverage

6b3e0b7

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

AbhishekHerbertSamuel force-pushed the add-deduplicate-initializers-pass branch from a00be10 to 6b3e0b7 Compare June 11, 2025 08:40

AbhishekHerbertSamuel added 2 commits June 11, 2025 14:13

Finalize DeduplicateInitializersPass implementation and test coverage

b318c63

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

Finalize DeduplicateInitializersPass implementation and test coverage

5507fd3

Signed-off-by: Abhishek Herbert Samuel <[email protected]>

xadupre reviewed Jun 12, 2025

View reviewed changes

src/onnx_ir/passes/common/deduplicate_initializers.py Outdated Show resolved Hide resolved

justinchuby requested a review from Copilot June 13, 2025 03:08

This comment was marked as outdated.

Sign in to view

justinchuby added 4 commits June 19, 2025 11:17

Update deduplicate_initializers.py

ed80daf

Signed-off-by: Justin Chu <[email protected]>

Update deduplicate_initializers_test.py

47bc3c5

Signed-off-by: Justin Chu <[email protected]>

Rename deduplicate_initializers.py to initializer_deduplication.py

001924f

Signed-off-by: Justin Chu <[email protected]>

Rename deduplicate_initializers_test.py to initializer_deduplication_…

6d4dd39

…test.py Signed-off-by: Justin Chu <[email protected]>

justinchuby added 2 commits June 19, 2025 11:28

Update initializer_deduplication.py

484898c

Signed-off-by: Justin Chu <[email protected]>

Merge branch 'main' into add-deduplicate-initializers-pass

e8ba40d

justinchuby changed the title ~~Add deduplication pass for initializer tensors (#66)~~ Add deduplication pass for initializer tensors Jun 19, 2025

justinchuby added the module: passes label Jun 19, 2025

github-advanced-security bot found potential problems Jun 19, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

justinchuby reviewed Jun 19, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Outdated Show resolved Hide resolved

src/onnx_ir/passes/common/initializer_deduplication.py Show resolved Hide resolved

src/onnx_ir/passes/common/initializer_deduplication_test.py Outdated Show resolved Hide resolved

justinchuby added 3 commits June 19, 2025 12:53

Update initializer_deduplication.py

4c765c0

Signed-off-by: Justin Chu <[email protected]>

Update initializer_deduplication.py

6c0fd8d

Signed-off-by: Justin Chu <[email protected]>

Update initializer_deduplication_test.py

7d28113

Signed-off-by: Justin Chu <[email protected]>

github-advanced-security bot found potential problems Jun 19, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

Update initializer_deduplication.py

a039526

Signed-off-by: Justin Chu <[email protected]>

justinchuby approved these changes Jun 19, 2025

View reviewed changes

justinchuby requested a review from Copilot June 19, 2025 20:22

Copilot AI reviewed Jun 19, 2025

View reviewed changes

justinchuby merged commit d8fa011 into onnx:main Jun 19, 2025
21 checks passed

Add deduplication pass for initializer tensors #67

Add deduplication pass for initializer tensors #67

Uh oh!

Conversation

AbhishekHerbertSamuel commented Jun 5, 2025

Summary

Implementation Details

Benefits

File Added

Closes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby left a comment

Choose a reason for hiding this comment

Uh oh!

AbhishekHerbertSamuel commented Jun 5, 2025

Uh oh!

AbhishekHerbertSamuel commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AbhishekHerbertSamuel commented Jun 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Jun 7, 2025

Uh oh!

justinchuby commented Jun 7, 2025

Uh oh!

AbhishekHerbertSamuel commented Jun 8, 2025

Uh oh!

AbhishekHerbertSamuel commented Jun 9, 2025

Uh oh!

codecov bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AbhishekHerbertSamuel commented Jun 12, 2025

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

AbhishekHerbertSamuel commented Jun 13, 2025

Uh oh!

AbhishekHerbertSamuel commented Jun 13, 2025

Uh oh!

AbhishekHerbertSamuel commented Jun 16, 2025

Uh oh!

justinchuby commented Jun 16, 2025

Uh oh!

justinchuby commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

AbhishekHerbertSamuel commented Jun 20, 2025

Uh oh!

Uh oh!

AbhishekHerbertSamuel commented Jun 6, 2025 •

edited

Loading

codecov bot commented Jun 9, 2025 •

edited

Loading