[Cadence] add reference quantized fully connected out #9018

zonglinpeng · 2025-03-06T19:15:32Z

add reference quantized fully connected out

test plan

python3 -m examples.cadence.models.rnnt_joiner

Differential Revision: D70503708 Pull Request resolved: #8888

Differential Revision: D70528853 Pull Request resolved: #8908

Differential Revision: D70538475 Pull Request resolved: #8918

…8616) * Qualcomm AI Engine Direct - Meta CI for Mobilebert and W2L * variable update

The TOSA compiler previously had a bug that caused a segmentation fault when loading the Wav2Letter model on Ethos-U85. This issue has now been fixed. Enable the test that previously failed due to this bug. Co-authored-by: Martin Lindström <[email protected]>

* Update using-executorch-building-from-source.md * Update using-executorch-building-from-source.md

…8411) * [ARM backend] Update fuse_batchnorm_pass to create new placeholders - This allows to fuse bn+convs with multiple users of the same weights - Adds new util functions create/delete_const_placeholders to take care of updating the GraphSignature and state_dict/constants dict when handling constant placholders. - Adds and updates related tests Change-Id: I8e550614d9741de840786d9dca9f30af9eb95a64 * Move create/delete_constant_node utils to shared folder Change-Id: I3a82f58f9796e421bd205f030f7d79d72a2f7ed9 * Add buck dependency * Fix bazel build --------- Co-authored-by: Digant Desai <[email protected]>

Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)

* up * up * up * up * up * up * up * up * up * up * up * up * up

) This is not supported, so we shouldn't partition it. Add an expectedFailure test to indicate that this is not supported Differential Revision: [D70343584](https://our.internmc.facebook.com/intern/diff/D70343584/) ghstack-source-id: 269356867 Pull Request resolved: #8891 Co-authored-by: Digant Desai <[email protected]>

Doesn't seem to be any reason not to allow optimized ops for this one.

[Tensor.cpp] add algorithm include to get stable_sort and iter_swap on windows

Differential Revision: D69947096 Pull Request resolved: #8896

Pull Request resolved: #8892 Differential Revision: [D70372220](https://our.internmc.facebook.com/intern/diff/D70372220/) ghstack-source-id: 269599293 Co-authored-by: Digant Desai <[email protected]>

We don't need a second isnan; see code comment. (This is a small optimization.)

init

Differential Revision: D70528015 Pull Request resolved: #8906

Differential Revision: D70540908 Pull Request resolved: #8943

Update [ghstack-poisoned]

See class comment. In brief, this adds an iterable range to make broadcasting ops convenient and efficient to implement.

#8941) Fix Timing Adapter settings depending on the memory mode & placement in the linker script

Pull Request resolved: #8887 PteDataMap is the NamedDataMap that will live in the runtime. It is used to give delegates access to opaque named data stored in the PTE file. Open to alternative naming suggestions, maybe 'PTEDataMap' or 'ProgramDataMap'? **Usage** The PteDataMap is owned by the program, and instantiated at program load time if named_data exists in the PTE file. We introduce usage of 'std::optional' here. I think we can also use executorch::aten::optional to avoid adding standard lib ? When initializing delegates, the PteDataMap is given to delegate_init. Delegates can retrieve opaque delegate data by key using 'get_data'. This gives them a FreeableBuffer that they can free later. **Testing** This test uses the C++ flatbuffer API to build a fake program containing named data. We also creates a temp file with sample data that the data loader can wrap around. TODO: e2e test once delegate aot is ready and we can generate a file with named data. **Note** As the PteDataMap wraps around flatbuffer constructs, the Program must outlive the PteDataMap. PteDataMap does not implement - get_metadata; currently, all data stored is opaque. Later, we can implement get_metadata if a backend stores plain tensor data. - load_into; this is mostly used for the training case, and isn't used by delegates, at least not at the moment ghstack-source-id: 269779453 Differential Revision: [D70213646](https://our.internmc.facebook.com/intern/diff/D70213646/) Co-authored-by: lucylq <[email protected]>

Differential Revision: D70541550 Pull Request resolved: #8921

Differential Revision: D70597013 Pull Request resolved: #8952

Differential Revision: D70577129 Pull Request resolved: #8940

* fix windows build issue * revert xnnpack cmakelist and fix file_data_loader * fix lint warning * undo non-relevant changes * linter error - extra newline --------- Co-authored-by: Chao Zhang <[email protected]>

partner engineers are calling ET via LlamaModule: https://github.com/pytorch/executorch/blob/main/extension/android/src/main/java/org/pytorch/executorch/LlamaModule.java This is a wrapper around the runner: https://www.internalfb.com/code/fbsource/[90d251fc01a84871b679406d6dc855eb5ded82fd]/fbcode/executorch/examples/models/llama/runner/runner.cpp?lines=47 Differential Revision: [D70596210](https://our.internmc.facebook.com/intern/diff/D70596210/) ghstack-source-id: 269741205 Pull Request resolved: #8953 Co-authored-by: lucylq <[email protected]>

don't use invalid flags on windows

Differential Revision: D70334196 Pull Request resolved: #8478

Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).

Summary - Remove redundant directory Co-authored-by: DannyYuyang-quic <[email protected]>

* [executorch][runtime] Introduce PteDataMap for weight sharing Pull Request resolved: #8887 PteDataMap is the NamedDataMap that will live in the runtime. It is used to give delegates access to opaque named data stored in the PTE file. Open to alternative naming suggestions, maybe 'PTEDataMap' or 'ProgramDataMap'? **Usage** The PteDataMap is owned by the program, and instantiated at program load time if named_data exists in the PTE file. We introduce usage of 'std::optional' here. I think we can also use executorch::aten::optional to avoid adding standard lib ? When initializing delegates, the PteDataMap is given to delegate_init. Delegates can retrieve opaque delegate data by key using 'get_data'. This gives them a FreeableBuffer that they can free later. **Testing** This test uses the C++ flatbuffer API to build a fake program containing named data. We also creates a temp file with sample data that the data loader can wrap around. TODO: e2e test once delegate aot is ready and we can generate a file with named data. **Note** As the PteDataMap wraps around flatbuffer constructs, the Program must outlive the PteDataMap. PteDataMap does not implement - get_metadata; currently, all data stored is opaque. Later, we can implement get_metadata if a backend stores plain tensor data. - load_into; this is mostly used for the training case, and isn't used by delegates, at least not at the moment Differential Revision: [D70213646](https://our.internmc.facebook.com/intern/diff/D70213646/) ghstack-source-id: 269691307 * [executorch][runtime] Add get_named_data_map to Program Pull Request resolved: #8853 Add to the program interface, to allow users to retrieve the NDM. Differential Revision: [D70276106](https://our.internmc.facebook.com/intern/diff/D70276106/) ghstack-source-id: 269693108 --------- Co-authored-by: lucylq <[email protected]>

IMO, Buck visibility is just inverse deps. We should trust that people have a good reason to add deps rather than attempt to police them and require double entry in both deps and visibility, especially since we seem to be committed to APIs by default in OSS anyway. Specific motivation is that #8712 would otherwise have to ad-hoc slap ExecuTorch-wide visibility on a lot of targets, but I've held this view for a long time. Differential Revision: D70647462

Fix java build

* [Windows Build] Implement MMAP for mmap_data_loader.cpp There is no sys/mman.h or posix-compatible mmap() implementation on Windows. The extension data loaders use it to map in data files, so adding an implementation. Test-run, & .\cmake-out\extension\data_loader\test\Debug\extension_data_loader_test.exe --gtest_brief=1 --gtest_filter=MmapDataLoader* Running main() from ...\executorch\third-party\googletest\googletest\src\gtest_main.cc [==========] 8 tests from 1 test suite ran. (50 ms total) [ PASSED ] 8 tests. * apply code suggestions * fix src -> srcs typo * try to fix build * fix src/headers brackets --------- Co-authored-by: Jeff Whiteside <[email protected]>

bump pytorch version (#8922) Summary: Pull Request resolved: #8922 Differential Revision: D70542319

* Fix phi4mini test model * Remove pull trigger since done testing

Revert "[Benchmark] fail test if model artifact does not exist (#8482)" This reverts commit 24671a9.

Program schema should stay a private dep.

…ass_manager (#8997) Add FuseViewCopyTransform and FuseConstantsPass in arm_pass_manager These passes both removes redundant ops from the graph: - FuseViewCopyTransform pass is added from backends/transforms to merge sequential view ops. - FuseConstantOpsPass is created to compute ops with constant inputs AOT - This is not done in cases where the result is a larger tensor, to avoid increasing the constant memory size. - For BI, ops are quantized with the q/dq-ops as to not change the behaviour of the graph. - Pass order is important: the pass must be placed after all passes which may add constant ops, but before the InsertTableOpsPass, since it doesn't handle TOSA _table-ops. Signed-off-by: Adrian Lundell <[email protected]>

Similar to https://github.com/pytorch/executorch/edit/main/.github/workflows/update-viablestrict.yml

pytorch-bot · 2025-03-06T19:15:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9018

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy and others added 30 commits March 6, 2025 11:09

Fix broken tests

4bebc49

Differential Revision: D70503708 Pull Request resolved: #8888

fix wrong error msg

0740a5d

Differential Revision: D70528853 Pull Request resolved: #8908

fix head_dim in metadata

a048c2c

Differential Revision: D70538475 Pull Request resolved: #8918

Qualcomm AI Engine Direct - Meta CI for Mobilebert , W2L, and Llama (#…

ee2180e

…8616) * Qualcomm AI Engine Direct - Meta CI for Mobilebert and W2L * variable update

Update using-executorch-building-from-source.md (#8925)

c080349

* Update using-executorch-building-from-source.md * Update using-executorch-building-from-source.md

[minibench] Drop outliers from benchmark result (#8919)

c3b7ef9

Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)

Fix ANE llama export (#8904)

612a6e1

* up * up * up * up * up * up * up * up * up * up * up * up * up

Link xnn_executor_runner with optimized op library (#8901)

a78101b

Doesn't seem to be any reason not to allow optimized ops for this one.

[Windows] [Tensor.cpp] add #include <algorithm> (#8912)

2a11642

[Tensor.cpp] add algorithm include to get stable_sort and iter_swap on windows

Add cpu_thread setting logic to xnn_executor_runner (#8902)

5bb91d5

Add a pass to remove certain redundant branched quant/dequant nodes

6d26449

Differential Revision: D69947096 Pull Request resolved: #8896

[ExecuTorch][XNNPACK] Rename linear weight partitioning flag for clarity

bf6e71e

Pull Request resolved: #8892 Differential Revision: [D70372220](https://our.internmc.facebook.com/intern/diff/D70372220/) ghstack-source-id: 269599293 Co-authored-by: Digant Desai <[email protected]>

portable arg{max,min}: optimize update check (#8863)

b09fdce

We don't need a second isnan; see code comment. (This is a small optimization.)

Fix trunk.yml (#8949)

0f48136

init

[Android demo] Decouple pte file from assets and remove unused

338d936

Differential Revision: D70528015 Pull Request resolved: #8906

Add Phi-4-mini-instruct (#8856)

6bf4e5b

Add optimized kernels to executorch pybindings

41dd47d

Differential Revision: D70540908 Pull Request resolved: #8943

fix -Werror -Wunused in executor_runner (#8955)

9c45f2f

Update [ghstack-poisoned]

add BroadcastIndexesRange (#8864)

760272c

See class comment. In brief, this adds an iterable range to make broadcasting ops convenient and efficient to implement.

Arm backend: Fix Timing Adapter settings depending on the memory mode… (

dc957db

#8941) Fix Timing Adapter settings depending on the memory mode & placement in the linker script

introduce file_data_sink

d5dfaac

Differential Revision: D70541550 Pull Request resolved: #8921

Add unfold_copy.out

927bdda

Differential Revision: D70597013 Pull Request resolved: #8952

Add max_pool2d_with_indices_backward

2a8e29b

Differential Revision: D70577129 Pull Request resolved: #8940

[Windows] [file_data_loader.cpp] Add compat_unistd.h (#8913)

5a24e92

* fix windows build issue * revert xnnpack cmakelist and fix file_data_loader * fix lint warning * undo non-relevant changes * linter error - extra newline --------- Co-authored-by: Chao Zhang <[email protected]>

[Windows] don't use invalid flags on Windows (#8915)

9429381

don't use invalid flags on windows

cad-audio and others added 19 commits March 6, 2025 11:09

Adding Convolution operator optimizations

fe39fd6

Differential Revision: D70334196 Pull Request resolved: #8478

Qualcomm AI Engine Direct - Remove copy headers mechanism (#8877)

55e102f

Summary - Remove redundant directory Co-authored-by: DannyYuyang-quic <[email protected]>

Fix android demo app java build

76dcda3

Fix java build

[Portable] Easy fix of unfold_copy_out function signature (#8975)

db5f474

Remove unused build/test_android_ci.sh (#8976)

522e99b

bump pytorch version (#8922)

f2ed70e

bump pytorch version (#8922) Summary: Pull Request resolved: #8922 Differential Revision: D70542319

Deploy BroadcastIndexesRange (#8865)

e4ab6c2

Fix phi4mini test model (#8971)

9c0f20f

* Fix phi4mini test model * Remove pull trigger since done testing

Revert "[Benchmark] fail test if model artifact does not exist" (#8985)

9830c26

Revert "[Benchmark] fail test if model artifact does not exist (#8482)" This reverts commit 24671a9.

Do not export autogenerated headers. (#8993)

704203a

Program schema should stay a private dep.

Add extension parallel as a dep to kernels custom framework (#8996)

5193c08

Update update-viablestrict.yml to ubuntu-22.04 (#8972)

a10483f

Similar to https://github.com/pytorch/executorch/edit/main/.github/workflows/update-viablestrict.yml

fix mman header issues (#8989)

318b015

init

dd60f2e

zonglinpeng added the topic: not user facing label Mar 6, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2025

zonglinpeng closed this Mar 6, 2025

zonglinpeng had a problem deploying to upload-benchmark-results March 6, 2025 20:20 — with GitHub Actions Failure

zonglinpeng had a problem deploying to upload-benchmark-results March 6, 2025 20:55 — with GitHub Actions Failure

github-actions bot mentioned this pull request Mar 10, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#16

Open

This was referenced Mar 17, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#18

Open

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#20

Open

github-actions bot mentioned this pull request Mar 31, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#22

Open

github-actions bot mentioned this pull request Apr 7, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cadence] add reference quantized fully connected out #9018

[Cadence] add reference quantized fully connected out #9018

zonglinpeng commented Mar 6, 2025

pytorch-bot bot commented Mar 6, 2025

[Cadence] add reference quantized fully connected out #9018

[Cadence] add reference quantized fully connected out #9018

Conversation

zonglinpeng commented Mar 6, 2025

test plan

pytorch-bot bot commented Mar 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9018