Skip to content

Arm backend: Add evaluate_model.py#18199

Open
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model
Open

Arm backend: Add evaluate_model.py#18199
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model

Conversation

@martinlsm
Copy link
Collaborator

@martinlsm martinlsm commented Mar 16, 2026

Arm backend: Add evaluate_model.py

This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:

  • TOSA reference models (INT, FP).
  • Evaluating a model that is quantized and/or lowered.
    I.e., it is possible to evaluate a model that is quantized but not
    lowered, lowered but not quantized, or both at the same time.
  • The program can cast the model with the --dtype flag to evaluate a
    model in e.g., bf16 or fp16 format.

Also add tests that exercise evaluate_model.py with different command
line arguments.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

@martinlsm martinlsm requested a review from digantdesai as a code owner March 16, 2026 15:18
Copilot AI review requested due to automatic review settings March 16, 2026 15:18
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit d7219c7 with merge base c81126e (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 16, 2026
@martinlsm martinlsm changed the title Marlin evaluate model Arm backend: Add evaluate_model.py Mar 16, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label ciflow/trunk

@martinlsm
Copy link
Collaborator Author

@pytorchbot label "partner: arm"

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 16, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label "release notes: arm"

@pytorch-bot pytorch-bot bot added the release notes: arm Changes to the ARM backend delegate label Mar 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.

Changes:

  • Add backends/arm/scripts/evaluate_model.py to compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities.
  • Add pytest coverage for running evaluate_model.py against TOSA INT/FP targets and validating the emitted metrics JSON.
  • Update examples/arm/aot_arm_compiler.py messaging to point users to the new evaluation script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
examples/arm/aot_arm_compiler.py Updates the deprecation/error message to redirect evaluation usage to evaluate_model.py.
backends/arm/scripts/evaluate_model.py Introduces the new evaluation CLI: argument parsing, compile/quantize/delegate pipeline, evaluator execution, and JSON metrics output.
backends/arm/test/misc/test_evaluate_model.py Adds integration-style tests invoking the new script with representative CLI flags and checking output structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if args.quant_mode is not None and args.dtype is not None:
raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")

evaluators: list[Evaluator] = [
Comment on lines +180 to +182
"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
)


# Add evaluator for compression ratio of TOSA file
intermediates_path = Path(args.intermediates)
tosa_paths = list(intermediates_path.glob("*.tosa"))
Comment on lines +17 to +22
# Add Executorch root to path so this script can be run from anywhere
_EXECUTORCH_DIR = Path(__file__).resolve().parents[3]
_EXECUTORCH_DIR_STR = str(_EXECUTORCH_DIR)
if _EXECUTORCH_DIR_STR not in sys.path:
sys.path.insert(0, _EXECUTORCH_DIR_STR)

Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
"provided, up to 1000 samples are used for calibration. "
"Supported files: Common image formats (e.g., .png or .jpg) if "
"using imagenet evaluator, otherwise .pt/.pth files. If not provided,"
"quantized models are calibrated on their example inputs."
Copy link
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge, this adds a new file but as the sile is it's own test script buck2 files should not need updates.

Martin Lindström added 2 commits March 17, 2026 10:54
This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:
- TOSA reference models (INT, FP).
- Evaluating a model that is quantized and/or lowered.
  I.e., it is possible to evaluate a model that is quantized but not
  lowered, lowered but not quantized, or both at the same time.
- The program can cast the model with the --dtype flag to evaluate a
  model in e.g., bf16 or fp16 format.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46
Add two tests that exercise evaluate_model.py with different command
line arguments.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228
@martinlsm martinlsm force-pushed the marlin-evaluate-model branch from 87a2dfd to d7219c7 Compare March 17, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants