Skip to content

Arm backend: Add evaluate_model.py#18199

Open
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model
Open

Arm backend: Add evaluate_model.py#18199
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model

Conversation

@martinlsm
Copy link
Collaborator

@martinlsm martinlsm commented Mar 16, 2026

Arm backend: Add evaluate_model.py

This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:

  • TOSA reference models (INT, FP).
  • Evaluating a model that is quantized and/or lowered.
    I.e., it is possible to evaluate a model that is quantized but not
    lowered, lowered but not quantized, or both at the same time.
  • The program can cast the model with the --dtype flag to evaluate a
    model in e.g., bf16 or fp16 format.

Also add tests that exercise evaluate_model.py with different command
line arguments.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Martin Lindström added 2 commits March 16, 2026 15:50
This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:
- TOSA reference models (INT, FP).
- Evaluating a model that is quantized and/or lowered.
  I.e., it is possible to evaluate a model that is quantized but not
  lowered, lowered but not quantized, or both at the same time.
- The program can cast the model with the --dtype flag to evaluate a
  model in e.g., bf16 or fp16 format.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46
Add two tests that exercise evaluate_model.py with different command
line arguments.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228
@martinlsm martinlsm requested a review from digantdesai as a code owner March 16, 2026 15:18
Copilot AI review requested due to automatic review settings March 16, 2026 15:18
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Awaiting Approval, 8 New Failures, 1 Cancelled Job

As of commit 87a2dfd with merge base 76df414 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 16, 2026
@martinlsm martinlsm changed the title Marlin evaluate model Arm backend: Add evaluate_model.py Mar 16, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label ciflow/trunk

@martinlsm
Copy link
Collaborator Author

@pytorchbot label "partner: arm"

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 16, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label "release notes: arm"

@pytorch-bot pytorch-bot bot added the release notes: arm Changes to the ARM backend delegate label Mar 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.

Changes:

  • Add backends/arm/scripts/evaluate_model.py to compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities.
  • Add pytest coverage for running evaluate_model.py against TOSA INT/FP targets and validating the emitted metrics JSON.
  • Update examples/arm/aot_arm_compiler.py messaging to point users to the new evaluation script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
examples/arm/aot_arm_compiler.py Updates the deprecation/error message to redirect evaluation usage to evaluate_model.py.
backends/arm/scripts/evaluate_model.py Introduces the new evaluation CLI: argument parsing, compile/quantize/delegate pipeline, evaluator execution, and JSON metrics output.
backends/arm/test/misc/test_evaluate_model.py Adds integration-style tests invoking the new script with representative CLI flags and checking output structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if args.quant_mode is not None and args.dtype is not None:
raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")

evaluators: list[Evaluator] = [
Comment on lines +180 to +182
"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
)


# Add evaluator for compression ratio of TOSA file
intermediates_path = Path(args.intermediates)
tosa_paths = list(intermediates_path.glob("*.tosa"))
Comment on lines +17 to +22
# Add Executorch root to path so this script can be run from anywhere
_EXECUTORCH_DIR = Path(__file__).resolve().parents[3]
_EXECUTORCH_DIR_STR = str(_EXECUTORCH_DIR)
if _EXECUTORCH_DIR_STR not in sys.path:
sys.path.insert(0, _EXECUTORCH_DIR_STR)

Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
"provided, up to 1000 samples are used for calibration. "
"Supported files: Common image formats (e.g., .png or .jpg) if "
"using imagenet evaluator, otherwise .pt/.pth files. If not provided,"
"quantized models are calibrated on their example inputs."
Copy link
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge, this adds a new file but as the sile is it's own test script buck2 files should not need updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants