Arm backend: Add evaluate_model.py by martinlsm · Pull Request #18199 · pytorch/executorch

martinlsm · 2026-03-16T15:18:46Z

Arm backend: Add evaluate_model.py

This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:

TOSA reference models (INT, FP).
Evaluating a model that is quantized and/or lowered.
I.e., it is possible to evaluate a model that is quantized but not
lowered, lowered but not quantized, or both at the same time.
The program can cast the model with the --dtype flag to evaluate a
model in e.g., bf16 or fp16 format.

Also add tests that exercise evaluate_model.py with different command
line arguments.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

This patch reimplements the evaluation feature that used to be in aot_arm_compiler.py while introducing a few improvements. The program is evaluate_model.py and it imports functions from aot_arm_compiler.py to compile a model in a similar manner, but runs its own code that is focused on evaluating a model using the evaluators classes in backends/arm/util/arm_model_evaluator.py. The following is supported in evaluate_model.py: - TOSA reference models (INT, FP). - Evaluating a model that is quantized and/or lowered. I.e., it is possible to evaluate a model that is quantized but not lowered, lowered but not quantized, or both at the same time. - The program can cast the model with the --dtype flag to evaluate a model in e.g., bf16 or fp16 format. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46

Add two tests that exercise evaluate_model.py with different command line arguments. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228

pytorch-bot · 2026-03-16T15:18:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Awaiting Approval, 8 New Failures, 1 Cancelled Job

As of commit 87a2dfd with merge base 76df414 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

Claude Code (gh)
periodic (gh)

NEW FAILURES - The following jobs have failed:

Apple / build-benchmark-app / macos-job (gh)
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_no_target) / linux-job (gh)
RuntimeError: Command docker exec -t 67a50525999670c28c059743a917167bcf40ee91ab573fbba5fd5b27a8513a13 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t da84c5a7aaf0b5e38d0fb15313aeaf832aa9294ea9d253a1094af8be34d3d8c2 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u55) / linux-job (gh)
RuntimeError: Command docker exec -t 4c3e582bf47ae306f3f81ee527dba209f16ed62c7d82908cc6f1f4eb697d0147 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t 1829746abc99d58a76f62395f4e6a2a30fbbb54b29136395b1fe57816e799acf /exec failed with exit code 1
trunk / test-arm-backend-vkml (test_pytest_ops_vkml) / linux-job (gh)
RuntimeError: Command docker exec -t 78ad05bd01804d076e78e428cd0afcb08efe939c02d1c37a0d972b8582a67c17 /exec failed with exit code 1
trunk / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t dfa4aa6dabaa502ab91029c3027c4df51cd426e611cec7c319d3e4c606853722 /exec failed with exit code 1
trunk / unittest-release / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_linear_model

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos-cpu (vit, xnnpack-quantization-delegation) / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

martinlsm · 2026-03-16T15:19:14Z

@pytorchbot label ciflow/trunk

martinlsm · 2026-03-16T15:19:23Z

@pytorchbot label "partner: arm"

martinlsm · 2026-03-16T15:19:32Z

@pytorchbot label "release notes: arm"

Copilot

Pull request overview

This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.

Changes:

Add backends/arm/scripts/evaluate_model.py to compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities.
Add pytest coverage for running evaluate_model.py against TOSA INT/FP targets and validating the emitted metrics JSON.
Update examples/arm/aot_arm_compiler.py messaging to point users to the new evaluation script.