Skip to content

Qwen3-1.7B to SA8295 fails in dockerized build: num_sharding=1 hits QNN graph memory limit, num_sharding=2 causes host OOM #17782

@FouadSakr

Description

@FouadSakr

I’m converting qwen3-1_7b for SA8295 in a dockerized ExecuTorch environment.

I see two different failures depending on num_sharding:

  1. num_sharding=1: Fails during QNN graph finalization/serialization with limit error:
    graph requires estimated allocation of 2309576 KB, limit is 2097152 KB
    Failed to finalize Qnn Graph with error: 1002
    AssertionError: Failed to generate Qnn context binary.

  2. num_sharding=2
    The process runs out of host memory (OOM), even on a large machine.

Environment:
Deployment style: Dockerized
Host OS: Linux (KVM VM)
CPU: Intel Xeon Platinum 8375C, 128 vCPUs
RAM: ~495 GiB total, no swap
Target SoC: SA8295
Model: Qwen3-1.7B
Executorch version: v1.1.0

local modification (static_llm_quant_recipe.py: 575-609):

  • Set default_quant_dtype = QuantDtype.use_16a8w
  • Updated the conv2d quantization to: QuantDtype.use_16a8w
  • Updated granularity=QuantGranularity.PER_CHANNEL (instead of per-block)

Command used:
python examples/qualcomm/oss_scripts/llama/llama.py
-b build-android
-m SA8295
--decoder_model qwen3-1_7b
--model_mode hybrid
--prefill_ar_len 128
--max_seq_len 256 \ # Tried 128, 256, 512, 1024
--temperature 0
--prompt "I would like to learn python, could you teach me with a simple example?"
--tasks wikitext \
--limit 1
--compile_only

My current hypothesis is that the num_sharding=1 failure is likely a platform limitation on SA8295 (DSP arch v68), given the explicit QNN graph serialization cap (2309576 KB required vs 2097152 KB limit).

However, the num_sharding=2 behavior is confusing: instead of resolving the graph-size issue, it leads to host-side OOM even on a 500 GB RAM-class system. I’m not sure whether this is expected compile-time memory behavior, a tooling inefficiency, or a potential memory leak. Also, other model with num_shardings > 1 are showing the same OOM behavior (Llama3.2-3B).

convert_qwen3-1_7b_SA8295_20260302_082259.txt
convert_qwen3-1_7b_SA8295_20260302_065355.txt

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions