[quantization] Implement llama wrappers for decoding by mhs4670go · Pull Request #528 · Samsung/TICO

mhs4670go · 2026-02-27T01:02:17Z

This commit implements llama wrappers for decoding.

TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com

mhs4670go · 2026-02-27T04:41:06Z

┌──────────── Quantization Error Summary (Decode DecoderLayer / Random) ────────────
│ Mean |diff| (hidden): 0.037206
│ PEIR        (hidden): 3.549647 %
│ Mean |diff| (new_k) : 0.003758
│ PEIR        (new_k) : 1.324077 %
│ Mean |diff| (new_v) : 0.001980
│ PEIR        (new_v) : 1.693285 %
└──────────────────────────────────────────────────────────────────────────────────
    ┌────────────────────────────────────────────┐
 4.2┤                                            │
    │                                         •  │
    │                                      ••    │
 2.8┤                                    •••     │
    │                                  •••       │
    │                               ••••         │
    │                             ••••           │
 1.5┤                           ••••             │
    │                         ••••               │
    │                       ••••                 │
 0.2┤                     ••••                   │
    │                   ••••                     │
    │                 ••••                       │
    │               ••••                         │
-1.1┤             ••••                           │
    │           ••••                             │
    │         •••                                │
-2.5┤       ••••                                 │
    │     •••                                    │
    │   •••                                      │
    │  •                                         │
-3.8┤                                            │
    └┬──────────┬──────────┬─────────┬──────────┬┘
   -3.8       -1.8        0.2       2.2       4.2

dayo09 · 2026-02-27T04:59:11Z

tico/quantization/wrapq/wrappers/ptq_wrapper.py

    ):
        super().__init__(qcfg)
-        wrapped_cls = lookup(type(module))
+        variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill"


Suggested change

variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill"

variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else None

Quantized nn module may not be one of prefill/decode stage

I desinged to use "prefill" as a default value. Do you think exact default value is needed?

Resolution Order ---------------- 1. Exact variant match 2. "prefill" fallback 3. Any registered variant (last-resort compatibility)

Previously, lookup(fp_cls) was used without any notion of variants. Even after introducing variants, it is likely that most wrappers remain variant-agnostic, meaning only a single implementation exists as you said.

In this situation, requesting variant="decode" does not necessarily mean that every module must provide a dedicated decode-specific wrapper. Many modules (e.g., Linear, LayerNorm) have identical behavior between prefill and decode execution and therefore do not require separate implementations.

With fallback enabled, these variant-agnostic modules can be safely reused while still constructing a decode-specialized graph. This allows decode graph construction to proceed without forcing unnecessary wrapper duplication for modules whose behavior does not actually differ across variants.

Thanks for the detailed explanation!

As you mentioned, most of them are irrelevant to the field 'variant'.
So I assumed that default value should be None for most cases and decode/prefill could be selected when they are relevant.

Using a technical term, it's a leaky abstraction. Let's choose to be 'opt-in' variance. Let modules to have 'None' or 'invariant' if they don't belong to either of 'prefill' or 'decode.

mhs4670go · 2026-02-27T07:55:50Z

python tico/quantization/wrapq/examples/llama/quantize_attn_decode.py
┌───────────── Quantization Error Summary (Decode Attn / Random) ─────────────
│ Mean |diff|: 0.087903
│ PEIR       : 10.583212 %
└────────────────────────────────────────────────────────────────────────────
     ┌───────────────────────────────────────────┐
 2.13┤                                           │
     │                                        •  │
     │                                     •     │
 1.21┤                                  ••• •    │
     │                               ••••••      │
     │                             ••••••        │
     │                          ••••••••         │
 0.30┤                         •••••••           │
     │                       •••••••             │
     │                    • •••••••              │
-0.62┤                   •••••••                 │
     │                   •••••                   │
     │                 •••••                     │
     │                •••                        │
-1.53┤              •••                          │
     │                                           │
     │                                           │
-2.45┤                                           │
     │                                           │
     │  •                                        │
     │                                           │
-3.36┤                                           │
     └┬──────────┬─────────┬──────────┬─────────┬┘
    -3.4       -2.0      -0.6        0.8      2.1 

Quantized Circle model saved to /home/seongwoo/TICO/attn_decode.q.circle

This commit implements llama wrappers for decoding. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>

mhs4670go · 2026-03-04T11:22:06Z

@dayo09 I've introduced "common" variant. PTAL:)

dayo09

LGTM Thanks!

mhs4670go added the DRAFT label Feb 27, 2026

mhs4670go mentioned this pull request Feb 27, 2026

[quantization] Rename QuantLlamaAttention to QuantLlamaAttentionPrefill #529

Merged

mhs4670go force-pushed the decode branch 2 times, most recently from 60a9280 to df0a587 Compare February 27, 2026 01:58

dayo09 reviewed Feb 27, 2026

View reviewed changes

mhs4670go mentioned this pull request Feb 27, 2026

[quantization] Rename modules #530

Merged

mhs4670go force-pushed the decode branch from 06e4a65 to fff817b Compare February 27, 2026 06:52

mhs4670go requested a review from dayo09 February 27, 2026 09:29

mhs4670go force-pushed the decode branch from 582547a to 81b332b Compare March 4, 2026 10:13

mhs4670go changed the title ~~[DRAFT] Implement llama wrappers for decoding~~ [quantization] Implement llama wrappers for decoding Mar 4, 2026

mhs4670go removed the DRAFT label Mar 4, 2026

mhs4670go marked this pull request as ready for review March 4, 2026 11:09

mhs4670go force-pushed the decode branch from 81b332b to fbb2903 Compare March 4, 2026 11:17

[quantization] Implement llama wrappers for decoding

09b64bc

This commit implements llama wrappers for decoding. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>

mhs4670go force-pushed the decode branch from fbb2903 to 09b64bc Compare March 4, 2026 11:20

dayo09 approved these changes Mar 5, 2026

View reviewed changes

dayo09 merged commit ad6de0f into Samsung:main Mar 5, 2026
7 checks passed

mhs4670go deleted the decode branch March 5, 2026 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Implement llama wrappers for decoding#528

[quantization] Implement llama wrappers for decoding#528
dayo09 merged 1 commit intoSamsung:mainfrom
mhs4670go:decode

mhs4670go commented Feb 27, 2026 •

edited

Loading

Uh oh!

mhs4670go commented Feb 27, 2026

Uh oh!

dayo09 Feb 27, 2026

Uh oh!

mhs4670go Feb 27, 2026

Uh oh!

mhs4670go Feb 27, 2026

Uh oh!

dayo09 Mar 3, 2026

Uh oh!

dayo09 Mar 3, 2026

Uh oh!

mhs4670go commented Feb 27, 2026

Uh oh!

mhs4670go commented Mar 4, 2026

Uh oh!

dayo09 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill"
	variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else None

Conversation

mhs4670go commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go commented Feb 27, 2026

Uh oh!

dayo09 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

dayo09 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

dayo09 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go commented Feb 27, 2026

Uh oh!

mhs4670go commented Mar 4, 2026

Uh oh!

dayo09 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mhs4670go commented Feb 27, 2026 •

edited

Loading