Skip to content

[quantization] Implement llama wrappers for decoding#528

Merged
dayo09 merged 1 commit intoSamsung:mainfrom
mhs4670go:decode
Mar 5, 2026
Merged

[quantization] Implement llama wrappers for decoding#528
dayo09 merged 1 commit intoSamsung:mainfrom
mhs4670go:decode

Conversation

@mhs4670go
Copy link
Contributor

@mhs4670go mhs4670go commented Feb 27, 2026

This commit implements llama wrappers for decoding.

TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com

@mhs4670go
Copy link
Contributor Author

┌──────────── Quantization Error Summary (Decode DecoderLayer / Random) ────────────
│ Mean |diff| (hidden): 0.037206
│ PEIR        (hidden): 3.549647 %
│ Mean |diff| (new_k) : 0.003758
│ PEIR        (new_k) : 1.324077 %
│ Mean |diff| (new_v) : 0.001980
│ PEIR        (new_v) : 1.693285 %
└──────────────────────────────────────────────────────────────────────────────────
    ┌────────────────────────────────────────────┐
 4.2┤                                            │
    │                                         •  │
    │                                      ••    │
 2.8┤                                    •••     │
    │                                  •••       │
    │                               ••••         │
    │                             ••••           │
 1.5┤                           ••••             │
    │                         ••••               │
    │                       ••••                 │
 0.2┤                     ••••                   │
    │                   ••••                     │
    │                 ••••                       │
    │               ••••                         │
-1.1┤             ••••                           │
    │           ••••                             │
    │         •••                                │
-2.5┤       ••••                                 │
    │     •••                                    │
    │   •••                                      │
    │  •                                         │
-3.8┤                                            │
    └┬──────────┬──────────┬─────────┬──────────┬┘
   -3.8       -1.8        0.2       2.2       4.2

):
super().__init__(qcfg)
wrapped_cls = lookup(type(module))
variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill"
variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else None

Quantized nn module may not be one of prefill/decode stage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I desinged to use "prefill" as a default value. Do you think exact default value is needed?

Resolution Order
    ----------------
    1. Exact variant match
    2. "prefill" fallback
    3. Any registered variant (last-resort compatibility)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, lookup(fp_cls) was used without any notion of variants. Even after introducing variants, it is likely that most wrappers remain variant-agnostic, meaning only a single implementation exists as you said.

In this situation, requesting variant="decode" does not necessarily mean that every module must provide a dedicated decode-specific wrapper. Many modules (e.g., Linear, LayerNorm) have identical behavior between prefill and decode execution and therefore do not require separate implementations.

With fallback enabled, these variant-agnostic modules can be safely reused while still constructing a decode-specialized graph. This allows decode graph construction to proceed without forcing unnecessary wrapper duplication for modules whose behavior does not actually differ across variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation!

As you mentioned, most of them are irrelevant to the field 'variant'.
So I assumed that default value should be None for most cases and decode/prefill could be selected when they are relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a technical term, it's a leaky abstraction. Let's choose to be 'opt-in' variance. Let modules to have 'None' or 'invariant' if they don't belong to either of 'prefill' or 'decode.

@mhs4670go
Copy link
Contributor Author

python tico/quantization/wrapq/examples/llama/quantize_attn_decode.py
┌───────────── Quantization Error Summary (Decode Attn / Random) ─────────────
│ Mean |diff|: 0.087903
│ PEIR       : 10.583212 %
└────────────────────────────────────────────────────────────────────────────
     ┌───────────────────────────────────────────┐
 2.13┤                                           │
     │                                        •  │
     │                                     •     │
 1.21┤                                  ••• •    │
     │                               ••••••      │
     │                             ••••••        │
     │                          ••••••••         │
 0.30┤                         •••••••           │
     │                       •••••••             │
     │                    • •••••••              │
-0.62┤                   •••••••                 │
     │                   •••••                   │
     │                 •••••                     │
     │                •••                        │
-1.53┤              •••                          │
     │                                           │
     │                                           │
-2.45┤                                           │
     │                                           │
     │  •                                        │
     │                                           │
-3.36┤                                           │
     └┬──────────┬─────────┬──────────┬─────────┬┘
    -3.4       -2.0      -0.6        0.8      2.1 

Quantized Circle model saved to /home/seongwoo/TICO/attn_decode.q.circle

@mhs4670go mhs4670go requested a review from dayo09 February 27, 2026 09:29
@mhs4670go mhs4670go changed the title [DRAFT] Implement llama wrappers for decoding [quantization] Implement llama wrappers for decoding Mar 4, 2026
@mhs4670go mhs4670go removed the DRAFT label Mar 4, 2026
@mhs4670go mhs4670go marked this pull request as ready for review March 4, 2026 11:09
This commit implements llama wrappers for decoding.

TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>
@mhs4670go
Copy link
Contributor Author

@dayo09 I've introduced "common" variant. PTAL:)

Copy link
Contributor

@dayo09 dayo09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

@dayo09 dayo09 merged commit ad6de0f into Samsung:main Mar 5, 2026
7 checks passed
@mhs4670go mhs4670go deleted the decode branch March 5, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants