[quantization] Implement llama wrappers for decoding#528
Conversation
60a9280 to
df0a587
Compare
┌──────────── Quantization Error Summary (Decode DecoderLayer / Random) ────────────
│ Mean |diff| (hidden): 0.037206
│ PEIR (hidden): 3.549647 %
│ Mean |diff| (new_k) : 0.003758
│ PEIR (new_k) : 1.324077 %
│ Mean |diff| (new_v) : 0.001980
│ PEIR (new_v) : 1.693285 %
└──────────────────────────────────────────────────────────────────────────────────
┌────────────────────────────────────────────┐
4.2┤ │
│ • │
│ •• │
2.8┤ ••• │
│ ••• │
│ •••• │
│ •••• │
1.5┤ •••• │
│ •••• │
│ •••• │
0.2┤ •••• │
│ •••• │
│ •••• │
│ •••• │
-1.1┤ •••• │
│ •••• │
│ ••• │
-2.5┤ •••• │
│ ••• │
│ ••• │
│ • │
-3.8┤ │
└┬──────────┬──────────┬─────────┬──────────┬┘
-3.8 -1.8 0.2 2.2 4.2 |
| ): | ||
| super().__init__(qcfg) | ||
| wrapped_cls = lookup(type(module)) | ||
| variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill" |
There was a problem hiding this comment.
| variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else "prefill" | |
| variant = getattr(qcfg, "wrapper_variant", "prefill") if qcfg else None |
Quantized nn module may not be one of prefill/decode stage
There was a problem hiding this comment.
I desinged to use "prefill" as a default value. Do you think exact default value is needed?
Resolution Order
----------------
1. Exact variant match
2. "prefill" fallback
3. Any registered variant (last-resort compatibility)There was a problem hiding this comment.
Previously, lookup(fp_cls) was used without any notion of variants. Even after introducing variants, it is likely that most wrappers remain variant-agnostic, meaning only a single implementation exists as you said.
In this situation, requesting variant="decode" does not necessarily mean that every module must provide a dedicated decode-specific wrapper. Many modules (e.g., Linear, LayerNorm) have identical behavior between prefill and decode execution and therefore do not require separate implementations.
With fallback enabled, these variant-agnostic modules can be safely reused while still constructing a decode-specialized graph. This allows decode graph construction to proceed without forcing unnecessary wrapper duplication for modules whose behavior does not actually differ across variants.
There was a problem hiding this comment.
Thanks for the detailed explanation!
As you mentioned, most of them are irrelevant to the field 'variant'.
So I assumed that default value should be None for most cases and decode/prefill could be selected when they are relevant.
There was a problem hiding this comment.
Using a technical term, it's a leaky abstraction. Let's choose to be 'opt-in' variance. Let modules to have 'None' or 'invariant' if they don't belong to either of 'prefill' or 'decode.
python tico/quantization/wrapq/examples/llama/quantize_attn_decode.py
┌───────────── Quantization Error Summary (Decode Attn / Random) ─────────────
│ Mean |diff|: 0.087903
│ PEIR : 10.583212 %
└────────────────────────────────────────────────────────────────────────────
┌───────────────────────────────────────────┐
2.13┤ │
│ • │
│ • │
1.21┤ ••• • │
│ •••••• │
│ •••••• │
│ •••••••• │
0.30┤ ••••••• │
│ ••••••• │
│ • ••••••• │
-0.62┤ ••••••• │
│ ••••• │
│ ••••• │
│ ••• │
-1.53┤ ••• │
│ │
│ │
-2.45┤ │
│ │
│ • │
│ │
-3.36┤ │
└┬──────────┬─────────┬──────────┬─────────┬┘
-3.4 -2.0 -0.6 0.8 2.1
Quantized Circle model saved to /home/seongwoo/TICO/attn_decode.q.circle |
This commit implements llama wrappers for decoding. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>
|
@dayo09 I've introduced "common" variant. PTAL:) |
This commit implements llama wrappers for decoding.
TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com