[quantization][DRAFT] Disk space consumption improvements for full model quantization#495
[quantization][DRAFT] Disk space consumption improvements for full model quantization#495stamalakhov wants to merge 1 commit intoSamsung:mainfrom
Conversation
|
@mhs4670go |
| causal_mask = self.layers[0].wrapped.get_attention_mask_for(hidden_states) | ||
| causal_mask = self._fq(causal_mask, self.obs_causal_mask) | ||
| position_embeddings = self.layers[0].wrapped.get_position_embeddings_for( |
There was a problem hiding this comment.
Current wrappers for a decoder layer and an attention creates their own masks because of having self-contained attributes.
How about just creating its own mask and embeddings instead of using first layer's? It needs some duplicate codes but can remove a dependency of the first layer.
There was a problem hiding this comment.
How about just creating its own mask and embeddings instead of using first layer's? It needs some duplicate codes but can remove a dependency of the first layer.
@mhs4670go
Ok. I'll fix it.
19aca2b to
5442616
Compare
| ) -> Union[Tuple, CausalLMOutputWithPast]: | ||
| # fixed input size, due to position_ids fixed | ||
| orig_len = input_ids.shape[-1] | ||
| input_ids = fix_inputs(self, self.tokenizer, input_ids) |
There was a problem hiding this comment.
Now, fix_inputs can be removed.
| # to prevent introduction of attention_mask as a parameter let's use preset attention_mask | ||
| L = hidden_states.size(1) | ||
| attention_mask = self._slice_causal(L, hidden_states.device) | ||
| if attention_mask is None or attention_mask.dtype == torch.bool: |
There was a problem hiding this comment.
Why does this condition come again?
There was a problem hiding this comment.
@mhs4670go
Ahhh. Sorry.
- It was recently removed from
quant_decoder_layer.pyto have fully quantized model( , becausecausal_maskreceived frommodeling_llama.pyoftransformerswas float, so to have a fully integer model, the line206was removed). - This draft uses quantized
causal_maskfromquant_model.pyso the check can be restored to have a chance to convertdecoder_layerlike thistico.convert(layer, (inp,))withoutcausal_maskin parameters. - In case it's left as it is (no check), all
decoder layerswill be populated with their ownattention_maskswhich will be disk consuming.
| self._fq(cos.unsqueeze(1), self.obs_cos), | ||
| self._fq(sin.unsqueeze(1), self.obs_sin), |
There was a problem hiding this comment.
Is this change for constant folding of unsqueeze?
There was a problem hiding this comment.
@mhs4670go
Ahh. Actually these lines are introduced to be the same as in quant_model.py to make the codes consistent. It occured that transforms that are done inside quant_attn produced another constants during tracing. Originally these lines were located at 164, 165 lines of quant_attn.py (you can find them in this draft)
cos_u = cos.unsqueeze(unsqueeze_dim)
sin_u = sin.unsqueeze(unsqueeze_dim)
So i summed all of the transforms and applied them once to prevent their populating.
There was a problem hiding this comment.
@mhs4670go
The latest version of quant_attn.py doesn't use unsqueeze_dim , so it can be removed from here also.
| if hasattr(cur_layer, "copy_quantizers"): | ||
| cur_layer.copy_quantizers(q_m.wrapped.model.wrapped) |
There was a problem hiding this comment.
Umm.. how about introducing an api that copies observers when it's really needed? I think wrappers having copy_quantizers method seems not proper. We can just export just a full qmodel only instead of all layers here in this script.
The script has come in because we need static buffers to be shared. So, exporting a single decoder can be done in other scripts.
There was a problem hiding this comment.
@mhs4670go
Sure. It was just an attempt to have all layers as they are in fully quantized model. I'll remove it. Thank you!
003d8f4 to
5ee8d8c
Compare
28a32e2 to
6ac97c0
Compare
…el quantization This PR quantizes the full `LLama` model and converts it to circle format. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
6ac97c0 to
6a2e7fb
Compare
|
It's now merged, so we can close this one. |
This PR fixes population of static
causal_masks`position_embeddings` through the layers to save disk space.It precomputes static
causal_mask/position_embeddingsfor using inllama/quant_decoder_layerto prevent populating every quantized decoder layer with these statically computed parameters to save disk space.Using this PR
circlemodel forHuggingFaceTB/SmolLM2-135M-Instructis just 105MiB (vs 300 Mib of #492)Draft: #436
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com