Conversation
….5 support!) - Removed deprecated native func `llama_adapter_lora_free` and related managed method `LoraAdapter.Unload`
|
Great work on this, Martin! Thank you! I’ve done some testing on Windows with the
So, it seems the GPU implementation is working fine, but there may be an issue with the CPU implementation or the layer partitioning logic. |
|
GPU RTX 3090 Qwen3.5 model seems to be working: No, I found problem! All the code works fine on 0.26.0 for Qwen3-Embedding-0.6B-F16.gguf: Error: That's the problem, but then how do embedders work? Full error log: warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: constructing llama_context info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_seq_max = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_batch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ubatch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: causal_attn = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: flash_attn = enabled info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: kv_unified = true info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_scale = 1 warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: set_abort_callback: call info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 0.58 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: enumerating backends dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: reserving ... dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: max_nodes = 2488 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: reserving full memory module dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: resolving fused Gated Delta Net support: dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: sched_reserve: fused Gated Delta Net (autoregressive) enabled dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1 D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed This working log for LLamaSharp 0.26.0: warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: constructing llama_context info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_seq_max = 64 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_batch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: n_ubatch = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: causal_attn = 1 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: flash_attn = enabled info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: kv_unified = true info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: freq_scale = 1 warn: LLama.LLamaWeights[0] LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: set_abort_callback: call info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 37.28 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0 LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0 dbug: LLama.LLamaWeights[0] dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 64/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: enumerating backends dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: max_nodes = 2488 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: reserving full memory module dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: llama_context: worst-case: n_tokens = 256, n_seqs = 64, n_outputs = 64 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 64, n_outputs = 64 dbug: LLama.LLamaWeights[0] LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA0 compute buffer size = 150.43 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: CUDA_Host compute buffer size = 2.07 MiB info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: graph nodes = 990 info: LLama.LLamaWeights[0] LLama.LLamaWeights: Information: llama_context: graph splits = 2 |
llama_adapter_lora_freeand related managed methodLoraAdapter.UnloadTesting: