Conversation
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This pull request updates the DeepSeek user guide to include instructions for the new DeepSeek-V3.2 model, specifically focusing on indexer training and checkpoint conversion. The updates are timely and provide clear steps for users to leverage the latest sparse attention features.
🔍 General Feedback
- Consistency: Ensure that the model names (
deepseek3.2-671b) and tokenizer paths (deepseek-ai/DeepSeek-V3.2) are consistent across all stages of the guide. - Syntax: Be careful with trailing backslashes in shell command examples, as they can cause errors if users copy-paste the last line.
- Clarity: Using concrete example values (like
0.1for scaling factors) is generally more user-friendly than placeholders in curly braces.
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). | |
| * DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens). |
| ```sh | ||
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ |
There was a problem hiding this comment.
| run_name=matmul_pre_training \ | |
| model_name=deepseek3.2-671b \ |
| ici_fsdp_parallelism=128 \ | ||
| steps=5 \ | ||
| max_target_length=1024 \ | ||
| async_checkpointing=false \ |
There was a problem hiding this comment.
| async_checkpointing=false \ | |
| tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ |
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ | ||
| attention=flash \ | ||
| dtype=bfloat16 \ |
There was a problem hiding this comment.
| dtype=bfloat16 \ | |
| indexer_loss_scaling_factor=0.1 \ |
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ | ||
| per_device_batch_size=4 \ |
There was a problem hiding this comment.
| per_device_batch_size=4 \ | |
| model_name=deepseek3.2-671b \ |
| steps=5 \ | ||
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ |
There was a problem hiding this comment.
| tokenizer_type=huggingface \ | |
| tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ |
| dtype=bfloat16 \ | ||
| weight_dtype=bfloat16 \ | ||
| megablox=False \ | ||
| sparse_matmul=False \ |
There was a problem hiding this comment.
| sparse_matmul=False \ | |
| indexer_loss_scaling_factor=0.1 \ | |
| trainable_parameters_mask=['.*indexer.*'] |
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
Instead of "replace vanilla attention", it would be better to say "improves MLA attention". The complexity is still O(L^2) but the indexer is added on top of MLA attention that Deepseek uses from V3 onwards.
There was a problem hiding this comment.
+1 Let's mention something similar like bellow, and please feel free to modify:
DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.
Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.
| dataset_type=synthetic | ||
| ``` | ||
|
|
||
| ## Indexer training |
There was a problem hiding this comment.
Highlight that this is only for V3.2 Sparse Attention in the heading itself
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=False \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Replace with default value in base.yml. And add a comment saying can replace with non-zero value.
There was a problem hiding this comment.
Or we could put a small value, like 0.01
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=True \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Same as comment above
| megablox=False \ | ||
| sparse_matmul=False \ |
There was a problem hiding this comment.
We should probably have this set to True in the sparse training stage. These flags control which MoE strategy to use.
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
Is there a difference in V3 vs V3.2 tokenizer path in HF? If not then this is fine.
There was a problem hiding this comment.
No difference, but let's update to v3.2 to avoid confusion
| ## Indexer training | ||
| DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. | ||
|
|
||
| 1. **Dense Warmup Stage** |
There was a problem hiding this comment.
Can you include a comment that in dense warmup stage, all model weights are frozen except the indexer weights.
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
+1 Let's mention something similar like bellow, and please feel free to modify:
DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.
Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=False \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Or we could put a small value, like 0.01
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
let's use v3.2 tokenizer path
| attention=flash \ | ||
| dtype=bfloat16 \ | ||
| weight_dtype=bfloat16 \ | ||
| megablox=False \ |
There was a problem hiding this comment.
Let's use sparse_matmul=True and megablox=True
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
No difference, but let's update to v3.2 to avoid confusion
| * **Target Directory:** `LOCAL_WEIGHTS` | ||
|
|
||
| ### 2. Dequantize Weights | ||
| Convert the weights from FP8 to BF16 using the official DeepSeek script. |
There was a problem hiding this comment.
Could we also add a section on decoding for v3.2?
Updating the user guide for DeepSeek-V3.2. Explains new feature updates and updates instructions on multi-stage lightning indexer training and checkpoint conversion.