DeepSeek V3.2 user guide update by snehalv2002 · Pull Request #3565 · AI-Hypercomputer/maxtext

snehalv2002 · 2026-04-03T15:45:01Z

Updating the user guide for DeepSeek-V3.2. Explains new feature updates and updates instructions on multi-stage lightning indexer training and checkpoint conversion.

github-actions · 2026-04-03T16:42:37Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This pull request updates the DeepSeek user guide to include instructions for the new DeepSeek-V3.2 model, specifically focusing on indexer training and checkpoint conversion. The updates are timely and provide clear steps for users to leverage the latest sparse attention features.

🔍 General Feedback

Consistency: Ensure that the model names (deepseek3.2-671b) and tokenizer paths (deepseek-ai/DeepSeek-V3.2) are consistent across all stages of the guide.
Syntax: Be careful with trailing backslashes in shell command examples, as they can cause errors if users copy-paste the last line.
Clarity: Using concrete example values (like 0.1 for scaling factors) is generally more user-friendly than placeholders in curly braces.

github-actions · 2026-04-03T16:44:25Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


🟢 Use standard Big O notation with parentheses instead of square brackets.

Suggested change

* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).

* DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens).

github-actions · 2026-04-03T16:44:26Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \


🟠 Use the `deepseek3.2-671b` model config, as it contains the necessary indexer configuration for this training.

Suggested change

run_name=matmul_pre_training \

model_name=deepseek3.2-671b \

github-actions · 2026-04-03T16:44:27Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \


🟡 Use the correct tokenizer path for DeepSeek-V3.2.

Suggested change

async_checkpointing=false \

tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

github-actions · 2026-04-03T16:44:28Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \
+    attention=flash \
+    dtype=bfloat16 \


🟡 Provide a concrete example value (like 0.1) instead of a placeholder in curly braces, as placeholders can be confusing in documentation examples.

Suggested change

dtype=bfloat16 \

indexer_loss_scaling_factor=0.1 \

github-actions · 2026-04-03T16:44:29Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \


🟠 Use the `deepseek3.2-671b` model config for the sparse training stage as well.

Suggested change

per_device_batch_size=4 \

model_name=deepseek3.2-671b \

github-actions · 2026-04-03T16:44:30Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \


🟡 Same as above, use the V3.2 tokenizer.

Suggested change

tokenizer_type=huggingface \

tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

github-actions · 2026-04-03T16:44:31Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \


🔴 The command should not end with a trailing backslash if it is the last line. Additionally, for indexer-only training, the `trainable_parameters_mask` should be present in both stages to isolate the indexer.

Suggested change

sparse_matmul=False \

indexer_loss_scaling_factor=0.1 \

trainable_parameters_mask=['.*indexer.*']

Rohan-Bierneni · 2026-04-03T16:40:13Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


Instead of "replace vanilla attention", it would be better to say "improves MLA attention". The complexity is still O(L^2) but the indexer is added on top of MLA attention that Deepseek uses from V3 onwards.

+1 Let's mention something similar like bellow, and please feel free to modify:

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.

Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.

Rohan-Bierneni · 2026-04-03T16:41:36Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

    dataset_type=synthetic
 ```

+## Indexer training


Highlight that this is only for V3.2 Sparse Attention in the heading itself

Rohan-Bierneni · 2026-04-03T16:42:59Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor={some non-zero value} \


Replace with default value in base.yml. And add a comment saying can replace with non-zero value.

Or we could put a small value, like 0.01

Rohan-Bierneni · 2026-04-03T16:43:11Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=True \
+    indexer_loss_scaling_factor={some non-zero value} \


Same as comment above

Rohan-Bierneni · 2026-04-03T16:45:07Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    megablox=False \
+    sparse_matmul=False \


We should probably have this set to True in the sparse training stage. These flags control which MoE strategy to use.

Rohan-Bierneni · 2026-04-03T16:45:58Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


Is there a difference in V3 vs V3.2 tokenizer path in HF? If not then this is fine.

No difference, but let's update to v3.2 to avoid confusion

Rohan-Bierneni · 2026-04-03T16:46:55Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+## Indexer training
+DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. 
+
+1. **Dense Warmup Stage**


Can you include a comment that in dense warmup stage, all model weights are frozen except the indexer weights.

RissyRan

Thanks for your 1st PR!!!

One more thing, could you update the PR desperation to follow our default template? One example: here

RissyRan · 2026-04-03T17:03:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


+1 Let's mention something similar like bellow, and please feel free to modify:

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.

Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.

RissyRan · 2026-04-03T17:05:23Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor={some non-zero value} \


Or we could put a small value, like 0.01

RissyRan · 2026-04-03T17:15:18Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


let's use v3.2 tokenizer path

RissyRan · 2026-04-03T17:15:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \


Let's use sparse_matmul=True and megablox=True

RissyRan · 2026-04-03T17:16:27Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


No difference, but let's update to v3.2 to avoid confusion

RissyRan · 2026-04-03T17:17:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+* **Target Directory:** `LOCAL_WEIGHTS`
+
+### 2. Dequantize Weights
+Convert the weights from FP8 to BF16 using the official DeepSeek script.


@shuningjin could you help check this part?

Perseus14 · 2026-04-04T21:12:17Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

Could we also add a section on decoding for v3.2?

DeepSeek V3.2 user guide update

5d66c42

snehalv2002 requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners April 3, 2026 15:45

snehalv2002 closed this Apr 3, 2026

snehalv2002 reopened this Apr 3, 2026

snehalv2002 marked this pull request as draft April 3, 2026 15:47

RissyRan marked this pull request as ready for review April 3, 2026 16:42

RissyRan added the gemini-review label Apr 3, 2026

github-actions bot reviewed Apr 3, 2026

View reviewed changes

Rohan-Bierneni reviewed Apr 3, 2026

View reviewed changes

RissyRan reviewed Apr 3, 2026

View reviewed changes

Perseus14 reviewed Apr 4, 2026

View reviewed changes

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

Copy link
Copy Markdown

Contributor

Perseus14 Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a section on decoding for v3.2?

	* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).
	* DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens).

	async_checkpointing=false \
	tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

	tokenizer_type=huggingface \
	tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

	sparse_matmul=False \
	indexer_loss_scaling_factor=0.1 \
	trainable_parameters_mask=['.indexer.']

Conversation

snehalv2002 commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants