Fault-tolerant top-k teacher logit saving by ajkv-google · Pull Request #3555 · AI-Hypercomputer/maxtext

ajkv-google · 2026-04-02T19:16:59Z

Description

This PR introduces a fault-tolerant approach to saving top-k teacher logits. Previously, the top-k teacher logits were written to a file (either local or gcs) after the teacher model has completed the set number of steps. However, if the job crashes or something happens where the script is abruptly ended, we need to re-run the saving of top-k teacher logits from scratch again. This PR introduces the fault-tolerance, where we write the data in chunks to a folder (in local or gcs). Now, you need to specify a value for the cmd arg --steps_per_file, which will save a file with the logits of that chunk (number of steps). This way, if you need to save the top-k teacher logits for 100 steps, and --steps_per_file=10, this will create 10 chunk files. If the program crashes abruptly, the code will look at the output directory, check how many chunk files were written, and resume saving the top-k teacher logits from where it left off. This allows for fault-tolerant data collection and can be very beneficial for long running experiments.

Tests

YAML file for testing: YAML

Ran the following command to save top-k teacher logits, and you can see the ouptut where it saves a file every 10 steps when --steps_per_file=10:

Command: python3 src/maxtext/trainers/post_train/distillation/save_top_k_teacher_logits.py \ src/maxtext/configs/post_train/distillation.yml \ --local_tmp_dir=/tmp/save_logits_dir \ --steps_per_file=10
Output Logs: Logs showing successful saving of top-k teacher logits (every 10 steps in chunks)

Abruptly stopped the saving file on purpose using cntrl + C at 140 steps, and ran the training saving top-k logits script again to see if the saving resumes from the previous point. We can see from the output below that there is a comment "Found existing data, resuming from step 140". This confirms that the fault-tolerance works:

Output: Logs for resuming writing logits after abrupt stop

Next, I modified the script tht verifies the correctness of the saved top-k teacher logits to take into account the chunked files. The output shows that verification is successful with this new change and that the data is being properly written:

Command: python3 python3 src/maxtext/trainers/post_train/distillation/verify_saved_logits.py \ --output_dir=/tmp/save_logits_dir \ --expected_steps=140
Output: Verifying top-k teacher logits stored correctly

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-02T19:21:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

src/maxtext/trainers/post_train/distillation/save_top_k_teacher_logits.py

…warding through the iterator

ajkv-google added 4 commits April 2, 2026 17:44

added implementation for fault-tolerant top-k logit saving

591e363

Added updated command in comments and fixed formatting

4145fbd

updated code to verify teacher logits are saved correctly in chunks

91f735a

Updated code formatting

e862e69

ajkv-google requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners April 2, 2026 19:17

vlad-karp reviewed Apr 2, 2026

View reviewed changes

src/maxtext/trainers/post_train/distillation/save_top_k_teacher_logits.py Outdated Show resolved Hide resolved

src/maxtext/trainers/post_train/distillation/save_top_k_teacher_logits.py Outdated Show resolved Hide resolved

ajkv-google added 2 commits April 3, 2026 17:54

Improved readability of code and made it more efficient when fast-for…

a5fb4b8

…warding through the iterator

removed un-needed comment

2f2312f

vlad-karp approved these changes Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault-tolerant top-k teacher logit saving#3555

Fault-tolerant top-k teacher logit saving#3555
ajkv-google wants to merge 6 commits intomainfrom
ajkv/fault-tolerant-save-top-k

ajkv-google commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajkv-google commented Apr 2, 2026

Description

Tests

Checklist

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants