Skip to content

Fault-tolerant top-k teacher logit saving#3555

Open
ajkv-google wants to merge 6 commits intomainfrom
ajkv/fault-tolerant-save-top-k
Open

Fault-tolerant top-k teacher logit saving#3555
ajkv-google wants to merge 6 commits intomainfrom
ajkv/fault-tolerant-save-top-k

Conversation

@ajkv-google
Copy link
Copy Markdown
Collaborator

Description

This PR introduces a fault-tolerant approach to saving top-k teacher logits. Previously, the top-k teacher logits were written to a file (either local or gcs) after the teacher model has completed the set number of steps. However, if the job crashes or something happens where the script is abruptly ended, we need to re-run the saving of top-k teacher logits from scratch again. This PR introduces the fault-tolerance, where we write the data in chunks to a folder (in local or gcs). Now, you need to specify a value for the cmd arg --steps_per_file, which will save a file with the logits of that chunk (number of steps). This way, if you need to save the top-k teacher logits for 100 steps, and --steps_per_file=10, this will create 10 chunk files. If the program crashes abruptly, the code will look at the output directory, check how many chunk files were written, and resume saving the top-k teacher logits from where it left off. This allows for fault-tolerant data collection and can be very beneficial for long running experiments.

Tests

YAML file for testing: YAML

Ran the following command to save top-k teacher logits, and you can see the ouptut where it saves a file every 10 steps when --steps_per_file=10:

Abruptly stopped the saving file on purpose using cntrl + C at 140 steps, and ran the training saving top-k logits script again to see if the saving resumes from the previous point. We can see from the output below that there is a comment "Found existing data, resuming from step 140". This confirms that the fault-tolerance works:

Next, I modified the script tht verifies the correctness of the saved top-k teacher logits to take into account the chunked files. The output shows that verification is successful with this new change and that the data is being properly written:

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants