Skip to content

Fix: proactive auth refresh to reload guarded disk state first#15357

Open
celia-oai wants to merge 2 commits intomainfrom
dev/cc/fix-auth
Open

Fix: proactive auth refresh to reload guarded disk state first#15357
celia-oai wants to merge 2 commits intomainfrom
dev/cc/fix-auth

Conversation

@celia-oai
Copy link
Collaborator

@celia-oai celia-oai commented Mar 20, 2026

Summary

Fix a managed ChatGPT auth bug where a stale Codex process could proactively refresh using an old in-memory refresh token even after another process had already rotated auth on disk.

This changes the proactive AuthManager::auth() path to reuse the existing guarded refresh_token() flow instead of calling the refresh endpoint directly from cached auth state.

Original Issue

Users reported repeated codexd log lines like:

ERROR codex_core::auth: Failed to refresh token: error sending request for url (https://auth.openai.com/oauth/token)

In practice this showed up most often when multiple codexd processes were left running. Killing the extra processes stopped the noise, which suggested the issue was caused by stale auth state across processes rather than invalid user credentials.

Diagnosis

The bug was in the proactive refresh path used by AuthManager::auth():

  • Process A could refresh successfully, rotate refresh token R0 to R1, and persist the updated auth state plus last_refresh to disk.
  • Process B could keep an older auth snapshot cached in memory, still holding R0 and the old last_refresh.
  • Later, when Process B called auth(), it checked staleness from its cached in-memory auth instead of first reloading from disk.
  • Because that cached last_refresh was stale, Process B would proactively call /oauth/token with stale refresh token R0.
  • On failure, auth() logged the refresh error but kept returning the same stale cached auth, so repeated auth() calls could keep retrying with dead state.

This differed from the existing unauthorized-recovery flow, which already did the safer thing: guarded reload from disk first, then refresh only if the on-disk auth was unchanged.

What Changed

  • Switched proactive refresh in AuthManager::auth() to:
    • do a pure staleness check on cached auth
    • call refresh_token() when stale
    • return the original cached auth on genuine refresh failure, preserving existing outward behavior
  • Removed the direct proactive refresh-from-cached-state path
  • Added regression tests covering:
    • stale cached auth with newer same-account auth already on disk
    • the same scenario even when the refresh endpoint would fail if called

Why This Fix

refresh_token() already contains the right cross-process safety behavior:

  • guarded reload from disk
  • same-account verification
  • skip-refresh when another process already changed auth

Reusing that path makes proactive refresh consistent with unauthorized recovery and prevents stale processes from trying to refresh already-rotated tokens.

Testing

Test shape:

  • create a fresh temp CODEX_HOME from ~/.codex/auth.json
  • force last_refresh to an old timestamp so proactive refresh is required
  • start two long-lived helper processes against the same auth file
  • start B first so it caches stale auth and sleeps
  • start A second so it refreshes first
  • point both at a local mock /oauth/token server
  • inspect whether B makes a second refresh request with the stale in-memory token, or reloads the rotated token from disk

Before the fix

The repro showed the bug clearly: the mock server saw two refreshes with the same stale token, A rotated to a new token, and B still returned the stale token instead of reloading from disk.

POST /oauth/token refresh_token=rt_j6s0...
POST /oauth/token refresh_token=rt_j6s0...

B:cached_before=rt_j6s0...
B:cached_after=rt_j6s0...
B:returned=rt_j6s0...

A:cached_before=rt_j6s0...
A:cached_after=rotated-refresh-token-logged-run-v2
A:returned=rotated-refresh-token-logged-run-v2

After the fix

After the fix, the mock server saw only one refresh request. A refreshed once, and B started with the stale token but reloaded and returned the rotated token.

POST /oauth/token refresh_token=rt_j6s0...

B:cached_before=rt_j6s0...
B:cached_after=rotated-refresh-token-fix-branch
B:returned=rotated-refresh-token-fix-branch

A:cached_before=rt_j6s0...
A:cached_after=rotated-refresh-token-fix-branch
A:returned=rotated-refresh-token-fix-branch

This shows the new behavior: A refreshes once, then B reuses the updated auth from disk instead of making a second refresh request with the stale token.

@celia-oai celia-oai marked this pull request as ready for review March 21, 2026 00:00
@celia-oai
Copy link
Collaborator Author

@codex review

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41a653fd81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1097 to +1099
if Self::is_stale_for_proactive_refresh(&auth)
&& let Err(err) = self.refresh_token().await
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep proactive refresh working when account ID is missing

auth() now routes stale ChatGPT auth through refresh_token(), but that path hard-fails when expected_account_id is None. Auth files can legitimately lack account_id (it is optional), so proactive refresh now returns stale cached auth instead of refreshing/reloading. This can leave expired tokens in use and cause repeated auth failures/log spam for those users.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pakrym you added this logic to skip reloading when there's no account_id: https://github.com/openai/codex/pull/8880/changes#diff-c618f196cd6214e554ed7c8845bdfefb3cd013c58274b10f3afe5d792e7d0af3R725-R727

is this expected behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant