fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792
fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792giulio-leone wants to merge 11 commits intostrands-agents:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Prevents process hangs on interpreter shutdown by ensuring MCPClient.stop() doesn’t block indefinitely when joining the background thread (notably when called from Agent.__del__).
Changes:
- Add a timeout (based on
startup_timeout) to_background_thread.join()inMCPClient.stop() - Log a warning when the join times out and continue cleanup
- Update/add tests to assert
join(timeout=...)is used and exercise the timeout path
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/strands/tools/mcp/mcp_client.py |
Adds join(timeout=...) + warning when the background thread doesn’t exit promptly. |
tests/strands/tools/mcp/test_mcp_client.py |
Updates existing stop() tests for the new join timeout behavior and adds a new timeout-path test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
All CI checks pass. Ready for review. |
|
Hi! Gentle ping — this PR is rebased, CI passes, and ready for review. Happy to address any feedback. Thanks! |
|
Hi @lizradway — I have verified the tests locally and all 52 tests in The Could you try re-running the CI? It may have been a transient issue. |
…hangs When an Agent holding an MCPClient goes out of scope inside a function, the Agent.__del__ finalizer calls MCPClient.stop() which calls _background_thread.join(). If the background thread cannot exit promptly (e.g. transport subprocess teardown is slow), join() blocks indefinitely, causing the entire process to hang on exit. Add a timeout (equal to startup_timeout, default 30s) to the join() call. If the thread does not exit within the timeout, log a warning and proceed with cleanup. The thread is already a daemon thread (daemon=True), so it will be cleaned up by the interpreter on process exit. Fixes strands-agents#1732 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imeout When join() times out and the thread is still alive, return early without closing the event loop or resetting internal state. This prevents RuntimeError from closing a running loop and avoids allowing a second start() while teardown is still in progress. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand the docstring to explicitly describe the expected behavior: skip event loop close and state reset when thread is still alive.
Applied Copilot suggestion: remove duplicate stray ")" and "return" lines that were introduced by a bad suggestion application. The timeout/early-return block in stop() now has exactly one return statement after the logger.warning() call.
…d exits The event_loop.close() call was accidentally dropped during a prior review-fix commit. Restore it on the normal (non-timeout) path so the asyncio event loop is properly closed when the background thread exits cleanly, preventing event loop leaks. The timeout early-return path correctly skips this to avoid closing a loop that may still be running in the background thread. Refs: strands-agents#1792
When stop() is called after start() timeout, the background thread is still alive. Previously stop() returned early without resetting state, which broke start() timeout cleanup (test_mcp_client_state_reset_after_timeout). Now stop() skips only the event loop close (unsafe while thread runs) but still resets all state fields so the client can be reused. Refs: strands-agents#1792
62e7472 to
e2f711a
Compare
|
Hi @lizradway — rebased on latest main (v1.29.0) and confirmed all 52 tests in The warnings are pre-existing unawaited-coroutine warnings from MagicMock interactions in other test functions — unrelated to this PR. Could you try pulling the latest force-push and re-running? Happy to investigate further if you still see failures. |
|
Hi @lizradway — the Root cause: The Fix: Removed the early
All 52 tests in |
Covers the elif branch in stop() where _background_thread is None but _background_thread_event_loop is not None, ensuring the event loop is still properly closed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback: when join() times out and the thread is still alive, return early without resetting internal state. This prevents: 1. RuntimeError from closing event loop while it's still running 2. A second background thread being spawned via start() while the first is still tearing down Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing to reduce PR volume. Happy to resubmit if the team finds this fix useful. |
Summary
When an
Agentholding anMCPClientis created inside a function and goes out of scope, theAgent.__del__finalizer callsMCPClient.stop()which invokes_background_thread.join(). If the background thread cannot exit promptly (e.g. transport subprocess teardown is slow),join()blocks indefinitely, causing the entire process to hang on exit.Changes
startup_timeout, default 30s) to thejoin()call instop()daemon=True), so it will be cleaned up by the interpreter on process exitTesting
join()is called with timeouttest_stop_does_not_hang_when_join_times_outcovering the timeout pathFixes #1732