fix: recover stale MCP sessions after server restart#1418
fix: recover stale MCP sessions after server restart#1418DmarshalTU wants to merge 4 commits intokagent-dev:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves resilience when remote MCP servers restart by validating cached MCP sessions with a ping and by adding an agent health endpoint used for Kubernetes liveness checks to trigger pod restarts when MCP connectivity is degraded.
Changes:
- Add
KAgentMCPSessionManagerto ping-validate cached sessions and transparently recreate stale sessions. - Add
/healthz/mcpendpoint that concurrently checks all configured MCP servers. - Add Kubernetes
LivenessProbefor agent pods that calls/healthz/mcp, plus unit tests covering session recovery and the health endpoint.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py | Introduces ping-validated MCP session manager and wires it into the toolset. |
| python/packages/kagent-adk/src/kagent/adk/_a2a.py | Adds /healthz/mcp endpoint and concurrent MCP server checks for liveness. |
| go/internal/controller/translator/agent/adk_api_translator.go | Adds agent pod LivenessProbe hitting /healthz/mcp. |
| python/packages/kagent-adk/tests/unittests/test_mcp_session_recovery.py | Adds unit tests for stale-session recovery and ping behavior. |
| python/packages/kagent-adk/tests/unittests/test_mcp_health_endpoint.py | Adds unit tests for /healthz/mcp success/failure behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| errors = {url: err for url, err in results if err is not None} | ||
|
|
||
| if errors: | ||
| return JSONResponse( | ||
| {"status": "error", "errors": errors}, | ||
| status_code=503, | ||
| ) |
There was a problem hiding this comment.
The /healthz/mcp endpoint returns raw exception strings keyed by full server URL in the response body. Since this endpoint is callable over HTTP, it can unintentionally expose internal network details and (depending on how MCP URLs are configured) potentially leak secrets embedded in URLs or error messages. Consider returning only a boolean status (and maybe a count), and logging the detailed per-server errors server-side instead (or redact/sanitize URLs and error strings before returning).
| LivenessProbe: &corev1.Probe{ | ||
| ProbeHandler: corev1.ProbeHandler{ | ||
| HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")}, | ||
| }, | ||
| InitialDelaySeconds: 30, | ||
| PeriodSeconds: 30, | ||
| TimeoutSeconds: 10, | ||
| FailureThreshold: 3, | ||
| }, |
There was a problem hiding this comment.
Configuring the agent's Kubernetes liveness probe to depend on external MCP server availability can cause restart loops when MCP is temporarily down or undergoing maintenance, even though restarting the pod may not help. Consider making this check a readiness probe (to stop routing traffic) and/or making the liveness probe more focused on detecting the specific "stale session" failure mode (e.g., only fail liveness on repeated session-terminated/404 patterns, or add a separate internal liveness that doesn't depend on external services).
There was a problem hiding this comment.
When the MCP server restarts, the agent can end up in a wedged state
e.g. anyio cancel scope corruption, see #1276
where it is still serving HTTP but cannot recover without a restart. Readiness would only stop routing traffic, it would not fix the wedged process.
Liveness ensures Kubernetes restarts the pod so it can recover.
If MCP is down for maintenance, the agent is effectively unusable anyway, restarting the pod keeps state clean and allows a fresh start when MCP comes back.
There was a problem hiding this comment.
my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.
There was a problem hiding this comment.
for now I can't point to all edge cases in that scenario but I know 1 fix that solve it and its agent pod restart,
I agree with you that is "ugly" one but for the moment I'm treating that as the necessary fallback,
I’ll leave the final call to you.
There was a problem hiding this comment.
yeah I understand your point and I also agree that an Agent that is ready but cant give responses is not a useful agent so I can see the argument for restarting the agent pod. However, I'm leaning towards relying on your mcpSessionManager subclass fix as the primary line of defense and emitting the appropriate warnings so the error is clear in the logs. The reason is because if retry_on_error fails because the mcp server is not back up with your change, eventually the request will recover once the mcp server is online. Restarting the agent itself will not actually resolve the issue when the dependent mcp server is unhealthy so in the worst case scenario we will have an agent in crash loop back off because of this due to this liveness probe or logs that indicate this is a dependent mcp server issue. I feel like the latter gives the user a clearer picture of the underlying error in this scenario. What do you think?
There was a problem hiding this comment.
In my personal view, I prefer that agent pod status will not be running if it can't communicate with the mcp because its visually reflects in the ui and agent have label as not ready, same with remote mcp he need to be accepted in the crd to work, and this is the problem when everything ok but map get rollout the static session id prevent from reconnecting of the agent, but for the user its seems legit, the agent is ready, mcp accepted and the only indication of failure is when he tries to chat, I don't see a reason that agent be ready if on eon the mcps it use is alive. for this pr I will remove the liveness and we can bring this discussion later on
| async def _check_one(params: Any) -> tuple[str, Optional[str]]: | ||
| """Return (url, error_string | None) for a single MCP server.""" | ||
| url = getattr(params, "url", "unknown") | ||
| mgr = KAgentMCPSessionManager(connection_params=params) | ||
| try: | ||
| await asyncio.wait_for( | ||
| mgr.create_session(), timeout=_MCP_HEALTH_TIMEOUT_SECONDS | ||
| ) |
There was a problem hiding this comment.
This liveness handler establishes and tears down new MCP sessions on every probe interval (one per configured server). That can be relatively expensive (extra handshakes, auth, server load) and can amplify load during outages when probes retry. Consider a cheaper reachability check (if available) or caching results briefly within the probe window to reduce repeated session creation.
There was a problem hiding this comment.
The cost is acceptable for the benefit of reliable liveness detection in my opinion.
| async def create_session( | ||
| self, headers: Optional[Dict[str, str]] = None | ||
| ) -> ClientSession: |
There was a problem hiding this comment.
Type hints in this module already use built-in generics (e.g., list[BaseTool], set[int]). For consistency, consider using dict[str, str] | None (or dict[str, str] with Optional) here instead of importing Dict from typing.
97a30cd to
d5516bb
Compare
| LivenessProbe: &corev1.Probe{ | ||
| ProbeHandler: corev1.ProbeHandler{ | ||
| HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")}, | ||
| }, | ||
| InitialDelaySeconds: 30, | ||
| PeriodSeconds: 30, | ||
| TimeoutSeconds: 10, | ||
| FailureThreshold: 3, | ||
| }, |
There was a problem hiding this comment.
my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.
| try: | ||
| await asyncio.wait_for(session.send_ping(), timeout=_PING_TIMEOUT_SECONDS) | ||
| return session | ||
| except Exception as exc: | ||
| if _is_server_alive_error(exc): | ||
| return session |
There was a problem hiding this comment.
do you think it makes sense to expand the scope here so once we validate the mcp server is ready, we revalidate the session to ensure that it doesn't get a 404? And if so we prune the session from the cache and recreate a new session?
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Supports confirmation flow for HITL (single and parallel tool use). Flow documented in `docs/architecture/human-in-the-loop.md` Todo: custom payload so the agent can invoke tools that require user input --------- Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
3e4ac86 to
4b09c09
Compare
Problem
When a remote MCP server restarts (same URL, new process), agents using Streamable HTTP or SSE transport become permanently unresponsive with "Session terminated" errors. The agent pod must be manually restarted to recover.
kagent-dev#1280
Root Cause
The upstream
MCPSessionManagercaches MCP sessions and checks_read_stream._closed/_write_stream._closedto decide if a cached session is still usable. These are in-memory anyio channels that stay open even when the remote server restarts, so the check always passes. The stalemcp-session-idis sent to the new server, which replies with HTTP 404, translated to"Session terminated"by the MCP SDK. The@retry_on_errorsdecorator retries, but hits the same stale cache.No liveness probe exists to detect this state, so Kubernetes never restarts the pod.
Upstream issues: python-sdk#1811, python-sdk#2150
Fix
Layer 1: Ping-validated session cache
KAgentMCPSessionManagersubclasses the upstreamMCPSessionManagerand overridescreate_session()to callsession.send_ping()after the parent returns a cached session. If the ping fails, the stale session is torn down and a fresh one is created transparently. Combined with@retry_on_errors, the next request after an MCP server restart recovers automatically without a pod restart.MCP servers that don't implement the optional
pingmethod reply with JSON-RPC-32601(Method not found). This is treated as proof of liveness — the server is reachable and responding.The
close()call during stale session teardown is wrapped in try/except to ensure recovery proceeds even if cleanup raises (e.g. known anyio cancel scope errors).Layer 2: Liveness probe
A
/healthz/mcpendpoint on the agent's FastAPI app creates a throwaway session manager for each configured MCP server, establishes a session, and validates via ping. Returns 200 when all servers respond, 503 otherwise. Checks run concurrently viaasyncio.gatherto avoid exceeding the probe timeout with multiple MCP servers.The agent pod spec now includes a
LivenessProbehitting/healthz/mcpevery 30s with a failure threshold of 3 (90s of MCP unreachable before pod restart).I tested it against my setup, controller and agents in one cluster and mcps in few others with http/sse connections,
my images I tested with:
docker.io/dmarshaltu/kagent-app:gh-fix
docker.io/dmarshaltu/kagent-controller:gh-fix
docker.io/dmarshaltu/kagent-adk:gh-fix
Important
Changes
python/.../kagent/adk/_mcp_toolset.pyKAgentMCPSessionManagerwith ping-validated cache,_is_server_alive_errorhelperpython/.../kagent/adk/_a2a.py/healthz/mcpendpoint with concurrent server checksgo/.../translator/agent/adk_api_translator.goLivenessProbeon/healthz/mcpin agent pod specpython/.../tests/unittests/test_mcp_session_recovery.pypython/.../tests/unittests/test_mcp_health_endpoint.py