Skip to content

fix: recover stale MCP sessions after server restart#1418

Open
DmarshalTU wants to merge 4 commits intokagent-dev:mainfrom
DmarshalTU:fix/graceful_handling
Open

fix: recover stale MCP sessions after server restart#1418
DmarshalTU wants to merge 4 commits intokagent-dev:mainfrom
DmarshalTU:fix/graceful_handling

Conversation

@DmarshalTU
Copy link

Problem

When a remote MCP server restarts (same URL, new process), agents using Streamable HTTP or SSE transport become permanently unresponsive with "Session terminated" errors. The agent pod must be manually restarted to recover.

kagent-dev#1280

Root Cause

The upstream MCPSessionManager caches MCP sessions and checks _read_stream._closed / _write_stream._closed to decide if a cached session is still usable. These are in-memory anyio channels that stay open even when the remote server restarts, so the check always passes. The stale mcp-session-id is sent to the new server, which replies with HTTP 404, translated to "Session terminated" by the MCP SDK. The @retry_on_errors decorator retries, but hits the same stale cache.

Agent request → MCPSessionManager.create_session()
  → cache hit (in-memory streams appear open)
  → returns stale session with old mcp-session-id
  → POST /mcp with stale ID → HTTP 404 → "Session terminated"
  → @retry_on_errors retries → same cache hit → same failure
  → error propagates to user

No liveness probe exists to detect this state, so Kubernetes never restarts the pod.

Upstream issues: python-sdk#1811, python-sdk#2150

Fix

Layer 1: Ping-validated session cache

KAgentMCPSessionManager subclasses the upstream MCPSessionManager and overrides create_session() to call session.send_ping() after the parent returns a cached session. If the ping fails, the stale session is torn down and a fresh one is created transparently. Combined with @retry_on_errors, the next request after an MCP server restart recovers automatically without a pod restart.

MCP servers that don't implement the optional ping method reply with JSON-RPC -32601 (Method not found). This is treated as proof of liveness — the server is reachable and responding.

The close() call during stale session teardown is wrapped in try/except to ensure recovery proceeds even if cleanup raises (e.g. known anyio cancel scope errors).

Layer 2: Liveness probe

A /healthz/mcp endpoint on the agent's FastAPI app creates a throwaway session manager for each configured MCP server, establishes a session, and validates via ping. Returns 200 when all servers respond, 503 otherwise. Checks run concurrently via asyncio.gather to avoid exceeding the probe timeout with multiple MCP servers.

The agent pod spec now includes a LivenessProbe hitting /healthz/mcp every 30s with a failure threshold of 3 (90s of MCP unreachable before pod restart).

I tested it against my setup, controller and agents in one cluster and mcps in few others with http/sse connections,
my images I tested with:
docker.io/dmarshaltu/kagent-app:gh-fix
docker.io/dmarshaltu/kagent-controller:gh-fix
docker.io/dmarshaltu/kagent-adk:gh-fix

Important

  • In-flight requests during MCP restart still fail. Layer 1 ensures the next request recovers.
  • Upstream MCP SDK bugs are mitigated but not eliminated.

Changes

File Change
python/.../kagent/adk/_mcp_toolset.py KAgentMCPSessionManager with ping-validated cache, _is_server_alive_error helper
python/.../kagent/adk/_a2a.py /healthz/mcp endpoint with concurrent server checks
go/.../translator/agent/adk_api_translator.go LivenessProbe on /healthz/mcp in agent pod spec
python/.../tests/unittests/test_mcp_session_recovery.py 6 tests: ping success, stale recovery, server down, timeout, method-not-found, close() failure
python/.../tests/unittests/test_mcp_health_endpoint.py 5 tests: no tools, null config, healthy, unhealthy 503, method-not-found

Copilot AI review requested due to automatic review settings March 3, 2026 12:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience when remote MCP servers restart by validating cached MCP sessions with a ping and by adding an agent health endpoint used for Kubernetes liveness checks to trigger pod restarts when MCP connectivity is degraded.

Changes:

  • Add KAgentMCPSessionManager to ping-validate cached sessions and transparently recreate stale sessions.
  • Add /healthz/mcp endpoint that concurrently checks all configured MCP servers.
  • Add Kubernetes LivenessProbe for agent pods that calls /healthz/mcp, plus unit tests covering session recovery and the health endpoint.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py Introduces ping-validated MCP session manager and wires it into the toolset.
python/packages/kagent-adk/src/kagent/adk/_a2a.py Adds /healthz/mcp endpoint and concurrent MCP server checks for liveness.
go/internal/controller/translator/agent/adk_api_translator.go Adds agent pod LivenessProbe hitting /healthz/mcp.
python/packages/kagent-adk/tests/unittests/test_mcp_session_recovery.py Adds unit tests for stale-session recovery and ping behavior.
python/packages/kagent-adk/tests/unittests/test_mcp_health_endpoint.py Adds unit tests for /healthz/mcp success/failure behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +101 to +107
errors = {url: err for url, err in results if err is not None}

if errors:
return JSONResponse(
{"status": "error", "errors": errors},
status_code=503,
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /healthz/mcp endpoint returns raw exception strings keyed by full server URL in the response body. Since this endpoint is callable over HTTP, it can unintentionally expose internal network details and (depending on how MCP URLs are configured) potentially leak secrets embedded in URLs or error messages. Consider returning only a boolean status (and maybe a count), and logging the detailed per-server errors server-side instead (or redact/sanitize URLs and error strings before returning).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines +527 to +535
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")},
},
InitialDelaySeconds: 30,
PeriodSeconds: 30,
TimeoutSeconds: 10,
FailureThreshold: 3,
},
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring the agent's Kubernetes liveness probe to depend on external MCP server availability can cause restart loops when MCP is temporarily down or undergoing maintenance, even though restarting the pod may not help. Consider making this check a readiness probe (to stop routing traffic) and/or making the liveness probe more focused on detecting the specific "stale session" failure mode (e.g., only fail liveness on repeated session-terminated/404 patterns, or add a separate internal liveness that doesn't depend on external services).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the MCP server restarts, the agent can end up in a wedged state
e.g. anyio cancel scope corruption, see #1276

where it is still serving HTTP but cannot recover without a restart. Readiness would only stop routing traffic, it would not fix the wedged process.
Liveness ensures Kubernetes restarts the pod so it can recover.
If MCP is down for maintenance, the agent is effectively unusable anyway, restarting the pod keeps state clean and allows a fresh start when MCP comes back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I can't point to all edge cases in that scenario but I know 1 fix that solve it and its agent pod restart,
I agree with you that is "ugly" one but for the moment I'm treating that as the necessary fallback,

I’ll leave the final call to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I understand your point and I also agree that an Agent that is ready but cant give responses is not a useful agent so I can see the argument for restarting the agent pod. However, I'm leaning towards relying on your mcpSessionManager subclass fix as the primary line of defense and emitting the appropriate warnings so the error is clear in the logs. The reason is because if retry_on_error fails because the mcp server is not back up with your change, eventually the request will recover once the mcp server is online. Restarting the agent itself will not actually resolve the issue when the dependent mcp server is unhealthy so in the worst case scenario we will have an agent in crash loop back off because of this due to this liveness probe or logs that indicate this is a dependent mcp server issue. I feel like the latter gives the user a clearer picture of the underlying error in this scenario. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my personal view, I prefer that agent pod status will not be running if it can't communicate with the mcp because its visually reflects in the ui and agent have label as not ready, same with remote mcp he need to be accepted in the crd to work, and this is the problem when everything ok but map get rollout the static session id prevent from reconnecting of the agent, but for the user its seems legit, the agent is ready, mcp accepted and the only indication of failure is when he tries to chat, I don't see a reason that agent be ready if on eon the mcps it use is alive. for this pr I will remove the liveness and we can bring this discussion later on

Comment on lines +75 to +82
async def _check_one(params: Any) -> tuple[str, Optional[str]]:
"""Return (url, error_string | None) for a single MCP server."""
url = getattr(params, "url", "unknown")
mgr = KAgentMCPSessionManager(connection_params=params)
try:
await asyncio.wait_for(
mgr.create_session(), timeout=_MCP_HEALTH_TIMEOUT_SECONDS
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This liveness handler establishes and tears down new MCP sessions on every probe interval (one per configured server). That can be relatively expensive (extra handshakes, auth, server load) and can amplify load during outages when probes retry. Consider a cheaper reachability check (if available) or caching results briefly within the probe window to reduce repeated session creation.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost is acceptable for the benefit of reliable liveness detection in my opinion.

Comment on lines +46 to +48
async def create_session(
self, headers: Optional[Dict[str, str]] = None
) -> ClientSession:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints in this module already use built-in generics (e.g., list[BaseTool], set[int]). For consistency, consider using dict[str, str] | None (or dict[str, str] with Optional) here instead of importing Dict from typing.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@DmarshalTU DmarshalTU force-pushed the fix/graceful_handling branch from 97a30cd to d5516bb Compare March 3, 2026 13:04
Comment on lines +527 to +535
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")},
},
InitialDelaySeconds: 30,
PeriodSeconds: 30,
TimeoutSeconds: 10,
FailureThreshold: 3,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.

Comment on lines +49 to +54
try:
await asyncio.wait_for(session.send_ping(), timeout=_PING_TIMEOUT_SECONDS)
return session
except Exception as exc:
if _is_server_alive_error(exc):
return session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it makes sense to expand the scope here so once we validate the mcp server is ready, we revalidate the session to ensure that it doesn't get a 404? And if so we prune the session from the cache and recreate a new session?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, make sense

DmarshalTU and others added 4 commits March 4, 2026 00:27
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Supports confirmation flow for HITL (single and parallel tool use). Flow
documented in `docs/architecture/human-in-the-loop.md`

Todo: custom payload so the agent can invoke tools that require user
input

---------

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
@DmarshalTU DmarshalTU force-pushed the fix/graceful_handling branch from 3e4ac86 to 4b09c09 Compare March 4, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants