fix: recover stale MCP sessions after server restart by DmarshalTU · Pull Request #1418 · kagent-dev/kagent

DmarshalTU · 2026-03-03T12:14:00Z

Problem

When a remote MCP server restarts (same URL, new process), agents using Streamable HTTP or SSE transport become permanently unresponsive with "Session terminated" errors. The agent pod must be manually restarted to recover.

kagent-dev#1280

Root Cause

The upstream MCPSessionManager caches MCP sessions and checks _read_stream._closed / _write_stream._closed to decide if a cached session is still usable. These are in-memory anyio channels that stay open even when the remote server restarts, so the check always passes. The stale mcp-session-id is sent to the new server, which replies with HTTP 404, translated to "Session terminated" by the MCP SDK. The @retry_on_errors decorator retries, but hits the same stale cache.

Agent request → MCPSessionManager.create_session()
  → cache hit (in-memory streams appear open)
  → returns stale session with old mcp-session-id
  → POST /mcp with stale ID → HTTP 404 → "Session terminated"
  → @retry_on_errors retries → same cache hit → same failure
  → error propagates to user

No liveness probe exists to detect this state, so Kubernetes never restarts the pod.

Upstream issues: python-sdk#1811, python-sdk#2150

Fix

Layer 1: Ping-validated session cache

KAgentMCPSessionManager subclasses the upstream MCPSessionManager and overrides create_session() to call session.send_ping() after the parent returns a cached session. If the ping fails, the stale session is torn down and a fresh one is created transparently. Combined with @retry_on_errors, the next request after an MCP server restart recovers automatically without a pod restart.

MCP servers that don't implement the optional ping method reply with JSON-RPC -32601 (Method not found). This is treated as proof of liveness — the server is reachable and responding.

The close() call during stale session teardown is wrapped in try/except to ensure recovery proceeds even if cleanup raises (e.g. known anyio cancel scope errors).

Layer 2: Liveness probe

A /healthz/mcp endpoint on the agent's FastAPI app creates a throwaway session manager for each configured MCP server, establishes a session, and validates via ping. Returns 200 when all servers respond, 503 otherwise. Checks run concurrently via asyncio.gather to avoid exceeding the probe timeout with multiple MCP servers.

The agent pod spec now includes a LivenessProbe hitting /healthz/mcp every 30s with a failure threshold of 3 (90s of MCP unreachable before pod restart).

I tested it against my setup, controller and agents in one cluster and mcps in few others with http/sse connections,
my images I tested with:
docker.io/dmarshaltu/kagent-app:gh-fix
docker.io/dmarshaltu/kagent-controller:gh-fix
docker.io/dmarshaltu/kagent-adk:gh-fix

Important

In-flight requests during MCP restart still fail. Layer 1 ensures the next request recovers.
Upstream MCP SDK bugs are mitigated but not eliminated.

Changes

File	Change
`python/.../kagent/adk/_mcp_toolset.py`	`KAgentMCPSessionManager` with ping-validated cache, `_is_server_alive_error` helper
`python/.../kagent/adk/_a2a.py`	`/healthz/mcp` endpoint with concurrent server checks
`go/.../translator/agent/adk_api_translator.go`	`LivenessProbe` on `/healthz/mcp` in agent pod spec
`python/.../tests/unittests/test_mcp_session_recovery.py`	6 tests: ping success, stale recovery, server down, timeout, method-not-found, close() failure
`python/.../tests/unittests/test_mcp_health_endpoint.py`	5 tests: no tools, null config, healthy, unhealthy 503, method-not-found

Copilot

Pull request overview

This PR improves resilience when remote MCP servers restart by validating cached MCP sessions with a ping and by adding an agent health endpoint used for Kubernetes liveness checks to trigger pod restarts when MCP connectivity is degraded.

Changes:

Add KAgentMCPSessionManager to ping-validate cached sessions and transparently recreate stale sessions.
Add /healthz/mcp endpoint that concurrently checks all configured MCP servers.
Add Kubernetes LivenessProbe for agent pods that calls /healthz/mcp, plus unit tests covering session recovery and the health endpoint.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py	Introduces ping-validated MCP session manager and wires it into the toolset.
python/packages/kagent-adk/src/kagent/adk/_a2a.py	Adds `/healthz/mcp` endpoint and concurrent MCP server checks for liveness.
go/internal/controller/translator/agent/adk_api_translator.go	Adds agent pod `LivenessProbe` hitting `/healthz/mcp`.
python/packages/kagent-adk/tests/unittests/test_mcp_session_recovery.py	Adds unit tests for stale-session recovery and ping behavior.
python/packages/kagent-adk/tests/unittests/test_mcp_health_endpoint.py	Adds unit tests for `/healthz/mcp` success/failure behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T12:21:55Z

python/packages/kagent-adk/src/kagent/adk/_a2a.py

+        errors = {url: err for url, err in results if err is not None}
+
+        if errors:
+            return JSONResponse(
+                {"status": "error", "errors": errors},
+                status_code=503,
+            )


The /healthz/mcp endpoint returns raw exception strings keyed by full server URL in the response body. Since this endpoint is callable over HTTP, it can unintentionally expose internal network details and (depending on how MCP URLs are configured) potentially leak secrets embedded in URLs or error messages. Consider returning only a boolean status (and maybe a count), and logging the detailed per-server errors server-side instead (or redact/sanitize URLs and error strings before returning).

Copilot · 2026-03-03T12:21:55Z

go/internal/controller/translator/agent/adk_api_translator.go

+						LivenessProbe: &corev1.Probe{
+							ProbeHandler: corev1.ProbeHandler{
+								HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")},
+							},
+							InitialDelaySeconds: 30,
+							PeriodSeconds:       30,
+							TimeoutSeconds:      10,
+							FailureThreshold:    3,
+						},


Configuring the agent's Kubernetes liveness probe to depend on external MCP server availability can cause restart loops when MCP is temporarily down or undergoing maintenance, even though restarting the pod may not help. Consider making this check a readiness probe (to stop routing traffic) and/or making the liveness probe more focused on detecting the specific "stale session" failure mode (e.g., only fail liveness on repeated session-terminated/404 patterns, or add a separate internal liveness that doesn't depend on external services).

When the MCP server restarts, the agent can end up in a wedged state
e.g. anyio cancel scope corruption, see #1276

where it is still serving HTTP but cannot recover without a restart. Readiness would only stop routing traffic, it would not fix the wedged process.
Liveness ensures Kubernetes restarts the pod so it can recover.
If MCP is down for maintenance, the agent is effectively unusable anyway, restarting the pod keeps state clean and allows a fresh start when MCP comes back.

my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.

for now I can't point to all edge cases in that scenario but I know 1 fix that solve it and its agent pod restart,
I agree with you that is "ugly" one but for the moment I'm treating that as the necessary fallback,

I’ll leave the final call to you.

yeah I understand your point and I also agree that an Agent that is ready but cant give responses is not a useful agent so I can see the argument for restarting the agent pod. However, I'm leaning towards relying on your mcpSessionManager subclass fix as the primary line of defense and emitting the appropriate warnings so the error is clear in the logs. The reason is because if retry_on_error fails because the mcp server is not back up with your change, eventually the request will recover once the mcp server is online. Restarting the agent itself will not actually resolve the issue when the dependent mcp server is unhealthy so in the worst case scenario we will have an agent in crash loop back off because of this due to this liveness probe or logs that indicate this is a dependent mcp server issue. I feel like the latter gives the user a clearer picture of the underlying error in this scenario. What do you think?

In my personal view, I prefer that agent pod status will not be running if it can't communicate with the mcp because its visually reflects in the ui and agent have label as not ready, same with remote mcp he need to be accepted in the crd to work, and this is the problem when everything ok but map get rollout the static session id prevent from reconnecting of the agent, but for the user its seems legit, the agent is ready, mcp accepted and the only indication of failure is when he tries to chat, I don't see a reason that agent be ready if on eon the mcps it use is alive. for this pr I will remove the liveness and we can bring this discussion later on

Copilot · 2026-03-03T12:21:55Z

python/packages/kagent-adk/src/kagent/adk/_a2a.py

+    async def _check_one(params: Any) -> tuple[str, Optional[str]]:
+        """Return (url, error_string | None) for a single MCP server."""
+        url = getattr(params, "url", "unknown")
+        mgr = KAgentMCPSessionManager(connection_params=params)
+        try:
+            await asyncio.wait_for(
+                mgr.create_session(), timeout=_MCP_HEALTH_TIMEOUT_SECONDS
+            )


This liveness handler establishes and tears down new MCP sessions on every probe interval (one per configured server). That can be relatively expensive (extra handshakes, auth, server load) and can amplify load during outages when probes retry. Consider a cheaper reachability check (if available) or caching results briefly within the probe window to reduce repeated session creation.

The cost is acceptable for the benefit of reliable liveness detection in my opinion.

Copilot · 2026-03-03T12:21:55Z

python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py

+    async def create_session(
+        self, headers: Optional[Dict[str, str]] = None
+    ) -> ClientSession:


Type hints in this module already use built-in generics (e.g., list[BaseTool], set[int]). For consistency, consider using dict[str, str] | None (or dict[str, str] with Optional) here instead of importing Dict from typing.

jmhbh · 2026-03-03T17:39:00Z

go/internal/controller/translator/agent/adk_api_translator.go

+						LivenessProbe: &corev1.Probe{
+							ProbeHandler: corev1.ProbeHandler{
+								HTTPGet: &corev1.HTTPGetAction{Path: "/healthz/mcp", Port: intstr.FromString("http")},
+							},
+							InitialDelaySeconds: 30,
+							PeriodSeconds:       30,
+							TimeoutSeconds:      10,
+							FailureThreshold:    3,
+						},


my only concern with this approach is that it doesn't actually fix the issue if the underlying mcp server is down or unresponsive and causes unnecessary churn in the environment. I think the subclassed mcp session manager fix makes a lot of sense and the approach of focusing on ensuring that we prune stale mcp sessions and restart using new mcp sessions is the direction we should probably pursue.

jmhbh · 2026-03-03T17:43:23Z

python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py

+        try:
+            await asyncio.wait_for(session.send_ping(), timeout=_PING_TIMEOUT_SECONDS)
+            return session
+        except Exception as exc:
+            if _is_server_alive_error(exc):
+                return session


do you think it makes sense to expand the scope here so once we validate the mcp server is ready, we revalidate the session to ensure that it doesn't get a 404? And if so we prune the session from the cache and recreate a new session?

yes, make sense

Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

Supports confirmation flow for HITL (single and parallel tool use). Flow documented in `docs/architecture/human-in-the-loop.md` Todo: custom payload so the agent can invoke tools that require user input --------- Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

Copilot AI review requested due to automatic review settings March 3, 2026 12:14

DmarshalTU requested review from EItanya, ilackarms, peterj and yuval-k as code owners March 3, 2026 12:14

Copilot started reviewing on behalf of DmarshalTU March 3, 2026 12:15 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

DmarshalTU force-pushed the fix/graceful_handling branch from 97a30cd to d5516bb Compare March 3, 2026 13:04

jmhbh reviewed Mar 3, 2026

View reviewed changes

DmarshalTU and others added 4 commits March 4, 2026 00:27

fix: recover stale MCP sessions after server restart

cdcc7c6

Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

Pull request overview fixes

20983c9

Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

remove adk LivenessProbe and update mcpSessionManager

4b09c09

Signed-off-by: Denis Tu <dmarshaltu@gmail.com>

DmarshalTU force-pushed the fix/graceful_handling branch from 3e4ac86 to 4b09c09 Compare March 4, 2026 08:27

Conversation

DmarshalTU commented Mar 3, 2026

Problem

Root Cause

Fix

Layer 1: Ping-validated session cache

Layer 2: Liveness probe

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants