-
Notifications
You must be signed in to change notification settings - Fork 93
Port forwarding session dies after hours due to insufficient reconnection retry budget on WebSocket close 1000 #135
Description
Environment:
SSM Agent: 3.3.x (on-prem hybrid instance, Amazon Linux 2023)
EC2 Security Group: Allow all outbound traffic
session-manager-plugin: latest (on EC2 t4g.micro, AL2023)
Region: us-west-2
Document: AWS-StartPortForwardingSession
Target: Hybrid managed instance (mi-*)
Idle timeout: 20m + ResumeSession
Max Session Timeout: Not set
Description:
We're running long-lived SSM port forwarding sessions (12-24h) from an EC2 instance to an on-premise hybrid managed instance.
The tunnel works correctly for hours (we use ResumeSession to be able to hanlde idleTimeout), but eventually dies silently -- all traffic starts timing out with no response.
Root cause from logs :
The SSM service closes WebSocket connections approximately every 60 minutes, sending websocket: close 1000 (normal): Bye. Both the on-prem agent and the EC2-side session-manager-plugin attempt to reconnect. The on-prem agent consistently reconnects successfully. The session-manager-plugin usually reconnects too, but occasionally fails.
When the plugin fails to reconnect, the on-prem agent receives "Session is already terminated" when trying to recreate the data channel -- meaning the EC2 side has already given up and killed the session*
On-prem agent log -- successful reconnections every ~60 min (same session, same pattern):
2026-03-28 13:27:39 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 14:27:41 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 15:27:43 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 16:27:44 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
All of the above resulted in successful reconnections -- tunnel continued working.
On-prem agent log -- the fatal disconnect (same pattern, but EC2 side failed):
2026-03-28 16:57:51 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 16:57:51 INFO [pluginName=Port] The session was cancelled
2026-03-28 16:57:51 ERROR [pluginName=Port] Unable to read from connection: use of closed network connection
2026-03-28 16:57:51 ERROR [pluginName=Port] Unable to accept stream: io: read/write on closed pipe
2026-03-28 16:57:52 INFO [pluginName=Port] Setting task to cancelled as session is already terminated
2026-03-28 16:57:52 ERROR [pluginName=Port] CreateDataChannel failed: Session is already terminated