setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts by ddermendzhiev · Pull Request #149 · amazonlinux/amazon-ec2-net-utils

ddermendzhiev · 2026-03-31T23:10:34Z

Issue #, if available:

Description of changes:

Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)

Two changes:

bin/setup-policy-routes.sh: add a 5-minute timeout to the sysfs wait loop in the start action so stuck processes eventually exit instead of holding the per-ENI lockfile indefinitely
lib/lib.sh: add a stale lock check in register_networkd_reloader(). If the lock owner PID is no longer alive, remove the lockfile before spinning

See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ericsu66888

Thanks for the PR

joeysk2012 · 2026-04-01T19:34:15Z

lib/lib.sh

+        existing_pid=$(cat "${lockfile}" 2>/dev/null)
+        if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
+            debug "Removing stale lock from dead process $existing_pid for ${iface}"
+            rm -f "${lockfile}"


Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) )) if [ "$lock_age" -gt 300 ]; then debug "Removing stale lock from dead process $existing_pid for ${iface}" rm -f "${lockfile}" fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

joeysk2012 · 2026-04-01T19:34:58Z

lib/lib.sh

    # nonzero exit codes from a redirect without considering them
    # fatal errors
    set +e
    while [ $cnt -lt $max ]; do


Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?

Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:

max_wait=3000 i.e. 300s due to sleep 0.1 (setup-policy-routes.sh)

max=3000 i.e. 300s due to sleep 0.1 (lib.sh)

"$lock_age" -gt 300 (lib.sh)

yeah, I agree. IMO lock_age is not needed.

Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.

bin/setup-policy-routes.sh

joeysk2012 · 2026-04-02T17:48:56Z

I ran this new script on my a host yesterday.
For the most part I feel good about it.
Running some more tests to see if any other issues.
I would like to get this merged and deployed into AL23 soon.
Please post any test results or logs if you have them.

…ch sysfs wait timeout

ddermendzhiev · 2026-04-02T19:22:38Z

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Host: host-instance
Package: amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1
Patched files:

/usr/bin/setup-policy-routes
/usr/share/amazon-ec2-net-utils/lib.sh

Setup

# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes

# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh

FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"

Test 1: Sysfs wait timeout

Purpose: start exits after max_wait instead of looping forever when the sysfs node never appears.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

exit code: 1

Journal:

Apr 02 18:17:14 host-instance ec2net[111890]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:17:15 host-instance ec2net[111890]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

Test 2: Stale lock detection

Purpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s.

mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

99999
exit code: 1

Journal:

Apr 02 18:19:37 host-instance ec2net[112210]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:19:38 host-instance ec2net[112210]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

The process got past register_networkd_reloader and entered the sysfs wait loop — proving the stale lock was removed. It then timed out and exited cleanly.

Test 3: Full race (start + concurrent refresh)

Purpose: start acquires the lock and enters the sysfs wait loop. refresh arrives concurrently. With the fix, start times out and exits, refresh acquires the lock, finds the ENI missing from sysfs, and exits — both within ~1 second instead of spinning for 300s.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"

Output:

[1] 126139
[2] 126149
[1]-  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" start
[2]+  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" refresh
both done

Journal:

Apr 02 18:26:29 host-instance ec2net[126139]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:26:30 host-instance ec2net[126139]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

start timed out and exited. refresh acquired the lock, hit [ -e "/sys/class/net/${iface}" ] || exit 0, and exited immediately — no journal output expected for that path.

Restore

sed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE"

joeysk2012 · 2026-04-02T19:49:25Z

I re-read your issue again: #148
Please help me understand the scenario better.
It says that udev remove event does not fire fails to trigger.
Which means the refresh-policy-routes@$name.timer service will be a leaked unit and will continue to run every 60s.
So this means we can still have potentially hundreds of still non-working timers still:
This PR will fix the infinite loop issue & spinning for lockfile but it does not seem to address this issue?
This means we will still consume more CPU than needed but not as much if we didn't have the PR.

04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy52.timer  refresh-policy-routes@dummy52.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:42 UTC 40s ago refresh-policy-routes@dummy63.timer  refresh-policy-routes@dummy63.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy90.timer  refresh-policy-routes@dummy90.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy20.timer  refresh-policy-routes@dummy20.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy24.timer  refresh-policy-routes@dummy24.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy37.timer  refresh-policy-routes@dummy37.service
...

I am wondering if there is a way to call /usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service if we get into this state outside of udev rules.

joeysk2012 · 2026-04-02T21:23:22Z

I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to
Restart=on-failure

ddermendzhiev · 2026-04-02T21:37:08Z

The reason I said "udev remove event does not fire" is because I observed the accumulation of refresh-policy-routes@$name.timer and policy-routes@$name.service units. As ECS task churn continued and attached+detached new ENIs, the leaked units accumulated, each with a stuck setup-policy-routes %i start proc and a setup-policy-routes %i refresh proc spinning trying to acquire the lock, exiting, then being respawned by the timer.

With the current PR, the start proc would timeout, but you are correct that refresh would continue to be respawned because the systemd unit is still active. Also noted on Restart=on-failure on the start rule.

ddermendzhiev · 2026-04-02T21:40:00Z

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

joeysk2012 · 2026-04-02T21:51:50Z

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

I think we need to only disable the .timer unit as the policy-routes@iface.service should die automatically?
I don't want to risk disabling the service unnecessarily. Also need the same code for the lock timeout issue.

joeysk2012 · 2026-04-02T22:05:48Z

Above test is good but not sufficient enough as everything runs in systemd units.
Here is my test script, even after applying your changes I am still getting leaked process. Although significantly lower.

for i in $(seq 1 500); do
  sudo systemctl start policy-routes@dummy${i}.service &
  sleep 0.2
  sudo systemctl stop policy-routes@dummy${i}.service &
  sleep 0.1
done

ddermendzhiev · 2026-04-02T22:59:13Z

I tried your test, and both with and without the systemctl disable timer commands, the test showed 0 leaked procs.

ps aux | grep setup-policy-routes | grep -v grep | wc -l
0

Is it because the systemctl stop preempts the race condition? What if we test without the systemctl stop to simulate the case when where udev remove event never fires.

for i in $(seq 1 50); do
  systemctl start policy-routes@dummy${i}.service &
  sleep 0.1
done
sleep 10
ps aux | grep setup-policy-routes | grep -v grep | wc -l

This tests whether the systemctl disable in the timeout block prevents leaked procs from Restart=on-failure when there is no clean stop. I ran it and it had basically no affect on the number leaked start procs. It is because of the Restart=on-failure on policy-routes@.service. The only way to stop this respawn is to also disable the service. Are you open to that or do you have another idea?

Without systemctl disable: 53 leaked procs
With systemctl disable (timer only): 50 leaked procs

joeysk2012 · 2026-04-02T23:47:27Z

This tests whether the systemctl disable in the timeout block prevents leaked procs from Restart=on-failure when there is no clean stop. I ran it and it had basically no affect on the number leaked start procs. It is because of the Restart=on-failure on policy-routes@.service. The only way to stop this respawn is to also disable the service. Are you open to that or do you have another idea?
Without systemctl disable: 53 leaked procs
With systemctl disable (timer only): 50 leaked procs

We can try. I am thinking if we disable the service after the max count is reached it will remove the service from the tracked units, which would produce the same as exit 2 and then RestartPreventExitStatus=2

…xit code 2 on timemout (ENI is invalid)

ddermendzhiev · 2026-04-03T15:17:33Z

I pushed the exit 2 and then RestartPreventExitStatus=2 change. I kept the disable of the refresh .timer unit since the assumption is that ENI doesn't exist if we reach the timeout, so nonsensical to refresh its configuration.

I reran the same test I sent before (50+ leaked start procs without this fix):

for i in $(seq 1 50); do
  systemctl start policy-routes@dummy${i}.service &
  sleep 0.1
done
sleep 15
ps aux | grep setup-policy-routes | grep -v grep | wc -l

With systemctl disable .timer: 50 leaked procs
With systemctl disable .timer + exit 2 logic: 0 leaked procs (had to wait ~1 min for cleanup)

# journalctl -u policy-routes@dummy29.service --no-pager | tail -30

Apr 03 14:50:05 host-instance ec2net[3763924]: Waiting for sysfs node to exist for dummy29 (iteration 0)
Apr 03 14:50:07 host-instance ec2net[3763924]: Timed out waiting for sysfs node for dummy29 after 1 seconds
Apr 03 14:51:03 host-instance ec2net[3763924]: Deferring networkd reload to another process
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Failed with result 'exit-code'.
Apr 03 14:51:03 host-instance systemd[1]: Failed to start policy-routes@dummy29.service - Set up policy routes for dummy29.

Note: The ~1 min delay is probably due to queuing of the many systemd requests. This would not occur in production where ENI timeouts happen one at a time

joeysk2012 · 2026-04-03T16:32:16Z

# journalctl -u policy-routes@dummy29.service --no-pager | tail -30

Apr 03 14:50:05 host-instance ec2net[3763924]: Waiting for sysfs node to exist for dummy29 (iteration 0)
Apr 03 14:50:07 host-instance ec2net[3763924]: Timed out waiting for sysfs node for dummy29 after 1 seconds
Apr 03 14:51:03 host-instance ec2net[3763924]: Deferring networkd reload to another process
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Failed with result 'exit-code'.
Apr 03 14:51:03 host-instance systemd[1]: Failed to start policy-routes@dummy29.service - Set up policy routes for dummy29.

Looks good.

setup-policy-routes: add sysfs wait timeout and stale lock detection

26a67ce

ericsu66888 approved these changes Apr 1, 2026

View reviewed changes

joeysk2012 reviewed Apr 1, 2026

View reviewed changes

joeysk2012 reviewed Apr 2, 2026

View reviewed changes

bin/setup-policy-routes.sh Outdated Show resolved Hide resolved

Dinko Dermendzhiev added 2 commits April 2, 2026 13:49

falsy ((counter++)) needs || true when counter = 0

9fad70b

lib/lib.sh: lower register_networkd_reloader() max (spin time) to mat…

097d289

…ch sysfs wait timeout

setup-policy-routes: disable refresh time and prevent restart using e…

65fbb03

…xit code 2 on timemout (ENI is invalid)

ddermendzhiev force-pushed the fix/setup-policy-routes-sysfs-timeout branch from 14702d2 to 65fbb03 Compare April 3, 2026 15:02

joeysk2012 merged commit dc2b238 into amazonlinux:main Apr 3, 2026
4 checks passed

ddermendzhiev mentioned this pull request Apr 3, 2026

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts #148

Closed

Conversation

ddermendzhiev commented Mar 31, 2026

Issue #, if available:

Description of changes:

Uh oh!

ericsu66888 left a comment

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Setup

Test 1: Sysfs wait timeout

Test 2: Stale lock detection

Test 3: Full race (start + concurrent refresh)

Restore

Uh oh!

joeysk2012 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Uh oh!

joeysk2012 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddermendzhiev commented Apr 3, 2026

Uh oh!

joeysk2012 commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeysk2012 commented Apr 2, 2026 •

edited

Loading

joeysk2012 commented Apr 2, 2026 •

edited

Loading

ddermendzhiev commented Apr 2, 2026 •

edited

Loading

joeysk2012 commented Apr 2, 2026 •

edited

Loading