Skip to content

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149

Merged
joeysk2012 merged 4 commits intoamazonlinux:mainfrom
ddermendzhiev:fix/setup-policy-routes-sysfs-timeout
Apr 3, 2026
Merged

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149
joeysk2012 merged 4 commits intoamazonlinux:mainfrom
ddermendzhiev:fix/setup-policy-routes-sysfs-timeout

Conversation

@ddermendzhiev
Copy link
Copy Markdown

Issue #, if available:

#148

Description of changes:

Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)

Two changes:

  • bin/setup-policy-routes.sh: add a 5-minute timeout to the sysfs wait loop in the start action so stuck processes eventually exit instead of holding the per-ENI lockfile indefinitely
  • lib/lib.sh: add a stale lock check in register_networkd_reloader(). If the lock owner PID is no longer alive, remove the lockfile before spinning

See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown

@ericsu66888 ericsu66888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR

existing_pid=$(cat "${lockfile}" 2>/dev/null)
if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

        local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
        if [ "$lock_age" -gt 300 ]; then
            debug "Removing stale lock from dead process $existing_pid for ${iface}"
            rm -f "${lockfile}"
        fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

# nonzero exit codes from a redirect without considering them
# fatal errors
set +e
while [ $cnt -lt $max ]; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:

  • max_wait=3000 i.e. 300s due to sleep 0.1 (setup-policy-routes.sh)
  • max=3000 i.e. 300s due to sleep 0.1 (lib.sh)
  • "$lock_age" -gt 300 (lib.sh)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree. IMO lock_age is not needed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.

@joeysk2012
Copy link
Copy Markdown
Contributor

I ran this new script on my a host yesterday.
For the most part I feel good about it.
Running some more tests to see if any other issues.
I would like to get this merged and deployed into AL23 soon.
Please post any test results or logs if you have them.

@ddermendzhiev
Copy link
Copy Markdown
Author

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Host: host-instance
Package: amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1
Patched files:

  • /usr/bin/setup-policy-routes
  • /usr/share/amazon-ec2-net-utils/lib.sh

Setup

# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes

# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh

FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"

Test 1: Sysfs wait timeout

Purpose: start exits after max_wait instead of looping forever when the sysfs node never appears.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

exit code: 1

Journal:

Apr 02 18:17:14 host-instance ec2net[111890]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:17:15 host-instance ec2net[111890]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

Test 2: Stale lock detection

Purpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s.

mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

99999
exit code: 1

Journal:

Apr 02 18:19:37 host-instance ec2net[112210]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:19:38 host-instance ec2net[112210]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

The process got past register_networkd_reloader and entered the sysfs wait loop — proving the stale lock was removed. It then timed out and exited cleanly.


Test 3: Full race (start + concurrent refresh)

Purpose: start acquires the lock and enters the sysfs wait loop. refresh arrives concurrently. With the fix, start times out and exits, refresh acquires the lock, finds the ENI missing from sysfs, and exits — both within ~1 second instead of spinning for 300s.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"

Output:

[1] 126139
[2] 126149
[1]-  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" start
[2]+  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" refresh
both done

Journal:

Apr 02 18:26:29 host-instance ec2net[126139]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:26:30 host-instance ec2net[126139]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

start timed out and exited. refresh acquired the lock, hit [ -e "/sys/class/net/${iface}" ] || exit 0, and exited immediately — no journal output expected for that path.


Restore

sed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE"

@joeysk2012
Copy link
Copy Markdown
Contributor

joeysk2012 commented Apr 2, 2026

I re-read your issue again: #148
Please help me understand the scenario better.
It says that udev remove event does not fire fails to trigger.
Which means the refresh-policy-routes@$name.timer service will be a leaked unit and will continue to run every 60s.
So this means we can still have potentially hundreds of still non-working timers still:
This PR will fix the infinite loop issue & spinning for lockfile but it does not seem to address this issue?
This means we will still consume more CPU than needed but not as much if we didn't have the PR.

04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy52.timer  refresh-policy-routes@dummy52.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:42 UTC 40s ago refresh-policy-routes@dummy63.timer  refresh-policy-routes@dummy63.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy90.timer  refresh-policy-routes@dummy90.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy20.timer  refresh-policy-routes@dummy20.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy24.timer  refresh-policy-routes@dummy24.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy37.timer  refresh-policy-routes@dummy37.service
...

I am wondering if there is a way to call /usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service if we get into this state outside of udev rules.

@joeysk2012
Copy link
Copy Markdown
Contributor

I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to
Restart=on-failure

@ddermendzhiev
Copy link
Copy Markdown
Author

The reason I said "udev remove event does not fire" is because I observed the accumulation of refresh-policy-routes@$name.timer and policy-routes@$name.service units. As ECS task churn continued and attached+detached new ENIs, the leaked units accumulated, each with a stuck setup-policy-routes %i start proc and a setup-policy-routes %i refresh proc spinning trying to acquire the lock, exiting, then being respawned by the timer.

With the current PR, the start proc would timeout, but you are correct that refresh would continue to be respawned because the systemd unit is still active. Also noted on Restart=on-failure on the start rule.

@ddermendzhiev
Copy link
Copy Markdown
Author

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

@joeysk2012
Copy link
Copy Markdown
Contributor

joeysk2012 commented Apr 2, 2026

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

I think we need to only disable the .timer unit as the policy-routes@iface.service should die automatically?
I don't want to risk disabling the service unnecessarily. Also need the same code for the lock timeout issue.

@joeysk2012
Copy link
Copy Markdown
Contributor

Above test is good but not sufficient enough as everything runs in systemd units.
Here is my test script, even after applying your changes I am still getting leaked process. Although significantly lower.

for i in $(seq 1 500); do
  sudo systemctl start policy-routes@dummy${i}.service &
  sleep 0.2
  sudo systemctl stop policy-routes@dummy${i}.service &
  sleep 0.1
done

@ddermendzhiev
Copy link
Copy Markdown
Author

ddermendzhiev commented Apr 2, 2026

I tried your test, and both with and without the systemctl disable timer commands, the test showed 0 leaked procs.

ps aux | grep setup-policy-routes | grep -v grep | wc -l
0

Is it because the systemctl stop preempts the race condition? What if we test without the systemctl stop to simulate the case when where udev remove event never fires.

for i in $(seq 1 50); do
  systemctl start policy-routes@dummy${i}.service &
  sleep 0.1
done
sleep 10
ps aux | grep setup-policy-routes | grep -v grep | wc -l

This tests whether the systemctl disable in the timeout block prevents leaked procs from Restart=on-failure when there is no clean stop. I ran it and it had basically no affect on the number leaked start procs. It is because of the Restart=on-failure on policy-routes@.service. The only way to stop this respawn is to also disable the service. Are you open to that or do you have another idea?

Without systemctl disable: 53 leaked procs
With systemctl disable (timer only): 50 leaked procs

@joeysk2012
Copy link
Copy Markdown
Contributor

joeysk2012 commented Apr 2, 2026

This tests whether the systemctl disable in the timeout block prevents leaked procs from Restart=on-failure when there is no clean stop. I ran it and it had basically no affect on the number leaked start procs. It is because of the Restart=on-failure on policy-routes@.service. The only way to stop this respawn is to also disable the service. Are you open to that or do you have another idea?

Without systemctl disable: 53 leaked procs
With systemctl disable (timer only): 50 leaked procs

We can try. I am thinking if we disable the service after the max count is reached it will remove the service from the tracked units, which would produce the same as exit 2 and then RestartPreventExitStatus=2

@ddermendzhiev ddermendzhiev force-pushed the fix/setup-policy-routes-sysfs-timeout branch from 14702d2 to 65fbb03 Compare April 3, 2026 15:02
@ddermendzhiev
Copy link
Copy Markdown
Author

I pushed the exit 2 and then RestartPreventExitStatus=2 change. I kept the disable of the refresh .timer unit since the assumption is that ENI doesn't exist if we reach the timeout, so nonsensical to refresh its configuration.

I reran the same test I sent before (50+ leaked start procs without this fix):

for i in $(seq 1 50); do
  systemctl start policy-routes@dummy${i}.service &
  sleep 0.1
done
sleep 15
ps aux | grep setup-policy-routes | grep -v grep | wc -l

With systemctl disable .timer: 50 leaked procs
With systemctl disable .timer + exit 2 logic: 0 leaked procs (had to wait ~1 min for cleanup)

# journalctl -u policy-routes@dummy29.service --no-pager | tail -30

Apr 03 14:50:05 host-instance ec2net[3763924]: Waiting for sysfs node to exist for dummy29 (iteration 0)
Apr 03 14:50:07 host-instance ec2net[3763924]: Timed out waiting for sysfs node for dummy29 after 1 seconds
Apr 03 14:51:03 host-instance ec2net[3763924]: Deferring networkd reload to another process
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Failed with result 'exit-code'.
Apr 03 14:51:03 host-instance systemd[1]: Failed to start policy-routes@dummy29.service - Set up policy routes for dummy29.

Note: The ~1 min delay is probably due to queuing of the many systemd requests. This would not occur in production where ENI timeouts happen one at a time

@joeysk2012
Copy link
Copy Markdown
Contributor

# journalctl -u policy-routes@dummy29.service --no-pager | tail -30

Apr 03 14:50:05 host-instance ec2net[3763924]: Waiting for sysfs node to exist for dummy29 (iteration 0)
Apr 03 14:50:07 host-instance ec2net[3763924]: Timed out waiting for sysfs node for dummy29 after 1 seconds
Apr 03 14:51:03 host-instance ec2net[3763924]: Deferring networkd reload to another process
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 14:51:03 host-instance systemd[1]: policy-routes@dummy29.service: Failed with result 'exit-code'.
Apr 03 14:51:03 host-instance systemd[1]: Failed to start policy-routes@dummy29.service - Set up policy routes for dummy29.

Looks good.

@joeysk2012 joeysk2012 merged commit dc2b238 into amazonlinux:main Apr 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants