setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149
Conversation
| existing_pid=$(cat "${lockfile}" 2>/dev/null) | ||
| if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then | ||
| debug "Removing stale lock from dead process $existing_pid for ${iface}" | ||
| rm -f "${lockfile}" |
There was a problem hiding this comment.
Could there be a race condtion when two PIDs clash and a lock file is removed by accident?
There was a problem hiding this comment.
Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.
I dont think an atomic operation is possible purely with shell code.
What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.
Something like:
local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
if [ "$lock_age" -gt 300 ]; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
fi
Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout
There was a problem hiding this comment.
I like this approach but I am also okay with not adding more complication since the PID space is quite large.
| # nonzero exit codes from a redirect without considering them | ||
| # fatal errors | ||
| set +e | ||
| while [ $cnt -lt $max ]; do |
There was a problem hiding this comment.
Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?
There was a problem hiding this comment.
Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:
max_wait=3000i.e. 300s due tosleep 0.1(setup-policy-routes.sh)max=3000i.e. 300s due tosleep 0.1(lib.sh)"$lock_age" -gt 300(lib.sh)
There was a problem hiding this comment.
yeah, I agree. IMO lock_age is not needed.
There was a problem hiding this comment.
Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.
|
I ran this new script on my a host yesterday. |
Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detectionHost:
Setup# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes
# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh
FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"Test 1: Sysfs wait timeoutPurpose: /usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"Output: Journal: Test 2: Stale lock detectionPurpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s. mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"Output: Journal: The process got past Test 3: Full race (start + concurrent refresh)Purpose: /usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"Output: Journal:
Restoresed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE" |
|
I re-read your issue again: #148 I am wondering if there is a way to call |
|
I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to |
|
The reason I said "udev remove event does not fire" is because I observed the accumulation of With the current PR, the |
|
I guess we can just add the same if ((counter >= max_wait)); then
error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
/usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
exit 1
fi |
I think we need to only disable the .timer unit as the policy-routes@iface.service should die automatically? |
|
Above test is good but not sufficient enough as everything runs in systemd units. |
|
I tried your test, and both with and without the ps aux | grep setup-policy-routes | grep -v grep | wc -l
0Is it because the systemctl stop preempts the race condition? What if we test without the This tests whether the systemctl disable in the timeout block prevents leaked procs from Restart=on-failure when there is no clean stop. I ran it and it had basically no affect on the number leaked Without systemctl disable: 53 leaked procs
With systemctl disable (timer only): 50 leaked procs |
We can try. I am thinking if we disable the service after the max count is reached it will remove the service from the tracked units, which would produce the same as exit 2 and then |
…xit code 2 on timemout (ENI is invalid)
14702d2 to
65fbb03
Compare
|
I pushed the exit 2 and then RestartPreventExitStatus=2 change. I kept the disable of the refresh .timer unit since the assumption is that ENI doesn't exist if we reach the timeout, so nonsensical to refresh its configuration. I reran the same test I sent before (50+ leaked for i in $(seq 1 50); do
systemctl start policy-routes@dummy${i}.service &
sleep 0.1
done
sleep 15
ps aux | grep setup-policy-routes | grep -v grep | wc -lWith systemctl disable .timer: 50 leaked procs Note: The ~1 min delay is probably due to queuing of the many systemd requests. This would not occur in production where ENI timeouts happen one at a time |
Looks good. |
Issue #, if available:
#148
Description of changes:
Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)
Two changes:
See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.