fix: activate NextReconcileTime guard to prevent reconciliation storms by muthusk · Pull Request #1657 · projectsveltos/addon-controller

muthusk · 2026-03-19T03:29:17Z

Problem

We've been running into reconciliation storms in production when managing multiple child clusters with 10+ Helm services each. The addon-controller gets stuck in a tight reconciliation loop (~100ms cycles), consuming 2-3 CPU cores and stalling observedGeneration advancement for 30+ minutes.

After digging into it, the core issue is that skipReconciliation() checks Status.NextReconcileTime to throttle reconciliation, but this field is almost never actually set during normal deploy flow — it's only written in one obscure path (HandOverError during undeploy). So the guard exists but is effectively dead code.

This means that whenever scope.Close() patches the status (e.g., changing the FeatureSummary hash from nil to a value), the resulting watch event immediately re-enqueues the ClusterSummary, completely bypassing the RequeueAfter delay that proceedDeployingClusterSummary returns. The controller thinks it's asking for a 60-second delay, but it gets called back in milliseconds.

Related upstream issue in k0rdent: k0rdent/kcm#2552

What this PR does

1. Activates the NextReconcileTime guard — Calls setNextReconcileTime() before every return reconcile.Result{RequeueAfter: X} across 8 call sites in reconcileNormal and proceedDeployingClusterSummary. This is the main fix — it makes the existing skipReconciliation() check actually work.

2. Adds an in-memory cooldown map — NextReconcileTimes map[NamespacedName]time.Time on the reconciler struct, following the same pattern as the existing DeletedInstances map. This is needed because when scope.Close() encounters a conflict error (another controller updated the same resource), the NextReconcileTime status field never reaches etcd. The in-memory map survives that.

3. Swallows conflict errors in scope.Close() — Instead of propagating conflict errors to controller-runtime (which immediately requeues, bypassing backoff), we log and swallow them. The watch event from whatever caused the conflict will re-enqueue the resource anyway, and the in-memory cooldown ensures the backoff is honored.

4. Skips redundant status patches — When a feature is already Provisioned with the same hash, we skip calling updateFeatureStatus. This prevents unnecessary status patches that generate watch events and re-trigger reconciliation for no reason.

5. Custom rate limiter — Configures the ClusterSummary controller workqueue with a 1-second base delay (vs the default 5ms) for exponential backoff on failed items.

6. Precise requeue timing — When skipReconciliation() fires, uses the actual remaining cooldown time instead of the hardcoded normalRequeueAfter (10s).

How the fixes work together

Watch event → Reconcile() → skipReconciliation()
                                    ↓
                         Check in-memory cooldown + status.NextReconcileTime
                                    ↓
                         Too soon? → skip, requeue at exact remaining time
                         Expired?  → proceed normally
                                    ↓
                         On error: set NextReconcileTime (both in-memory + status)
                                    ↓
                         scope.Close() conflict? → swallow (in-memory guard still active)

Testing

go build / go vet clean
Unit tests pass (11/11 in lib/clusterops, pkg/scope, controllers/dependencymanager)
FV tests pass (28/29 — the one failure is the pre-existing flaky PolicyRef Tier test that also fails intermittently on upstream main)

Production results

Deployed to a k0rdent cluster managing 3 child clusters with 11 services each:

addon-controller CPU dropped from ~3 cores to ~374m
Conflict errors ("the object has been modified") dropped to zero
skipReconciliation guard firing correctly (520 skips/minute)

The remaining elevated CPU (~374m vs expected ~50m) is caused by a separate issue in the k0rdent controller manager writing Profile specs in non-deterministic order — tracked in k0rdent/kcm#2552.

…memory cooldown When multiple controllers update the same ClusterSummary, watch events from status patches bypass RequeueAfter delays, causing tight reconciliation loops (~100ms cycles, 3 CPU cores consumed). The existing NextReconcileTime guard in skipReconciliation() was effectively dead code — only set in one obscure undeploy path. Changes: - Set NextReconcileTime before every RequeueAfter return (10 call sites) - Add in-memory cooldown map (survives status-patch conflicts) - Swallow conflict errors in scope.Close() to prevent bypass of backoff - Skip redundant status patches when feature already provisioned with same hash - Configure 1s base-delay rate limiter on ClusterSummary workqueue - Use precise remaining time in skip path instead of hardcoded 10s

…rror return

gianlucam76 · 2026-03-19T09:03:34Z

Thank you so much @muthusk for the detailed description and the PR. I will review this by today. Thank you so much again. Really appreciated

gianlucam76 · 2026-03-19T11:03:28Z

controllers/clustersummary_controller.go


-	DeletedInstances map[types.NamespacedName]time.Time
+	DeletedInstances   map[types.NamespacedName]time.Time
+	NextReconcileTimes map[types.NamespacedName]time.Time // in-memory cooldown, survives status-patch conflicts


this is minor, in reconcileDelete, we should delete the clusterSummary instance if present in this map

gianlucam76

This is a thoughtful PR. Thank you.

I added a comment, but will address in a separate PR.

And the test failing in fv-agentless sanity, I will debug it offline.

muthusk added 9 commits March 17, 2026 17:53

ci: add workflow to test reconciliation storm fix

382f287

fix: address lint issues — funlen, magic number, whitespace

db17249

fix: nil map panic in tests, funlen lint, whitespace lint

fa34082

fix: remove unused nolint, deprecated Requeue field, and always-nil e…

d3fb978

…rror return

ci: add publish-image job after tests pass

27020e7

ci: publish image after lint+unit tests, decouple from flaky fv

3c349af

ci: fix docker/login-action reference

4c5b9cb

chore: remove fork-only CI workflow and report file

9de58d2

gianlucam76 reviewed Mar 19, 2026

View reviewed changes

gianlucam76 approved these changes Mar 19, 2026

View reviewed changes

gianlucam76 merged commit 0e6f725 into projectsveltos:main Mar 19, 2026
20 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: activate NextReconcileTime guard to prevent reconciliation storms#1657

fix: activate NextReconcileTime guard to prevent reconciliation storms#1657
gianlucam76 merged 9 commits intoprojectsveltos:mainfrom
muthusk:fix/reconciliation-storm

muthusk commented Mar 19, 2026

Uh oh!

gianlucam76 commented Mar 19, 2026

Uh oh!

gianlucam76 Mar 19, 2026

Uh oh!

gianlucam76 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

muthusk commented Mar 19, 2026

Problem

What this PR does

How the fixes work together

Testing

Production results

Uh oh!

gianlucam76 commented Mar 19, 2026

Uh oh!

gianlucam76 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gianlucam76 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants