Skip to content

fix: activate NextReconcileTime guard to prevent reconciliation storms#1657

Merged
gianlucam76 merged 9 commits intoprojectsveltos:mainfrom
muthusk:fix/reconciliation-storm
Mar 19, 2026
Merged

fix: activate NextReconcileTime guard to prevent reconciliation storms#1657
gianlucam76 merged 9 commits intoprojectsveltos:mainfrom
muthusk:fix/reconciliation-storm

Conversation

@muthusk
Copy link
Contributor

@muthusk muthusk commented Mar 19, 2026

Problem

We've been running into reconciliation storms in production when managing multiple child clusters with 10+ Helm services each. The addon-controller gets stuck in a tight reconciliation loop (~100ms cycles), consuming 2-3 CPU cores and stalling observedGeneration advancement for 30+ minutes.

After digging into it, the core issue is that skipReconciliation() checks Status.NextReconcileTime to throttle reconciliation, but this field is almost never actually set during normal deploy flow — it's only written in one obscure path (HandOverError during undeploy). So the guard exists but is effectively dead code.

This means that whenever scope.Close() patches the status (e.g., changing the FeatureSummary hash from nil to a value), the resulting watch event immediately re-enqueues the ClusterSummary, completely bypassing the RequeueAfter delay that proceedDeployingClusterSummary returns. The controller thinks it's asking for a 60-second delay, but it gets called back in milliseconds.

Related upstream issue in k0rdent: k0rdent/kcm#2552

What this PR does

1. Activates the NextReconcileTime guard — Calls setNextReconcileTime() before every return reconcile.Result{RequeueAfter: X} across 8 call sites in reconcileNormal and proceedDeployingClusterSummary. This is the main fix — it makes the existing skipReconciliation() check actually work.

2. Adds an in-memory cooldown mapNextReconcileTimes map[NamespacedName]time.Time on the reconciler struct, following the same pattern as the existing DeletedInstances map. This is needed because when scope.Close() encounters a conflict error (another controller updated the same resource), the NextReconcileTime status field never reaches etcd. The in-memory map survives that.

3. Swallows conflict errors in scope.Close() — Instead of propagating conflict errors to controller-runtime (which immediately requeues, bypassing backoff), we log and swallow them. The watch event from whatever caused the conflict will re-enqueue the resource anyway, and the in-memory cooldown ensures the backoff is honored.

4. Skips redundant status patches — When a feature is already Provisioned with the same hash, we skip calling updateFeatureStatus. This prevents unnecessary status patches that generate watch events and re-trigger reconciliation for no reason.

5. Custom rate limiter — Configures the ClusterSummary controller workqueue with a 1-second base delay (vs the default 5ms) for exponential backoff on failed items.

6. Precise requeue timing — When skipReconciliation() fires, uses the actual remaining cooldown time instead of the hardcoded normalRequeueAfter (10s).

How the fixes work together

Watch event → Reconcile() → skipReconciliation()
                                    ↓
                         Check in-memory cooldown + status.NextReconcileTime
                                    ↓
                         Too soon? → skip, requeue at exact remaining time
                         Expired?  → proceed normally
                                    ↓
                         On error: set NextReconcileTime (both in-memory + status)
                                    ↓
                         scope.Close() conflict? → swallow (in-memory guard still active)

Testing

  • go build / go vet clean
  • Unit tests pass (11/11 in lib/clusterops, pkg/scope, controllers/dependencymanager)
  • FV tests pass (28/29 — the one failure is the pre-existing flaky PolicyRef Tier test that also fails intermittently on upstream main)

Production results

Deployed to a k0rdent cluster managing 3 child clusters with 11 services each:

  • addon-controller CPU dropped from ~3 cores to ~374m
  • Conflict errors ("the object has been modified") dropped to zero
  • skipReconciliation guard firing correctly (520 skips/minute)

The remaining elevated CPU (~374m vs expected ~50m) is caused by a separate issue in the k0rdent controller manager writing Profile specs in non-deterministic order — tracked in k0rdent/kcm#2552.

muthusk added 9 commits March 17, 2026 17:53
…memory cooldown

      When multiple controllers update the same ClusterSummary, watch events from
      status patches bypass RequeueAfter delays, causing tight reconciliation loops
      (~100ms cycles, 3 CPU cores consumed). The existing NextReconcileTime guard
      in skipReconciliation() was effectively dead code — only set in one obscure
      undeploy path.

      Changes:
      - Set NextReconcileTime before every RequeueAfter return (10 call sites)
      - Add in-memory cooldown map (survives status-patch conflicts)
      - Swallow conflict errors in scope.Close() to prevent bypass of backoff
      - Skip redundant status patches when feature already provisioned with same hash
      - Configure 1s base-delay rate limiter on ClusterSummary workqueue
      - Use precise remaining time in skip path instead of hardcoded 10s
@gianlucam76
Copy link
Member

Thank you so much @muthusk for the detailed description and the PR. I will review this by today. Thank you so much again. Really appreciated


DeletedInstances map[types.NamespacedName]time.Time
DeletedInstances map[types.NamespacedName]time.Time
NextReconcileTimes map[types.NamespacedName]time.Time // in-memory cooldown, survives status-patch conflicts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor, in reconcileDelete, we should delete the clusterSummary instance if present in this map

Copy link
Member

@gianlucam76 gianlucam76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thoughtful PR. Thank you.

I added a comment, but will address in a separate PR.

And the test failing in fv-agentless sanity, I will debug it offline.

@gianlucam76 gianlucam76 merged commit 0e6f725 into projectsveltos:main Mar 19, 2026
20 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants