391043 Stack
📖 Tutorial

How to Defeat Controller Staleness in Kubernetes v1.36 with AtomicFIFO and Better Observability

Last updated: 2026-05-05 22:03:41 Intermediate
Complete guide
Follow along with this comprehensive guide

Introduction

Controller staleness—when your Kubernetes controller makes decisions based on outdated cache data—can lead to subtle but serious failures. A controller might delete a Pod that still exists, fail to scale up when needed, or take too long to react. Kubernetes v1.36 introduces powerful tools to fight this: the AtomicFIFO feature in client-go and enhanced observability in kube-controller-manager. This guide walks you through the steps to enable and leverage these improvements to keep your controllers accurate and responsive.

How to Defeat Controller Staleness in Kubernetes v1.36 with AtomicFIFO and Better Observability

What You Need

  • Kubernetes cluster running v1.36 or later
  • kube-controller-manager with the AtomicFIFO feature gate enabled (if using built-in controllers)
  • Client-go updated to v1.36+ in your custom controllers
  • Access to cluster metrics (e.g., via Prometheus or kube-state-metrics) for observability
  • Basic understanding of informer patterns and controller reconciliation loops

Step-by-Step Guide

Step 1: Understand Staleness and Identify Affected Controllers

Before upgrading, review your controllers for staleness symptoms: unexpected actions (e.g., scaling down instead of up), delayed reactions, or duplicate work. Typical causes include restarts (cache rebuilds), API server outages, or out-of-order events. The new features in v1.36 address the root cause: outdated views of the world inside the informer cache.

  • Check controller logs for repeated list-watch errors
  • Monitor metrics like workqueue_depth and workqueue_unfinished_work_seconds
  • Identify controllers that rely on FIFO queues (most client-go based ones)

Step 2: Enable the AtomicFIFO Feature Gate

The AtomicFIFO feature gate changes how new events (especially batch events from initial list operations) are added to the work queue. It ensures atomic processing—either all events from a batch are queued consistently, or none are, preventing partial updates that cause cache inconsistencies.

  1. Edit the kube-controller-manager deployment or static pod manifest:
  2. --feature-gates=AtomicFIFO=true
  3. For kube-controller-manager (if you use built-in controllers like Deployment or ReplicaSet): add this flag to the startup arguments.
  4. If you run custom controllers with a separate binary, enable the same feature gate in your code using the k8s.io/component-base/featuregate package.
  5. Restart the controller process to apply changes.

Note: This feature is available behind a gate in v1.36; it will become default in a future release.

Step 3: Update Custom Controllers to Use Atomic FIFO Processing

Client-go v1.36 includes the AtomicFIFO queue implementation. If you write custom controllers using cache.NewFIFO or workqueue.New, you should migrate to use the atomic variant.

  1. Update your go.mod to use client-go v0.36+:
  2. require k8s.io/client-go v0.36.0
  3. Replace your FIFO queue creation with:
  4. import "k8s.io/client-go/tools/cache"
    queue := cache.NewAtomicFIFO(keyFunc)
  5. Adjust your informer’s event handler—instead of adding items directly to a work queue, let the informer push into the AtomicFIFO.
  6. Ensure your controller’s reconciliation loop reads from this queue atomically.

This change guarantees that when an informer performs an initial list, all objects are queued before any individual update events, preventing temporary inconsistencies.

Step 4: Use Cache Introspection to Verify Freshness

With v1.36, client-go exposes the latest resource version known to the cache. You can now check whether your controller’s view is stale before acting.

  1. From your controller code, call informer.LastSyncResourceVersion() (available on shared informers).
  2. Compare this version with the API server’s current version (exposed via a discovery API or metadata).
  3. If the difference exceeds a threshold (e.g., missing many events), skip the reconciliation or log a warning.
if version, err := informer.LastSyncResourceVersion(); err == nil {
    if version < expectedVersion {
        log.Warn("Cache is behind by %d versions", expectedVersion-version)
        // Optionally, wait or re-list
    }
}

This introspection helps you detect staleness early and avoid taking incorrect actions.

Step 5: Leverage Enhanced Observability for Controllers

Kubernetes v1.36 also improves metrics and logs for kube-controller-manager’s highly contended controllers (e.g., endpoints, endpointslices). These metrics reveal when operations are delayed due to stale caches or queue bottlenecks.

  1. Enable the ControllerMetrics feature gate (if not default) to get per-controller staleness metrics.
  2. Monitor controller_staleness_errors_total and controller_cache_lag_seconds in your monitoring system.
  3. Set up alerts for spikes in these metrics—they indicate that a controller is falling behind.
  4. Use the new AtomicFIFOQueueDepth metric to see how many items are waiting for atomic processing.

By combining observability with the AtomicFIFO fix, you can both detect staleness and prevent it from causing harm.

Tips for Successful Implementation

  • Test in a staging environment first—upgrading to new client-go APIs can break existing controllers if not correctly migrated.
  • Monitor controller startup—the initial list operation now blocks until the AtomicFIFO is built. Expect slightly longer startup times, but more consistent behavior.
  • Order of events matters less—with AtomicFIFO, you no longer need to worry about out-of-order updates corrupting your state. Rely on the queue’s consistency.
  • Combine with leader election—if you run multiple replicas, ensure only one works on the queue to avoid duplicate processing.
  • Check resource version introspection regularly in production to catch unexpected API server delays.
  • Upgrade gradually—enable the feature gate first, observe metrics, then update code to use AtomicFIFO.