• How Kubernetes ResourceQuotas Cause Silent Pod Evictions
    Jun 29 2026
    In this episode of DevOps Daily with Fexingo, Lucas and Luna dive into a subtle but destructive Kubernetes behavior: how ResourceQuotas can silently evict pods when namespace limits are reached, even when the cluster has ample capacity. They walk through a real incident at a mid-sized e-commerce company where a single namespace's quota misconfiguration caused cascading evictions across 12 microservices during a flash sale. Lucas explains the mechanism—how the kube-apiserver rejects pod creations when quota is exceeded, and how the scheduler's lack of awareness leads to orphaned pods. Luna raises the issue of observability gaps, noting that standard dashboards often miss quota-related denials. They discuss mitigation strategies: setting explicit deny messages, monitoring quota metrics via Prometheus, and using admission webhooks for early warnings. The episode delivers a concrete lesson for any team running multi-tenant clusters. A brief donation segment highlights listener support for the ad-free show. #Kubernetes #ResourceQuotas #PodEvictions #DevOps #CloudNative #K8sTroubleshooting #ClusterManagement #AdmissionControl #NamespaceQuotas #Observability #Prometheus #IncidentResponse #Ecommerce #FexingoBusiness #BusinessPodcast #Technology #DevOpsDaily #Infrastructure Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    8 mins
  • How Kubernetes Vertical Pod Autoscaler Misallocates Memory
    Jun 28 2026
    Lucas and Luna dig into the Kubernetes Vertical Pod Autoscaler's recalculations that often leave memory over-provisioned and CPU under-provisioned. They examine a case study where a production e-commerce cluster saw 22% of VPA-recommended memory requests exceed actual usage by over 40%, while CPU recommendations lagged behind real demand by nearly 30%. The episode explains the recommender's sliding-window analysis, the percentile-based target (default 95th), and why spikes in Java garbage collection or Python memory fragmentation trick VPA into over-allocating. They contrast VPA with Horizontal Pod Autoscaler and discuss when to pin memory limits manually. Practical takeaway: set a custom memory target percentile via the VPA config's `targetMemoryPercentile` field, or use a sidecar that exposes real-time RSS metrics to tune recommendations. No fluff, just a concrete debugging path for anyone running VPA in production. #Kubernetes #VerticalPodAutoscaler #VPA #CloudNative #DevOps #PodAutoscaling #ResourceManagement #MemoryAllocation #CPUAllocation #JavaGC #PythonMemory #KubernetesBestPractices #ClusterOptimization #SRE #ProductionKubernetes #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    10 mins
  • How Kubernetes CPU Manager Pins Cause Node Drain Failures
    Jun 28 2026
    Kubernetes CPU Manager static policy seems like a performance win, pinning pods to specific CPU cores. But when you need to drain a node for maintenance, those pinned pods refuse to move—or worse, they crash on restart. In this episode, Lucas and Luna dissect the tension between CPU pinning and node lifecycle. They walk through a real scenario where a 32-core production node stalled a rolling update for 45 minutes because kubelet couldn't evict a CPU Manager pod cleanly. They explain the CPU Manager's topology-aware allocation, the eviction logic gap, and the workarounds: using descheduler with a custom strategy, setting `cpuManagerPolicy: none` for drain-sensitive workloads, and tweaking kubelet eviction thresholds. If your cluster has latency-sensitive apps pinned to cores, this episode will save you from a messy node drain. #Kubernetes #CPUManager #NodeDrain #Kubelet #StaticPolicy #TopologyManager #DevOps #ClusterLifecycle #PodEviction #Descheduler #LatencySensitive #CloudNative #Infrastructure #SRE #Technology #FexingoBusiness #BusinessPodcast #DevOpsDaily Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    12 mins
  • How Kubernetes NetworkPolicies Create Latency Jitter
    Jun 27 2026
    Lucas and Luna dig into a surprising side effect of Kubernetes NetworkPolicies: latency jitter. When you apply fine-grained network policies, every packet traverses iptables rules that can add unpredictable delay, especially under load. They walk through a real incident at a mid-size fintech where packet drops from conntrack table exhaustion caused 200ms tail-latency spikes. They explain the mechanics — how each policy adds iptables chains, how conntrack tracks flows, and why the kernel's hash table can overflow under high connection rates. They discuss mitigations like eBPF-based Cilium, tuning conntrack parameters, and using NetworkPolicy ordering to minimize rule traversal. No silver bullet, but practical advice for teams deploying microservices with strict network segmentation. #Kubernetes #NetworkPolicy #LatencyJitter #iptables #conntrack #Cilium #eBPF #DevOps #CloudNative #Performance #SRE #Fintech #Microservices #Networking #Technology #FexingoBusiness #BusinessPodcast #DevOpsDaily Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    10 mins
  • Why Kubernetes Ingress Controllers Leak Memory Under Load
    Jun 27 2026
    Lucas and Luna dive into a specific production issue that has plagued Kubernetes operators for years: memory leaks in ingress controllers under sustained traffic. They trace the problem to connection tracking and endpoint slice churn, citing a real incident where NGINX Ingress consumed 4GB of RSS in under 48 hours. The episode walks through diagnosis via pprof, root cause in the upstream connection pool, and practical mitigations like adjusting keepalive timeouts and using endpoint slice filtering. No fluff — just one concrete, actionable debugging story for anyone running Kubernetes in production. #Kubernetes #IngressController #MemoryLeak #NGINXIngress #KubernetesNetworking #EndpointSlice #ConnectionPooling #pprof #ProductionDebugging #ContainerOrchestration #TechPodcast #DevOps #SiteReliabilityEngineering #FexingoBusiness #BusinessPodcast #CloudNative #Microservices #KubernetesSRE Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    10 mins
  • How Kubernetes Pod Overhead Wastes Cluster Capacity
    Jun 26 2026
    Kubernetes cluster administrators often find that their nodes can't fit as many pods as expected — even when resource requests seem reasonable. In Episode 75 of DevOps Daily, Lucas and Luna dive into the hidden cost of Pod Overhead, a feature introduced in Kubernetes 1.24 to account for sandbox and sidecar resource consumption. They walk through how Pod Overhead is calculated in practice, why it frequently goes unset, and how a 10-15 percent capacity loss accumulates across large clusters. Using real-world scenarios with gVisor sandboxed pods and Istio sidecars, they explain why monitoring Pod Overhead is critical for efficient bin packing. Lucas also shares a Terraform-based approach to detect and alert on missing Pod Overhead configurations. A must-listen for anyone running Kubernetes at scale. #Kubernetes #PodOverhead #gVisor #Istio #Sidecar #ClusterCapacity #BinPacking #ResourceManagement #DevOps #SRE #ContainerOrchestration #K8sScheduling #KubernetesEfficiency #InfrastructureAsCode #Terraform #CloudNative #Technology #FexingoBusiness Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    11 mins
  • How Kubernetes Scheduler Threading Causes Node Allocation Delays
    Jun 26 2026
    Lucas and Luna dive into a subtle but painful Kubernetes performance issue: the single-threaded scheduler loop. When you have hundreds of nodes and thousands of pods, the scheduler's internal algorithm — which scores nodes sequentially — can bottleneck under load, causing allocation delays of 30 seconds or more. They walk through a real example from a mid-stage startup that saw pod startup times spike during morning deploys, and explain why the scheduler's "fit then score" logic, combined with default threading, leads to cascading latency. They also touch on the partial remedy of scheduler profiling and why the scheduler framework plugin system doesn't fix the core threading model. This episode is for any engineer who has wondered why their cluster feels sluggish even when CPU and memory headroom is plentiful. #Kubernetes #K8s #Scheduler #Threading #PodAllocation #NodeScoring #ClusterPerformance #DevOps #CloudNative #CNCF #SchedulingLatency #KubernetesScheduler #GoConcurrency #FexingoBusiness #BusinessPodcast #Technology #SoftwareEngineering #ContainerOrchestration Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    6 mins
  • How Kubernetes Pod Security Standards Break Legacy Workloads
    Jun 25 2026
    Kubernetes Pod Security Standards (PSS) replaced PodSecurityPolicies in 1.25, but migrating legacy workloads to restricted mode often breaks them silently. In this episode, Lucas and Luna dig into why PSS admission checks fail for statefulsets running on GKE, how the 'privileged' profile leaks capabilities via container runtime defaults, and what the baseline profile actually blocks. They walk through a real cluster audit where a simple NFS-provisioner pod got rejected because of the 'SYS_ADMIN' capability mismatch, and explain why 'kubectl auth can-i' doesn't catch admission-time failures. Listeners will learn the three PSS profiles, the gotcha around seccomp profiles in v1.27+, and how to audit your cluster with 'kubectl psp-migration' before flipping the admission mode. If your team is still on PodSecurityPolicies or has hit a wall with PSS, this episode is your debug log. #Kubernetes #PodSecurityStandards #PSS #K8sSecurity #DevOps #CloudNative #ContainerSecurity #GKE #kubectl #Seccomp #PodSecurityPolicy #NFSProvisioner #StatefulSet #AdmissionController #SecurityAudit #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    9 mins