How One Engineer Debugged a Kubernetes Pod Eviction That Wiped 5000 Jobs cover art

How One Engineer Debugged a Kubernetes Pod Eviction That Wiped 5000 Jobs

How One Engineer Debugged a Kubernetes Pod Eviction That Wiped 5000 Jobs

Listen for free

View show details
In this episode of The Software Engineering Podcast with Fexingo, Lucas and Luna dive into a production nightmare: a Kubernetes cluster that silently evicted over 5000 batch jobs over three weekends. They walk through how one engineer at a data processing startup traced the root cause to a subtle interaction between kubelet resource reservation defaults and a misconfigured eviction threshold. Learn how she used Prometheus metrics, a custom admission webhook, and a prioritization framework to prevent it from happening again. A masterclass in debugging distributed systems under pressure. #Kubernetes #PodEviction #DevOps #SiteReliabilityEngineering #DistributedSystems #BatchProcessing #Prometheus #AdmissionWebhook #DataProcessing #ProductionDebugging #CloudNative #SRE #EngineeringResilience #IncidentResponse #FexingoBusiness #BusinessPodcast #Technology #SoftwareEngineering Keep every episode free: buymeacoffee.com/fexingo
adbl_web_anon_alc_button_suppression_t1
No reviews yet