The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering cover art

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

By: Fexingo
Listen for free

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your system is already in production? Lucas and Luna don't pretend to have final answers — they build the conversation so you can draw your own. If you've ever argued about whether a page was necessary or whether an SLO should be tightened, this is your show. #SiteReliabilityEngineering #SRE #Uptime #ProductionEngineering #IncidentResponse #ErrorBudgets #SLOs #Postmortem #ToilAutomation #CapacityPlanning #Observability #DevOps #PlatformEngineering #Resilience #OnCall #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo© 2026 Fexingo. All rights reserved. Economics
Episodes
  • How SRE Teams Use Cost of Delay to Prioritize Reliability Work
    Jun 17 2026
    Lucas and Luna explore how SRE teams at companies like Spotify and Etsy use 'cost of delay' — a concept borrowed from product management — to quantify the business impact of reliability work. Lucas explains the math behind deferring a reliability project, using a real-world example: a payment-processing team deciding whether to fix a latency issue or build a new feature. Luna pushes back on the difficulty of estimating delay costs, and they discuss a practical framework — weighted shortest job first (WSJF) — that helps teams rank reliability initiatives alongside feature work. The episode includes a concrete example: if deferring an SRE project by one quarter costs $200,000 in incident-related losses, the team can calculate the cost of delay per week and compare it to the effort required. Listeners learn how to present reliability investments in the language executives understand: dollars and time. The conversation closes with a reflection on how cost of delay changes the conversation from 'how reliable should we be?' to 'what happens if we defer this work?' #SiteReliabilityEngineering #CostOfDelay #WSJF #Spotify #Etsy #SREPrioritization #ReliabilityEngineering #IncidentResponse #Technology #BusinessCase #ProductManagement #WeightedShortestJobFirst #SREMetrics #LatencyOptimization #FexingoBusiness #BusinessPodcast #TechPodcast #SREPodcast Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    10 mins
  • How SRE Teams Reduce Incident Noise with Intelligent Alert Routing
    Jun 17 2026
    Episode 56 of The Site Reliability Podcast explores how SRE teams at companies like Airbnb and Etsy use intelligent alert routing to slash incident noise by over 60 percent. Lucas and Luna break down the evolution from on-call pagers to modern event-driven routing, explain how machine learning models classify alerts by severity and team ownership, and discuss the trade-off between routing accuracy and latency. They also touch on the human side: how noise reduction cuts burnout and improves on-call experience. A must-listen for any SRE or platform engineer tired of being woken up for non-critical alerts. #SiteReliabilityPodcast #SRE #AlertRouting #IncidentManagement #OnCall #NoiseReduction #MachineLearning #Airbnb #Etsy #PagerDuty #OpsGenie #Burnout #ReliabilityEngineering #Technology #DevOps #ProductionEngineering #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    9 mins
  • How SRE Teams Use Incident Cost Analysis to Prioritize Reliability Investments
    Jun 16 2026
    Episode 55 of The Site Reliability Podcast with Fexingo dives into incident cost analysis — a growing practice at companies like Google and Stripe where SRE teams assign a dollar value to every outage minute. Lucas and Luna break down the methodology: how to quantify direct revenue loss, reputational damage, and opportunity cost from incidents, and how that data helps teams justify automation spend, toil reduction, and architecture changes. They walk through a real example from a mid-size e-commerce platform that cut its annual incident cost by 40 percent after implementing this framework. The episode also covers common pitfalls, like overvaluing rare catastrophic events or ignoring compounding effects of small incidents. By the end, listeners will understand how to build a simple incident cost model and use it to make the case for reliability work in language the business understands. #SiteReliabilityEngineering #IncidentCostAnalysis #SRE #ReliabilityEngineering #ProductionEngineering #Uptime #IncidentResponse #CostOptimization #Automation #ToilReduction #Google #Stripe #BusinessCase #Technology #FexingoBusiness #BusinessPodcast #TechOps #DevOps Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    9 mins
adbl_web_anon_alc_button_suppression_t1
No reviews yet