The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

By: Fexingo

Listen for free

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your system is already in production? Lucas and Luna don't pretend to have final answers — they build the conversation so you can draw your own. If you've ever argued about whether a page was necessary or whether an SLO should be tightened, this is your show. #SiteReliabilityEngineering #SRE #Uptime #ProductionEngineering #IncidentResponse #ErrorBudgets #SLOs #Postmortem #ToilAutomation #CapacityPlanning #Observability #DevOps #PlatformEngineering #Resilience #OnCall #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo© 2026 Fexingo. All rights reserved.

Economics

Episodes View all

How SRE Teams Use Cost of Delay to Prioritize Reliability Work

Jun 17 2026

Lucas and Luna explore how SRE teams at companies like Spotify and Etsy use 'cost of delay' — a concept borrowed from product management — to quantify the business impact of reliability work. Lucas explains the math behind deferring a reliability project, using a real-world example: a payment-processing team deciding whether to fix a latency issue or build a new feature. Luna pushes back on the difficulty of estimating delay costs, and they discuss a practical framework — weighted shortest job first (WSJF) — that helps teams rank reliability initiatives alongside feature work. The episode includes a concrete example: if deferring an SRE project by one quarter costs $200,000 in incident-related losses, the team can calculate the cost of delay per week and compare it to the effort required. Listeners learn how to present reliability investments in the language executives understand: dollars and time. The conversation closes with a reflection on how cost of delay changes the conversation from 'how reliable should we be?' to 'what happens if we defer this work?' #SiteReliabilityEngineering #CostOfDelay #WSJF #Spotify #Etsy #SREPrioritization #ReliabilityEngineering #IncidentResponse #Technology #BusinessCase #ProductManagement #WeightedShortestJobFirst #SREMetrics #LatencyOptimization #FexingoBusiness #BusinessPodcast #TechPodcast #SREPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Reduce Incident Noise with Intelligent Alert Routing

Jun 17 2026

Episode 56 of The Site Reliability Podcast explores how SRE teams at companies like Airbnb and Etsy use intelligent alert routing to slash incident noise by over 60 percent. Lucas and Luna break down the evolution from on-call pagers to modern event-driven routing, explain how machine learning models classify alerts by severity and team ownership, and discuss the trade-off between routing accuracy and latency. They also touch on the human side: how noise reduction cuts burnout and improves on-call experience. A must-listen for any SRE or platform engineer tired of being woken up for non-critical alerts. #SiteReliabilityPodcast #SRE #AlertRouting #IncidentManagement #OnCall #NoiseReduction #MachineLearning #Airbnb #Etsy #PagerDuty #OpsGenie #Burnout #ReliabilityEngineering #Technology #DevOps #ProductionEngineering #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Incident Cost Analysis to Prioritize Reliability Investments

Jun 16 2026

Episode 55 of The Site Reliability Podcast with Fexingo dives into incident cost analysis — a growing practice at companies like Google and Stripe where SRE teams assign a dollar value to every outage minute. Lucas and Luna break down the methodology: how to quantify direct revenue loss, reputational damage, and opportunity cost from incidents, and how that data helps teams justify automation spend, toil reduction, and architecture changes. They walk through a real example from a mid-size e-commerce platform that cut its annual incident cost by 40 percent after implementing this framework. The episode also covers common pitfalls, like overvaluing rare catastrophic events or ignoring compounding effects of small incidents. By the end, listeners will understand how to build a simple incident cost model and use it to make the case for reliability work in language the business understands. #SiteReliabilityEngineering #IncidentCostAnalysis #SRE #ReliabilityEngineering #ProductionEngineering #Uptime #IncidentResponse #CostOptimization #Automation #ToilReduction #Google #Stripe #BusinessCase #Technology #FexingoBusiness #BusinessPodcast #TechOps #DevOps Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free

No reviews yet

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

How SRE Teams Use Cost of Delay to Prioritize Reliability Work

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Reduce Incident Noise with Intelligent Alert Routing

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Incident Cost Analysis to Prioritize Reliability Investments

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed