The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Episodes

How SRE Teams Use Runbooks to Streamline Incident Response

Jun 29 2026

In episode 80 of The Site Reliability Podcast, Lucas and Luna dive into the practical world of runbooks — the step-by-step guides that SRE teams use to respond to incidents faster and more consistently. They explore how runbooks reduce cognitive load during high-stress outages, why documenting the 'why' behind each step prevents dangerous cargo-culting, and how a major streaming service cut its mean time to recover by 40 percent after implementing standardized runbooks. Lucas shares an anecdote about a junior engineer who resolved a critical database failover using a runbook she'd never seen before, and Luna pushes back on the risk of runbooks becoming stale or misleading. They also discuss the tension between automation and manual-runbook-driven processes, and how the best teams treat runbooks as living documents — tested regularly, tied to specific incident types, and owned by the engineers who write them. The episode doesn't cover postmortems, chaos engineering, or SLOs — it focuses squarely on the unsung backbone of reliable incident response: the humble runbook. #SiteReliabilityEngineering #SRE #IncidentResponse #Runbooks #DevOps #Uptime #ProductionEngineering #OnCall #TechOps #IncidentManagement #Automation #ReliabilityEngineering #MTTR #KnowledgeManagement #Documentation #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

14 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Observability to Reduce Mean Time to Detect

Jun 28 2026

Episode 79 of The Site Reliability Podcast looks at how modern SRE teams are using observability tools to shrink mean time to detect — the gap between a system failure and the team knowing about it. Hosts Lucas and Luna break down why observability goes beyond traditional monitoring, using real-world examples like a major e-commerce platform that cut MTTD from 12 minutes to under 90 seconds by shifting from threshold-based alerts to structured logging and distributed tracing. They discuss the three pillars of observability — logs, metrics, and traces — and explain why merging them into a single signal pattern reduces alert fatigue and incident response time. The episode also covers the trade-off between storage costs and retention policies, and how teams justify the investment. No prior SRE experience required, just curiosity about how reliable systems actually stay reliable. #SiteReliabilityEngineering #Observability #MeanTimeToDetect #SRE #IncidentResponse #DistributedTracing #StructuredLogging #Metrics #AlertFatigue #Monitoring #Uptime #ProductionEngineering #DevOps #Technology #FexingoBusiness #BusinessPodcast #LucasAndLuna #SREPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Service Level Agreements to Set Expectations

Jun 28 2026

Lucas and Luna dive into the often-overlooked difference between Service Level Agreements (SLAs) and Service Level Objectives (SLOs) in site reliability engineering. They explore how SLAs are not just legal documents but critical tools for managing stakeholder expectations, using a real-world case from a major cloud provider. The episode explains the 99.9% vs 99.99% uptime debate, the cost implications of tighter SLAs, and how SRE teams can negotiate realistic targets. Listeners learn why SLAs are fundamentally about trust and trade-offs, not just uptime numbers. #SRE #ServiceLevelAgreement #SLA #SLO #Uptime #ReliabilityEngineering #CloudComputing #IncidentResponse #ExpectationManagement #SiteReliability #TechPodcast #FexingoBusiness #BusinessPodcast #Technology #LucasAndLuna #ProductionEngineering #DevOps #SLABestPractices Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Canary Deployments to Reduce Risk

Jun 27 2026

Episode 77 of The Site Reliability Podcast dives into canary deployments: rolling out code changes gradually to a small subset of users before a full release. Lucas and Luna explain how companies like Netflix and Etsy use canary analysis to catch regressions early, using real traffic and metrics. They walk through the mechanics: routing a fraction of traffic, comparing key SLOs like latency and error rates, and the decision to roll forward or roll back. The hosts discuss the difference between canary and blue-green deployments, how to choose the right canary size, and what happens when a canary fails. They also cover the human side: developer anxiety during canary windows, the importance of automated rollback triggers, and how mature SRE teams integrate canary results into their deployment pipeline. By the end, listeners will understand why canary releases are a cornerstone of safe, high-velocity deployment. #CanaryDeployments #SRE #SiteReliability #ProductionEngineering #IncidentResponse #Uptime #DeploymentStrategies #Netflix #Etsy #Spinnaker #ContinuousDelivery #DevOps #Automation #Rollback #Latency #ErrorBudget #FexingoBusiness #Technology Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

11 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use DORA Metrics to Measure DevOps Performance

Jun 27 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into DORA metrics — the four key DevOps Research and Assessment measures that elite SRE teams use to quantify software delivery and operational performance. They break down each metric: deployment frequency, lead time for changes, mean time to restore (MTTR), and change failure rate. The hosts explain how Google's 2019 Accelerate State of DevOps report found that elite performers deploy 208 times more frequently than low performers, with lead times 106 times faster. Lucas and Luna discuss why measuring these metrics matters for SRE teams, common pitfalls like vanity metrics and local maxima, and practical steps to start tracking DORA without overhead. They also touch on how DORA complements the more complex SPACE framework from GitHub for developer productivity. Real examples from cases like Etsy's continuous deployment and Netflix's Simian Army illustrate the concepts. The conversation is grounded in the current tech environment as of June 27, 2026, where platform engineering teams are increasingly adopting DORA dashboards to drive reliability improvements. #DORA #DevOps #SRE #SiteReliabilityEngineering #GoogleCloud #AccelerateBook #DeploymentFrequency #LeadTime #MTTR #ChangeFailureRate #DevOpsMetrics #SoftwareDelivery #ContinuousDeployment #SPACEFramework #Etsy #Netflix #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Service Level Objectives to Drive Reliability

Jun 26 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical use of Service Level Objectives (SLOs) in site reliability engineering. They discuss how a major European bank reduced pager fatigue by 40% by shifting from alert-based monitoring to SLO-based error budgets. Lucas explains the difference between SLIs, SLOs, and SLAs, and why measuring user-facing latency is more actionable than measuring CPU utilization. Luna shares a story about a gaming company that used SLOs to prevent a catastrophic launch day outage. They also cover common pitfalls, like setting too many SLOs or targets that are too tight. The episode includes a brief, natural mention of listener support at buy me a coffee dot com slash fexingo. Tune in for a focused, actionable conversation on making SLOs work in real production environments. #SRE #SiteReliabilityEngineering #ServiceLevelObjectives #ErrorBudgets #SLI #SLA #Alerting #IncidentResponse #ProductionEngineering #Uptime #ReliabilityEngineering #Monitoring #Observability #TechPodcast #FexingoBusiness #BusinessPodcast #Technology #DevOps Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

11 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Blameless Culture to Improve Incident Response

Jun 26 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into how a blameless culture can actually improve incident response times and reduce recurrence. They explore a real case from a mid-size SaaS company that cut its mean time to resolution by 40 percent after adopting blameless postmortems. Lucas breaks down the psychological safety factors that make engineers more willing to share details, and Luna shares data from a 2025 survey showing that teams with high blamelessness scores detect incidents 30 percent faster. Together, they discuss practical steps to shift from a blame-oriented to a learning-oriented culture, including rewiring incident review templates and leadership modeling. The hosts also touch on common pitfalls, like confusing blameless with accountability-free. This episode offers actionable insights for SREs and engineering leaders looking to build more resilient systems through culture change. #BlamelessCulture #IncidentResponse #SiteReliabilityEngineering #SRE #BlamelessPostmortems #PsychologicalSafety #IncidentManagement #TechCulture #EngineeringLeadership #MTTR #LearningCulture #ReliabilityEngineering #DevOps #Technology #BusinessPodcast #FexingoBusiness #TheSiteReliabilityPodcast #ProductionEngineering Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Blameless Postmortems to Build Trust

Jun 25 2026

In Episode 73 of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems transform incident response culture. Using examples from a major e-commerce platform's 2024 database outage, they break down the difference between blame and accountability, explain why 'human error' is a shallow root cause, and share how one team cut repeat incidents by 40% just by rewiring their post-incident process. They also touch on the tension between blameless culture and individual responsibility, and why some engineers push back. Practical tips for running your first blameless postmortem included. #BlamelessPostmortems #IncidentResponse #SRE #SiteReliabilityEngineering #DevOps #RootCauseAnalysis #PostIncidentReview #LearningCulture #ReliabilityEngineering #NoBlameCulture #IncidentManagement #SoftwareEngineering #ProductionEngineering #FexingoBusiness #BusinessPodcast #TechnologyPodcast #SREPodcast #ResilienceEngineering Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free

Episodes

How SRE Teams Use Runbooks to Streamline Incident Response

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Observability to Reduce Mean Time to Detect

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Service Level Agreements to Set Expectations

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Canary Deployments to Reduce Risk

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use DORA Metrics to Measure DevOps Performance

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Service Level Objectives to Drive Reliability

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Blameless Culture to Improve Incident Response

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Blameless Postmortems to Build Trust

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed