Navigated to Inside Entra Resilience: Microsoft's Outage War Stories, Backup Secrets and Preventing Global Outages

Inside Entra Resilience: Microsoft's Outage War Stories, Backup Secrets and Preventing Global Outages

August 23
1h 15m

Episode Description

In this episode, I sit down with my boss, Tarek Dawoud, to pull back the curtain on what really happens during a major service outage.

Tarek shares some incredible "war stories" from his time in the trenches, from the early days of DirSync where the team had to edit a sync file with a debugger to prevent an incident, to the massive outages of 2017 and 2018 that changed everything.

We'll give you a peek into the high-stakes, quick-thinking world of a "live site" incident and reveal the groundbreaking engineering principles like cell-based architecture and the backup authentication service that were born from these challenges, making Entra more resilient than ever before.

Subscribe with your favorite podcast player or watch on YouTube 👇

About Tarek Dawoud

Tarek Dawoud is a Lead Architect in the Customer Engineering team for Microsoft Entra. With years of experience growing up in Entra engineering, he has been involved in his share of outages and has a deep understanding of what it takes to build and maintain a resilient, hyperscale identity service.

LinkedIn - https://www.linkedin.com/in/tarekdawoud/

🔗 Related Links

* SLA performance for Microsoft Entra ID - aka.ms/entraidsla

* Microsoft Blames "Severe Weather" for Azure Cloud Outage

* Microsoft Probes Cause of Global Web Outage

* Microsoft's Azure AD authentication outage: What went wrong

📗 Chapters

00:57 What is a "Live Site"?

14:15 The Secret to Entra's Uptime: Cell-Based Architecture

18:09 How Entra Routes Your Login Request Globally

24:46 War Story #1: The 2017 Conditional Access Outage

29:52 War Story #2: How a Hurricane & an Office Bug Caused Chaos

43:39 The Backup Auth Service: Entra's Secret Weapon

57:54 Does the Backup Service Kick in Automatically?

01:04:16 Regional Isolation & The Power of Managed Identity

01:08:17 Anatomy of a Near-Outage in 2021

01:12:02 How Microsoft's Culture Learns From Mistakes

Podcast Apps

🎙️ Entra.Chat - https://entra.chat

🎧 Apple Podcast → https://entra.chat/apple

📺 YouTube → https://entra.chat/youtube

📺 Spotify → https://entra.chat/spotify

🎧 Overcast → https://entra.chat/overcast

🎧 Pocketcast → https://entra.chat/pocketcast

🎧 Others → https://entra.chat/rss

Merill's socials

📺 YouTube → youtube.com/@merillx

👔 LinkedIn → linkedin.com/in/merill

🐤 Twitter → twitter.com/merill

🕺 TikTok → tiktok.com/@merillf

🦋 Bluesky → bsky.app/profile/merill.net

🐘 Mastodon → infosec.exchange/@merill

🧵 Threads → threads.net/@merillf

🤖 GitHub → github.com/merill



Get full access to Entra.News - Your weekly dose of Microsoft Entra at entra.news/subscribe
See all episodes

Never lose your place, on any device

Create a free account to sync, back up, and get personal recommendations.