Episode Transcript
1
00:00:00,000 --> 00:00:07,000
Hey everyone. Justin Garrison here. If this is your first folk around and find out that you've been listening to,
2
00:00:07,000 --> 00:00:13,000
you're in for a treat. This is a different show. This is not our typical episode.
3
00:00:13,000 --> 00:00:21,000
I've been wanting to do this for a little while, and I've been inspired on this show from other folks that do podcasts,
4
00:00:21,000 --> 00:00:28,000
from people we've had on the show just in general from how I like to learn about things.
5
00:00:28,000 --> 00:00:33,000
And I can't do this show all the time. We're still going to keep the general format for the show.
6
00:00:33,000 --> 00:00:39,000
We love talking to guests. We love bringing them on, understanding the problems they're working on, and the solutions they come up with.
7
00:00:39,000 --> 00:00:44,000
That's going to be the typical show going forward. Autumn's going to be back with me in January.
8
00:00:44,000 --> 00:00:52,000
But for this episode, the closing of our first year of folk around and find out, I wanted to try something a little bit special.
9
00:00:52,000 --> 00:00:59,000
And if you like true crime podcasts, if you like a little more narrative in some of your podcast episodes,
10
00:00:59,000 --> 00:01:06,000
this one's for you because one of my favorite things is postmortems, and I love reading a good postmortem.
11
00:01:06,000 --> 00:01:08,000
So if you have any, please send them my way.
12
00:01:08,000 --> 00:01:17,000
But one of the things I really love about postmortems is understanding all of the things that aren't in the actual technical review of what happened.
13
00:01:17,000 --> 00:01:25,000
There's always some series of events, a technical failure that happened or something in the process that didn't go as planned.
14
00:01:25,000 --> 00:01:31,000
But that extends much more than just the individual systems that we work on.
15
00:01:31,000 --> 00:01:34,000
And this is my attempt at bringing some of that to you.
16
00:01:34,000 --> 00:01:44,000
When I read a postmortem in my head, I try to put together the full picture of what this may have looked like for the engineer who was dealing with it,
17
00:01:44,000 --> 00:01:47,000
or maybe even the group of people that caused the problem.
18
00:01:47,000 --> 00:01:50,000
So I would love to hear your feedback about this episode.
19
00:01:50,000 --> 00:01:54,000
There's an email me link in the show notes or my blue sky.
20
00:01:54,000 --> 00:01:56,000
Just send us a message, right?
21
00:01:56,000 --> 00:02:01,000
Just let us know what you thought because I want to hear more about not only what you thought of the show,
22
00:02:01,000 --> 00:02:04,000
but also other ideas you might have along this vein.
23
00:02:04,000 --> 00:02:06,000
Again, we're not doing this every time.
24
00:02:06,000 --> 00:02:15,000
This episode took me probably three times as long as our normal episodes do to plan and I just don't have that much time.
25
00:02:15,000 --> 00:02:21,000
But I really wanted to put this out just to put it out in the world because I think something like this should exist.
26
00:02:21,000 --> 00:02:30,000
I want it to welcome in new people to technology to understand the impact of systems and decisions made 10 years ago in a code base
27
00:02:30,000 --> 00:02:34,000
and how that plays out today when something fails.
28
00:02:34,000 --> 00:02:37,000
The music was all created by Dave Eddy, a friend of the show.
29
00:02:37,000 --> 00:02:44,000
You may know him as the you suck at programming guy, but he also does some great music links in the show notes for Christmas and the holidays.
30
00:02:44,000 --> 00:02:47,000
There's just one gift that I asked from everyone listening to the show.
31
00:02:47,000 --> 00:02:50,000
There's a link to review this in your show notes.
32
00:02:50,000 --> 00:02:55,000
If you haven't already left a review on the podcast, please click one of those links and leave a review.
33
00:02:55,000 --> 00:03:03,000
We'll put a couple of them in there for various platforms that have review systems and all of them help because it just helps someone else find the show.
34
00:03:03,000 --> 00:03:12,000
And it helps someone else understand what we're trying to do here with welcoming more people into technology to understand that these are humans behind the scene.
35
00:03:12,000 --> 00:03:19,000
That these systems are built by real people and that our decisions in a code base have real outcomes to other people's lives.
36
00:03:19,000 --> 00:03:27,000
Any of the timestamps in this episode are in UTC, the one true time zone, and I tried to stick to the facts as much as possible.
37
00:03:27,000 --> 00:03:29,000
But there are things that we just don't know.
38
00:03:29,000 --> 00:03:33,000
Even though I worked at Amazon, I have no special insights into this outage.
39
00:03:33,000 --> 00:03:35,000
I wasn't working there at the time.
40
00:03:35,000 --> 00:03:38,000
We hope you enjoy it. Have a happy holidays and a wonderful new year.
41
00:03:38,000 --> 00:03:40,000
We'll see you in 2026.
42
00:04:00,000 --> 00:04:07,000
It's an amazing thing, the power and influence and importance of these problems and scary.
43
00:04:07,000 --> 00:04:09,000
Some more players.
44
00:04:09,000 --> 00:04:13,000
And if you had trouble getting on some internet sites yesterday, here's why.
45
00:04:13,000 --> 00:04:20,000
Amazon's powerful cloud service went down for about four hours from roughly 12.30 to 5.00 p.m.
46
00:04:20,000 --> 00:04:23,000
Thousands of internet services and media outlets were affected.
47
00:04:23,000 --> 00:04:26,000
Amazon has not said what caused the outage, Maria.
48
00:04:29,000 --> 00:04:45,000
It's cold, but not unusually cold.
49
00:04:45,000 --> 00:04:51,000
The temperature was close to freezing, but was starting to warm up by the time they arrived at work.
50
00:04:51,000 --> 00:04:58,000
February 28th, 2017 was an extremely unremarkable Tuesday.
51
00:04:58,000 --> 00:05:05,000
It was the last day of the short month of February, but the majority of the week was still ahead.
52
00:05:05,000 --> 00:05:08,000
The AWS billing system was having a problem.
53
00:05:08,000 --> 00:05:13,000
But not a SEV-1 that required immediate attention or an off-hours page.
54
00:05:13,000 --> 00:05:17,000
So when the engineer arrived at work, they knew they'd have to look into it.
55
00:05:17,000 --> 00:05:22,000
But these types of bugs were notoriously difficult to troubleshoot.
56
00:05:22,000 --> 00:05:27,000
It wasn't that billing was broken. That would have required a SEV-1 page.
57
00:05:27,000 --> 00:05:29,000
But things were slow.
58
00:05:29,000 --> 00:05:31,000
The worst kind of bug.
59
00:05:31,000 --> 00:05:36,000
They had a hunch where the problem might be, not because they were completely familiar with how the billing system worked,
60
00:05:36,000 --> 00:05:43,000
but because they had seen an increase in alerts and slowness on systems billing relied on.
61
00:05:43,000 --> 00:05:48,000
And they had tooling and runbooks to cycle those systems and hope the slowness would go away,
62
00:05:48,000 --> 00:05:53,000
or maybe the team asking for a fix would be satisfied with an attempt.
63
00:05:53,000 --> 00:05:59,000
So after a morning round of office chatter, a stand-up letting people know they'd be looking into the issue,
64
00:05:59,000 --> 00:06:04,000
and a cup of office coffee, they sat down to get started.
65
00:06:04,000 --> 00:06:12,000
They put the free bananas on their desk, wiggled their mouse, and touched their yubiky to authenticate into their system.
66
00:06:12,000 --> 00:06:17,000
They were a handful of tools that helped them execute common runbooks.
67
00:06:18,000 --> 00:06:25,000
These tools were ancient bash scripts that might as well have been written by the aliens who built the Great Pyramids.
68
00:06:25,000 --> 00:06:32,000
The scripts had evolved from single-line tool wrappers into monstrosities of semi-portable bash.
69
00:06:32,000 --> 00:06:39,000
These executable text files make human evolution from single-celled organisms over millions of years look quaint.
70
00:06:39,000 --> 00:06:42,000
And like humans, they're pretty reliable.
71
00:06:42,000 --> 00:06:46,000
Or at least they are when you give improper instructions.
72
00:06:46,000 --> 00:06:50,000
Unfortunately, today the instructions were less than proper.
73
00:06:50,000 --> 00:06:56,000
To reduce tool spread and make it easier to stay up to date, the scripts work on a variety of similar systems.
74
00:06:56,000 --> 00:06:59,000
The billing system was the target of today's maintenance.
75
00:06:59,000 --> 00:07:02,000
But like a stormtrooper's aim, it was off the mark.
76
00:07:02,000 --> 00:07:06,000
The key press was so insignificant it didn't need a review.
77
00:07:06,000 --> 00:07:12,000
It was done by authorized employees from an assigned office desk on company property.
78
00:07:12,000 --> 00:07:23,000
It not only took down AWS production services, it took down more than a dozen high-profile Internet darlings who placed their bets on the cloud.
79
00:07:23,000 --> 00:07:32,000
S3 has a terrible, horrible, no good, very bad day on this episode of Fork Around and Find Out.
80
00:07:32,000 --> 00:07:34,000
Amazon breaks the Internet.
81
00:07:34,000 --> 00:07:39,000
How a problem in the cloud triggered the error message sweeping across the East Coast.
82
00:07:39,000 --> 00:07:43,000
The S3 server farm started showing a high rate of errors.
83
00:07:43,000 --> 00:07:46,000
It's not clear yet what caused all these problems.
84
00:07:46,000 --> 00:07:50,000
Cloud computing service, which went down for more than four hours Tuesday.
85
00:07:53,000 --> 00:08:00,000
Welcome to Fork Around and Find Out, the podcast about building, running, and maintaining software and systems.
86
00:08:10,000 --> 00:08:14,000
Lots of people working in tech remember this day.
87
00:08:14,000 --> 00:08:19,000
Maybe not exactly what happened, but we remember how we felt.
88
00:08:19,000 --> 00:08:26,000
We might have been cloud champions at our companies, and a major four-hour outage was not going to help our case.
89
00:08:26,000 --> 00:08:30,000
Especially because Amazon wouldn't even admit it.
90
00:08:30,000 --> 00:08:33,000
The AWS cloud account on Twitter said,
91
00:08:33,000 --> 00:08:43,000
We are continuing to experience high error rates with S3 in US East One, which is impacting some other AWS services.
92
00:08:43,000 --> 00:08:47,000
They couldn't even say they were having an outage.
93
00:08:47,000 --> 00:08:53,000
Cloud naysayers were warning about dependencies that were already creeping into applications.
94
00:08:53,000 --> 00:09:03,000
This outage was the I told you so moment they needed to convince senior leadership that budget of server refreshes was a good thing to approve.
95
00:09:03,000 --> 00:09:11,000
The simple fact that Amazon status page went down with the outage and updates could only be found via the AWS Twitter account
96
00:09:11,000 --> 00:09:15,000
didn't reduce the cloud skepticism from late adopters.
97
00:09:15,000 --> 00:09:20,000
How could a single service outage in a single region have such a global impact?
98
00:09:20,000 --> 00:09:24,000
We need to start with what it is before we can talk about what happened.
99
00:09:24,000 --> 00:09:27,000
Sara Jones on Twitter at one Sara Jones said,
100
00:09:27,000 --> 00:09:33,000
Five hours ago, I'd never heard of AWS S3, and yet it has ruined my entire day.
101
00:09:33,000 --> 00:09:41,000
Amazon's S3 server is responsible for providing cloud services to about 150,000 companies around the world.
102
00:09:41,000 --> 00:09:46,000
If you're a longtime listener of this podcast, I'm sure you know what S3 is.
103
00:09:46,000 --> 00:09:49,000
But stay with me for a minute because we're going to go a bit deeper.
104
00:09:49,000 --> 00:09:55,000
S3 stands for Simple Storage Service, and it was one of the groundbreaking innovations of the cloud.
105
00:09:55,000 --> 00:10:00,000
Before S3, there were only two types of storage commonly deployed.
106
00:10:00,000 --> 00:10:03,000
There were blocks and there were files.
107
00:10:03,000 --> 00:10:05,000
Blocks were a necessary evil.
108
00:10:05,000 --> 00:10:09,000
They're the storage low level systems need to have access to bits.
109
00:10:09,000 --> 00:10:14,000
Your operating system needs blocks, but your application usually doesn't.
110
00:10:15,000 --> 00:10:17,000
Applications generally need files.
111
00:10:17,000 --> 00:10:19,000
Files are great until they're not.
112
00:10:19,000 --> 00:10:24,000
Files have actions applications can perform like read, write, and execute,
113
00:10:24,000 --> 00:10:28,000
but they also have pesky things like ownership, locks, hierarchy,
114
00:10:28,000 --> 00:10:32,000
and requirements to have some form of system to manage the files.
115
00:10:32,000 --> 00:10:34,000
Call it a file system.
116
00:10:34,000 --> 00:10:37,000
This was almost always something locally accessible to the application,
117
00:10:37,000 --> 00:10:40,000
or at least something that appeared local.
118
00:10:40,000 --> 00:10:45,000
NFS, or the network file system, was pivotal for scaling file storage
119
00:10:45,000 --> 00:10:49,000
by trading network latency for large storage systems.
120
00:10:49,000 --> 00:10:54,000
Pulling a bunch of computers to appear like one large resource wasn't new,
121
00:10:54,000 --> 00:10:57,000
but usually they were limited to POSIX constraints
122
00:10:57,000 --> 00:11:02,000
because the systems accessing them had to pretend they were local resources.
123
00:11:02,000 --> 00:11:06,000
Databases were another option for removing the need of files.
124
00:11:06,000 --> 00:11:08,000
Store all the stateful stuff somewhere else.
125
00:11:08,000 --> 00:11:12,000
Remote connectivity and advanced query syntax sounds like a great option,
126
00:11:12,000 --> 00:11:15,000
but those queries aren't always the easiest thing to work with.
127
00:11:15,000 --> 00:11:17,000
Connection pooling becomes a problem,
128
00:11:17,000 --> 00:11:21,000
and you need to understand what data you have and how it's going to be used
129
00:11:21,000 --> 00:11:23,000
before you really start storing it.
130
00:11:23,000 --> 00:11:26,000
Object storage didn't have those trade-offs.
131
00:11:26,000 --> 00:11:28,000
Object storage doesn't have a hierarchy.
132
00:11:28,000 --> 00:11:34,000
It doesn't need a file system, and it's definitely not POSIX constrained.
133
00:11:34,000 --> 00:11:39,000
There's no SQL queries or connection pooling, just verbs like get and put.
134
00:11:39,000 --> 00:11:44,000
It's one of the few trillion-dollar ideas that the cloud has created.
135
00:11:44,000 --> 00:11:46,000
Well, it didn't create the idea of object storage,
136
00:11:46,000 --> 00:11:50,000
but it definitely made it a viable option for millions of users around the world,
137
00:11:50,000 --> 00:11:54,000
and S3 is the standard for object storage.
138
00:11:54,000 --> 00:11:59,000
And on February 28, 2017, it broke.
139
00:11:59,000 --> 00:12:02,000
AtnetGarun on Twitter said,
140
00:12:02,000 --> 00:12:04,000
Happy AWS Appreciation Day, Internet.
141
00:12:04,000 --> 00:12:07,000
S3 isn't just another AWS service.
142
00:12:07,000 --> 00:12:11,000
It's a foundational piece of the entire AWS empire.
143
00:12:11,000 --> 00:12:14,000
You can usually tell how important services are to AWS
144
00:12:14,000 --> 00:12:19,000
by how many nines they assign to their availability SLAs.
145
00:12:19,000 --> 00:12:26,000
Route 53 is the only AWS service that guarantees a 100% SLA.
146
00:12:26,000 --> 00:12:31,000
As a matter of fact, I think it's the only SaaS that exists with 100% SLA.
147
00:12:31,000 --> 00:12:33,000
It's pretty important.
148
00:12:33,000 --> 00:12:39,000
There's only a handful of Amazon's 200 services that have four nines of availability,
149
00:12:39,000 --> 00:12:41,000
and S3 is one of them.
150
00:12:41,000 --> 00:12:46,000
This level guarantees only four minutes of downtime per month,
151
00:12:46,000 --> 00:12:49,000
or 48 minutes for the entire year.
152
00:12:49,000 --> 00:12:52,000
You can't finish a single episode of Stranger Things
153
00:12:52,000 --> 00:12:56,000
in the amount of time S3 would get for downtime in a year.
154
00:12:56,000 --> 00:13:03,000
S3's availability SLA is often confused with its durability SLA of 11 nines,
155
00:13:03,000 --> 00:13:05,000
a ridiculous number to ponder,
156
00:13:05,000 --> 00:13:11,000
and an odd form of SREP cocking that somehow sounds more impressive than 100%.
157
00:13:11,000 --> 00:13:18,000
But even four nines of availability takes more than a few servers and a load balancer to achieve.
158
00:13:18,000 --> 00:13:24,000
The backend of a storage system that stores this much data this reliably has a lot of components.
159
00:13:24,000 --> 00:13:27,000
The service obviously has a web front-end.
160
00:13:27,000 --> 00:13:30,000
It has authentication, and there's millions of disks to store the data.
161
00:13:30,000 --> 00:13:32,000
But we're not going to focus on those parts.
162
00:13:32,000 --> 00:13:36,000
But we are going to look at how S3 gets and puts objects into the system.
163
00:13:36,000 --> 00:13:41,000
Core idea of making S3 scalable is sharding data across many different hard drives.
164
00:13:41,000 --> 00:13:43,000
Amazon does this with a lot of different services,
165
00:13:43,000 --> 00:13:47,000
and it calls their method of sharding data shuffle sharding.
166
00:13:48,000 --> 00:13:50,000
It doesn't make it very difficult for us to scale,
167
00:13:50,000 --> 00:13:53,000
because we have to think about what if customers become unbalanced over time.
168
00:13:53,000 --> 00:13:56,000
So we do something a little bit different on S3.
169
00:13:56,000 --> 00:14:02,000
And it's actually an example of a pattern that we talk about often in AWS called shuffle sharding.
170
00:14:02,000 --> 00:14:04,000
The idea of shuffle sharding is actually pretty simple.
171
00:14:04,000 --> 00:14:07,000
It's that rather than ecstatically assigning a workload to a drive
172
00:14:07,000 --> 00:14:11,000
or to any other kind of resource, a CPU, a GPU, what have you,
173
00:14:11,000 --> 00:14:15,000
we randomly spread workloads across the fleet, across our drives.
174
00:14:15,000 --> 00:14:19,000
So when you do a put, we'll pick a random set of drives to put those bytes on.
175
00:14:19,000 --> 00:14:21,000
Maybe we pick these two.
176
00:14:21,000 --> 00:14:24,000
The next time you do a put, even to the same bucket, even to the same key,
177
00:14:24,000 --> 00:14:26,000
it doesn't even matter, right?
178
00:14:26,000 --> 00:14:27,000
We'll pick a different set of drives.
179
00:14:27,000 --> 00:14:30,000
They might overlap, but we'll make a new random choice
180
00:14:30,000 --> 00:14:32,000
of which drives to use for that put.
181
00:14:33,000 --> 00:14:37,000
Shuffle sharding is basically a predictable way for you to assign resources
182
00:14:37,000 --> 00:14:40,000
to a customer and spread those resources
183
00:14:40,000 --> 00:14:43,000
so two customers don't keep getting grouped together.
184
00:14:43,000 --> 00:14:47,000
If Disney, Netflix, HBO, and Peacock are all customers
185
00:14:47,000 --> 00:14:50,000
and they each get two servers, shuffle sharding would make sure
186
00:14:50,000 --> 00:14:53,000
that Disney and Netflix might be co-located on one server,
187
00:14:53,000 --> 00:14:58,000
but the second allocated server should be Netflix and HBO.
188
00:14:58,000 --> 00:15:02,000
This allows for things like the Disney and Netflix server to go down,
189
00:15:02,000 --> 00:15:05,000
but Netflix still has some availability.
190
00:15:05,000 --> 00:15:09,000
Or if HBO is a noisy neighbor and consumes all the resources on the second server,
191
00:15:09,000 --> 00:15:13,000
Netflix still has a server that's not overloaded.
192
00:15:13,000 --> 00:15:16,000
This spreads the load between customers.
193
00:15:16,000 --> 00:15:20,000
Assuming those customers don't buy each other in multi-billion dollar acquisitions,
194
00:15:20,000 --> 00:15:22,000
but that's not really Amazon's problem.
195
00:15:22,000 --> 00:15:24,000
In order to do this for large storage systems,
196
00:15:24,000 --> 00:15:29,000
you need to have services that store metadata about where blobs are stored.
197
00:15:29,000 --> 00:15:34,000
This service is in the critical path for putting data into the system.
198
00:15:34,000 --> 00:15:37,000
Because if you can't keep track of where the data went,
199
00:15:37,000 --> 00:15:38,000
you might as well just delete it.
200
00:15:38,000 --> 00:15:43,000
Of course, S3 writes data in multiple places throughout multiple data centers.
201
00:15:43,000 --> 00:15:45,000
The details don't matter for this outage,
202
00:15:45,000 --> 00:15:49,000
but what does matter is there's a critical service part of S3
203
00:15:49,000 --> 00:15:53,000
that keeps track of where all this data is stored called the placement service.
204
00:15:53,000 --> 00:15:56,000
New puts go through the placement service.
205
00:15:56,000 --> 00:16:01,000
All other requests, get, list, put, delete, go through the index service.
206
00:16:01,000 --> 00:16:04,000
People usually put data before they get it,
207
00:16:04,000 --> 00:16:07,000
but not all customers know what they're doing.
208
00:16:08,000 --> 00:16:12,000
But the important thing here is that we chunk up the data that you send us
209
00:16:12,000 --> 00:16:15,000
and we store it redundantly across a set of drives.
210
00:16:15,000 --> 00:16:20,000
Those drives are running a piece of software called Shardstore.
211
00:16:20,000 --> 00:16:25,000
So Shardstore is the file system that we run on our storage node hosts.
212
00:16:25,000 --> 00:16:28,000
Shardstore is something that we've written ourselves.
213
00:16:28,000 --> 00:16:32,000
There's public papers on Shardstore, which James will talk about in his section.
214
00:16:32,000 --> 00:16:36,000
But at its core, it's a log-structured file system.
215
00:16:37,000 --> 00:16:41,000
The internal names for S3 services are not as boring as index and placement.
216
00:16:41,000 --> 00:16:44,000
They have, or at least at one time they had,
217
00:16:44,000 --> 00:16:51,000
cool quirky names like R2D2, Death Star, Mr. Biggs, PMS, and Cramps.
218
00:16:51,000 --> 00:16:54,000
But the postmortem would have had a lot more to explain
219
00:16:54,000 --> 00:16:57,000
if they used the internal names to describe these systems.
220
00:16:57,000 --> 00:17:03,000
After the break, we'll talk about what happened on February 28th, 2017.
221
00:17:03,000 --> 00:17:07,000
On the morning of February 28th, an engineer went to scale down
222
00:17:07,000 --> 00:17:10,000
a small set of servers used by the billing system.
223
00:17:10,000 --> 00:17:13,000
They were using a runbook, which included a pre-approved script,
224
00:17:13,000 --> 00:17:16,000
to manage the scale-down process.
225
00:17:16,000 --> 00:17:19,000
Unfortunately, they mistyped the command.
226
00:17:19,000 --> 00:17:23,000
We've all done it, and some of us have even taken down production with our mistype.
227
00:17:23,000 --> 00:17:26,000
This mistype took down a lot of productions.
228
00:17:26,000 --> 00:17:30,000
I envision it like a bash-grip with a couple flags
229
00:17:30,000 --> 00:17:32,000
and some arguments that need to go in a certain order.
230
00:17:32,000 --> 00:17:36,000
And if you change that order, it might do something different.
231
00:17:36,000 --> 00:17:39,000
I don't know the details, but we can see how this would happen.
232
00:17:39,000 --> 00:17:42,000
Instead of removing a few servers for the billing service,
233
00:17:42,000 --> 00:17:48,000
they removed more than 50% of the servers for the index and placement services.
234
00:17:48,000 --> 00:17:53,000
Amazon's service disruption report was posted two days after the outage.
235
00:17:53,000 --> 00:17:56,000
It says, removing a significant portion of the capacity
236
00:17:56,000 --> 00:18:01,000
caused each of these systems to require a full restart.
237
00:18:01,000 --> 00:18:05,000
It's unknown why this amount of capacity required a full restart.
238
00:18:05,000 --> 00:18:08,000
You would think the service would just be slow,
239
00:18:08,000 --> 00:18:10,000
or only specific customers would be affected.
240
00:18:10,000 --> 00:18:13,000
After all, Amazon talks about how great shuffle sharding is.
241
00:18:13,000 --> 00:18:17,000
But for some reason, these services were not configured this way,
242
00:18:17,000 --> 00:18:21,000
or at some level of capacity, you had to shut everything down.
243
00:18:21,000 --> 00:18:23,000
They go on to say,
244
00:18:23,000 --> 00:18:28,000
S3 subsystems are designed to support the removal or failure of significant capacity
245
00:18:28,000 --> 00:18:31,000
with little or no customer impact.
246
00:18:31,000 --> 00:18:35,000
We build our systems with the assumptions that things will occasionally fail,
247
00:18:35,000 --> 00:18:38,000
and we rely on the ability to remove and replace capacity
248
00:18:38,000 --> 00:18:41,000
as one of our core operational processes.
249
00:18:41,000 --> 00:18:45,000
While this is an operation that we have relied on to maintain our systems,
250
00:18:45,000 --> 00:18:47,000
since the launch of S3,
251
00:18:47,000 --> 00:18:51,000
we have not completely restarted the index subsystem
252
00:18:51,000 --> 00:18:57,000
or the placement subsystem in our larger regions for many years.
253
00:18:57,000 --> 00:19:00,000
And again, this is 2017.
254
00:19:00,000 --> 00:19:05,000
Many years, in my opinion, makes it more than three,
255
00:19:05,000 --> 00:19:08,000
and S3 launched in 2012.
256
00:19:08,000 --> 00:19:15,000
So there was only a couple years in there that they may have restarted these services completely.
257
00:19:15,000 --> 00:19:17,000
But why does any of this matter?
258
00:19:17,000 --> 00:19:20,000
S3 is just a storage mechanism.
259
00:19:20,000 --> 00:19:21,000
Who cares?
260
00:19:21,000 --> 00:19:27,000
Who cares if some files don't get put into storage and retrieved from that storage?
261
00:19:27,000 --> 00:19:33,000
Well, when a service is this critical, other Amazon services get built on it.
262
00:19:33,000 --> 00:19:38,000
So while S3 was down, other Amazon services also went down.
263
00:19:38,000 --> 00:19:41,000
Things like EC2 instances.
264
00:19:41,000 --> 00:19:46,000
You could not launch a new VM in AWS without S3.
265
00:19:46,000 --> 00:19:52,000
New Lambda invocations couldn't scale up because those Lambda functions were stored in S3.
266
00:19:52,000 --> 00:19:56,000
EBS volumes that needed snapshot restores, those also come from S3.
267
00:19:56,000 --> 00:20:00,000
Different load balancers within AWS, S3.
268
00:20:00,000 --> 00:20:07,000
There were so many services that cascaded down into failure because S3 objects were not available.
269
00:20:07,000 --> 00:20:12,000
NPR called this Amazon's $150 million typo.
270
00:20:12,000 --> 00:20:18,000
Because while S3 was down, it's estimated that of the top 500 S&P on the stock market,
271
00:20:18,000 --> 00:20:22,000
those companies lost $150 million in value.
272
00:20:22,000 --> 00:20:24,000
But that's still not the complete picture here.
273
00:20:24,000 --> 00:20:27,000
So many other companies lost productivity.
274
00:20:27,000 --> 00:20:29,000
This isn't just stock trading.
275
00:20:29,000 --> 00:20:31,000
A bunch of companies couldn't do work.
276
00:20:31,000 --> 00:20:32,000
They hired people.
277
00:20:32,000 --> 00:20:34,000
They sat around for hours.
278
00:20:34,000 --> 00:20:43,000
An internet monitoring company, Apica, found that over half of the top 100 online retailers slowed their performance by 20%.
279
00:20:43,000 --> 00:20:48,000
Half of the biggest sellers on the internet had slow websites for the day.
280
00:20:48,000 --> 00:20:57,000
And a CEO of Catchpoint estimated that the total ramifications of this one typo would be in the hundreds of billions of dollars in loss productivity.
281
00:20:57,000 --> 00:21:01,000
People are sitting around at work and they can't do their work.
282
00:21:01,000 --> 00:21:08,000
So many websites were down for four hours in an Amazon outage that the whole world felt the impact.
283
00:21:08,000 --> 00:21:12,000
S3 may just be storing objects at an HTTP endpoint.
284
00:21:12,000 --> 00:21:18,000
But so many things we rely on rely on things being available.
285
00:21:18,000 --> 00:21:24,000
This is why Spotify was down and various companies were down because those music files that you stream,
286
00:21:24,000 --> 00:21:29,000
when you look all the way back where they come from, they come from storage somewhere.
287
00:21:29,000 --> 00:21:34,000
Like it's still a file that exists somewhere on a hard drive in a data center.
288
00:21:34,000 --> 00:21:39,000
It just so happened to go through the most convoluted, complex system to fetch a file.
289
00:21:39,000 --> 00:21:44,000
Yes, there are plenty of other ways to spread files across the internet.
290
00:21:44,000 --> 00:21:48,000
CDNs are widely used, but CDNs are expensive.
291
00:21:48,000 --> 00:21:57,000
You have to balance how much money you're paying for fast, reliable, globally replicated data versus slow things that occasionally get served up.
292
00:21:57,000 --> 00:22:05,000
S3 has a pretty good pricing model to meet that middle ground of this is available and has 11 nines of durability.
293
00:22:05,000 --> 00:22:10,000
So it must be good to store some things and trust that it's going to be there for a very long time.
294
00:22:10,000 --> 00:22:19,000
And of course S3 has tiering systems and all this other stuff that make it more easy to just throw data ads and leave it there for who knows how long.
295
00:22:19,000 --> 00:22:26,000
But when all of a sudden those files aren't available, you start realizing that all of your applications need files.
296
00:22:26,000 --> 00:22:33,000
Not just your local applications. The websites you're using have files behind the scenes.
297
00:22:33,000 --> 00:22:37,000
Half of Apple's iCloud, half of the services were down.
298
00:22:37,000 --> 00:22:43,000
This includes issues with the App Store and iCloud backup and Apple Music and Apple TV.
299
00:22:43,000 --> 00:22:46,000
All of these services store files at the end of the day.
300
00:22:46,000 --> 00:22:54,000
And that's why something like S3 as an object storage is so critically important to the functioning of the internet.
301
00:22:54,000 --> 00:23:02,000
When it really comes down to it, the internet is basically just a bunch of remote file servers in different ways to access those files.
302
00:23:02,000 --> 00:23:07,000
Webpages are files, musics files, streaming services, all files.
303
00:23:07,000 --> 00:23:14,000
And when the files go away for four hours, turns out there's not a lot of stuff that people are able to do.
304
00:23:14,000 --> 00:23:21,000
And as Amazon's slowly recovering these services, they get to the point where they realize they have to turn things off.
305
00:23:21,000 --> 00:23:27,000
There's too much traffic going to the placement service and the index service.
306
00:23:27,000 --> 00:23:34,000
And there's just not enough servers there that even trying to scale them up isn't going to work because you can add more servers to the pool.
307
00:23:34,000 --> 00:23:43,000
But it's going to be difficult for the load balancer to add them in and help check them and do other things that at some point you just kind of have to reset the whole system.
308
00:23:43,000 --> 00:23:46,000
Remember back in the day with Windows XP?
309
00:23:46,000 --> 00:23:50,000
This was like back in time when we used to restart our computers regularly.
310
00:23:50,000 --> 00:23:54,000
At the end of the day, you would go home from work and you'd actually shut down your computer.
311
00:23:54,000 --> 00:24:02,000
There was a big shift in how you used your computer when Windows Vista and Windows 7 came out and sleep kind of worked reliably.
312
00:24:02,000 --> 00:24:05,000
And we didn't have to shut down our computer every day.
313
00:24:05,000 --> 00:24:13,000
But in the morning when you got to work and you turned on your computer, even though the computer was on and maybe even at the login screen,
314
00:24:13,000 --> 00:24:17,000
it still needed like 10 more minutes to be ready to use.
315
00:24:17,000 --> 00:24:21,000
Imagine that but for a quarter of the internet.
316
00:24:21,000 --> 00:24:25,000
The blast radius of S3 was something unseen before this time.
317
00:24:25,000 --> 00:24:31,000
Casey Newton from The Verge wrote this article which might as well be cloud poetry.
318
00:24:31,000 --> 00:24:32,000
He says,
319
00:24:33,000 --> 00:24:34,000
Is it down right now?
320
00:24:34,000 --> 00:24:37,000
A website that tells you when websites are down.
321
00:24:37,000 --> 00:24:38,000
Is down right now?
322
00:24:38,000 --> 00:24:44,000
With Is It Down Right Now Down, you will be unable to learn what other websites are down.
323
00:24:44,000 --> 00:24:46,000
At least until it's back up.
324
00:24:46,000 --> 00:24:51,000
At this time, it's not clear when Is It Down Right Now will be back up.
325
00:24:51,000 --> 00:24:58,000
Like many websites, Is It Down Right Now has been affected by the partial failure of Amazon's S3 hosting platform.
326
00:24:58,000 --> 00:25:00,000
Which is down right now.
327
00:25:00,000 --> 00:25:03,000
While we can't tell you everything that is down right now.
328
00:25:03,000 --> 00:25:10,000
Some things that are down right now include Trello, Quora, If This Then That, and ChurchWeb.
329
00:25:10,000 --> 00:25:12,000
Which built your church's website.
330
00:25:12,000 --> 00:25:18,000
For other outages, you would be able to tell that these websites were down by visiting Is It Down Right Now.
331
00:25:18,000 --> 00:25:23,000
But as we mentioned earlier, Is It Down Right Now is down right now.
332
00:25:23,000 --> 00:25:28,000
This post will be updated when Is It Down Right Now is up again.
333
00:25:28,000 --> 00:25:31,000
Third party websites were not the only thing that went down.
334
00:25:31,000 --> 00:25:34,000
From the service disruption report Amazon said,
335
00:25:34,000 --> 00:25:46,000
From the beginning of this event until 737 UTC, we were unable to update the individual services status on the AWS Service Health Dashboard, SHD.
336
00:25:46,000 --> 00:25:52,000
Because of a dependency, the SHD Administration Console has on Amazon S3.
337
00:25:52,000 --> 00:25:59,000
Specifically, the Service Health Dashboard was running from a single S3 bucket hosted in US East One.
338
00:25:59,000 --> 00:26:03,000
Amazon was using the at AWS Cloud Twitter account for updates.
339
00:26:03,000 --> 00:26:05,000
The account said,
340
00:26:05,000 --> 00:26:09,000
The dashboard is not changing color is related to the S3 issue.
341
00:26:09,000 --> 00:26:12,000
See the banner at the top of the dashboard for updates.
342
00:26:12,000 --> 00:26:16,000
Granted, this was before most of the bots and Nazis took over Twitter.
343
00:26:16,000 --> 00:26:21,000
But it was still an embarrassment for AWS to have to resort to this for its announcements.
344
00:26:21,000 --> 00:26:23,000
Christopher Hansen said,
345
00:26:23,000 --> 00:26:27,000
Apparently, AWS is pretty important to and for the proper functioning of the web.
346
00:26:27,000 --> 00:26:29,000
John Battelle said,
347
00:26:29,000 --> 00:26:32,000
You never realize how important AWS is until it's down.
348
00:26:32,000 --> 00:26:34,000
Unfortunately, Azure can't say the same.
349
00:26:34,000 --> 00:26:38,000
Cassidy, who we will have on the podcast, said at the time,
350
00:26:38,000 --> 00:26:40,000
Oh man, AWS S3 buckets are down.
351
00:26:40,000 --> 00:26:42,000
Hashtag Amazon.
352
00:26:42,000 --> 00:26:48,000
Let's not forget, AWS had 15 other regions in 2017.
353
00:26:48,000 --> 00:26:53,000
But then, like now, US East One was more used than any other region.
354
00:26:53,000 --> 00:26:56,000
Maybe even all other regions combined.
355
00:26:56,000 --> 00:27:00,000
This was still pretty early cloud days for a lot of companies.
356
00:27:00,000 --> 00:27:04,000
Amazon leadership would still go to large potential customers and conferences.
357
00:27:04,000 --> 00:27:05,000
Remember those?
358
00:27:05,000 --> 00:27:07,000
And share the amazing benefits of the cloud.
359
00:27:07,000 --> 00:27:09,000
It just so happened that at the very moment,
360
00:27:09,000 --> 00:27:12,000
AWS was having one of its largest outages in history.
361
00:27:12,000 --> 00:27:16,000
Adrian Cockroff, a recent VP hire from Netflix,
362
00:27:16,000 --> 00:27:22,000
was on stage talking about the many benefits of AWS's scale and reliability.
363
00:27:22,000 --> 00:27:25,000
So what happened after the outage?
364
00:27:25,000 --> 00:27:29,000
Websites all over the country were affected, about 148,000 websites.
365
00:27:29,000 --> 00:27:33,000
They are putting out from a verified account and that stocks down half a percent.
366
00:27:33,000 --> 00:27:40,000
They believe they understand the root cause and are working hard at repairing it.
367
00:27:40,000 --> 00:27:43,000
Future updates will all be on dashboard.
368
00:27:43,000 --> 00:27:48,000
So as we watch that, we will try and keep you updated on any new developments.
369
00:27:48,000 --> 00:27:51,000
But it apparently is affecting millions.
370
00:27:51,000 --> 00:27:55,000
It's important to note that Amazon never calls these things outages.
371
00:27:55,000 --> 00:28:00,000
The official AWS cloud Twitter account called it a high rate of errors.
372
00:28:00,000 --> 00:28:02,000
For the rest of us, that just means it's an outage.
373
00:28:02,000 --> 00:28:04,000
It was mocked relentlessly.
374
00:28:04,000 --> 00:28:09,000
But Amazon put their fingers in their ears and said la la la until things blew over.
375
00:28:09,000 --> 00:28:13,000
Loot Ventures called the disruption a temporary black eye for Amazon.
376
00:28:13,000 --> 00:28:18,000
Customers would not go through the hassle of switching to a competing cloud service
377
00:28:18,000 --> 00:28:20,000
because of a one-time event, he said.
378
00:28:20,000 --> 00:28:22,000
Amazon chose to ignore this outage.
379
00:28:22,000 --> 00:28:25,000
Like many outages before it, and many that have come since.
380
00:28:25,000 --> 00:28:28,000
Beyond the couple days it had in the spotlight,
381
00:28:28,000 --> 00:28:30,000
there were lots of private apologies to customers,
382
00:28:30,000 --> 00:28:34,000
a bunch of bill reimbursements for SLAs that were broken, and that's about it.
383
00:28:34,000 --> 00:28:38,000
Two months later, during Amazon's quarterly earnings report,
384
00:28:38,000 --> 00:28:40,000
it didn't even mention it.
385
00:28:40,000 --> 00:28:43,000
As a matter of fact, Amazon stock barely noticed.
386
00:28:43,000 --> 00:28:48,000
A small dip the day it happened, and then a continued march to record growth.
387
00:28:49,000 --> 00:28:53,000
What a big player Amazon is on the internet with their cloud services.
388
00:28:53,000 --> 00:28:58,000
I think like a third of the world's cloud services is operated by Amazon.
389
00:28:58,000 --> 00:29:02,000
They didn't break the internet, but they certainly brought it to a slow,
390
00:29:02,000 --> 00:29:05,000
not a standstill yesterday, but a large number of people were affected.
391
00:29:05,000 --> 00:29:09,000
So in some perverse way, we see the stock moving higher today.
392
00:29:09,000 --> 00:29:12,000
Maybe this is a recognition of, wow, their web services is really big.
393
00:29:12,000 --> 00:29:16,000
Yes, and how big a company they are, how important they are on the internet.
394
00:29:16,000 --> 00:29:18,000
I think it's a tremendous amount of revenue.
395
00:29:18,000 --> 00:29:20,000
It's not just the online buying site.
396
00:29:20,000 --> 00:29:23,000
This web services division is huge.
397
00:29:24,000 --> 00:29:26,000
Amazon said in the disruption report,
398
00:29:26,000 --> 00:29:30,000
we are making several changes as a result to this operational event.
399
00:29:30,000 --> 00:29:33,000
While removal of capacity is a key operational practice,
400
00:29:33,000 --> 00:29:39,000
in this instance, the tool used allowed too much capacity to be removed too quickly.
401
00:29:39,000 --> 00:29:43,000
We have modified this tool to remove capacity more slowly
402
00:29:43,000 --> 00:29:46,000
and added safeguards to prevent capacity from being removed
403
00:29:46,000 --> 00:29:51,000
when it will take any subsystem below its minimum required capacity.
404
00:29:52,000 --> 00:29:55,000
Meaning, in the past when someone ran this bash script,
405
00:29:55,000 --> 00:29:57,000
it would let you scale to zero if you wanted.
406
00:29:57,000 --> 00:30:01,000
The script trusted that the human had the context to know what they were doing
407
00:30:01,000 --> 00:30:04,000
and not make any mistakes.
408
00:30:04,000 --> 00:30:07,000
But we all know that's not really how this works.
409
00:30:07,000 --> 00:30:10,000
This is an important lesson for a lot of people to learn.
410
00:30:10,000 --> 00:30:13,000
Just because a mistake was made doesn't mean there's blame.
411
00:30:13,000 --> 00:30:16,000
This script allowed the command to be run.
412
00:30:16,000 --> 00:30:19,000
Was the mistake caused by the person who it entered
413
00:30:19,000 --> 00:30:21,000
or the person who committed the code?
414
00:30:21,000 --> 00:30:23,000
The answer is neither.
415
00:30:23,000 --> 00:30:25,000
There isn't a person to blame.
416
00:30:25,000 --> 00:30:27,000
There's a system and there's consequences.
417
00:30:28,000 --> 00:30:31,000
Everyone is responsible for the safety of people and systems
418
00:30:31,000 --> 00:30:37,000
and failures are only root cause to the point of when a measurement hits a threshold.
419
00:30:37,000 --> 00:30:41,000
But the events that led up to that threshold being agreed upon
420
00:30:41,000 --> 00:30:45,000
or that measurement being tracked at all play a role.
421
00:30:45,000 --> 00:30:50,000
The entire system of people and processes are to blame when things go bad
422
00:30:50,000 --> 00:30:53,000
and are to be praised when they function as intended.
423
00:30:53,000 --> 00:30:58,000
Amazon further says we are also auditing our other operational tools
424
00:30:58,000 --> 00:31:01,000
to ensure we have similar safety checks.
425
00:31:01,000 --> 00:31:07,000
We will also make changes to improve the recovery time of the key S3 subsystems.
426
00:31:07,000 --> 00:31:13,000
We employ multiple techniques to allow our services to recover from any failure quickly.
427
00:31:13,000 --> 00:31:18,000
One of the most important involves breaking services into small partitions
428
00:31:18,000 --> 00:31:20,000
which we call cells.
429
00:31:21,000 --> 00:31:25,000
We could further reduce the blast radius by creating smaller boundaries
430
00:31:25,000 --> 00:31:31,000
or cells and having a copy of our application in each of those boundaries.
431
00:31:31,000 --> 00:31:34,000
Now what we have done here if there's an issue that happens
432
00:31:34,000 --> 00:31:38,000
it will only be isolated within this boundary or within this cell
433
00:31:38,000 --> 00:31:43,000
reducing the blast radius and containing the failure to be only within a defined boundary.
434
00:31:44,000 --> 00:31:48,000
Cells are a term to make you feel bad about your architecture.
435
00:31:48,000 --> 00:31:51,000
To make the cloud seem cooler than your data center
436
00:31:51,000 --> 00:31:57,000
and to pretend an architectural improvement will prevent the systemic failure of leadership.
437
00:31:57,000 --> 00:32:00,000
If you enjoyed this episode and want to hear more please let us know.
438
00:32:00,000 --> 00:32:03,000
We'll have our regular interviews starting up again in 2026.
439
00:32:03,000 --> 00:32:08,000
Until then have a happy holiday and may your pagers stay silent.
440
00:32:11,000 --> 00:32:14,000
Thank you for listening to this episode of Fork Around and Find Out.
441
00:32:14,000 --> 00:32:19,000
If you like this show please consider sharing it with a friend, a co-worker, a family member or even an enemy.
442
00:32:19,000 --> 00:32:24,000
However we get the word out about this show helps it to become sustainable for the long term.
443
00:32:24,000 --> 00:32:30,000
If you want to sponsor this show please go to fafo.fm slash sponsor
444
00:32:30,000 --> 00:32:34,000
and reach out to us there about what you're interested in sponsoring and how we can help.
445
00:32:35,000 --> 00:32:38,000
We hope your system stay available and your pagers stay quiet.
446
00:32:38,000 --> 00:32:40,000
We'll see you again next time.
447
00:32:44,000 --> 00:32:46,000
Thank you.
