·E25

Objecting to storage

Episode Transcript

1 00:00:00,000 --> 00:00:07,000 Hey everyone. Justin Garrison here. If this is your first folk around and find out that you've been listening to, 2 00:00:07,000 --> 00:00:13,000 you're in for a treat. This is a different show. This is not our typical episode. 3 00:00:13,000 --> 00:00:21,000 I've been wanting to do this for a little while, and I've been inspired on this show from other folks that do podcasts, 4 00:00:21,000 --> 00:00:28,000 from people we've had on the show just in general from how I like to learn about things. 5 00:00:28,000 --> 00:00:33,000 And I can't do this show all the time. We're still going to keep the general format for the show. 6 00:00:33,000 --> 00:00:39,000 We love talking to guests. We love bringing them on, understanding the problems they're working on, and the solutions they come up with. 7 00:00:39,000 --> 00:00:44,000 That's going to be the typical show going forward. Autumn's going to be back with me in January. 8 00:00:44,000 --> 00:00:52,000 But for this episode, the closing of our first year of folk around and find out, I wanted to try something a little bit special. 9 00:00:52,000 --> 00:00:59,000 And if you like true crime podcasts, if you like a little more narrative in some of your podcast episodes, 10 00:00:59,000 --> 00:01:06,000 this one's for you because one of my favorite things is postmortems, and I love reading a good postmortem. 11 00:01:06,000 --> 00:01:08,000 So if you have any, please send them my way. 12 00:01:08,000 --> 00:01:17,000 But one of the things I really love about postmortems is understanding all of the things that aren't in the actual technical review of what happened. 13 00:01:17,000 --> 00:01:25,000 There's always some series of events, a technical failure that happened or something in the process that didn't go as planned. 14 00:01:25,000 --> 00:01:31,000 But that extends much more than just the individual systems that we work on. 15 00:01:31,000 --> 00:01:34,000 And this is my attempt at bringing some of that to you. 16 00:01:34,000 --> 00:01:44,000 When I read a postmortem in my head, I try to put together the full picture of what this may have looked like for the engineer who was dealing with it, 17 00:01:44,000 --> 00:01:47,000 or maybe even the group of people that caused the problem. 18 00:01:47,000 --> 00:01:50,000 So I would love to hear your feedback about this episode. 19 00:01:50,000 --> 00:01:54,000 There's an email me link in the show notes or my blue sky. 20 00:01:54,000 --> 00:01:56,000 Just send us a message, right? 21 00:01:56,000 --> 00:02:01,000 Just let us know what you thought because I want to hear more about not only what you thought of the show, 22 00:02:01,000 --> 00:02:04,000 but also other ideas you might have along this vein. 23 00:02:04,000 --> 00:02:06,000 Again, we're not doing this every time. 24 00:02:06,000 --> 00:02:15,000 This episode took me probably three times as long as our normal episodes do to plan and I just don't have that much time. 25 00:02:15,000 --> 00:02:21,000 But I really wanted to put this out just to put it out in the world because I think something like this should exist. 26 00:02:21,000 --> 00:02:30,000 I want it to welcome in new people to technology to understand the impact of systems and decisions made 10 years ago in a code base 27 00:02:30,000 --> 00:02:34,000 and how that plays out today when something fails. 28 00:02:34,000 --> 00:02:37,000 The music was all created by Dave Eddy, a friend of the show. 29 00:02:37,000 --> 00:02:44,000 You may know him as the you suck at programming guy, but he also does some great music links in the show notes for Christmas and the holidays. 30 00:02:44,000 --> 00:02:47,000 There's just one gift that I asked from everyone listening to the show. 31 00:02:47,000 --> 00:02:50,000 There's a link to review this in your show notes. 32 00:02:50,000 --> 00:02:55,000 If you haven't already left a review on the podcast, please click one of those links and leave a review. 33 00:02:55,000 --> 00:03:03,000 We'll put a couple of them in there for various platforms that have review systems and all of them help because it just helps someone else find the show. 34 00:03:03,000 --> 00:03:12,000 And it helps someone else understand what we're trying to do here with welcoming more people into technology to understand that these are humans behind the scene. 35 00:03:12,000 --> 00:03:19,000 That these systems are built by real people and that our decisions in a code base have real outcomes to other people's lives. 36 00:03:19,000 --> 00:03:27,000 Any of the timestamps in this episode are in UTC, the one true time zone, and I tried to stick to the facts as much as possible. 37 00:03:27,000 --> 00:03:29,000 But there are things that we just don't know. 38 00:03:29,000 --> 00:03:33,000 Even though I worked at Amazon, I have no special insights into this outage. 39 00:03:33,000 --> 00:03:35,000 I wasn't working there at the time. 40 00:03:35,000 --> 00:03:38,000 We hope you enjoy it. Have a happy holidays and a wonderful new year. 41 00:03:38,000 --> 00:03:40,000 We'll see you in 2026. 42 00:04:00,000 --> 00:04:07,000 It's an amazing thing, the power and influence and importance of these problems and scary. 43 00:04:07,000 --> 00:04:09,000 Some more players. 44 00:04:09,000 --> 00:04:13,000 And if you had trouble getting on some internet sites yesterday, here's why. 45 00:04:13,000 --> 00:04:20,000 Amazon's powerful cloud service went down for about four hours from roughly 12.30 to 5.00 p.m. 46 00:04:20,000 --> 00:04:23,000 Thousands of internet services and media outlets were affected. 47 00:04:23,000 --> 00:04:26,000 Amazon has not said what caused the outage, Maria. 48 00:04:29,000 --> 00:04:45,000 It's cold, but not unusually cold. 49 00:04:45,000 --> 00:04:51,000 The temperature was close to freezing, but was starting to warm up by the time they arrived at work. 50 00:04:51,000 --> 00:04:58,000 February 28th, 2017 was an extremely unremarkable Tuesday. 51 00:04:58,000 --> 00:05:05,000 It was the last day of the short month of February, but the majority of the week was still ahead. 52 00:05:05,000 --> 00:05:08,000 The AWS billing system was having a problem. 53 00:05:08,000 --> 00:05:13,000 But not a SEV-1 that required immediate attention or an off-hours page. 54 00:05:13,000 --> 00:05:17,000 So when the engineer arrived at work, they knew they'd have to look into it. 55 00:05:17,000 --> 00:05:22,000 But these types of bugs were notoriously difficult to troubleshoot. 56 00:05:22,000 --> 00:05:27,000 It wasn't that billing was broken. That would have required a SEV-1 page. 57 00:05:27,000 --> 00:05:29,000 But things were slow. 58 00:05:29,000 --> 00:05:31,000 The worst kind of bug. 59 00:05:31,000 --> 00:05:36,000 They had a hunch where the problem might be, not because they were completely familiar with how the billing system worked, 60 00:05:36,000 --> 00:05:43,000 but because they had seen an increase in alerts and slowness on systems billing relied on. 61 00:05:43,000 --> 00:05:48,000 And they had tooling and runbooks to cycle those systems and hope the slowness would go away, 62 00:05:48,000 --> 00:05:53,000 or maybe the team asking for a fix would be satisfied with an attempt. 63 00:05:53,000 --> 00:05:59,000 So after a morning round of office chatter, a stand-up letting people know they'd be looking into the issue, 64 00:05:59,000 --> 00:06:04,000 and a cup of office coffee, they sat down to get started. 65 00:06:04,000 --> 00:06:12,000 They put the free bananas on their desk, wiggled their mouse, and touched their yubiky to authenticate into their system. 66 00:06:12,000 --> 00:06:17,000 They were a handful of tools that helped them execute common runbooks. 67 00:06:18,000 --> 00:06:25,000 These tools were ancient bash scripts that might as well have been written by the aliens who built the Great Pyramids. 68 00:06:25,000 --> 00:06:32,000 The scripts had evolved from single-line tool wrappers into monstrosities of semi-portable bash. 69 00:06:32,000 --> 00:06:39,000 These executable text files make human evolution from single-celled organisms over millions of years look quaint. 70 00:06:39,000 --> 00:06:42,000 And like humans, they're pretty reliable. 71 00:06:42,000 --> 00:06:46,000 Or at least they are when you give improper instructions. 72 00:06:46,000 --> 00:06:50,000 Unfortunately, today the instructions were less than proper. 73 00:06:50,000 --> 00:06:56,000 To reduce tool spread and make it easier to stay up to date, the scripts work on a variety of similar systems. 74 00:06:56,000 --> 00:06:59,000 The billing system was the target of today's maintenance. 75 00:06:59,000 --> 00:07:02,000 But like a stormtrooper's aim, it was off the mark. 76 00:07:02,000 --> 00:07:06,000 The key press was so insignificant it didn't need a review. 77 00:07:06,000 --> 00:07:12,000 It was done by authorized employees from an assigned office desk on company property. 78 00:07:12,000 --> 00:07:23,000 It not only took down AWS production services, it took down more than a dozen high-profile Internet darlings who placed their bets on the cloud. 79 00:07:23,000 --> 00:07:32,000 S3 has a terrible, horrible, no good, very bad day on this episode of Fork Around and Find Out. 80 00:07:32,000 --> 00:07:34,000 Amazon breaks the Internet. 81 00:07:34,000 --> 00:07:39,000 How a problem in the cloud triggered the error message sweeping across the East Coast. 82 00:07:39,000 --> 00:07:43,000 The S3 server farm started showing a high rate of errors. 83 00:07:43,000 --> 00:07:46,000 It's not clear yet what caused all these problems. 84 00:07:46,000 --> 00:07:50,000 Cloud computing service, which went down for more than four hours Tuesday. 85 00:07:53,000 --> 00:08:00,000 Welcome to Fork Around and Find Out, the podcast about building, running, and maintaining software and systems. 86 00:08:10,000 --> 00:08:14,000 Lots of people working in tech remember this day. 87 00:08:14,000 --> 00:08:19,000 Maybe not exactly what happened, but we remember how we felt. 88 00:08:19,000 --> 00:08:26,000 We might have been cloud champions at our companies, and a major four-hour outage was not going to help our case. 89 00:08:26,000 --> 00:08:30,000 Especially because Amazon wouldn't even admit it. 90 00:08:30,000 --> 00:08:33,000 The AWS cloud account on Twitter said, 91 00:08:33,000 --> 00:08:43,000 We are continuing to experience high error rates with S3 in US East One, which is impacting some other AWS services. 92 00:08:43,000 --> 00:08:47,000 They couldn't even say they were having an outage. 93 00:08:47,000 --> 00:08:53,000 Cloud naysayers were warning about dependencies that were already creeping into applications. 94 00:08:53,000 --> 00:09:03,000 This outage was the I told you so moment they needed to convince senior leadership that budget of server refreshes was a good thing to approve. 95 00:09:03,000 --> 00:09:11,000 The simple fact that Amazon status page went down with the outage and updates could only be found via the AWS Twitter account 96 00:09:11,000 --> 00:09:15,000 didn't reduce the cloud skepticism from late adopters. 97 00:09:15,000 --> 00:09:20,000 How could a single service outage in a single region have such a global impact? 98 00:09:20,000 --> 00:09:24,000 We need to start with what it is before we can talk about what happened. 99 00:09:24,000 --> 00:09:27,000 Sara Jones on Twitter at one Sara Jones said, 100 00:09:27,000 --> 00:09:33,000 Five hours ago, I'd never heard of AWS S3, and yet it has ruined my entire day. 101 00:09:33,000 --> 00:09:41,000 Amazon's S3 server is responsible for providing cloud services to about 150,000 companies around the world. 102 00:09:41,000 --> 00:09:46,000 If you're a longtime listener of this podcast, I'm sure you know what S3 is. 103 00:09:46,000 --> 00:09:49,000 But stay with me for a minute because we're going to go a bit deeper. 104 00:09:49,000 --> 00:09:55,000 S3 stands for Simple Storage Service, and it was one of the groundbreaking innovations of the cloud. 105 00:09:55,000 --> 00:10:00,000 Before S3, there were only two types of storage commonly deployed. 106 00:10:00,000 --> 00:10:03,000 There were blocks and there were files. 107 00:10:03,000 --> 00:10:05,000 Blocks were a necessary evil. 108 00:10:05,000 --> 00:10:09,000 They're the storage low level systems need to have access to bits. 109 00:10:09,000 --> 00:10:14,000 Your operating system needs blocks, but your application usually doesn't. 110 00:10:15,000 --> 00:10:17,000 Applications generally need files. 111 00:10:17,000 --> 00:10:19,000 Files are great until they're not. 112 00:10:19,000 --> 00:10:24,000 Files have actions applications can perform like read, write, and execute, 113 00:10:24,000 --> 00:10:28,000 but they also have pesky things like ownership, locks, hierarchy, 114 00:10:28,000 --> 00:10:32,000 and requirements to have some form of system to manage the files. 115 00:10:32,000 --> 00:10:34,000 Call it a file system. 116 00:10:34,000 --> 00:10:37,000 This was almost always something locally accessible to the application, 117 00:10:37,000 --> 00:10:40,000 or at least something that appeared local. 118 00:10:40,000 --> 00:10:45,000 NFS, or the network file system, was pivotal for scaling file storage 119 00:10:45,000 --> 00:10:49,000 by trading network latency for large storage systems. 120 00:10:49,000 --> 00:10:54,000 Pulling a bunch of computers to appear like one large resource wasn't new, 121 00:10:54,000 --> 00:10:57,000 but usually they were limited to POSIX constraints 122 00:10:57,000 --> 00:11:02,000 because the systems accessing them had to pretend they were local resources. 123 00:11:02,000 --> 00:11:06,000 Databases were another option for removing the need of files. 124 00:11:06,000 --> 00:11:08,000 Store all the stateful stuff somewhere else. 125 00:11:08,000 --> 00:11:12,000 Remote connectivity and advanced query syntax sounds like a great option, 126 00:11:12,000 --> 00:11:15,000 but those queries aren't always the easiest thing to work with. 127 00:11:15,000 --> 00:11:17,000 Connection pooling becomes a problem, 128 00:11:17,000 --> 00:11:21,000 and you need to understand what data you have and how it's going to be used 129 00:11:21,000 --> 00:11:23,000 before you really start storing it. 130 00:11:23,000 --> 00:11:26,000 Object storage didn't have those trade-offs. 131 00:11:26,000 --> 00:11:28,000 Object storage doesn't have a hierarchy. 132 00:11:28,000 --> 00:11:34,000 It doesn't need a file system, and it's definitely not POSIX constrained. 133 00:11:34,000 --> 00:11:39,000 There's no SQL queries or connection pooling, just verbs like get and put. 134 00:11:39,000 --> 00:11:44,000 It's one of the few trillion-dollar ideas that the cloud has created. 135 00:11:44,000 --> 00:11:46,000 Well, it didn't create the idea of object storage, 136 00:11:46,000 --> 00:11:50,000 but it definitely made it a viable option for millions of users around the world, 137 00:11:50,000 --> 00:11:54,000 and S3 is the standard for object storage. 138 00:11:54,000 --> 00:11:59,000 And on February 28, 2017, it broke. 139 00:11:59,000 --> 00:12:02,000 AtnetGarun on Twitter said, 140 00:12:02,000 --> 00:12:04,000 Happy AWS Appreciation Day, Internet. 141 00:12:04,000 --> 00:12:07,000 S3 isn't just another AWS service. 142 00:12:07,000 --> 00:12:11,000 It's a foundational piece of the entire AWS empire. 143 00:12:11,000 --> 00:12:14,000 You can usually tell how important services are to AWS 144 00:12:14,000 --> 00:12:19,000 by how many nines they assign to their availability SLAs. 145 00:12:19,000 --> 00:12:26,000 Route 53 is the only AWS service that guarantees a 100% SLA. 146 00:12:26,000 --> 00:12:31,000 As a matter of fact, I think it's the only SaaS that exists with 100% SLA. 147 00:12:31,000 --> 00:12:33,000 It's pretty important. 148 00:12:33,000 --> 00:12:39,000 There's only a handful of Amazon's 200 services that have four nines of availability, 149 00:12:39,000 --> 00:12:41,000 and S3 is one of them. 150 00:12:41,000 --> 00:12:46,000 This level guarantees only four minutes of downtime per month, 151 00:12:46,000 --> 00:12:49,000 or 48 minutes for the entire year. 152 00:12:49,000 --> 00:12:52,000 You can't finish a single episode of Stranger Things 153 00:12:52,000 --> 00:12:56,000 in the amount of time S3 would get for downtime in a year. 154 00:12:56,000 --> 00:13:03,000 S3's availability SLA is often confused with its durability SLA of 11 nines, 155 00:13:03,000 --> 00:13:05,000 a ridiculous number to ponder, 156 00:13:05,000 --> 00:13:11,000 and an odd form of SREP cocking that somehow sounds more impressive than 100%. 157 00:13:11,000 --> 00:13:18,000 But even four nines of availability takes more than a few servers and a load balancer to achieve. 158 00:13:18,000 --> 00:13:24,000 The backend of a storage system that stores this much data this reliably has a lot of components. 159 00:13:24,000 --> 00:13:27,000 The service obviously has a web front-end. 160 00:13:27,000 --> 00:13:30,000 It has authentication, and there's millions of disks to store the data. 161 00:13:30,000 --> 00:13:32,000 But we're not going to focus on those parts. 162 00:13:32,000 --> 00:13:36,000 But we are going to look at how S3 gets and puts objects into the system. 163 00:13:36,000 --> 00:13:41,000 Core idea of making S3 scalable is sharding data across many different hard drives. 164 00:13:41,000 --> 00:13:43,000 Amazon does this with a lot of different services, 165 00:13:43,000 --> 00:13:47,000 and it calls their method of sharding data shuffle sharding. 166 00:13:48,000 --> 00:13:50,000 It doesn't make it very difficult for us to scale, 167 00:13:50,000 --> 00:13:53,000 because we have to think about what if customers become unbalanced over time. 168 00:13:53,000 --> 00:13:56,000 So we do something a little bit different on S3. 169 00:13:56,000 --> 00:14:02,000 And it's actually an example of a pattern that we talk about often in AWS called shuffle sharding. 170 00:14:02,000 --> 00:14:04,000 The idea of shuffle sharding is actually pretty simple. 171 00:14:04,000 --> 00:14:07,000 It's that rather than ecstatically assigning a workload to a drive 172 00:14:07,000 --> 00:14:11,000 or to any other kind of resource, a CPU, a GPU, what have you, 173 00:14:11,000 --> 00:14:15,000 we randomly spread workloads across the fleet, across our drives. 174 00:14:15,000 --> 00:14:19,000 So when you do a put, we'll pick a random set of drives to put those bytes on. 175 00:14:19,000 --> 00:14:21,000 Maybe we pick these two. 176 00:14:21,000 --> 00:14:24,000 The next time you do a put, even to the same bucket, even to the same key, 177 00:14:24,000 --> 00:14:26,000 it doesn't even matter, right? 178 00:14:26,000 --> 00:14:27,000 We'll pick a different set of drives. 179 00:14:27,000 --> 00:14:30,000 They might overlap, but we'll make a new random choice 180 00:14:30,000 --> 00:14:32,000 of which drives to use for that put. 181 00:14:33,000 --> 00:14:37,000 Shuffle sharding is basically a predictable way for you to assign resources 182 00:14:37,000 --> 00:14:40,000 to a customer and spread those resources 183 00:14:40,000 --> 00:14:43,000 so two customers don't keep getting grouped together. 184 00:14:43,000 --> 00:14:47,000 If Disney, Netflix, HBO, and Peacock are all customers 185 00:14:47,000 --> 00:14:50,000 and they each get two servers, shuffle sharding would make sure 186 00:14:50,000 --> 00:14:53,000 that Disney and Netflix might be co-located on one server, 187 00:14:53,000 --> 00:14:58,000 but the second allocated server should be Netflix and HBO. 188 00:14:58,000 --> 00:15:02,000 This allows for things like the Disney and Netflix server to go down, 189 00:15:02,000 --> 00:15:05,000 but Netflix still has some availability. 190 00:15:05,000 --> 00:15:09,000 Or if HBO is a noisy neighbor and consumes all the resources on the second server, 191 00:15:09,000 --> 00:15:13,000 Netflix still has a server that's not overloaded. 192 00:15:13,000 --> 00:15:16,000 This spreads the load between customers. 193 00:15:16,000 --> 00:15:20,000 Assuming those customers don't buy each other in multi-billion dollar acquisitions, 194 00:15:20,000 --> 00:15:22,000 but that's not really Amazon's problem. 195 00:15:22,000 --> 00:15:24,000 In order to do this for large storage systems, 196 00:15:24,000 --> 00:15:29,000 you need to have services that store metadata about where blobs are stored. 197 00:15:29,000 --> 00:15:34,000 This service is in the critical path for putting data into the system. 198 00:15:34,000 --> 00:15:37,000 Because if you can't keep track of where the data went, 199 00:15:37,000 --> 00:15:38,000 you might as well just delete it. 200 00:15:38,000 --> 00:15:43,000 Of course, S3 writes data in multiple places throughout multiple data centers. 201 00:15:43,000 --> 00:15:45,000 The details don't matter for this outage, 202 00:15:45,000 --> 00:15:49,000 but what does matter is there's a critical service part of S3 203 00:15:49,000 --> 00:15:53,000 that keeps track of where all this data is stored called the placement service. 204 00:15:53,000 --> 00:15:56,000 New puts go through the placement service. 205 00:15:56,000 --> 00:16:01,000 All other requests, get, list, put, delete, go through the index service. 206 00:16:01,000 --> 00:16:04,000 People usually put data before they get it, 207 00:16:04,000 --> 00:16:07,000 but not all customers know what they're doing. 208 00:16:08,000 --> 00:16:12,000 But the important thing here is that we chunk up the data that you send us 209 00:16:12,000 --> 00:16:15,000 and we store it redundantly across a set of drives. 210 00:16:15,000 --> 00:16:20,000 Those drives are running a piece of software called Shardstore. 211 00:16:20,000 --> 00:16:25,000 So Shardstore is the file system that we run on our storage node hosts. 212 00:16:25,000 --> 00:16:28,000 Shardstore is something that we've written ourselves. 213 00:16:28,000 --> 00:16:32,000 There's public papers on Shardstore, which James will talk about in his section. 214 00:16:32,000 --> 00:16:36,000 But at its core, it's a log-structured file system. 215 00:16:37,000 --> 00:16:41,000 The internal names for S3 services are not as boring as index and placement. 216 00:16:41,000 --> 00:16:44,000 They have, or at least at one time they had, 217 00:16:44,000 --> 00:16:51,000 cool quirky names like R2D2, Death Star, Mr. Biggs, PMS, and Cramps. 218 00:16:51,000 --> 00:16:54,000 But the postmortem would have had a lot more to explain 219 00:16:54,000 --> 00:16:57,000 if they used the internal names to describe these systems. 220 00:16:57,000 --> 00:17:03,000 After the break, we'll talk about what happened on February 28th, 2017. 221 00:17:03,000 --> 00:17:07,000 On the morning of February 28th, an engineer went to scale down 222 00:17:07,000 --> 00:17:10,000 a small set of servers used by the billing system. 223 00:17:10,000 --> 00:17:13,000 They were using a runbook, which included a pre-approved script, 224 00:17:13,000 --> 00:17:16,000 to manage the scale-down process. 225 00:17:16,000 --> 00:17:19,000 Unfortunately, they mistyped the command. 226 00:17:19,000 --> 00:17:23,000 We've all done it, and some of us have even taken down production with our mistype. 227 00:17:23,000 --> 00:17:26,000 This mistype took down a lot of productions. 228 00:17:26,000 --> 00:17:30,000 I envision it like a bash-grip with a couple flags 229 00:17:30,000 --> 00:17:32,000 and some arguments that need to go in a certain order. 230 00:17:32,000 --> 00:17:36,000 And if you change that order, it might do something different. 231 00:17:36,000 --> 00:17:39,000 I don't know the details, but we can see how this would happen. 232 00:17:39,000 --> 00:17:42,000 Instead of removing a few servers for the billing service, 233 00:17:42,000 --> 00:17:48,000 they removed more than 50% of the servers for the index and placement services. 234 00:17:48,000 --> 00:17:53,000 Amazon's service disruption report was posted two days after the outage. 235 00:17:53,000 --> 00:17:56,000 It says, removing a significant portion of the capacity 236 00:17:56,000 --> 00:18:01,000 caused each of these systems to require a full restart. 237 00:18:01,000 --> 00:18:05,000 It's unknown why this amount of capacity required a full restart. 238 00:18:05,000 --> 00:18:08,000 You would think the service would just be slow, 239 00:18:08,000 --> 00:18:10,000 or only specific customers would be affected. 240 00:18:10,000 --> 00:18:13,000 After all, Amazon talks about how great shuffle sharding is. 241 00:18:13,000 --> 00:18:17,000 But for some reason, these services were not configured this way, 242 00:18:17,000 --> 00:18:21,000 or at some level of capacity, you had to shut everything down. 243 00:18:21,000 --> 00:18:23,000 They go on to say, 244 00:18:23,000 --> 00:18:28,000 S3 subsystems are designed to support the removal or failure of significant capacity 245 00:18:28,000 --> 00:18:31,000 with little or no customer impact. 246 00:18:31,000 --> 00:18:35,000 We build our systems with the assumptions that things will occasionally fail, 247 00:18:35,000 --> 00:18:38,000 and we rely on the ability to remove and replace capacity 248 00:18:38,000 --> 00:18:41,000 as one of our core operational processes. 249 00:18:41,000 --> 00:18:45,000 While this is an operation that we have relied on to maintain our systems, 250 00:18:45,000 --> 00:18:47,000 since the launch of S3, 251 00:18:47,000 --> 00:18:51,000 we have not completely restarted the index subsystem 252 00:18:51,000 --> 00:18:57,000 or the placement subsystem in our larger regions for many years. 253 00:18:57,000 --> 00:19:00,000 And again, this is 2017. 254 00:19:00,000 --> 00:19:05,000 Many years, in my opinion, makes it more than three, 255 00:19:05,000 --> 00:19:08,000 and S3 launched in 2012. 256 00:19:08,000 --> 00:19:15,000 So there was only a couple years in there that they may have restarted these services completely. 257 00:19:15,000 --> 00:19:17,000 But why does any of this matter? 258 00:19:17,000 --> 00:19:20,000 S3 is just a storage mechanism. 259 00:19:20,000 --> 00:19:21,000 Who cares? 260 00:19:21,000 --> 00:19:27,000 Who cares if some files don't get put into storage and retrieved from that storage? 261 00:19:27,000 --> 00:19:33,000 Well, when a service is this critical, other Amazon services get built on it. 262 00:19:33,000 --> 00:19:38,000 So while S3 was down, other Amazon services also went down. 263 00:19:38,000 --> 00:19:41,000 Things like EC2 instances. 264 00:19:41,000 --> 00:19:46,000 You could not launch a new VM in AWS without S3. 265 00:19:46,000 --> 00:19:52,000 New Lambda invocations couldn't scale up because those Lambda functions were stored in S3. 266 00:19:52,000 --> 00:19:56,000 EBS volumes that needed snapshot restores, those also come from S3. 267 00:19:56,000 --> 00:20:00,000 Different load balancers within AWS, S3. 268 00:20:00,000 --> 00:20:07,000 There were so many services that cascaded down into failure because S3 objects were not available. 269 00:20:07,000 --> 00:20:12,000 NPR called this Amazon's $150 million typo. 270 00:20:12,000 --> 00:20:18,000 Because while S3 was down, it's estimated that of the top 500 S&P on the stock market, 271 00:20:18,000 --> 00:20:22,000 those companies lost $150 million in value. 272 00:20:22,000 --> 00:20:24,000 But that's still not the complete picture here. 273 00:20:24,000 --> 00:20:27,000 So many other companies lost productivity. 274 00:20:27,000 --> 00:20:29,000 This isn't just stock trading. 275 00:20:29,000 --> 00:20:31,000 A bunch of companies couldn't do work. 276 00:20:31,000 --> 00:20:32,000 They hired people. 277 00:20:32,000 --> 00:20:34,000 They sat around for hours. 278 00:20:34,000 --> 00:20:43,000 An internet monitoring company, Apica, found that over half of the top 100 online retailers slowed their performance by 20%. 279 00:20:43,000 --> 00:20:48,000 Half of the biggest sellers on the internet had slow websites for the day. 280 00:20:48,000 --> 00:20:57,000 And a CEO of Catchpoint estimated that the total ramifications of this one typo would be in the hundreds of billions of dollars in loss productivity. 281 00:20:57,000 --> 00:21:01,000 People are sitting around at work and they can't do their work. 282 00:21:01,000 --> 00:21:08,000 So many websites were down for four hours in an Amazon outage that the whole world felt the impact. 283 00:21:08,000 --> 00:21:12,000 S3 may just be storing objects at an HTTP endpoint. 284 00:21:12,000 --> 00:21:18,000 But so many things we rely on rely on things being available. 285 00:21:18,000 --> 00:21:24,000 This is why Spotify was down and various companies were down because those music files that you stream, 286 00:21:24,000 --> 00:21:29,000 when you look all the way back where they come from, they come from storage somewhere. 287 00:21:29,000 --> 00:21:34,000 Like it's still a file that exists somewhere on a hard drive in a data center. 288 00:21:34,000 --> 00:21:39,000 It just so happened to go through the most convoluted, complex system to fetch a file. 289 00:21:39,000 --> 00:21:44,000 Yes, there are plenty of other ways to spread files across the internet. 290 00:21:44,000 --> 00:21:48,000 CDNs are widely used, but CDNs are expensive. 291 00:21:48,000 --> 00:21:57,000 You have to balance how much money you're paying for fast, reliable, globally replicated data versus slow things that occasionally get served up. 292 00:21:57,000 --> 00:22:05,000 S3 has a pretty good pricing model to meet that middle ground of this is available and has 11 nines of durability. 293 00:22:05,000 --> 00:22:10,000 So it must be good to store some things and trust that it's going to be there for a very long time. 294 00:22:10,000 --> 00:22:19,000 And of course S3 has tiering systems and all this other stuff that make it more easy to just throw data ads and leave it there for who knows how long. 295 00:22:19,000 --> 00:22:26,000 But when all of a sudden those files aren't available, you start realizing that all of your applications need files. 296 00:22:26,000 --> 00:22:33,000 Not just your local applications. The websites you're using have files behind the scenes. 297 00:22:33,000 --> 00:22:37,000 Half of Apple's iCloud, half of the services were down. 298 00:22:37,000 --> 00:22:43,000 This includes issues with the App Store and iCloud backup and Apple Music and Apple TV. 299 00:22:43,000 --> 00:22:46,000 All of these services store files at the end of the day. 300 00:22:46,000 --> 00:22:54,000 And that's why something like S3 as an object storage is so critically important to the functioning of the internet. 301 00:22:54,000 --> 00:23:02,000 When it really comes down to it, the internet is basically just a bunch of remote file servers in different ways to access those files. 302 00:23:02,000 --> 00:23:07,000 Webpages are files, musics files, streaming services, all files. 303 00:23:07,000 --> 00:23:14,000 And when the files go away for four hours, turns out there's not a lot of stuff that people are able to do. 304 00:23:14,000 --> 00:23:21,000 And as Amazon's slowly recovering these services, they get to the point where they realize they have to turn things off. 305 00:23:21,000 --> 00:23:27,000 There's too much traffic going to the placement service and the index service. 306 00:23:27,000 --> 00:23:34,000 And there's just not enough servers there that even trying to scale them up isn't going to work because you can add more servers to the pool. 307 00:23:34,000 --> 00:23:43,000 But it's going to be difficult for the load balancer to add them in and help check them and do other things that at some point you just kind of have to reset the whole system. 308 00:23:43,000 --> 00:23:46,000 Remember back in the day with Windows XP? 309 00:23:46,000 --> 00:23:50,000 This was like back in time when we used to restart our computers regularly. 310 00:23:50,000 --> 00:23:54,000 At the end of the day, you would go home from work and you'd actually shut down your computer. 311 00:23:54,000 --> 00:24:02,000 There was a big shift in how you used your computer when Windows Vista and Windows 7 came out and sleep kind of worked reliably. 312 00:24:02,000 --> 00:24:05,000 And we didn't have to shut down our computer every day. 313 00:24:05,000 --> 00:24:13,000 But in the morning when you got to work and you turned on your computer, even though the computer was on and maybe even at the login screen, 314 00:24:13,000 --> 00:24:17,000 it still needed like 10 more minutes to be ready to use. 315 00:24:17,000 --> 00:24:21,000 Imagine that but for a quarter of the internet. 316 00:24:21,000 --> 00:24:25,000 The blast radius of S3 was something unseen before this time. 317 00:24:25,000 --> 00:24:31,000 Casey Newton from The Verge wrote this article which might as well be cloud poetry. 318 00:24:31,000 --> 00:24:32,000 He says, 319 00:24:33,000 --> 00:24:34,000 Is it down right now? 320 00:24:34,000 --> 00:24:37,000 A website that tells you when websites are down. 321 00:24:37,000 --> 00:24:38,000 Is down right now? 322 00:24:38,000 --> 00:24:44,000 With Is It Down Right Now Down, you will be unable to learn what other websites are down. 323 00:24:44,000 --> 00:24:46,000 At least until it's back up. 324 00:24:46,000 --> 00:24:51,000 At this time, it's not clear when Is It Down Right Now will be back up. 325 00:24:51,000 --> 00:24:58,000 Like many websites, Is It Down Right Now has been affected by the partial failure of Amazon's S3 hosting platform. 326 00:24:58,000 --> 00:25:00,000 Which is down right now. 327 00:25:00,000 --> 00:25:03,000 While we can't tell you everything that is down right now. 328 00:25:03,000 --> 00:25:10,000 Some things that are down right now include Trello, Quora, If This Then That, and ChurchWeb. 329 00:25:10,000 --> 00:25:12,000 Which built your church's website. 330 00:25:12,000 --> 00:25:18,000 For other outages, you would be able to tell that these websites were down by visiting Is It Down Right Now. 331 00:25:18,000 --> 00:25:23,000 But as we mentioned earlier, Is It Down Right Now is down right now. 332 00:25:23,000 --> 00:25:28,000 This post will be updated when Is It Down Right Now is up again. 333 00:25:28,000 --> 00:25:31,000 Third party websites were not the only thing that went down. 334 00:25:31,000 --> 00:25:34,000 From the service disruption report Amazon said, 335 00:25:34,000 --> 00:25:46,000 From the beginning of this event until 737 UTC, we were unable to update the individual services status on the AWS Service Health Dashboard, SHD. 336 00:25:46,000 --> 00:25:52,000 Because of a dependency, the SHD Administration Console has on Amazon S3. 337 00:25:52,000 --> 00:25:59,000 Specifically, the Service Health Dashboard was running from a single S3 bucket hosted in US East One. 338 00:25:59,000 --> 00:26:03,000 Amazon was using the at AWS Cloud Twitter account for updates. 339 00:26:03,000 --> 00:26:05,000 The account said, 340 00:26:05,000 --> 00:26:09,000 The dashboard is not changing color is related to the S3 issue. 341 00:26:09,000 --> 00:26:12,000 See the banner at the top of the dashboard for updates. 342 00:26:12,000 --> 00:26:16,000 Granted, this was before most of the bots and Nazis took over Twitter. 343 00:26:16,000 --> 00:26:21,000 But it was still an embarrassment for AWS to have to resort to this for its announcements. 344 00:26:21,000 --> 00:26:23,000 Christopher Hansen said, 345 00:26:23,000 --> 00:26:27,000 Apparently, AWS is pretty important to and for the proper functioning of the web. 346 00:26:27,000 --> 00:26:29,000 John Battelle said, 347 00:26:29,000 --> 00:26:32,000 You never realize how important AWS is until it's down. 348 00:26:32,000 --> 00:26:34,000 Unfortunately, Azure can't say the same. 349 00:26:34,000 --> 00:26:38,000 Cassidy, who we will have on the podcast, said at the time, 350 00:26:38,000 --> 00:26:40,000 Oh man, AWS S3 buckets are down. 351 00:26:40,000 --> 00:26:42,000 Hashtag Amazon. 352 00:26:42,000 --> 00:26:48,000 Let's not forget, AWS had 15 other regions in 2017. 353 00:26:48,000 --> 00:26:53,000 But then, like now, US East One was more used than any other region. 354 00:26:53,000 --> 00:26:56,000 Maybe even all other regions combined. 355 00:26:56,000 --> 00:27:00,000 This was still pretty early cloud days for a lot of companies. 356 00:27:00,000 --> 00:27:04,000 Amazon leadership would still go to large potential customers and conferences. 357 00:27:04,000 --> 00:27:05,000 Remember those? 358 00:27:05,000 --> 00:27:07,000 And share the amazing benefits of the cloud. 359 00:27:07,000 --> 00:27:09,000 It just so happened that at the very moment, 360 00:27:09,000 --> 00:27:12,000 AWS was having one of its largest outages in history. 361 00:27:12,000 --> 00:27:16,000 Adrian Cockroff, a recent VP hire from Netflix, 362 00:27:16,000 --> 00:27:22,000 was on stage talking about the many benefits of AWS's scale and reliability. 363 00:27:22,000 --> 00:27:25,000 So what happened after the outage? 364 00:27:25,000 --> 00:27:29,000 Websites all over the country were affected, about 148,000 websites. 365 00:27:29,000 --> 00:27:33,000 They are putting out from a verified account and that stocks down half a percent. 366 00:27:33,000 --> 00:27:40,000 They believe they understand the root cause and are working hard at repairing it. 367 00:27:40,000 --> 00:27:43,000 Future updates will all be on dashboard. 368 00:27:43,000 --> 00:27:48,000 So as we watch that, we will try and keep you updated on any new developments. 369 00:27:48,000 --> 00:27:51,000 But it apparently is affecting millions. 370 00:27:51,000 --> 00:27:55,000 It's important to note that Amazon never calls these things outages. 371 00:27:55,000 --> 00:28:00,000 The official AWS cloud Twitter account called it a high rate of errors. 372 00:28:00,000 --> 00:28:02,000 For the rest of us, that just means it's an outage. 373 00:28:02,000 --> 00:28:04,000 It was mocked relentlessly. 374 00:28:04,000 --> 00:28:09,000 But Amazon put their fingers in their ears and said la la la until things blew over. 375 00:28:09,000 --> 00:28:13,000 Loot Ventures called the disruption a temporary black eye for Amazon. 376 00:28:13,000 --> 00:28:18,000 Customers would not go through the hassle of switching to a competing cloud service 377 00:28:18,000 --> 00:28:20,000 because of a one-time event, he said. 378 00:28:20,000 --> 00:28:22,000 Amazon chose to ignore this outage. 379 00:28:22,000 --> 00:28:25,000 Like many outages before it, and many that have come since. 380 00:28:25,000 --> 00:28:28,000 Beyond the couple days it had in the spotlight, 381 00:28:28,000 --> 00:28:30,000 there were lots of private apologies to customers, 382 00:28:30,000 --> 00:28:34,000 a bunch of bill reimbursements for SLAs that were broken, and that's about it. 383 00:28:34,000 --> 00:28:38,000 Two months later, during Amazon's quarterly earnings report, 384 00:28:38,000 --> 00:28:40,000 it didn't even mention it. 385 00:28:40,000 --> 00:28:43,000 As a matter of fact, Amazon stock barely noticed. 386 00:28:43,000 --> 00:28:48,000 A small dip the day it happened, and then a continued march to record growth. 387 00:28:49,000 --> 00:28:53,000 What a big player Amazon is on the internet with their cloud services. 388 00:28:53,000 --> 00:28:58,000 I think like a third of the world's cloud services is operated by Amazon. 389 00:28:58,000 --> 00:29:02,000 They didn't break the internet, but they certainly brought it to a slow, 390 00:29:02,000 --> 00:29:05,000 not a standstill yesterday, but a large number of people were affected. 391 00:29:05,000 --> 00:29:09,000 So in some perverse way, we see the stock moving higher today. 392 00:29:09,000 --> 00:29:12,000 Maybe this is a recognition of, wow, their web services is really big. 393 00:29:12,000 --> 00:29:16,000 Yes, and how big a company they are, how important they are on the internet. 394 00:29:16,000 --> 00:29:18,000 I think it's a tremendous amount of revenue. 395 00:29:18,000 --> 00:29:20,000 It's not just the online buying site. 396 00:29:20,000 --> 00:29:23,000 This web services division is huge. 397 00:29:24,000 --> 00:29:26,000 Amazon said in the disruption report, 398 00:29:26,000 --> 00:29:30,000 we are making several changes as a result to this operational event. 399 00:29:30,000 --> 00:29:33,000 While removal of capacity is a key operational practice, 400 00:29:33,000 --> 00:29:39,000 in this instance, the tool used allowed too much capacity to be removed too quickly. 401 00:29:39,000 --> 00:29:43,000 We have modified this tool to remove capacity more slowly 402 00:29:43,000 --> 00:29:46,000 and added safeguards to prevent capacity from being removed 403 00:29:46,000 --> 00:29:51,000 when it will take any subsystem below its minimum required capacity. 404 00:29:52,000 --> 00:29:55,000 Meaning, in the past when someone ran this bash script, 405 00:29:55,000 --> 00:29:57,000 it would let you scale to zero if you wanted. 406 00:29:57,000 --> 00:30:01,000 The script trusted that the human had the context to know what they were doing 407 00:30:01,000 --> 00:30:04,000 and not make any mistakes. 408 00:30:04,000 --> 00:30:07,000 But we all know that's not really how this works. 409 00:30:07,000 --> 00:30:10,000 This is an important lesson for a lot of people to learn. 410 00:30:10,000 --> 00:30:13,000 Just because a mistake was made doesn't mean there's blame. 411 00:30:13,000 --> 00:30:16,000 This script allowed the command to be run. 412 00:30:16,000 --> 00:30:19,000 Was the mistake caused by the person who it entered 413 00:30:19,000 --> 00:30:21,000 or the person who committed the code? 414 00:30:21,000 --> 00:30:23,000 The answer is neither. 415 00:30:23,000 --> 00:30:25,000 There isn't a person to blame. 416 00:30:25,000 --> 00:30:27,000 There's a system and there's consequences. 417 00:30:28,000 --> 00:30:31,000 Everyone is responsible for the safety of people and systems 418 00:30:31,000 --> 00:30:37,000 and failures are only root cause to the point of when a measurement hits a threshold. 419 00:30:37,000 --> 00:30:41,000 But the events that led up to that threshold being agreed upon 420 00:30:41,000 --> 00:30:45,000 or that measurement being tracked at all play a role. 421 00:30:45,000 --> 00:30:50,000 The entire system of people and processes are to blame when things go bad 422 00:30:50,000 --> 00:30:53,000 and are to be praised when they function as intended. 423 00:30:53,000 --> 00:30:58,000 Amazon further says we are also auditing our other operational tools 424 00:30:58,000 --> 00:31:01,000 to ensure we have similar safety checks. 425 00:31:01,000 --> 00:31:07,000 We will also make changes to improve the recovery time of the key S3 subsystems. 426 00:31:07,000 --> 00:31:13,000 We employ multiple techniques to allow our services to recover from any failure quickly. 427 00:31:13,000 --> 00:31:18,000 One of the most important involves breaking services into small partitions 428 00:31:18,000 --> 00:31:20,000 which we call cells. 429 00:31:21,000 --> 00:31:25,000 We could further reduce the blast radius by creating smaller boundaries 430 00:31:25,000 --> 00:31:31,000 or cells and having a copy of our application in each of those boundaries. 431 00:31:31,000 --> 00:31:34,000 Now what we have done here if there's an issue that happens 432 00:31:34,000 --> 00:31:38,000 it will only be isolated within this boundary or within this cell 433 00:31:38,000 --> 00:31:43,000 reducing the blast radius and containing the failure to be only within a defined boundary. 434 00:31:44,000 --> 00:31:48,000 Cells are a term to make you feel bad about your architecture. 435 00:31:48,000 --> 00:31:51,000 To make the cloud seem cooler than your data center 436 00:31:51,000 --> 00:31:57,000 and to pretend an architectural improvement will prevent the systemic failure of leadership. 437 00:31:57,000 --> 00:32:00,000 If you enjoyed this episode and want to hear more please let us know. 438 00:32:00,000 --> 00:32:03,000 We'll have our regular interviews starting up again in 2026. 439 00:32:03,000 --> 00:32:08,000 Until then have a happy holiday and may your pagers stay silent. 440 00:32:11,000 --> 00:32:14,000 Thank you for listening to this episode of Fork Around and Find Out. 441 00:32:14,000 --> 00:32:19,000 If you like this show please consider sharing it with a friend, a co-worker, a family member or even an enemy. 442 00:32:19,000 --> 00:32:24,000 However we get the word out about this show helps it to become sustainable for the long term. 443 00:32:24,000 --> 00:32:30,000 If you want to sponsor this show please go to fafo.fm slash sponsor 444 00:32:30,000 --> 00:32:34,000 and reach out to us there about what you're interested in sponsoring and how we can help. 445 00:32:35,000 --> 00:32:38,000 We hope your system stay available and your pagers stay quiet. 446 00:32:38,000 --> 00:32:40,000 We'll see you again next time. 447 00:32:44,000 --> 00:32:46,000 Thank you.

Objecting to storage

Episode Transcript

Never lose your place, on any device