·E661

Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Episode Transcript

1 00:00:00,000 --> 00:00:04,019 Ben Hartshorne: For all of these dependencies, there are clearly 2 00:00:04,019 --> 00:00:10,020 several who have built their system with this challenge in mind and have 3 00:00:10,080 --> 00:00:12,660 a series of different fallbacks. 4 00:00:13,200 --> 00:00:16,170 Uh, I'll, I'll give you the story of, um, uh, we used LaunchDarkly 5 00:00:16,170 --> 00:00:17,100 for our feature flagging. 6 00:00:17,460 --> 00:00:19,050 Their service was also impacted yesterday. 7 00:00:19,050 --> 00:00:22,230 One would think, oh, we need our feature flags in order to boot up. 8 00:00:23,070 --> 00:00:27,390 Well, their SDK is built with the idea that you set your 9 00:00:27,390 --> 00:00:28,650 feature flag defaults in code. 10 00:00:28,995 --> 00:00:31,215 And if we can't reach our service, we'll go ahead and use those. 11 00:00:32,145 --> 00:00:33,795 And if we can reach our service, great. 12 00:00:33,915 --> 00:00:34,635 We'll update them. 13 00:00:35,145 --> 00:00:37,035 And if we can update them once, that's great. 14 00:00:37,065 --> 00:00:39,255 If we can connect to the streaming service even better. 15 00:00:40,455 --> 00:00:44,085 And I, I think they also have, uh, some, some more, uh, bridging in 16 00:00:44,085 --> 00:00:47,385 there, but we don't use, uh, the, the more complicated infrastructure. 17 00:00:47,655 --> 00:00:53,535 But this idea that they design the system with the expectation that in the 18 00:00:53,535 --> 00:00:57,465 event of a service on unavailability, things will continue to work. 19 00:00:58,665 --> 00:01:01,155 Made the recovery process all that much better. 20 00:01:01,665 --> 00:01:06,225 And, uh, even when, when, uh, their service was unavailable and ours 21 00:01:06,225 --> 00:01:10,905 was still running, uh, the SDK still answers questions in code for 22 00:01:10,905 --> 00:01:12,255 the status of all of these flags. 23 00:01:12,735 --> 00:01:14,925 It doesn't say, oh, I, I can't reach my upstream. 24 00:01:14,925 --> 00:01:16,575 Suddenly, I can't give you an answer anymore. 25 00:01:16,785 --> 00:01:20,955 No, the SDK is built with that idea of local caching so that it can 26 00:01:21,015 --> 00:01:23,745 continue to serve the correct answer. 27 00:01:24,630 --> 00:01:27,300 So far as it new from whenever it lost its connection. 28 00:01:32,160 --> 00:01:33,840 Corey: Welcome to Screaming in the Cloud. 29 00:01:34,080 --> 00:01:35,220 I'm Cory Quinn. 30 00:01:35,400 --> 00:01:39,240 My guest today is one of those folks that I am disappointed I have not 31 00:01:39,240 --> 00:01:43,289 had on the show until now, just because I assumed I already had. 32 00:01:43,590 --> 00:01:48,330 Ben Hartshornee is a principal engineer at Honeycomb, but oh, so much more than that. 33 00:01:48,539 --> 00:01:50,520 Ben, thank you for dating to join us. 34 00:01:50,789 --> 00:01:52,050 Ben Hartshorne: It's lovely to be here this morning. 35 00:01:52,440 --> 00:01:55,530 Corey: This episode is sponsored in part by my day job Duck. 36 00:01:55,530 --> 00:01:58,384 Bill, do you have a horrifying AWS bill? 37 00:01:59,009 --> 00:02:00,869 That can mean a lot of things. 38 00:02:01,110 --> 00:02:06,179 Predicting what it's going to be, determining what it should be, negotiating 39 00:02:06,179 --> 00:02:11,609 your next long-term contract with AWS, or just figuring out why it increasingly 40 00:02:11,609 --> 00:02:16,140 resembles a phone number, but nobody seems to quite know why that is. 41 00:02:16,440 --> 00:02:20,010 To learn more, visit duck bill hq.com. 42 00:02:20,310 --> 00:02:23,190 Remember, you can't duck the duck bill. 43 00:02:23,220 --> 00:02:28,620 Bill, which my CEO reliably informs me is absolutely not our slogan. 44 00:02:29,025 --> 00:02:35,174 So you gave a talk, uh, about roughly a month ago, uh, at the inaugural 45 00:02:35,265 --> 00:02:38,144 finops uh, meetup in San Francisco. 46 00:02:39,015 --> 00:02:40,215 Give us the high level. 47 00:02:40,215 --> 00:02:40,945 What did you talk about? 48 00:02:41,520 --> 00:02:43,710 Ben Hartshorne: Well, I got to talk about two stories. 49 00:02:43,830 --> 00:02:44,970 Um, I love telling stories. 50 00:02:45,210 --> 00:02:49,680 I got to about, talk about two stories of how we used honeycomb and instrumentation 51 00:02:49,950 --> 00:02:51,780 to help optimize our cloud. 52 00:02:51,780 --> 00:02:55,680 Spending a topic near and dear to your heart, uh, is what brought me there. 53 00:02:56,070 --> 00:03:01,410 We gotta look at the overall bill and say, Hey, what, where are some 54 00:03:01,410 --> 00:03:02,610 of the big things coming from? 55 00:03:02,910 --> 00:03:05,940 Obviously it's people sending us data and people asking us 56 00:03:05,940 --> 00:03:07,079 questions about those data. 57 00:03:07,920 --> 00:03:09,930 Corey: And if they would just stop both of those things, your 58 00:03:09,930 --> 00:03:11,250 bill would be so much better. 59 00:03:11,640 --> 00:03:12,990 Ben Hartshorne: It would be so much smaller. 60 00:03:13,380 --> 00:03:15,390 Um, so would my salary, unfortunately. 61 00:03:15,900 --> 00:03:22,240 Um, so we wanted to reduce some of those costs, but, uh, it, it's a, it's 62 00:03:22,240 --> 00:03:26,430 a problem that that's hard to get into just from like a, a general perspective. 63 00:03:26,430 --> 00:03:28,680 You need to really get in and look at all the details to find 64 00:03:28,680 --> 00:03:29,970 out what you're gonna change. 65 00:03:30,480 --> 00:03:32,550 So, uh, I gotta tell two stories. 66 00:03:33,255 --> 00:03:35,025 Uh, reducing costs. 67 00:03:35,355 --> 00:03:40,665 One by switching from a MD to, uh, arm architecture for Amazon. 68 00:03:40,665 --> 00:03:42,765 That's the Graviton chip set, which is fantastic. 69 00:03:43,155 --> 00:03:47,235 Uh, and the other was about the amazing power of spreadsheets. 70 00:03:49,455 --> 00:03:52,395 As much as I love graphs, I also love spreadsheets. 71 00:03:53,040 --> 00:03:54,420 I, I'm sorry. 72 00:03:54,630 --> 00:03:55,980 It's a personal failing. 73 00:03:55,980 --> 00:03:56,430 Perhaps. 74 00:03:56,670 --> 00:04:00,510 Corey: It's wild to me how many tools out there do all kinds of business 75 00:04:00,510 --> 00:04:04,530 adjacent things, but somehow never bother to realize that if you can just export 76 00:04:04,530 --> 00:04:09,330 and CSV, suddenly you're speaking kind of the language of your ultimate user. 77 00:04:09,540 --> 00:04:13,500 Play up with pandas a little bit more and spit out an actual Excel file, 78 00:04:13,500 --> 00:04:14,640 and now you're cooking it with gas. 79 00:04:14,645 --> 00:04:14,965 Mm-hmm. 80 00:04:16,320 --> 00:04:19,950 Ben Hartshorne: So, uh, the, the second story is about doing that with honeycomb. 81 00:04:20,310 --> 00:04:23,969 Taking, uh, a number of different graphs and looking at, um, five 82 00:04:23,969 --> 00:04:29,070 different attributes of our lambda, uh, costs and what was going into 83 00:04:29,070 --> 00:04:33,719 them, and making changes across all of them in order to, uh, accomplish 84 00:04:33,719 --> 00:04:36,060 an overall cost reduction about 50%. 85 00:04:36,599 --> 00:04:37,469 Uh, which is really great. 86 00:04:37,860 --> 00:04:42,599 So the, the story, uh, it does combine my love of graph because we 87 00:04:42,599 --> 00:04:44,159 gotta see the three lines go down. 88 00:04:44,580 --> 00:04:48,990 Um, the power of spreadsheets and also this idea that. 89 00:04:49,530 --> 00:04:55,560 You can't just look for one answer to find the, uh, solution to your 90 00:04:55,620 --> 00:04:58,169 problems around, well, anything really. 91 00:04:58,500 --> 00:05:00,510 Uh, but especially around reducing costs. 92 00:05:00,990 --> 00:05:03,870 It's going to be a bunch of small things that you can put 93 00:05:03,870 --> 00:05:05,400 together, uh, into one place. 94 00:05:06,000 --> 00:05:09,359 Corey: There's a, there's a lot that's valuable when we start going down that 95 00:05:09,359 --> 00:05:11,969 particular path of starting to look at. 96 00:05:12,719 --> 00:05:15,810 Things through a, a lens of a particular kind of data that 97 00:05:15,810 --> 00:05:17,280 you otherwise wouldn't think to? 98 00:05:17,429 --> 00:05:21,929 I, I remain, I maintain that you remain the only customer we have 99 00:05:21,929 --> 00:05:28,679 found so far that uses Honeycomb to completely instrument their AWS bill. 100 00:05:28,859 --> 00:05:31,919 Uh, we had not seen that before or since. 101 00:05:32,099 --> 00:05:34,650 It, it makes sense for you to do it that way. 102 00:05:34,650 --> 00:05:35,460 Absolutely. 103 00:05:36,150 --> 00:05:38,130 It's a bit of a heavy lift for. 104 00:05:38,565 --> 00:05:40,125 Shall we say everyone else? 105 00:05:40,755 --> 00:05:44,655 Ben Hartshorne: Uh, and it, it actually is a, a bit of a lift for, for us to, to 106 00:05:44,655 --> 00:05:49,604 say we've instrumented the entire bill, uh, is a, a wonderful thing to, to assert. 107 00:05:49,695 --> 00:05:55,215 And, uh, as we've talked about, we, we use the power of spreadsheets too. 108 00:05:55,784 --> 00:06:00,195 So there are some aspects there is that, there's some aspects of our 109 00:06:00,195 --> 00:06:03,885 eight of us spending and actually really dominant ones, uh, that 110 00:06:04,664 --> 00:06:06,765 lend themselves very easily to be. 111 00:06:08,310 --> 00:06:09,000 Using Honeycomb. 112 00:06:09,360 --> 00:06:14,400 Um, the best example is Lambda because Lambda is, uh, charged on a per 113 00:06:14,400 --> 00:06:21,690 millisecond basis and our instrumentation is collecting spans, traces about your 114 00:06:21,690 --> 00:06:23,610 compute on a per millisecond basis. 115 00:06:23,910 --> 00:06:27,270 There's a very easy translation there, and so we can get really good 116 00:06:27,270 --> 00:06:30,990 insight into which customers are spending, how much or rather, which 117 00:06:30,995 --> 00:06:32,825 customers are causing us to spend. 118 00:06:32,885 --> 00:06:33,225 How much. 119 00:06:34,815 --> 00:06:39,945 Provide our product to them and, uh, understand how that, how we can balance 120 00:06:39,945 --> 00:06:45,495 our, uh, development resources to both provide new features and also, uh, 121 00:06:45,495 --> 00:06:49,440 understand when we need to shift and, uh, spend our attention managing costs that. 122 00:06:50,520 --> 00:06:54,450 Corey: There's a continuum here, and I think that it tends to follow a 123 00:06:54,450 --> 00:06:59,160 lot around company ethos and company culture here, where folks have 124 00:06:59,280 --> 00:07:04,110 varying degrees of insight into the factors that drive their cloud spend. 125 00:07:04,560 --> 00:07:05,230 Uh, you are. 126 00:07:05,700 --> 00:07:10,560 Clearly an observability company you have been observing your AWS bill 127 00:07:10,560 --> 00:07:14,700 for, I would argue longer than it would've made sense to on some level. 128 00:07:14,700 --> 00:07:18,270 In the very early days you were doing this and your AWS bill was 129 00:07:18,270 --> 00:07:23,370 not the, the, the limiting factor to your company's success back in those 130 00:07:23,370 --> 00:07:24,870 days, but, but you did grow into it. 131 00:07:25,349 --> 00:07:28,560 Other folks, even at very large enterprise scale, more or less, 132 00:07:28,560 --> 00:07:30,300 do this based on vibes and. 133 00:07:31,025 --> 00:07:34,805 Most folks, I think, tend to fall somewhere in the middle of this, 134 00:07:34,805 --> 00:07:36,545 but, but it's not evenly distributed. 135 00:07:36,695 --> 00:07:39,185 Some teams tend to have a very deep insight into what 136 00:07:39,185 --> 00:07:40,685 they're doing and others are. 137 00:07:40,685 --> 00:07:42,485 Amazon Bill, you mean the books? 138 00:07:42,724 --> 00:07:46,265 It, it, again, most tend to fall somewhere center of that. 139 00:07:46,265 --> 00:07:47,914 It's, it's a law of large numbers. 140 00:07:47,914 --> 00:07:50,255 Everything starts to revert to a mean, past a certain point. 141 00:07:51,455 --> 00:07:53,585 Ben Hartshorne: Well, I mean, you, you wouldn't have a job if, if they didn't 142 00:07:53,585 --> 00:07:55,534 make it a bit of a challenge to do so. 143 00:07:56,190 --> 00:07:59,250 Corey: Or I might have a better job depending, but we'll see. 144 00:07:59,489 --> 00:08:02,729 Uh, I I do wanna detour a little bit here because as we record this, it is the 145 00:08:02,729 --> 00:08:07,140 day after AWS's big significant outage. 146 00:08:07,140 --> 00:08:11,280 I could really mess with the conspiracy theorists and say it is their first 147 00:08:11,280 --> 00:08:15,150 major outage of, uh, October of 2025. 148 00:08:15,539 --> 00:08:17,159 Uh, and then people are like, wait, what do you mean? 149 00:08:17,370 --> 00:08:18,359 What do you mean this is World War? 150 00:08:18,780 --> 00:08:22,049 I like same type of approach, like, but these things do tend to cluster. 151 00:08:22,320 --> 00:08:24,299 Uh, how was your day yesterday? 152 00:08:25,305 --> 00:08:27,345 Ben Hartshorne: Uh, well, it did start very early. 153 00:08:27,855 --> 00:08:33,855 Um, uh, our, our service, uh, has, has presence in multiple regions. 154 00:08:34,275 --> 00:08:40,485 Uh, but we do have our, our main, uh, US instance in, in Amazon's US East one. 155 00:08:40,965 --> 00:08:46,555 And so as, uh, things stop working, uh, a lot of our service stopped working too. 156 00:08:48,104 --> 00:08:52,305 I mean, the, the outage was, was significant, but wasn't, uh, pervasive. 157 00:08:52,545 --> 00:08:56,594 There were still some things that kept functioning, and amazingly, 158 00:08:56,805 --> 00:09:01,844 we actually preserved all of the customer telemetry that made it 159 00:09:01,844 --> 00:09:03,584 to our front door successfully. 160 00:09:04,185 --> 00:09:06,584 Uh, which is a big deal because we hate dropping data. 161 00:09:06,944 --> 00:09:09,974 Corey: Yeah, it's, that took some work in engineering and, and I have to 162 00:09:09,974 --> 00:09:11,265 imagine this was also not an accident. 163 00:09:11,775 --> 00:09:12,645 Ben Hartshorne: It was not an accident. 164 00:09:13,125 --> 00:09:16,185 Now their ability to query that data during the outage. 165 00:09:17,355 --> 00:09:17,685 Suffered. 166 00:09:18,255 --> 00:09:20,415 Corey: I, I'm gonna push back on you on that for a second there. 167 00:09:20,444 --> 00:09:24,525 When AWS's US East, one where you have a significant workload 168 00:09:24,525 --> 00:09:30,285 is impacted to this degree, how important is, uh, observability? 169 00:09:30,680 --> 00:09:32,564 I, I know that when I've dealt with outages in the. 170 00:09:32,655 --> 00:09:33,375 Past. 171 00:09:33,405 --> 00:09:36,525 There's, uh, the first thing you try and figure out of, is it my shitty, 172 00:09:36,525 --> 00:09:38,685 shitty code or is it a global issue? 173 00:09:38,714 --> 00:09:39,464 That's important. 174 00:09:39,584 --> 00:09:43,155 And once you establish it's a global issue, then you can begin, uh, the 175 00:09:43,155 --> 00:09:45,135 mitigation part of that process. 176 00:09:45,314 --> 00:09:48,464 And yes, observability becomes extraordinarily important there for 177 00:09:48,464 --> 00:09:50,115 some things, but for others it's. 178 00:09:50,865 --> 00:09:54,074 There, there's also, at least with the cloud being as big as it is now, 179 00:09:54,255 --> 00:09:58,365 there's some reputational headline, uh, risk protection here in that 180 00:09:58,545 --> 00:10:01,545 no one is talking about your site going down in some weird ways. 181 00:10:01,545 --> 00:10:05,145 Yesterday everyone's talking about AWS going down, like they 182 00:10:05,145 --> 00:10:06,555 own the reputation of this. 183 00:10:06,735 --> 00:10:07,005 Yeah, 184 00:10:08,324 --> 00:10:09,015 Ben Hartshorne: that's true. 185 00:10:09,464 --> 00:10:15,314 Um, and also when a business's customers are asking them. 186 00:10:15,825 --> 00:10:17,355 Which parts of your service are working? 187 00:10:17,565 --> 00:10:21,495 I know AWS is having a thing, uh, how bad is it affecting you? 188 00:10:21,675 --> 00:10:23,535 You wanna be able to give them a solid answer. 189 00:10:24,285 --> 00:10:28,185 So our customers were asking us yesterday, Hey, are you dropping our data? 190 00:10:28,995 --> 00:10:32,775 And we wanted to be able to give them a, a reasonable answer even in the moment. 191 00:10:32,955 --> 00:10:37,095 So yes, the, we we're able to deflect a certain. 192 00:10:39,020 --> 00:10:40,635 The, the reputational harm. 193 00:10:40,785 --> 00:10:44,324 But at the same time, there are people that have come back and say, well, I 194 00:10:44,324 --> 00:10:45,615 mean, shouldn't you have done better? 195 00:10:45,915 --> 00:10:48,765 It's important for us to be able to rebuild our business and to 196 00:10:48,824 --> 00:10:52,094 to move region to region, and we need you to help us do that too. 197 00:10:52,425 --> 00:10:53,145 Corey: Oh, absolutely. 198 00:10:53,145 --> 00:10:56,235 And I, I actually encountered a lot of this yesterday when I, uh, early in the 199 00:10:56,235 --> 00:11:01,365 morning tried to get a, uh, what was it, A Halloween costume and Amazon site was not 200 00:11:01,365 --> 00:11:03,704 working properly for some strange reason. 201 00:11:03,885 --> 00:11:05,040 Now, if I read some of the. 202 00:11:05,895 --> 00:11:09,194 Relatively out of touch analyses in the mainstream press. 203 00:11:09,435 --> 00:11:12,165 Uh, that's billions and billions of dollars lost. 204 00:11:12,165 --> 00:11:16,035 Therefore, I either went to go get a Halloween costume from another 205 00:11:16,035 --> 00:11:19,485 vendor, or I will never wear a Halloween costume this year. 206 00:11:19,635 --> 00:11:21,225 Better luck in 2026. 207 00:11:21,765 --> 00:11:23,985 Neither of those is necessarily true, 208 00:11:24,255 --> 00:11:24,915 Ben Hartshorne: and that's really. 209 00:11:25,469 --> 00:11:29,819 Exactly why we're, we, were focused on preserving, successfully storing 210 00:11:29,819 --> 00:11:31,199 our customer's data in the moment. 211 00:11:31,709 --> 00:11:34,949 Because then when the, uh, when the time comes afterwards, they're like, 212 00:11:34,949 --> 00:11:37,920 okay, now we, we, we said what we said in the time, in the moment. 213 00:11:38,189 --> 00:11:40,319 Now they're asking us, okay, what really happened? 214 00:11:40,709 --> 00:11:44,729 Uh, that data is invaluable in helping our customers piece together 215 00:11:44,969 --> 00:11:47,579 which parts of their services were working and which weren't. 216 00:11:48,150 --> 00:11:48,959 At what times 217 00:11:49,380 --> 00:11:52,829 Corey: did you see a drop in, uh, telemetry during the outage? 218 00:11:53,370 --> 00:11:54,270 Ben Hartshorne: Yep, for sure. 219 00:11:54,660 --> 00:11:56,910 Corey: Is that because people's systems were down, or is that because their 220 00:11:56,910 --> 00:11:58,200 systems could not communicate out? 221 00:11:58,380 --> 00:11:58,890 Ben Hartshorne: Both. 222 00:11:59,760 --> 00:12:00,210 Corey: Excellent. 223 00:12:01,740 --> 00:12:05,460 Ben Hartshorne: Uh, we did get some reports of, uh, from our customers that 224 00:12:05,730 --> 00:12:10,740 their, uh, specifically the open telemetry collector that was, uh, gathering the 225 00:12:10,740 --> 00:12:15,180 data from their application was unable to successfully send it to Honeycomb. 226 00:12:15,810 --> 00:12:18,990 Uh, at the same time we were not rejecting it. 227 00:12:19,515 --> 00:12:24,344 So clearly there were challenges in the, the path between those two things. 228 00:12:24,735 --> 00:12:28,875 Uh, whether that was an AWS's network in some other network unable to get to aws. 229 00:12:28,875 --> 00:12:30,015 I, I dunno. 230 00:12:30,194 --> 00:12:35,145 So, uh, we definitely saw there were issues of reachability. 231 00:12:35,714 --> 00:12:39,615 Uh, and so undoubtedly there was some data drop there that's completely out 232 00:12:39,615 --> 00:12:43,844 of our, so the, the only part we could say is once the data got to us, we 233 00:12:43,844 --> 00:12:45,135 were able to successfully store it. 234 00:12:45,525 --> 00:12:48,380 So, um, the question is, uh, was it. 235 00:12:49,110 --> 00:12:50,850 Customers apps going down. 236 00:12:51,704 --> 00:12:57,285 Uh, absolutely many of our customers were down and they were unable to send us on e 237 00:12:57,285 --> 00:12:59,055 telemetry because their app was offline. 238 00:12:59,564 --> 00:13:03,824 Uh, but the the other side is also true that the ones that were up 239 00:13:04,094 --> 00:13:07,854 were having trouble getting to us because of our location in US East. 240 00:13:08,645 --> 00:13:11,615 Corey: Now to continue reading what the mainstream press had to say about 241 00:13:11,645 --> 00:13:17,225 this, does that mean that you are now actively considering evacuating AWS 242 00:13:17,225 --> 00:13:21,875 entirely to go to a different provider that can be more reliable, probably 243 00:13:21,875 --> 00:13:23,135 building your own data centers. 244 00:13:23,465 --> 00:13:25,055 Ben Hartshorne: Yeah, you know, I've, I've heard people say 245 00:13:25,055 --> 00:13:26,405 that's the thing to do these days. 246 00:13:26,944 --> 00:13:29,319 Now, I, I have helped build data centers in the past. 247 00:13:30,375 --> 00:13:32,594 Corey: As have I, there's a reason that both of us have a 248 00:13:32,594 --> 00:13:33,974 job that does not involve that 249 00:13:34,454 --> 00:13:37,964 Ben Hartshorne: there is, uh, the data centers I built, were not as reliable as 250 00:13:37,964 --> 00:13:42,645 any of the data centers that are available from our, our big public cloud providers. 251 00:13:42,750 --> 00:13:45,194 Corey: I, I would've said, unless you worked at one of those companies building 252 00:13:45,194 --> 00:13:48,104 the data centers, and even back then, given the time you've been at Honeycomb, 253 00:13:48,104 --> 00:13:51,584 I can say with a certainty, you are not as good at running data centers as 254 00:13:51,584 --> 00:13:53,925 they are because effectively no one is. 255 00:13:53,925 --> 00:13:56,685 This is something that you get to learn about at significant scale. 256 00:13:56,835 --> 00:13:59,685 The concern is, I see it as one of consolidation, but I've seen too 257 00:13:59,685 --> 00:14:04,095 many folks try and go multi-cloud for resilience reasons, and all 258 00:14:04,095 --> 00:14:06,345 they've done is they've added a second single point of failure. 259 00:14:06,345 --> 00:14:09,045 So now they're exposed to everyone's outage, and when that happens, their 260 00:14:09,045 --> 00:14:13,185 site continues to fall down in different ways as opposed to being more resilient, 261 00:14:13,215 --> 00:14:16,665 which is a hell of a lot more than just picking multiple providers. 262 00:14:17,205 --> 00:14:19,365 Ben Hartshorne: But there is something to say though of looking 263 00:14:19,365 --> 00:14:21,405 at a business and saying, okay, what. 264 00:14:22,845 --> 00:14:26,865 What is the cost for us to be, you know, single region versus what is 265 00:14:26,865 --> 00:14:31,455 the cost to be fully, uh, you know, multi-region where we can fail over 266 00:14:31,455 --> 00:14:33,435 in an instant and nobody notices? 267 00:14:34,005 --> 00:14:35,955 Uh, those cost differences are huge. 268 00:14:36,630 --> 00:14:38,130 And for most businesses 269 00:14:38,490 --> 00:14:40,200 Corey: of course ma, it's a massive investment. 270 00:14:40,200 --> 00:14:40,830 At least 10 x. 271 00:14:41,040 --> 00:14:41,220 Ben Hartshorne: Yeah. 272 00:14:41,640 --> 00:14:44,040 So for most businesses you're not gonna go that far. 273 00:14:44,400 --> 00:14:47,820 Corey: My, my newsletter, publication is entirely bound within US West 274 00:14:47,820 --> 00:14:51,300 two, because if that goes down, that, that just happened to be for latency 275 00:14:51,300 --> 00:14:52,830 purposes, not reliability reasons. 276 00:14:52,980 --> 00:14:55,530 But if the region is hard down and I need to send an email newsletter 277 00:14:55,530 --> 00:14:58,470 and it's down for several days, I'm writing that one by hand. 278 00:14:58,500 --> 00:15:00,420 'cause I've got a different story to tell that week. 279 00:15:00,420 --> 00:15:02,550 I don't need it to do the business as usual. 280 00:15:03,005 --> 00:15:03,125 Thing. 281 00:15:03,305 --> 00:15:06,875 And that that's a reflection of architecture and investment decisions 282 00:15:07,055 --> 00:15:08,675 reflecting the reality of my business. 283 00:15:08,855 --> 00:15:09,035 Ben Hartshorne: Yes. 284 00:15:09,335 --> 00:15:10,955 And that's, that's exactly where to start. 285 00:15:11,585 --> 00:15:15,395 And there are things you can do within a region to increase a little bit 286 00:15:15,395 --> 00:15:19,205 of resilience to certain services within that region suffering. 287 00:15:19,865 --> 00:15:25,085 So, um, as an example, uh, uh, I don't remember how many years ago it was, uh, 288 00:15:25,085 --> 00:15:29,135 but uh, Amazon had an outage in kms, the, uh, the key management service. 289 00:15:29,675 --> 00:15:32,465 And that basically made everything stop. 290 00:15:33,150 --> 00:15:35,400 Uh, you can probably find out exactly when it happened. 291 00:15:35,640 --> 00:15:36,810 Corey: Yes, I'm pulling that up now. 292 00:15:36,810 --> 00:15:37,560 Please continue. 293 00:15:37,560 --> 00:15:38,130 I'm curious. 294 00:15:38,130 --> 00:15:38,460 Now 295 00:15:38,640 --> 00:15:42,329 Ben Hartshorne: they provide a really easy way to replicate all of your 296 00:15:42,329 --> 00:15:47,850 keys to another region and a pretty easy way to fail over accessing those 297 00:15:47,850 --> 00:15:49,110 keys from one region to another. 298 00:15:49,680 --> 00:15:52,920 So even if you're not gonna be fully multi-region, you can insulate 299 00:15:52,920 --> 00:15:56,490 against individual services that might have an incident and prevent 300 00:15:56,550 --> 00:15:59,910 those one services from having an outsized impact on your application. 301 00:16:00,690 --> 00:16:04,110 You know, we don't need their keys most of the time, but when you 302 00:16:04,110 --> 00:16:07,170 do need them, you kind of need them to start your application. 303 00:16:07,170 --> 00:16:10,080 So if you need to scale up or do something like that and it's not 304 00:16:10,080 --> 00:16:12,600 available, you're really out of luck. 305 00:16:13,290 --> 00:16:18,210 So I, the, the thing is, I, I don't wanna advocate that people try and go 306 00:16:18,210 --> 00:16:21,990 fully multi-region, but that's not to say that we advocate all responsibility 307 00:16:21,990 --> 00:16:24,065 for insulating our application from. 308 00:16:24,795 --> 00:16:27,525 Having transient outages in our dependencies. 309 00:16:27,945 --> 00:16:28,095 Corey: Yeah. 310 00:16:28,150 --> 00:16:31,005 To, to be clear, they did not do a formal writeup on the KMS issue 311 00:16:31,005 --> 00:16:38,055 on their basically kind of not terrific, uh, list of, uh, uh, list 312 00:16:38,085 --> 00:16:40,635 of, um, out post event summaries. 313 00:16:40,815 --> 00:16:42,855 It's, things have to be sort of noisy for that to hit. 314 00:16:43,755 --> 00:16:46,965 I'm sure yesterday's will wind up on that list once they have, uh, the, the 315 00:16:47,205 --> 00:16:48,795 had that up before this thing publishes. 316 00:16:49,065 --> 00:16:51,285 But yeah, they did not put the KMS issue there. 317 00:16:51,465 --> 00:16:52,335 You're completely correct. 318 00:16:53,145 --> 00:16:56,985 It's a, this is the sort of thing of what is, what is the ba, what is 319 00:16:56,985 --> 00:16:58,365 the blast radius of these issues? 320 00:16:58,725 --> 00:17:04,305 And I, I think that there's this sense that before we went in the 321 00:17:04,305 --> 00:17:07,305 cloud, everything was more reliable, but just the opposite is true. 322 00:17:07,665 --> 00:17:10,665 Uh, the difference was, is that if we were all building our data centers 323 00:17:10,665 --> 00:17:14,025 today, my shitty stuff at Duck Bill is down as it is every, you know, 324 00:17:14,085 --> 00:17:16,905 every random Tuesday and tomorrow. 325 00:17:16,970 --> 00:17:20,120 Honeycomb is down because, oops, it turns out you once again are 326 00:17:20,120 --> 00:17:22,010 forgotten to replace a bad, hard drive. 327 00:17:22,339 --> 00:17:22,790 Cool. 328 00:17:23,300 --> 00:17:24,710 But those are not the happening. 329 00:17:24,710 --> 00:17:27,680 At the same time, when you start with the centralization story, 330 00:17:27,740 --> 00:17:31,970 suddenly a disproportionate swath of the world is down simultaneously, 331 00:17:31,970 --> 00:17:33,440 and that's where things get weird. 332 00:17:33,830 --> 00:17:37,010 It gets even harder though because you can test your durability and 333 00:17:37,010 --> 00:17:38,870 your resilience as much as you want. 334 00:17:39,639 --> 00:17:42,459 But it doesn't impact, it doesn't account for the, the challenge of third 335 00:17:42,459 --> 00:17:44,500 party providers on your critical path. 336 00:17:45,730 --> 00:17:49,060 You're, you obviously need to make sure that if, in order for honeycomb 337 00:17:49,060 --> 00:17:51,730 to work, honeycomb itself has to be up. 338 00:17:51,850 --> 00:17:52,929 That's sort of step one. 339 00:17:53,139 --> 00:17:57,490 But to do that, AWS itself has to be up in certain places. 340 00:17:57,790 --> 00:17:59,320 What other vendors factor into this? 341 00:17:59,409 --> 00:18:00,790 Ben Hartshorne: You know, that was, I think, the most interesting 342 00:18:00,790 --> 00:18:04,300 part of yesterday's challenge, bringing the service back up. 343 00:18:04,990 --> 00:18:05,435 Uh, is that. 344 00:18:06,195 --> 00:18:09,885 We do rely on an incredible number of other services. 345 00:18:10,215 --> 00:18:14,085 Uh, there's some list of all of our vendors that is hundreds long. 346 00:18:14,445 --> 00:18:16,335 Now those are obviously very different parts of the business. 347 00:18:16,335 --> 00:18:20,385 They involve, uh, you know, companies we contract with for marketing outreach and 348 00:18:20,385 --> 00:18:22,350 for, uh, business and for all of that. 349 00:18:22,940 --> 00:18:23,210 Corey: Right. 350 00:18:23,210 --> 00:18:26,930 We use Dropbox here, and if Dropbox is down, uh, it, that doesn't necessarily 351 00:18:26,930 --> 00:18:30,770 impact our ability to wind up serving our customers, but it does mean I to 352 00:18:30,770 --> 00:18:34,700 find a need, to find a different way, for example, to get the recorded file from 353 00:18:34,700 --> 00:18:36,980 this podcast over to my editing team. 354 00:18:37,070 --> 00:18:39,290 Ben Hartshorne: Yeah, so there's, there's the very long list. 355 00:18:39,850 --> 00:18:42,820 And then there's the much, much shorter list of vendors that are 356 00:18:42,820 --> 00:18:46,330 really in the critical path, and we have a bunch of those too. 357 00:18:46,720 --> 00:18:51,189 Um, we use, uh, uh, vendors for feature flagging and for sending 358 00:18:51,189 --> 00:18:57,250 email and uh, for, um, uh, some, some other, uh, forms of telemetry that, 359 00:18:57,340 --> 00:18:58,750 that are destined for other spots. 360 00:19:00,550 --> 00:19:05,110 For the most part, when we get that many vendors all relying on each other. 361 00:19:06,090 --> 00:19:07,380 They're all down at once. 362 00:19:07,650 --> 00:19:10,410 There's this bootstrapping problem where they're all trying to come back, 363 00:19:10,410 --> 00:19:13,260 but they all sort of rely on each other in order to come back successfully. 364 00:19:13,830 --> 00:19:17,190 And I think that's part of what made yesterday morning's, uh, out. 365 00:19:18,240 --> 00:19:24,270 Move from, uh, roughly what, like midnight to 3:00 AM Pacific all the way 366 00:19:24,270 --> 00:19:29,160 through the rest of the day and, and still have issues, uh, with, with some 367 00:19:29,160 --> 00:19:32,040 companies up until, uh, five six, 7:00 PM 368 00:19:32,400 --> 00:19:35,970 Corey: This episode is sponsored by my own company, duck Bill, 369 00:19:36,210 --> 00:19:38,125 having trouble with your AWS bill. 370 00:19:38,415 --> 00:19:41,865 Perhaps it's time to renegotiate a contract with them. 371 00:19:42,195 --> 00:19:47,595 Maybe you're just wondering how to predict what's going on in the wide world of AWS. 372 00:19:47,655 --> 00:19:50,295 Well, that's where Duck Bill comes in to help. 373 00:19:50,504 --> 00:19:53,235 Remember, you can't duck the duck bill. 374 00:19:53,235 --> 00:19:56,835 Bill, which I am reliably informed by my business partner 375 00:19:56,955 --> 00:19:59,445 is absolutely not our motto. 376 00:19:59,685 --> 00:20:02,770 To learn more, visit duck bill hq.com. 377 00:20:03,585 --> 00:20:07,425 The, the Google SRE book talked about this, oh geez, when was it? 378 00:20:07,455 --> 00:20:08,504 15 years ago now. 379 00:20:08,534 --> 00:20:09,794 Damn near that. 380 00:20:09,825 --> 00:20:13,814 Uh, that at some point when a service goes down and then it starts to recover, 381 00:20:13,995 --> 00:20:17,715 everything that depends on it will often basically pummel it back into 382 00:20:17,715 --> 00:20:19,905 submission, trying to talk to the thing. 383 00:20:20,235 --> 00:20:24,945 It's a, like I remember back when I worked at, uh, as a senior systems engineer at 384 00:20:24,945 --> 00:20:28,185 Media Temple in the days before GoDaddy bought and then ultimately killed them. 385 00:20:28,514 --> 00:20:28,845 Uh. 386 00:20:29,030 --> 00:20:31,940 They, they, I was touring the data center my first week. 387 00:20:31,940 --> 00:20:34,040 We had, uh, we had three different facilities. 388 00:20:34,040 --> 00:20:36,200 I was in one of them and I asked, okay, great. 389 00:20:36,200 --> 00:20:39,350 I just trip over things and hit the emergency power off switch. 390 00:20:39,350 --> 00:20:39,770 Great. 391 00:20:39,890 --> 00:20:41,330 And kill the entire data center. 392 00:20:41,630 --> 00:20:43,760 There's an order that you have to bring things back up. 393 00:20:43,760 --> 00:20:47,000 In the event of those catastrophic outages, is there a runbook? 394 00:20:47,000 --> 00:20:48,380 And of course there was great. 395 00:20:48,380 --> 00:20:48,740 Where is it? 396 00:20:48,770 --> 00:20:49,790 Oh, it's not Confluence. 397 00:20:49,880 --> 00:20:50,300 Terrific. 398 00:20:50,300 --> 00:20:50,810 Where's that? 399 00:20:50,840 --> 00:20:52,340 Oh, in the rack over there. 400 00:20:52,939 --> 00:20:55,610 And I looked the data center manager, and she, she was delightful and 401 00:20:55,610 --> 00:20:57,979 incredibly on her point, and she knew exactly where I was going. 402 00:20:58,310 --> 00:20:59,719 We're gonna print that out right now. 403 00:21:00,050 --> 00:21:01,399 Excellent, excellent. 404 00:21:01,399 --> 00:21:01,820 Like that. 405 00:21:01,850 --> 00:21:02,840 That's why you ask. 406 00:21:02,840 --> 00:21:05,899 It's, it's someone who has never seen it before, but knows how these things were 407 00:21:05,899 --> 00:21:09,439 going through that because you build dependency on top of dependency and you 408 00:21:09,439 --> 00:21:12,830 never get the luxury of taking a step back and looking at it with fresh eyes. 409 00:21:13,010 --> 00:21:14,300 But that's what our industry has done. 410 00:21:14,340 --> 00:21:17,790 But you have, you have your vendors that have their own critical dependencies 411 00:21:18,030 --> 00:21:21,419 that they may or may not have done as good a job as you have of identifying 412 00:21:21,419 --> 00:21:23,399 those and so on and so forth. 413 00:21:23,399 --> 00:21:26,429 It's the end of a very long chain that does kind of eat itself at some point. 414 00:21:26,850 --> 00:21:27,030 Ben Hartshorne: Yeah. 415 00:21:27,030 --> 00:21:28,500 There are two things that that brings to mind. 416 00:21:28,649 --> 00:21:31,860 First, we absolutely saw exactly what you're describing yesterday in our 417 00:21:31,860 --> 00:21:35,399 track patterns, where the, the volume of incoming traffic would sort of 418 00:21:35,399 --> 00:21:36,629 come along and then it would drop. 419 00:21:36,885 --> 00:21:40,125 As their services went off, and then it's quiet for a little while, 420 00:21:40,125 --> 00:21:43,275 and then we get this huge spike as they're trying to like, you know, 421 00:21:43,305 --> 00:21:44,745 bring everything back on all at once. 422 00:21:45,135 --> 00:21:48,254 Uh, thankfully those were sort of spread out across our customers, so 423 00:21:48,254 --> 00:21:52,815 we didn't have like, just one enormous spike hit all of our, our servers. 424 00:21:53,205 --> 00:21:56,055 Um, but we did see them on a, on a per customer basis. 425 00:21:56,060 --> 00:21:57,975 It's, it's a real, very real pattern. 426 00:21:58,485 --> 00:22:03,945 Um, but the second one, for all of these dependencies, there are clearly 427 00:22:03,945 --> 00:22:05,985 several who have built their system. 428 00:22:06,659 --> 00:22:12,750 With this challenge in mind and have a series of different fallbacks. 429 00:22:13,110 --> 00:22:18,060 Uh, and, and, uh, I'll, I'll give you the story of, um, uh, we used 430 00:22:18,060 --> 00:22:19,530 LaunchDarkly for our feature flagging. 431 00:22:20,550 --> 00:22:22,740 Um, their service was also impacted yesterday. 432 00:22:24,600 --> 00:22:27,270 One would think, oh, we need our feature flags in order to boot up. 433 00:22:28,110 --> 00:22:33,270 Well, their SDK is built with the idea that you set your feature flag default 434 00:22:33,270 --> 00:22:35,915 in code, and if we can't reach our service, we'll go ahead and use those. 435 00:22:37,169 --> 00:22:38,820 And if we can reach our service, great. 436 00:22:38,940 --> 00:22:39,690 We'll update them. 437 00:22:40,169 --> 00:22:42,060 And if we can update them once, that's great. 438 00:22:42,090 --> 00:22:44,280 If we can connect to the streaming service even better. 439 00:22:45,480 --> 00:22:49,320 And I, I think they also have, uh, some, some more, uh, bridging in there. 440 00:22:49,320 --> 00:22:52,980 But, um, we don't use, uh, the, the more complicated infrastructure. 441 00:22:53,250 --> 00:22:59,129 But this idea that they design the system with the expectation that in the 442 00:22:59,129 --> 00:23:03,030 event of a service on unavailability, things will continue to work. 443 00:23:04,260 --> 00:23:06,750 Made the recovery process all that much better. 444 00:23:07,260 --> 00:23:11,850 And, uh, even when, when, uh, their service was unavailable and ours 445 00:23:11,850 --> 00:23:16,470 was still running, uh, the SDK still answers questions in code for 446 00:23:16,470 --> 00:23:17,850 the status of all of these flags. 447 00:23:18,300 --> 00:23:20,520 It doesn't say, oh, I, I can't reach my upstream. 448 00:23:20,520 --> 00:23:22,170 Suddenly, I can't give you an answer anymore. 449 00:23:22,410 --> 00:23:26,550 No, the SDK is built with that idea of local caching so that it can 450 00:23:26,610 --> 00:23:29,340 continue to serve the correct answer. 451 00:23:30,149 --> 00:23:32,909 So far as it new from whenever it lost its connection. 452 00:23:33,030 --> 00:23:37,169 But it means that if, if they have a transient outage, our stuff doesn't break. 453 00:23:37,830 --> 00:23:42,810 And that kind of design, uh, really, uh, makes recovering from these like 454 00:23:42,840 --> 00:23:47,520 interdependent outages, uh, feasible in a way that the, the, uh, the 455 00:23:47,520 --> 00:23:50,185 strict ordering you were describing just is, is really difficult. 456 00:23:51,075 --> 00:23:54,405 Corey: At least in my case, I, I have the luxury of knowing these things just 457 00:23:54,405 --> 00:23:58,995 because I'm old and I, I figured this out Before it was SRE Common Knowledge, 458 00:23:58,995 --> 00:24:02,774 or SRE, was a widely acknowledged thing where, okay, you have a job server 459 00:24:02,774 --> 00:24:04,845 that runs CR jobs, uh, every day. 460 00:24:05,115 --> 00:24:08,445 And when it, it turns out that, oh, when you founded missed a CR job. 461 00:24:08,520 --> 00:24:09,270 Oopsy doozy. 462 00:24:09,270 --> 00:24:11,070 That's a problem for some of those things. 463 00:24:11,070 --> 00:24:14,010 So now you start building in error checking and the rest, and then you 464 00:24:14,010 --> 00:24:17,280 do a restore for three days ago from backup for that thing, and it suddenly 465 00:24:17,280 --> 00:24:21,150 dinks it, missed all theron jobs and runs them all, and then hammers some 466 00:24:21,150 --> 00:24:22,980 other system to death when it shouldn't. 467 00:24:22,980 --> 00:24:27,270 And you, you learn iteratively of, oh, that's kind of a failure mode. 468 00:24:27,450 --> 00:24:29,850 Like when you start externalizing and hardening APIs, you 469 00:24:29,850 --> 00:24:31,650 build, you learn very quickly. 470 00:24:31,650 --> 00:24:36,210 Everything needs a rate limit, and you need a way to make bad actors 471 00:24:36,240 --> 00:24:37,830 stop hammering your endpoints. 472 00:24:38,865 --> 00:24:40,575 Not just bad actors, naive ones. 473 00:24:40,755 --> 00:24:43,755 Ben Hartshorne: And, uh, rate limits are a good, a good example because, 474 00:24:43,875 --> 00:24:48,345 um, uh, that is one of the things that that did happen uh, yesterday as people 475 00:24:48,345 --> 00:24:52,000 were coming back, we actually wound up needing to rate limit ourselves. 476 00:24:53,445 --> 00:24:56,594 We didn't have to rate, limit our customers, but the, because, 477 00:24:56,625 --> 00:24:58,485 uh, so brief digression here. 478 00:24:58,814 --> 00:25:02,024 Um, honeycomb uses honeycomb in order to build honeycomb. 479 00:25:02,115 --> 00:25:04,784 Uh, we, we are our own observability vendor. 480 00:25:05,145 --> 00:25:10,544 Uh, now this, this leads to some obvious, um, uh, challenges in architecture. 481 00:25:11,145 --> 00:25:13,274 Uh, you know, how, how do we know we're right? 482 00:25:13,665 --> 00:25:16,545 Well, in the beginning we did have some other services that we'd use to 483 00:25:16,545 --> 00:25:19,305 checkpoint our, our numbers and make sure that they, they were actually correct. 484 00:25:19,665 --> 00:25:22,845 Uh, but our production instance sits here and serves our customers 485 00:25:23,145 --> 00:25:26,500 and all of its telemetry goes into the next one down the chain. 486 00:25:28,140 --> 00:25:31,560 We call that dog food because we are, uh, you know, the, the whole phrase of 487 00:25:31,560 --> 00:25:34,650 eating your own dog food, uh, drinking your own champagne is the other, 488 00:25:34,650 --> 00:25:37,050 uh, um, more, more pleasing version. 489 00:25:37,350 --> 00:25:40,200 Um, so the, from our production, it goes to dog food. 490 00:25:40,200 --> 00:25:40,890 From dog food. 491 00:25:40,890 --> 00:25:42,000 Well, what's dog food made of? 492 00:25:42,000 --> 00:25:43,050 It's made up of, of kibble. 493 00:25:43,140 --> 00:25:45,180 So our third environment is called kibble. 494 00:25:45,360 --> 00:25:49,710 Uh, so the, the dog food telemetry, it goes into this third environment and 495 00:25:49,710 --> 00:25:52,500 that third environment, well, we need to know if it's working too, so it 496 00:25:52,500 --> 00:25:54,060 feeds back into our production instance. 497 00:25:54,555 --> 00:25:57,315 Each of these instances, uh, is emitting telemetry. 498 00:25:57,555 --> 00:26:01,515 Uh, and we have our, um, rate limiting and our, I'm sorry, our 499 00:26:01,515 --> 00:26:03,225 tail sampling proxy called refinery. 500 00:26:03,555 --> 00:26:08,745 That, uh, helps us reduce volume so it's not a, a positively amplifying cycle. 501 00:26:09,525 --> 00:26:15,705 Um, but in this, in this incident yesterday, uh, we started emitting 502 00:26:15,705 --> 00:26:17,805 logs that we don't normally emit. 503 00:26:18,585 --> 00:26:20,295 These are coming from some of our SD. 504 00:26:23,460 --> 00:26:24,180 Their services. 505 00:26:24,899 --> 00:26:30,120 And so suddenly we started getting two or three or four log entries 506 00:26:30,120 --> 00:26:31,680 for every event we were sending. 507 00:26:31,680 --> 00:26:35,909 And, uh, did get into this kind of amplifying cycle. 508 00:26:36,629 --> 00:26:39,899 So we, we put, uh, a pretty heavy rate limit on the kibble 509 00:26:39,899 --> 00:26:43,620 environment in order to squash that traffic and disrupt the cycle. 510 00:26:43,980 --> 00:26:47,370 Uh, which, which made it difficult to ensure that was 511 00:26:47,370 --> 00:26:49,020 working correctly, which, but. 512 00:26:49,365 --> 00:26:53,205 It was, and, and that led us make sure that, make sure that the production 513 00:26:53,205 --> 00:26:54,225 instance was working alright. 514 00:26:54,645 --> 00:26:57,735 Um, but this idea of rate limits being a, a critical part of 515 00:26:57,765 --> 00:27:02,415 maintaining an interconnected stack, uh, in order to, to suppress these 516 00:27:02,415 --> 00:27:04,815 kind of, um, uh, like wavelike. 517 00:27:06,270 --> 00:27:10,230 Formations that oscillations, that start growing on each other and amplifying 518 00:27:10,230 --> 00:27:13,890 themselves, uh, can, can take any infrastructure down and being able 519 00:27:13,890 --> 00:27:17,100 to put in, uh, just the right point, a little, a couple switches and say, 520 00:27:17,250 --> 00:27:21,240 Nope, suppress that signal, uh, really made a big difference in our ability 521 00:27:21,240 --> 00:27:23,100 to, to bring back all of the services. 522 00:27:23,195 --> 00:27:23,315 Corey: I, 523 00:27:23,340 --> 00:27:26,610 I want to pivot to one last topic, but I, we could talk about 524 00:27:26,610 --> 00:27:27,835 this outage for days and hours. 525 00:27:28,605 --> 00:27:32,385 I, but there's, uh, something that you mentioned you wanted to go into that I 526 00:27:32,385 --> 00:27:37,185 wanted to pick a fight with you over, uh, was how to get people to instrument their 527 00:27:37,185 --> 00:27:41,534 applications to, for observability so they can understand their applications, 528 00:27:41,534 --> 00:27:43,245 their performance, and the rest. 529 00:27:43,425 --> 00:27:47,115 And I'm gonna go with the easy answer because it's a pain in the ass. 530 00:27:47,115 --> 00:27:50,445 Ben, have you tried instrumenting an application that already 531 00:27:50,445 --> 00:27:52,875 exists without having to spend a 532 00:27:52,875 --> 00:27:53,445 week on it? 533 00:27:53,804 --> 00:27:54,345 Ben Hartshorne: I. 534 00:27:58,064 --> 00:27:58,784 you're not wrong. 535 00:27:59,115 --> 00:28:02,175 It's a pain in the ass and it's getting better. 536 00:28:02,625 --> 00:28:04,274 There's lots of ways to make it better. 537 00:28:04,695 --> 00:28:06,975 Uh, there are packages that do auto instrumentation. 538 00:28:07,215 --> 00:28:07,784 Corey: Oh yeah, absolutely. 539 00:28:07,784 --> 00:28:08,235 From my case. 540 00:28:08,235 --> 00:28:08,354 Yeah. 541 00:28:08,354 --> 00:28:09,405 It's Claude Coat's problem. 542 00:28:09,405 --> 00:28:10,695 Now I'm getting another drink. 543 00:28:10,784 --> 00:28:14,804 Ben Hartshorne: You know, uh, you, you say that in jest and yet, um, 544 00:28:14,834 --> 00:28:16,965 they are actually getting really good. 545 00:28:17,235 --> 00:28:17,445 Yeah. 546 00:28:17,804 --> 00:28:19,304 Corey: No, that's what I've been doing. 547 00:28:19,304 --> 00:28:20,264 It works super well. 548 00:28:20,264 --> 00:28:22,844 You test it first, obviously, but yeah. 549 00:28:23,565 --> 00:28:24,825 YOLO slammed that into production. 550 00:28:24,885 --> 00:28:25,275 But yeah, 551 00:28:25,575 --> 00:28:27,555 Ben Hartshorne: the, uh, the, the LLMs are actually getting 552 00:28:27,555 --> 00:28:30,735 pretty good at understanding where instrumentation can be useful. 553 00:28:30,735 --> 00:28:32,625 I say understanding, I put that in their quotes. 554 00:28:32,895 --> 00:28:36,885 Uh, they're good at, uh, finding code that represents a, a good place to, 555 00:28:36,915 --> 00:28:40,215 to put instrumentation and, and adding it to your code in the right place. 556 00:28:40,555 --> 00:28:42,835 Corey: I need to take another try one of these days. 557 00:28:42,865 --> 00:28:46,885 Uh, the last time I played with Honeycomb, I instrumented my home Kubernetes 558 00:28:46,885 --> 00:28:51,085 cluster and I exceeded the limits of the free tier based on ingest volume 559 00:28:51,145 --> 00:28:52,615 by the second day of every month. 560 00:28:53,215 --> 00:28:58,495 And that led to either you have really unfair limits, which I don't believe to be 561 00:28:58,525 --> 00:29:03,595 true or the more insightful question, what the hell is my Kubernetes cluster doing? 562 00:29:03,595 --> 00:29:04,735 That's that chatty. 563 00:29:05,770 --> 00:29:08,199 So I rebuilt the whole thing from scratch, so it's time for me 564 00:29:08,199 --> 00:29:09,399 to go back and figure that out. 565 00:29:09,429 --> 00:29:09,699 Ben Hartshorne: Yeah. 566 00:29:09,699 --> 00:29:14,260 So, um, I will say a lot of, a lot of instrumentation is terrible. 567 00:29:14,980 --> 00:29:21,010 A lot of instrumentation is based on this idea that every single signal must 568 00:29:21,010 --> 00:29:28,510 be published all the time, and, um, that that's not relevant to you as a 569 00:29:28,510 --> 00:29:30,070 person running the Kubernetes cluster. 570 00:29:30,689 --> 00:29:35,100 You know, do you need to know every time, uh, the, the, um, a, a 571 00:29:35,100 --> 00:29:38,429 local pod checks in to see whether it's, uh, needs to be evicted? 572 00:29:38,939 --> 00:29:39,750 No, you don't. 573 00:29:40,169 --> 00:29:44,250 What you're interested in are the, the types of activities that are 574 00:29:44,250 --> 00:29:47,909 relevant to what you need to do as an operator of that cluster. 575 00:29:48,179 --> 00:29:49,560 And the same is true of an application. 576 00:29:50,040 --> 00:29:56,250 If you just, you know, put, uh, uh, in the tracing language, put a 577 00:29:56,250 --> 00:29:58,260 span on every single function call. 578 00:29:58,950 --> 00:30:03,784 You will not have useful traces because it doesn't map to, uh, a, 579 00:30:04,020 --> 00:30:07,710 a useful way of representing your user's journey through your product. 580 00:30:08,490 --> 00:30:12,120 So there's definitely some nuance to getting the right level of 581 00:30:12,120 --> 00:30:17,159 instrumentation, and I think the right level, it's not a single place, uh, it's 582 00:30:17,159 --> 00:30:20,639 a continuously moving spectrum based on what you were trying to understand 583 00:30:20,850 --> 00:30:22,350 about what your application is doing. 584 00:30:22,980 --> 00:30:24,600 So, uh, at least at Honeycomb. 585 00:30:25,440 --> 00:30:30,030 We add instrumentation all the time, and we remove instrumentation all the time 586 00:30:30,960 --> 00:30:35,550 because what's relevant to me now as I'm building out this feature is different 587 00:30:35,940 --> 00:30:40,379 from what I need to know about that feature once it is fully built and stable 588 00:30:40,560 --> 00:30:42,600 and running in, in a regular workload. 589 00:30:43,440 --> 00:30:48,720 Um, furthermore, a as I'm looking at a. Specific problem or question? 590 00:30:48,720 --> 00:30:52,140 I, we talked about, uh, you know, pricing for Lambdas at the beginning of this. 591 00:30:52,530 --> 00:30:57,300 Um, there was a time when we really wanted to understand pricing for S3 and 592 00:30:57,390 --> 00:31:00,810 part of our model, it, it's a struggle. 593 00:31:01,020 --> 00:31:04,740 Um, part of our, part of our storage model is that, uh, we store our customers 594 00:31:04,740 --> 00:31:07,080 telemetry in S3, in in many files. 595 00:31:07,295 --> 00:31:07,645 Files. 596 00:31:07,645 --> 00:31:11,520 And we put instrumentation around every single F three access. 597 00:31:12,125 --> 00:31:16,294 In order to understand both the volume and the latency of those to, to see 598 00:31:16,294 --> 00:31:19,294 like, okay, should we bundle them up or resize it like this and how 599 00:31:19,294 --> 00:31:20,675 does that influence SOS and so on. 600 00:31:21,004 --> 00:31:24,185 And it's incredibly expensive to do that kind of, uh, experiment. 601 00:31:24,425 --> 00:31:26,524 And it, it's not just expensive in dollars. 602 00:31:26,885 --> 00:31:30,725 Adding that level of instrumentation does have an impact on the overall 603 00:31:30,725 --> 00:31:32,345 performance of, of the system. 604 00:31:32,794 --> 00:31:36,845 When you're making 10,000 calls to S3 and you add a span around 605 00:31:36,845 --> 00:31:39,305 everyone, it takes a bit more time. 606 00:31:39,875 --> 00:31:40,325 So. 607 00:31:40,710 --> 00:31:44,340 Once we understood the system well enough to, to make the change, we wanted 608 00:31:44,340 --> 00:31:45,750 to make, we pulled all that back out. 609 00:31:46,890 --> 00:31:49,950 So, for your Kubernetes cluster, uh, you know, maybe it's interesting 610 00:31:49,950 --> 00:31:53,760 at the very beginning to, to look at every single, uh, connection 611 00:31:53,760 --> 00:31:55,350 that any, any process might make. 612 00:31:57,240 --> 00:32:00,210 But if it's your home cluster, that's not really what you 613 00:32:00,210 --> 00:32:01,710 need to know as an operator. 614 00:32:02,385 --> 00:32:07,395 So finding the right balance there of instrumentation that lets you fulfill 615 00:32:07,395 --> 00:32:11,055 the needs of the business, that lets you understand the, the needs of the 616 00:32:11,055 --> 00:32:16,275 operator in order to, uh, best be able to provide the service that this 617 00:32:16,275 --> 00:32:17,985 business is providing to its customers. 618 00:32:19,125 --> 00:32:22,185 It's a, it's a place somewhere there in the middle, and you're 619 00:32:22,185 --> 00:32:23,385 gonna need some people to find it, 620 00:32:23,745 --> 00:32:24,465 Corey: and that's 621 00:32:25,215 --> 00:32:26,895 easier said than done for a lot of folks. 622 00:32:27,270 --> 00:32:29,280 But you're right, it is getting easier to instrument these things. 623 00:32:29,280 --> 00:32:33,660 It is something that is iteratively getting better all the time, uh, to the 624 00:32:33,660 --> 00:32:37,140 point where now, like this is an area where AI is surprisingly effective. 625 00:32:37,740 --> 00:32:42,750 It doesn't take a lot to wrap a function call with a decorator. 626 00:32:42,930 --> 00:32:43,140 Ben Hartshorne: Mm-hmm. 627 00:32:43,875 --> 00:32:46,754 It just takes a lot of doing that over and over and over again. 628 00:32:47,175 --> 00:32:50,685 You, you do a lot of them and you see what it looks like and then you see, okay, 629 00:32:50,685 --> 00:32:56,895 which ones of these are actually useful to me Now that's gonna, and uh, we want. 630 00:32:58,245 --> 00:33:01,965 Open to that changing and willing to understand that, uh, 631 00:33:02,115 --> 00:33:03,495 that this is an evolving thing. 632 00:33:03,824 --> 00:33:07,064 And this does actually tie back to one of the core operating principles 633 00:33:07,155 --> 00:33:14,385 of modern sa uh, architectures, the ability to deploy your code quickly. 634 00:33:15,405 --> 00:33:19,215 Because if you're in this cycle of adding instrumentation, of removing 635 00:33:19,215 --> 00:33:20,445 instrumentation, you see a bug. 636 00:33:20,445 --> 00:33:25,185 It has to be easy enough to add a little bit more data to get insight 637 00:33:25,185 --> 00:33:26,895 into that bug in order to resolve it. 638 00:33:27,555 --> 00:33:31,305 It's gonna do it and the whole business suffer for 639 00:33:31,965 --> 00:33:33,075 what is quickly to you, 640 00:33:34,095 --> 00:33:36,405 uh, in. 641 00:33:38,685 --> 00:33:42,915 Uh, I need to make this change and, uh, it's visible in my test environment. 642 00:33:43,185 --> 00:33:46,005 A couple of minutes I need to make this change and have it 643 00:33:46,005 --> 00:33:47,415 visible running in production. 644 00:33:47,865 --> 00:33:52,995 Um, it depends on like how, how much the, the, uh, how frequency, how frequent 645 00:33:52,995 --> 00:33:56,385 the bug comes, but I'm, I'm actually okay with it being about, about an 646 00:33:56,385 --> 00:33:58,605 hour for that kind of, uh, turnaround. 647 00:33:58,935 --> 00:34:01,485 I know a lot of people say you should have your code running in 15 minutes. 648 00:34:01,875 --> 00:34:02,445 That's great. 649 00:34:02,955 --> 00:34:06,765 Uh, I know that's outta reach for a lot of people in a lot of industries, so, um. 650 00:34:07,889 --> 00:34:10,469 I'm, I'm not a hardliner on, on how quickly it has to 651 00:34:10,469 --> 00:34:12,089 be, but it can't be a week. 652 00:34:12,659 --> 00:34:18,299 It can't, it, it can bear, it can't be a day that just like you're gonna 653 00:34:18,359 --> 00:34:20,969 wanna do this two or three times in the course of resolving a bug. 654 00:34:21,299 --> 00:34:26,219 And so if it's something too long, you're just really pushing out any 655 00:34:26,219 --> 00:34:27,900 ability to respond quickly to a customer. 656 00:34:28,230 --> 00:34:30,929 Corey: I really wanna thank you for taking the time to speak with me about all this. 657 00:34:31,049 --> 00:34:33,120 If people wanna learn more, where's the best place for them to go? 658 00:34:33,989 --> 00:34:38,279 Ben Hartshorne: You know, I have, uh, backed off of almost all of 659 00:34:38,279 --> 00:34:42,029 the platforms in which people carry on conversations in the internet. 660 00:34:42,239 --> 00:34:42,870 Corey: Everyone 661 00:34:42,870 --> 00:34:43,830 seems to have done this. 662 00:34:44,669 --> 00:34:48,299 Ben Hartshorne: I, I, uh, I, I did work for Facebook for 663 00:34:48,419 --> 00:34:50,759 two and a half years and, um, 664 00:34:50,940 --> 00:34:51,899 Corey: someday I might forgive you. 665 00:34:52,739 --> 00:34:53,879 Ben Hartshorne: Someday I might forgive myself. 666 00:34:54,089 --> 00:34:54,600 Um. 667 00:34:58,050 --> 00:35:03,030 Really different environment and, uh, I could see the allure of the world they're 668 00:35:03,030 --> 00:35:05,010 trying to create and it doesn't match. 669 00:35:05,220 --> 00:35:06,960 Oh, I interviewed there in 2009. 670 00:35:06,960 --> 00:35:08,430 It was, it was incredibly compelling. 671 00:35:08,970 --> 00:35:12,300 Um, it doesn't match the, the view that I see of the world we're in. 672 00:35:12,750 --> 00:35:17,610 And so, um, uh, I have a, a presence at, at Honeycomb. 673 00:35:17,700 --> 00:35:22,950 Um, I do have, uh, accounts on all of the major, um, platforms, 674 00:35:23,250 --> 00:35:24,420 so you can find me there. 675 00:35:24,825 --> 00:35:30,134 Uh, there, there will be links afterwards I'm sure, but, um, LinkedIn, blue Sky. 676 00:35:30,855 --> 00:35:31,245 I dunno. 677 00:35:31,755 --> 00:35:33,375 GitHub, is that a social media platform now? 678 00:35:34,035 --> 00:35:34,665 Corey: They wish. 679 00:35:35,384 --> 00:35:36,075 We'll put all this in. 680 00:35:36,075 --> 00:35:38,025 The show notes Problem solve for us. 681 00:35:38,085 --> 00:35:40,095 Thank you so much for taking the time to speak with me. 682 00:35:40,095 --> 00:35:40,785 I appreciate it. 683 00:35:41,055 --> 00:35:41,745 Ben Hartshorne: It's a real pleasure. 684 00:35:41,895 --> 00:35:42,285 Thank you. 685 00:35:42,585 --> 00:35:45,765 Corey: Ben Hartshorne is the principle engineer at Honeycomb. 686 00:35:45,855 --> 00:35:48,495 One of the possibly might have more than one. 687 00:35:48,495 --> 00:35:51,855 Seems to be something you can scale, unlike my nonsense as Chief Cloud 688 00:35:51,855 --> 00:35:53,325 Economist at the Duck Bill Group. 689 00:35:53,805 --> 00:35:55,395 And this is screaming in the cloud. 690 00:35:55,740 --> 00:35:58,589 If you've enjoyed this podcast, please leave a five star review on 691 00:35:58,589 --> 00:36:00,270 your podcast platform of choice. 692 00:36:00,359 --> 00:36:03,600 Whereas if you've hated this podcast, please leave a five star review on 693 00:36:03,600 --> 00:36:07,500 your podcast platform of choice along with an insulting comment that won't 694 00:36:07,500 --> 00:36:10,709 work because that platform is down and not accepting comments at this moment.

Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Episode Transcript

Never lose your place, on any device