Navigated to Building Systems That Work Even When Everything Breaks with Ben Hartshorne - Transcript

Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Episode Transcript

1 00:00:00,000 --> 00:00:04,019 Ben Hartshorne: For all of these dependencies, there are clearly 2 00:00:04,019 --> 00:00:10,020 several who have built their system with this challenge in mind and have 3 00:00:10,080 --> 00:00:12,660 a series of different fallbacks. 4 00:00:13,200 --> 00:00:16,170 Uh, I'll, I'll give you the story of, um, uh, we used LaunchDarkly 5 00:00:16,170 --> 00:00:17,100 for our feature flagging. 6 00:00:17,460 --> 00:00:19,050 Their service was also impacted yesterday. 7 00:00:19,050 --> 00:00:22,230 One would think, oh, we need our feature flags in order to boot up. 8 00:00:23,070 --> 00:00:27,390 Well, their SDK is built with the idea that you set your 9 00:00:27,390 --> 00:00:28,650 feature flag defaults in code. 10 00:00:28,995 --> 00:00:31,215 And if we can't reach our service, we'll go ahead and use those. 11 00:00:32,145 --> 00:00:33,795 And if we can reach our service, great. 12 00:00:33,915 --> 00:00:34,635 We'll update them. 13 00:00:35,145 --> 00:00:37,035 And if we can update them once, that's great. 14 00:00:37,065 --> 00:00:39,255 If we can connect to the streaming service even better. 15 00:00:40,455 --> 00:00:44,085 And I, I think they also have, uh, some, some more, uh, bridging in 16 00:00:44,085 --> 00:00:47,385 there, but we don't use, uh, the, the more complicated infrastructure. 17 00:00:47,655 --> 00:00:53,535 But this idea that they design the system with the expectation that in the 18 00:00:53,535 --> 00:00:57,465 event of a service on unavailability, things will continue to work. 19 00:00:58,665 --> 00:01:01,155 Made the recovery process all that much better. 20 00:01:01,665 --> 00:01:06,225 And, uh, even when, when, uh, their service was unavailable and ours 21 00:01:06,225 --> 00:01:10,905 was still running, uh, the SDK still answers questions in code for 22 00:01:10,905 --> 00:01:12,255 the status of all of these flags. 23 00:01:12,735 --> 00:01:14,925 It doesn't say, oh, I, I can't reach my upstream. 24 00:01:14,925 --> 00:01:16,575 Suddenly, I can't give you an answer anymore. 25 00:01:16,785 --> 00:01:20,955 No, the SDK is built with that idea of local caching so that it can 26 00:01:21,015 --> 00:01:23,745 continue to serve the correct answer. 27 00:01:24,630 --> 00:01:27,300 So far as it new from whenever it lost its connection. 28 00:01:32,160 --> 00:01:33,840 Corey: Welcome to Screaming in the Cloud. 29 00:01:34,080 --> 00:01:35,220 I'm Cory Quinn. 30 00:01:35,400 --> 00:01:39,240 My guest today is one of those folks that I am disappointed I have not 31 00:01:39,240 --> 00:01:43,289 had on the show until now, just because I assumed I already had. 32 00:01:43,590 --> 00:01:48,330 Ben Hartshornee is a principal engineer at Honeycomb, but oh, so much more than that. 33 00:01:48,539 --> 00:01:50,520 Ben, thank you for dating to join us. 34 00:01:50,789 --> 00:01:52,050 Ben Hartshorne: It's lovely to be here this morning. 35 00:01:52,440 --> 00:01:55,530 Corey: This episode is sponsored in part by my day job Duck. 36 00:01:55,530 --> 00:01:58,384 Bill, do you have a horrifying AWS bill? 37 00:01:59,009 --> 00:02:00,869 That can mean a lot of things. 38 00:02:01,110 --> 00:02:06,179 Predicting what it's going to be, determining what it should be, negotiating 39 00:02:06,179 --> 00:02:11,609 your next long-term contract with AWS, or just figuring out why it increasingly 40 00:02:11,609 --> 00:02:16,140 resembles a phone number, but nobody seems to quite know why that is. 41 00:02:16,440 --> 00:02:20,010 To learn more, visit duck bill hq.com. 42 00:02:20,310 --> 00:02:23,190 Remember, you can't duck the duck bill. 43 00:02:23,220 --> 00:02:28,620 Bill, which my CEO reliably informs me is absolutely not our slogan. 44 00:02:29,025 --> 00:02:35,174 So you gave a talk, uh, about roughly a month ago, uh, at the inaugural 45 00:02:35,265 --> 00:02:38,144 finops uh, meetup in San Francisco. 46 00:02:39,015 --> 00:02:40,215 Give us the high level. 47 00:02:40,215 --> 00:02:40,945 What did you talk about? 48 00:02:41,520 --> 00:02:43,710 Ben Hartshorne: Well, I got to talk about two stories. 49 00:02:43,830 --> 00:02:44,970 Um, I love telling stories. 50 00:02:45,210 --> 00:02:49,680 I got to about, talk about two stories of how we used honeycomb and instrumentation 51 00:02:49,950 --> 00:02:51,780 to help optimize our cloud. 52 00:02:51,780 --> 00:02:55,680 Spending a topic near and dear to your heart, uh, is what brought me there. 53 00:02:56,070 --> 00:03:01,410 We gotta look at the overall bill and say, Hey, what, where are some 54 00:03:01,410 --> 00:03:02,610 of the big things coming from? 55 00:03:02,910 --> 00:03:05,940 Obviously it's people sending us data and people asking us 56 00:03:05,940 --> 00:03:07,079 questions about those data. 57 00:03:07,920 --> 00:03:09,930 Corey: And if they would just stop both of those things, your 58 00:03:09,930 --> 00:03:11,250 bill would be so much better. 59 00:03:11,640 --> 00:03:12,990 Ben Hartshorne: It would be so much smaller. 60 00:03:13,380 --> 00:03:15,390 Um, so would my salary, unfortunately. 61 00:03:15,900 --> 00:03:22,240 Um, so we wanted to reduce some of those costs, but, uh, it, it's a, it's 62 00:03:22,240 --> 00:03:26,430 a problem that that's hard to get into just from like a, a general perspective. 63 00:03:26,430 --> 00:03:28,680 You need to really get in and look at all the details to find 64 00:03:28,680 --> 00:03:29,970 out what you're gonna change. 65 00:03:30,480 --> 00:03:32,550 So, uh, I gotta tell two stories. 66 00:03:33,255 --> 00:03:35,025 Uh, reducing costs. 67 00:03:35,355 --> 00:03:40,665 One by switching from a MD to, uh, arm architecture for Amazon. 68 00:03:40,665 --> 00:03:42,765 That's the Graviton chip set, which is fantastic. 69 00:03:43,155 --> 00:03:47,235 Uh, and the other was about the amazing power of spreadsheets. 70 00:03:49,455 --> 00:03:52,395 As much as I love graphs, I also love spreadsheets. 71 00:03:53,040 --> 00:03:54,420 I, I'm sorry. 72 00:03:54,630 --> 00:03:55,980 It's a personal failing. 73 00:03:55,980 --> 00:03:56,430 Perhaps. 74 00:03:56,670 --> 00:04:00,510 Corey: It's wild to me how many tools out there do all kinds of business 75 00:04:00,510 --> 00:04:04,530 adjacent things, but somehow never bother to realize that if you can just export 76 00:04:04,530 --> 00:04:09,330 and CSV, suddenly you're speaking kind of the language of your ultimate user. 77 00:04:09,540 --> 00:04:13,500 Play up with pandas a little bit more and spit out an actual Excel file, 78 00:04:13,500 --> 00:04:14,640 and now you're cooking it with gas. 79 00:04:14,645 --> 00:04:14,965 Mm-hmm. 80 00:04:16,320 --> 00:04:19,950 Ben Hartshorne: So, uh, the, the second story is about doing that with honeycomb. 81 00:04:20,310 --> 00:04:23,969 Taking, uh, a number of different graphs and looking at, um, five 82 00:04:23,969 --> 00:04:29,070 different attributes of our lambda, uh, costs and what was going into 83 00:04:29,070 --> 00:04:33,719 them, and making changes across all of them in order to, uh, accomplish 84 00:04:33,719 --> 00:04:36,060 an overall cost reduction about 50%. 85 00:04:36,599 --> 00:04:37,469 Uh, which is really great. 86 00:04:37,860 --> 00:04:42,599 So the, the story, uh, it does combine my love of graph because we 87 00:04:42,599 --> 00:04:44,159 gotta see the three lines go down. 88 00:04:44,580 --> 00:04:48,990 Um, the power of spreadsheets and also this idea that. 89 00:04:49,530 --> 00:04:55,560 You can't just look for one answer to find the, uh, solution to your 90 00:04:55,620 --> 00:04:58,169 problems around, well, anything really. 91 00:04:58,500 --> 00:05:00,510 Uh, but especially around reducing costs. 92 00:05:00,990 --> 00:05:03,870 It's going to be a bunch of small things that you can put 93 00:05:03,870 --> 00:05:05,400 together, uh, into one place. 94 00:05:06,000 --> 00:05:09,359 Corey: There's a, there's a lot that's valuable when we start going down that 95 00:05:09,359 --> 00:05:11,969 particular path of starting to look at. 96 00:05:12,719 --> 00:05:15,810 Things through a, a lens of a particular kind of data that 97 00:05:15,810 --> 00:05:17,280 you otherwise wouldn't think to? 98 00:05:17,429 --> 00:05:21,929 I, I remain, I maintain that you remain the only customer we have 99 00:05:21,929 --> 00:05:28,679 found so far that uses Honeycomb to completely instrument their AWS bill. 100 00:05:28,859 --> 00:05:31,919 Uh, we had not seen that before or since. 101 00:05:32,099 --> 00:05:34,650 It, it makes sense for you to do it that way. 102 00:05:34,650 --> 00:05:35,460 Absolutely. 103 00:05:36,150 --> 00:05:38,130 It's a bit of a heavy lift for. 104 00:05:38,565 --> 00:05:40,125 Shall we say everyone else? 105 00:05:40,755 --> 00:05:44,655 Ben Hartshorne: Uh, and it, it actually is a, a bit of a lift for, for us to, to 106 00:05:44,655 --> 00:05:49,604 say we've instrumented the entire bill, uh, is a, a wonderful thing to, to assert. 107 00:05:49,695 --> 00:05:55,215 And, uh, as we've talked about, we, we use the power of spreadsheets too. 108 00:05:55,784 --> 00:06:00,195 So there are some aspects there is that, there's some aspects of our 109 00:06:00,195 --> 00:06:03,885 eight of us spending and actually really dominant ones, uh, that 110 00:06:04,664 --> 00:06:06,765 lend themselves very easily to be. 111 00:06:08,310 --> 00:06:09,000 Using Honeycomb. 112 00:06:09,360 --> 00:06:14,400 Um, the best example is Lambda because Lambda is, uh, charged on a per 113 00:06:14,400 --> 00:06:21,690 millisecond basis and our instrumentation is collecting spans, traces about your 114 00:06:21,690 --> 00:06:23,610 compute on a per millisecond basis. 115 00:06:23,910 --> 00:06:27,270 There's a very easy translation there, and so we can get really good 116 00:06:27,270 --> 00:06:30,990 insight into which customers are spending, how much or rather, which 117 00:06:30,995 --> 00:06:32,825 customers are causing us to spend. 118 00:06:32,885 --> 00:06:33,225 How much. 119 00:06:34,815 --> 00:06:39,945 Provide our product to them and, uh, understand how that, how we can balance 120 00:06:39,945 --> 00:06:45,495 our, uh, development resources to both provide new features and also, uh, 121 00:06:45,495 --> 00:06:49,440 understand when we need to shift and, uh, spend our attention managing costs that. 122 00:06:50,520 --> 00:06:54,450 Corey: There's a continuum here, and I think that it tends to follow a 123 00:06:54,450 --> 00:06:59,160 lot around company ethos and company culture here, where folks have 124 00:06:59,280 --> 00:07:04,110 varying degrees of insight into the factors that drive their cloud spend. 125 00:07:04,560 --> 00:07:05,230 Uh, you are. 126 00:07:05,700 --> 00:07:10,560 Clearly an observability company you have been observing your AWS bill 127 00:07:10,560 --> 00:07:14,700 for, I would argue longer than it would've made sense to on some level. 128 00:07:14,700 --> 00:07:18,270 In the very early days you were doing this and your AWS bill was 129 00:07:18,270 --> 00:07:23,370 not the, the, the limiting factor to your company's success back in those 130 00:07:23,370 --> 00:07:24,870 days, but, but you did grow into it. 131 00:07:25,349 --> 00:07:28,560 Other folks, even at very large enterprise scale, more or less, 132 00:07:28,560 --> 00:07:30,300 do this based on vibes and. 133 00:07:31,025 --> 00:07:34,805 Most folks, I think, tend to fall somewhere in the middle of this, 134 00:07:34,805 --> 00:07:36,545 but, but it's not evenly distributed. 135 00:07:36,695 --> 00:07:39,185 Some teams tend to have a very deep insight into what 136 00:07:39,185 --> 00:07:40,685 they're doing and others are. 137 00:07:40,685 --> 00:07:42,485 Amazon Bill, you mean the books? 138 00:07:42,724 --> 00:07:46,265 It, it, again, most tend to fall somewhere center of that. 139 00:07:46,265 --> 00:07:47,914 It's, it's a law of large numbers. 140 00:07:47,914 --> 00:07:50,255 Everything starts to revert to a mean, past a certain point. 141 00:07:51,455 --> 00:07:53,585 Ben Hartshorne: Well, I mean, you, you wouldn't have a job if, if they didn't 142 00:07:53,585 --> 00:07:55,534 make it a bit of a challenge to do so. 143 00:07:56,190 --> 00:07:59,250 Corey: Or I might have a better job depending, but we'll see. 144 00:07:59,489 --> 00:08:02,729 Uh, I I do wanna detour a little bit here because as we record this, it is the 145 00:08:02,729 --> 00:08:07,140 day after AWS's big significant outage. 146 00:08:07,140 --> 00:08:11,280 I could really mess with the conspiracy theorists and say it is their first 147 00:08:11,280 --> 00:08:15,150 major outage of, uh, October of 2025. 148 00:08:15,539 --> 00:08:17,159 Uh, and then people are like, wait, what do you mean? 149 00:08:17,370 --> 00:08:18,359 What do you mean this is World War? 150 00:08:18,780 --> 00:08:22,049 I like same type of approach, like, but these things do tend to cluster. 151 00:08:22,320 --> 00:08:24,299 Uh, how was your day yesterday? 152 00:08:25,305 --> 00:08:27,345 Ben Hartshorne: Uh, well, it did start very early. 153 00:08:27,855 --> 00:08:33,855 Um, uh, our, our service, uh, has, has presence in multiple regions. 154 00:08:34,275 --> 00:08:40,485 Uh, but we do have our, our main, uh, US instance in, in Amazon's US East one. 155 00:08:40,965 --> 00:08:46,555 And so as, uh, things stop working, uh, a lot of our service stopped working too. 156 00:08:48,104 --> 00:08:52,305 I mean, the, the outage was, was significant, but wasn't, uh, pervasive. 157 00:08:52,545 --> 00:08:56,594 There were still some things that kept functioning, and amazingly, 158 00:08:56,805 --> 00:09:01,844 we actually preserved all of the customer telemetry that made it 159 00:09:01,844 --> 00:09:03,584 to our front door successfully. 160 00:09:04,185 --> 00:09:06,584 Uh, which is a big deal because we hate dropping data. 161 00:09:06,944 --> 00:09:09,974 Corey: Yeah, it's, that took some work in engineering and, and I have to 162 00:09:09,974 --> 00:09:11,265 imagine this was also not an accident. 163 00:09:11,775 --> 00:09:12,645 Ben Hartshorne: It was not an accident. 164 00:09:13,125 --> 00:09:16,185 Now their ability to query that data during the outage. 165 00:09:17,355 --> 00:09:17,685 Suffered. 166 00:09:18,255 --> 00:09:20,415 Corey: I, I'm gonna push back on you on that for a second there. 167 00:09:20,444 --> 00:09:24,525 When AWS's US East, one where you have a significant workload 168 00:09:24,525 --> 00:09:30,285 is impacted to this degree, how important is, uh, observability? 169 00:09:30,680 --> 00:09:32,564 I, I know that when I've dealt with outages in the. 170 00:09:32,655 --> 00:09:33,375 Past. 171 00:09:33,405 --> 00:09:36,525 There's, uh, the first thing you try and figure out of, is it my shitty, 172 00:09:36,525 --> 00:09:38,685 shitty code or is it a global issue? 173 00:09:38,714 --> 00:09:39,464 That's important. 174 00:09:39,584 --> 00:09:43,155 And once you establish it's a global issue, then you can begin, uh, the 175 00:09:43,155 --> 00:09:45,135 mitigation part of that process. 176 00:09:45,314 --> 00:09:48,464 And yes, observability becomes extraordinarily important there for 177 00:09:48,464 --> 00:09:50,115 some things, but for others it's. 178 00:09:50,865 --> 00:09:54,074 There, there's also, at least with the cloud being as big as it is now, 179 00:09:54,255 --> 00:09:58,365 there's some reputational headline, uh, risk protection here in that 180 00:09:58,545 --> 00:10:01,545 no one is talking about your site going down in some weird ways. 181 00:10:01,545 --> 00:10:05,145 Yesterday everyone's talking about AWS going down, like they 182 00:10:05,145 --> 00:10:06,555 own the reputation of this. 183 00:10:06,735 --> 00:10:07,005 Yeah, 184 00:10:08,324 --> 00:10:09,015 Ben Hartshorne: that's true. 185 00:10:09,464 --> 00:10:15,314 Um, and also when a business's customers are asking them. 186 00:10:15,825 --> 00:10:17,355 Which parts of your service are working? 187 00:10:17,565 --> 00:10:21,495 I know AWS is having a thing, uh, how bad is it affecting you? 188 00:10:21,675 --> 00:10:23,535 You wanna be able to give them a solid answer. 189 00:10:24,285 --> 00:10:28,185 So our customers were asking us yesterday, Hey, are you dropping our data? 190 00:10:28,995 --> 00:10:32,775 And we wanted to be able to give them a, a reasonable answer even in the moment. 191 00:10:32,955 --> 00:10:37,095 So yes, the, we we're able to deflect a certain. 192 00:10:39,020 --> 00:10:40,635 The, the reputational harm. 193 00:10:40,785 --> 00:10:44,324 But at the same time, there are people that have come back and say, well, I 194 00:10:44,324 --> 00:10:45,615 mean, shouldn't you have done better? 195 00:10:45,915 --> 00:10:48,765 It's important for us to be able to rebuild our business and to 196 00:10:48,824 --> 00:10:52,094 to move region to region, and we need you to help us do that too. 197 00:10:52,425 --> 00:10:53,145 Corey: Oh, absolutely. 198 00:10:53,145 --> 00:10:56,235 And I, I actually encountered a lot of this yesterday when I, uh, early in the 199 00:10:56,235 --> 00:11:01,365 morning tried to get a, uh, what was it, A Halloween costume and Amazon site was not 200 00:11:01,365 --> 00:11:03,704 working properly for some strange reason. 201 00:11:03,885 --> 00:11:05,040 Now, if I read some of the. 202 00:11:05,895 --> 00:11:09,194 Relatively out of touch analyses in the mainstream press. 203 00:11:09,435 --> 00:11:12,165 Uh, that's billions and billions of dollars lost. 204 00:11:12,165 --> 00:11:16,035 Therefore, I either went to go get a Halloween costume from another 205 00:11:16,035 --> 00:11:19,485 vendor, or I will never wear a Halloween costume this year. 206 00:11:19,635 --> 00:11:21,225 Better luck in 2026. 207 00:11:21,765 --> 00:11:23,985 Neither of those is necessarily true, 208 00:11:24,255 --> 00:11:24,915 Ben Hartshorne: and that's really. 209 00:11:25,469 --> 00:11:29,819 Exactly why we're, we, were focused on preserving, successfully storing 210 00:11:29,819 --> 00:11:31,199 our customer's data in the moment. 211 00:11:31,709 --> 00:11:34,949 Because then when the, uh, when the time comes afterwards, they're like, 212 00:11:34,949 --> 00:11:37,920 okay, now we, we, we said what we said in the time, in the moment. 213 00:11:38,189 --> 00:11:40,319 Now they're asking us, okay, what really happened? 214 00:11:40,709 --> 00:11:44,729 Uh, that data is invaluable in helping our customers piece together 215 00:11:44,969 --> 00:11:47,579 which parts of their services were working and which weren't. 216 00:11:48,150 --> 00:11:48,959 At what times 217 00:11:49,380 --> 00:11:52,829 Corey: did you see a drop in, uh, telemetry during the outage? 218 00:11:53,370 --> 00:11:54,270 Ben Hartshorne: Yep, for sure. 219 00:11:54,660 --> 00:11:56,910 Corey: Is that because people's systems were down, or is that because their 220 00:11:56,910 --> 00:11:58,200 systems could not communicate out? 221 00:11:58,380 --> 00:11:58,890 Ben Hartshorne: Both. 222 00:11:59,760 --> 00:12:00,210 Corey: Excellent. 223 00:12:01,740 --> 00:12:05,460 Ben Hartshorne: Uh, we did get some reports of, uh, from our customers that 224 00:12:05,730 --> 00:12:10,740 their, uh, specifically the open telemetry collector that was, uh, gathering the 225 00:12:10,740 --> 00:12:15,180 data from their application was unable to successfully send it to Honeycomb. 226 00:12:15,810 --> 00:12:18,990 Uh, at the same time we were not rejecting it. 227 00:12:19,515 --> 00:12:24,344 So clearly there were challenges in the, the path between those two things. 228 00:12:24,735 --> 00:12:28,875 Uh, whether that was an AWS's network in some other network unable to get to aws. 229 00:12:28,875 --> 00:12:30,015 I, I dunno. 230 00:12:30,194 --> 00:12:35,145 So, uh, we definitely saw there were issues of reachability. 231 00:12:35,714 --> 00:12:39,615 Uh, and so undoubtedly there was some data drop there that's completely out 232 00:12:39,615 --> 00:12:43,844 of our, so the, the only part we could say is once the data got to us, we 233 00:12:43,844 --> 00:12:45,135 were able to successfully store it. 234 00:12:45,525 --> 00:12:48,380 So, um, the question is, uh, was it. 235 00:12:49,110 --> 00:12:50,850 Customers apps going down. 236 00:12:51,704 --> 00:12:57,285 Uh, absolutely many of our customers were down and they were unable to send us on e 237 00:12:57,285 --> 00:12:59,055 telemetry because their app was offline. 238 00:12:59,564 --> 00:13:03,824 Uh, but the the other side is also true that the ones that were up 239 00:13:04,094 --> 00:13:07,854 were having trouble getting to us because of our location in US East. 240 00:13:08,645 --> 00:13:11,615 Corey: Now to continue reading what the mainstream press had to say about 241 00:13:11,645 --> 00:13:17,225 this, does that mean that you are now actively considering evacuating AWS 242 00:13:17,225 --> 00:13:21,875 entirely to go to a different provider that can be more reliable, probably 243 00:13:21,875 --> 00:13:23,135 building your own data centers. 244 00:13:23,465 --> 00:13:25,055 Ben Hartshorne: Yeah, you know, I've, I've heard people say 245 00:13:25,055 --> 00:13:26,405 that's the thing to do these days. 246 00:13:26,944 --> 00:13:29,319 Now, I, I have helped build data centers in the past. 247 00:13:30,375 --> 00:13:32,594 Corey: As have I, there's a reason that both of us have a 248 00:13:32,594 --> 00:13:33,974 job that does not involve that 249 00:13:34,454 --> 00:13:37,964 Ben Hartshorne: there is, uh, the data centers I built, were not as reliable as 250 00:13:37,964 --> 00:13:42,645 any of the data centers that are available from our, our big public cloud providers. 251 00:13:42,750 --> 00:13:45,194 Corey: I, I would've said, unless you worked at one of those companies building 252 00:13:45,194 --> 00:13:48,104 the data centers, and even back then, given the time you've been at Honeycomb, 253 00:13:48,104 --> 00:13:51,584 I can say with a certainty, you are not as good at running data centers as 254 00:13:51,584 --> 00:13:53,925 they are because effectively no one is. 255 00:13:53,925 --> 00:13:56,685 This is something that you get to learn about at significant scale. 256 00:13:56,835 --> 00:13:59,685 The concern is, I see it as one of consolidation, but I've seen too 257 00:13:59,685 --> 00:14:04,095 many folks try and go multi-cloud for resilience reasons, and all 258 00:14:04,095 --> 00:14:06,345 they've done is they've added a second single point of failure. 259 00:14:06,345 --> 00:14:09,045 So now they're exposed to everyone's outage, and when that happens, their 260 00:14:09,045 --> 00:14:13,185 site continues to fall down in different ways as opposed to being more resilient, 261 00:14:13,215 --> 00:14:16,665 which is a hell of a lot more than just picking multiple providers. 262 00:14:17,205 --> 00:14:19,365 Ben Hartshorne: But there is something to say though of looking 263 00:14:19,365 --> 00:14:21,405 at a business and saying, okay, what. 264 00:14:22,845 --> 00:14:26,865 What is the cost for us to be, you know, single region versus what is 265 00:14:26,865 --> 00:14:31,455 the cost to be fully, uh, you know, multi-region where we can fail over 266 00:14:31,455 --> 00:14:33,435 in an instant and nobody notices? 267 00:14:34,005 --> 00:14:35,955 Uh, those cost differences are huge. 268 00:14:36,630 --> 00:14:38,130 And for most businesses 269 00:14:38,490 --> 00:14:40,200 Corey: of course ma, it's a massive investment. 270 00:14:40,200 --> 00:14:40,830 At least 10 x. 271 00:14:41,040 --> 00:14:41,220 Ben Hartshorne: Yeah. 272 00:14:41,640 --> 00:14:44,040 So for most businesses you're not gonna go that far. 273 00:14:44,400 --> 00:14:47,820 Corey: My, my newsletter, publication is entirely bound within US West 274 00:14:47,820 --> 00:14:51,300 two, because if that goes down, that, that just happened to be for latency 275 00:14:51,300 --> 00:14:52,830 purposes, not reliability reasons. 276 00:14:52,980 --> 00:14:55,530 But if the region is hard down and I need to send an email newsletter 277 00:14:55,530 --> 00:14:58,470 and it's down for several days, I'm writing that one by hand. 278 00:14:58,500 --> 00:15:00,420 'cause I've got a different story to tell that week. 279 00:15:00,420 --> 00:15:02,550 I don't need it to do the business as usual. 280 00:15:03,005 --> 00:15:03,125 Thing. 281 00:15:03,305 --> 00:15:06,875 And that that's a reflection of architecture and investment decisions 282 00:15:07,055 --> 00:15:08,675 reflecting the reality of my business. 283 00:15:08,855 --> 00:15:09,035 Ben Hartshorne: Yes. 284 00:15:09,335 --> 00:15:10,955 And that's, that's exactly where to start. 285 00:15:11,585 --> 00:15:15,395 And there are things you can do within a region to increase a little bit 286 00:15:15,395 --> 00:15:19,205 of resilience to certain services within that region suffering. 287 00:15:19,865 --> 00:15:25,085 So, um, as an example, uh, uh, I don't remember how many years ago it was, uh, 288 00:15:25,085 --> 00:15:29,135 but uh, Amazon had an outage in kms, the, uh, the key management service. 289 00:15:29,675 --> 00:15:32,465 And that basically made everything stop. 290 00:15:33,150 --> 00:15:35,400 Uh, you can probably find out exactly when it happened. 291 00:15:35,640 --> 00:15:36,810 Corey: Yes, I'm pulling that up now. 292 00:15:36,810 --> 00:15:37,560 Please continue. 293 00:15:37,560 --> 00:15:38,130 I'm curious. 294 00:15:38,130 --> 00:15:38,460 Now 295 00:15:38,640 --> 00:15:42,329 Ben Hartshorne: they provide a really easy way to replicate all of your 296 00:15:42,329 --> 00:15:47,850 keys to another region and a pretty easy way to fail over accessing those 297 00:15:47,850 --> 00:15:49,110 keys from one region to another. 298 00:15:49,680 --> 00:15:52,920 So even if you're not gonna be fully multi-region, you can insulate 299 00:15:52,920 --> 00:15:56,490 against individual services that might have an incident and prevent 300 00:15:56,550 --> 00:15:59,910 those one services from having an outsized impact on your application. 301 00:16:00,690 --> 00:16:04,110 You know, we don't need their keys most of the time, but when you 302 00:16:04,110 --> 00:16:07,170 do need them, you kind of need them to start your application. 303 00:16:07,170 --> 00:16:10,080 So if you need to scale up or do something like that and it's not 304 00:16:10,080 --> 00:16:12,600 available, you're really out of luck. 305 00:16:13,290 --> 00:16:18,210 So I, the, the thing is, I, I don't wanna advocate that people try and go 306 00:16:18,210 --> 00:16:21,990 fully multi-region, but that's not to say that we advocate all responsibility 307 00:16:21,990 --> 00:16:24,065 for insulating our application from. 308 00:16:24,795 --> 00:16:27,525 Having transient outages in our dependencies. 309 00:16:27,945 --> 00:16:28,095 Corey: Yeah. 310 00:16:28,150 --> 00:16:31,005 To, to be clear, they did not do a formal writeup on the KMS issue 311 00:16:31,005 --> 00:16:38,055 on their basically kind of not terrific, uh, list of, uh, uh, list 312 00:16:38,085 --> 00:16:40,635 of, um, out post event summaries. 313 00:16:40,815 --> 00:16:42,855 It's, things have to be sort of noisy for that to hit. 314 00:16:43,755 --> 00:16:46,965 I'm sure yesterday's will wind up on that list once they have, uh, the, the 315 00:16:47,205 --> 00:16:48,795 had that up before this thing publishes. 316 00:16:49,065 --> 00:16:51,285 But yeah, they did not put the KMS issue there. 317 00:16:51,465 --> 00:16:52,335 You're completely correct. 318 00:16:53,145 --> 00:16:56,985 It's a, this is the sort of thing of what is, what is the ba, what is 319 00:16:56,985 --> 00:16:58,365 the blast radius of these issues? 320 00:16:58,725 --> 00:17:04,305 And I, I think that there's this sense that before we went in the 321 00:17:04,305 --> 00:17:07,305 cloud, everything was more reliable, but just the opposite is true. 322 00:17:07,665 --> 00:17:10,665 Uh, the difference was, is that if we were all building our data centers 323 00:17:10,665 --> 00:17:14,025 today, my shitty stuff at Duck Bill is down as it is every, you know, 324 00:17:14,085 --> 00:17:16,905 every random Tuesday and tomorrow. 325 00:17:16,970 --> 00:17:20,120 Honeycomb is down because, oops, it turns out you once again are 326 00:17:20,120 --> 00:17:22,010 forgotten to replace a bad, hard drive. 327 00:17:22,339 --> 00:17:22,790 Cool. 328 00:17:23,300 --> 00:17:24,710 But those are not the happening. 329 00:17:24,710 --> 00:17:27,680 At the same time, when you start with the centralization story, 330 00:17:27,740 --> 00:17:31,970 suddenly a disproportionate swath of the world is down simultaneously, 331 00:17:31,970 --> 00:17:33,440 and that's where things get weird. 332 00:17:33,830 --> 00:17:37,010 It gets even harder though because you can test your durability and 333 00:17:37,010 --> 00:17:38,870 your resilience as much as you want. 334 00:17:39,639 --> 00:17:42,459 But it doesn't impact, it doesn't account for the, the challenge of third 335 00:17:42,459 --> 00:17:44,500 party providers on your critical path. 336 00:17:45,730 --> 00:17:49,060 You're, you obviously need to make sure that if, in order for honeycomb 337 00:17:49,060 --> 00:17:51,730 to work, honeycomb itself has to be up. 338 00:17:51,850 --> 00:17:52,929 That's sort of step one. 339 00:17:53,139 --> 00:17:57,490 But to do that, AWS itself has to be up in certain places. 340 00:17:57,790 --> 00:17:59,320 What other vendors factor into this? 341 00:17:59,409 --> 00:18:00,790 Ben Hartshorne: You know, that was, I think, the most interesting 342 00:18:00,790 --> 00:18:04,300 part of yesterday's challenge, bringing the service back up. 343 00:18:04,990 --> 00:18:05,435 Uh, is that. 344 00:18:06,195 --> 00:18:09,885 We do rely on an incredible number of other services. 345 00:18:10,215 --> 00:18:14,085 Uh, there's some list of all of our vendors that is hundreds long. 346 00:18:14,445 --> 00:18:16,335 Now those are obviously very different parts of the business. 347 00:18:16,335 --> 00:18:20,385 They involve, uh, you know, companies we contract with for marketing outreach and 348 00:18:20,385 --> 00:18:22,350 for, uh, business and for all of that. 349 00:18:22,940 --> 00:18:23,210 Corey: Right. 350 00:18:23,210 --> 00:18:26,930 We use Dropbox here, and if Dropbox is down, uh, it, that doesn't necessarily 351 00:18:26,930 --> 00:18:30,770 impact our ability to wind up serving our customers, but it does mean I to 352 00:18:30,770 --> 00:18:34,700 find a need, to find a different way, for example, to get the recorded file from 353 00:18:34,700 --> 00:18:36,980 this podcast over to my editing team. 354 00:18:37,070 --> 00:18:39,290 Ben Hartshorne: Yeah, so there's, there's the very long list. 355 00:18:39,850 --> 00:18:42,820 And then there's the much, much shorter list of vendors that are 356 00:18:42,820 --> 00:18:46,330 really in the critical path, and we have a bunch of those too. 357 00:18:46,720 --> 00:18:51,189 Um, we use, uh, uh, vendors for feature flagging and for sending 358 00:18:51,189 --> 00:18:57,250 email and uh, for, um, uh, some, some other, uh, forms of telemetry that, 359 00:18:57,340 --> 00:18:58,750 that are destined for other spots. 360 00:19:00,550 --> 00:19:05,110 For the most part, when we get that many vendors all relying on each other. 361 00:19:06,090 --> 00:19:07,380 They're all down at once. 362 00:19:07,650 --> 00:19:10,410 There's this bootstrapping problem where they're all trying to come back, 363 00:19:10,410 --> 00:19:13,260 but they all sort of rely on each other in order to come back successfully. 364 00:19:13,830 --> 00:19:17,190 And I think that's part of what made yesterday morning's, uh, out. 365 00:19:18,240 --> 00:19:24,270 Move from, uh, roughly what, like midnight to 3:00 AM Pacific all the way 366 00:19:24,270 --> 00:19:29,160 through the rest of the day and, and still have issues, uh, with, with some 367 00:19:29,160 --> 00:19:32,040 companies up until, uh, five six, 7:00 PM 368 00:19:32,400 --> 00:19:35,970 Corey: This episode is sponsored by my own company, duck Bill, 369 00:19:36,210 --> 00:19:38,125 having trouble with your AWS bill. 370 00:19:38,415 --> 00:19:41,865 Perhaps it's time to renegotiate a contract with them. 371 00:19:42,195 --> 00:19:47,595 Maybe you're just wondering how to predict what's going on in the wide world of AWS. 372 00:19:47,655 --> 00:19:50,295 Well, that's where Duck Bill comes in to help. 373 00:19:50,504 --> 00:19:53,235 Remember, you can't duck the duck bill. 374 00:19:53,235 --> 00:19:56,835 Bill, which I am reliably informed by my business partner 375 00:19:56,955 --> 00:19:59,445 is absolutely not our motto. 376 00:19:59,685 --> 00:20:02,770 To learn more, visit duck bill hq.com. 377 00:20:03,585 --> 00:20:07,425 The, the Google SRE book talked about this, oh geez, when was it? 378 00:20:07,455 --> 00:20:08,504 15 years ago now. 379 00:20:08,534 --> 00:20:09,794 Damn near that. 380 00:20:09,825 --> 00:20:13,814 Uh, that at some point when a service goes down and then it starts to recover, 381 00:20:13,995 --> 00:20:17,715 everything that depends on it will often basically pummel it back into 382 00:20:17,715 --> 00:20:19,905 submission, trying to talk to the thing. 383 00:20:20,235 --> 00:20:24,945 It's a, like I remember back when I worked at, uh, as a senior systems engineer at 384 00:20:24,945 --> 00:20:28,185 Media Temple in the days before GoDaddy bought and then ultimately killed them. 385 00:20:28,514 --> 00:20:28,845 Uh. 386 00:20:29,030 --> 00:20:31,940 They, they, I was touring the data center my first week. 387 00:20:31,940 --> 00:20:34,040 We had, uh, we had three different facilities. 388 00:20:34,040 --> 00:20:36,200 I was in one of them and I asked, okay, great. 389 00:20:36,200 --> 00:20:39,350 I just trip over things and hit the emergency power off switch. 390 00:20:39,350 --> 00:20:39,770 Great. 391 00:20:39,890 --> 00:20:41,330 And kill the entire data center. 392 00:20:41,630 --> 00:20:43,760 There's an order that you have to bring things back up. 393 00:20:43,760 --> 00:20:47,000 In the event of those catastrophic outages, is there a runbook? 394 00:20:47,000 --> 00:20:48,380 And of course there was great. 395 00:20:48,380 --> 00:20:48,740 Where is it? 396 00:20:48,770 --> 00:20:49,790 Oh, it's not Confluence. 397 00:20:49,880 --> 00:20:50,300 Terrific. 398 00:20:50,300 --> 00:20:50,810 Where's that? 399 00:20:50,840 --> 00:20:52,340 Oh, in the rack over there. 400 00:20:52,939 --> 00:20:55,610 And I looked the data center manager, and she, she was delightful and 401 00:20:55,610 --> 00:20:57,979 incredibly on her point, and she knew exactly where I was going. 402 00:20:58,310 --> 00:20:59,719 We're gonna print that out right now. 403 00:21:00,050 --> 00:21:01,399 Excellent, excellent. 404 00:21:01,399 --> 00:21:01,820 Like that. 405 00:21:01,850 --> 00:21:02,840 That's why you ask. 406 00:21:02,840 --> 00:21:05,899 It's, it's someone who has never seen it before, but knows how these things were 407 00:21:05,899 --> 00:21:09,439 going through that because you build dependency on top of dependency and you 408 00:21:09,439 --> 00:21:12,830 never get the luxury of taking a step back and looking at it with fresh eyes. 409 00:21:13,010 --> 00:21:14,300 But that's what our industry has done. 410 00:21:14,340 --> 00:21:17,790 But you have, you have your vendors that have their own critical dependencies 411 00:21:18,030 --> 00:21:21,419 that they may or may not have done as good a job as you have of identifying 412 00:21:21,419 --> 00:21:23,399 those and so on and so forth. 413 00:21:23,399 --> 00:21:26,429 It's the end of a very long chain that does kind of eat itself at some point. 414 00:21:26,850 --> 00:21:27,030 Ben Hartshorne: Yeah. 415 00:21:27,030 --> 00:21:28,500 There are two things that that brings to mind. 416 00:21:28,649 --> 00:21:31,860 First, we absolutely saw exactly what you're describing yesterday in our 417 00:21:31,860 --> 00:21:35,399 track patterns, where the, the volume of incoming traffic would sort of 418 00:21:35,399 --> 00:21:36,629 come along and then it would drop. 419 00:21:36,885 --> 00:21:40,125 As their services went off, and then it's quiet for a little while, 420 00:21:40,125 --> 00:21:43,275 and then we get this huge spike as they're trying to like, you know, 421 00:21:43,305 --> 00:21:44,745 bring everything back on all at once. 422 00:21:45,135 --> 00:21:48,254 Uh, thankfully those were sort of spread out across our customers, so 423 00:21:48,254 --> 00:21:52,815 we didn't have like, just one enormous spike hit all of our, our servers. 424 00:21:53,205 --> 00:21:56,055 Um, but we did see them on a, on a per customer basis. 425 00:21:56,060 --> 00:21:57,975 It's, it's a real, very real pattern. 426 00:21:58,485 --> 00:22:03,945 Um, but the second one, for all of these dependencies, there are clearly 427 00:22:03,945 --> 00:22:05,985 several who have built their system. 428 00:22:06,659 --> 00:22:12,750 With this challenge in mind and have a series of different fallbacks. 429 00:22:13,110 --> 00:22:18,060 Uh, and, and, uh, I'll, I'll give you the story of, um, uh, we used 430 00:22:18,060 --> 00:22:19,530 LaunchDarkly for our feature flagging. 431 00:22:20,550 --> 00:22:22,740 Um, their service was also impacted yesterday. 432 00:22:24,600 --> 00:22:27,270 One would think, oh, we need our feature flags in order to boot up. 433 00:22:28,110 --> 00:22:33,270 Well, their SDK is built with the idea that you set your feature flag default 434 00:22:33,270 --> 00:22:35,915 in code, and if we can't reach our service, we'll go ahead and use those. 435 00:22:37,169 --> 00:22:38,820 And if we can reach our service, great. 436 00:22:38,940 --> 00:22:39,690 We'll update them. 437 00:22:40,169 --> 00:22:42,060 And if we can update them once, that's great. 438 00:22:42,090 --> 00:22:44,280 If we can connect to the streaming service even better. 439 00:22:45,480 --> 00:22:49,320 And I, I think they also have, uh, some, some more, uh, bridging in there. 440 00:22:49,320 --> 00:22:52,980 But, um, we don't use, uh, the, the more complicated infrastructure. 441 00:22:53,250 --> 00:22:59,129 But this idea that they design the system with the expectation that in the 442 00:22:59,129 --> 00:23:03,030 event of a service on unavailability, things will continue to work. 443 00:23:04,260 --> 00:23:06,750 Made the recovery process all that much better. 444 00:23:07,260 --> 00:23:11,850 And, uh, even when, when, uh, their service was unavailable and ours 445 00:23:11,850 --> 00:23:16,470 was still running, uh, the SDK still answers questions in code for 446 00:23:16,470 --> 00:23:17,850 the status of all of these flags. 447 00:23:18,300 --> 00:23:20,520 It doesn't say, oh, I, I can't reach my upstream. 448 00:23:20,520 --> 00:23:22,170 Suddenly, I can't give you an answer anymore. 449 00:23:22,410 --> 00:23:26,550 No, the SDK is built with that idea of local caching so that it can 450 00:23:26,610 --> 00:23:29,340 continue to serve the correct answer. 451 00:23:30,149 --> 00:23:32,909 So far as it new from whenever it lost its connection. 452 00:23:33,030 --> 00:23:37,169 But it means that if, if they have a transient outage, our stuff doesn't break. 453 00:23:37,830 --> 00:23:42,810 And that kind of design, uh, really, uh, makes recovering from these like 454 00:23:42,840 --> 00:23:47,520 interdependent outages, uh, feasible in a way that the, the, uh, the 455 00:23:47,520 --> 00:23:50,185 strict ordering you were describing just is, is really difficult. 456 00:23:51,075 --> 00:23:54,405 Corey: At least in my case, I, I have the luxury of knowing these things just 457 00:23:54,405 --> 00:23:58,995 because I'm old and I, I figured this out Before it was SRE Common Knowledge, 458 00:23:58,995 --> 00:24:02,774 or SRE, was a widely acknowledged thing where, okay, you have a job server 459 00:24:02,774 --> 00:24:04,845 that runs CR jobs, uh, every day. 460 00:24:05,115 --> 00:24:08,445 And when it, it turns out that, oh, when you founded missed a CR job. 461 00:24:08,520 --> 00:24:09,270 Oopsy doozy. 462 00:24:09,270 --> 00:24:11,070 That's a problem for some of those things. 463 00:24:11,070 --> 00:24:14,010 So now you start building in error checking and the rest, and then you 464 00:24:14,010 --> 00:24:17,280 do a restore for three days ago from backup for that thing, and it suddenly 465 00:24:17,280 --> 00:24:21,150 dinks it, missed all theron jobs and runs them all, and then hammers some 466 00:24:21,150 --> 00:24:22,980 other system to death when it shouldn't. 467 00:24:22,980 --> 00:24:27,270 And you, you learn iteratively of, oh, that's kind of a failure mode. 468 00:24:27,450 --> 00:24:29,850 Like when you start externalizing and hardening APIs, you 469 00:24:29,850 --> 00:24:31,650 build, you learn very quickly. 470 00:24:31,650 --> 00:24:36,210 Everything needs a rate limit, and you need a way to make bad actors 471 00:24:36,240 --> 00:24:37,830 stop hammering your endpoints. 472 00:24:38,865 --> 00:24:40,575 Not just bad actors, naive ones. 473 00:24:40,755 --> 00:24:43,755 Ben Hartshorne: And, uh, rate limits are a good, a good example because, 474 00:24:43,875 --> 00:24:48,345 um, uh, that is one of the things that that did happen uh, yesterday as people 475 00:24:48,345 --> 00:24:52,000 were coming back, we actually wound up needing to rate limit ourselves. 476 00:24:53,445 --> 00:24:56,594 We didn't have to rate, limit our customers, but the, because, 477 00:24:56,625 --> 00:24:58,485 uh, so brief digression here. 478 00:24:58,814 --> 00:25:02,024 Um, honeycomb uses honeycomb in order to build honeycomb. 479 00:25:02,115 --> 00:25:04,784 Uh, we, we are our own observability vendor. 480 00:25:05,145 --> 00:25:10,544 Uh, now this, this leads to some obvious, um, uh, challenges in architecture. 481 00:25:11,145 --> 00:25:13,274 Uh, you know, how, how do we know we're right? 482 00:25:13,665 --> 00:25:16,545 Well, in the beginning we did have some other services that we'd use to 483 00:25:16,545 --> 00:25:19,305 checkpoint our, our numbers and make sure that they, they were actually correct. 484 00:25:19,665 --> 00:25:22,845 Uh, but our production instance sits here and serves our customers 485 00:25:23,145 --> 00:25:26,500 and all of its telemetry goes into the next one down the chain. 486 00:25:28,140 --> 00:25:31,560 We call that dog food because we are, uh, you know, the, the whole phrase of 487 00:25:31,560 --> 00:25:34,650 eating your own dog food, uh, drinking your own champagne is the other, 488 00:25:34,650 --> 00:25:37,050 uh, um, more, more pleasing version. 489 00:25:37,350 --> 00:25:40,200 Um, so the, from our production, it goes to dog food. 490 00:25:40,200 --> 00:25:40,890 From dog food. 491 00:25:40,890 --> 00:25:42,000 Well, what's dog food made of? 492 00:25:42,000 --> 00:25:43,050 It's made up of, of kibble. 493 00:25:43,140 --> 00:25:45,180 So our third environment is called kibble. 494 00:25:45,360 --> 00:25:49,710 Uh, so the, the dog food telemetry, it goes into this third environment and 495 00:25:49,710 --> 00:25:52,500 that third environment, well, we need to know if it's working too, so it 496 00:25:52,500 --> 00:25:54,060 feeds back into our production instance. 497 00:25:54,555 --> 00:25:57,315 Each of these instances, uh, is emitting telemetry. 498 00:25:57,555 --> 00:26:01,515 Uh, and we have our, um, rate limiting and our, I'm sorry, our 499 00:26:01,515 --> 00:26:03,225 tail sampling proxy called refinery. 500 00:26:03,555 --> 00:26:08,745 That, uh, helps us reduce volume so it's not a, a positively amplifying cycle. 501 00:26:09,525 --> 00:26:15,705 Um, but in this, in this incident yesterday, uh, we started emitting 502 00:26:15,705 --> 00:26:17,805 logs that we don't normally emit. 503 00:26:18,585 --> 00:26:20,295 These are coming from some of our SD. 504 00:26:23,460 --> 00:26:24,180 Their services. 505 00:26:24,899 --> 00:26:30,120 And so suddenly we started getting two or three or four log entries 506 00:26:30,120 --> 00:26:31,680 for every event we were sending. 507 00:26:31,680 --> 00:26:35,909 And, uh, did get into this kind of amplifying cycle. 508 00:26:36,629 --> 00:26:39,899 So we, we put, uh, a pretty heavy rate limit on the kibble 509 00:26:39,899 --> 00:26:43,620 environment in order to squash that traffic and disrupt the cycle. 510 00:26:43,980 --> 00:26:47,370 Uh, which, which made it difficult to ensure that was 511 00:26:47,370 --> 00:26:49,020 working correctly, which, but. 512 00:26:49,365 --> 00:26:53,205 It was, and, and that led us make sure that, make sure that the production 513 00:26:53,205 --> 00:26:54,225 instance was working alright. 514 00:26:54,645 --> 00:26:57,735 Um, but this idea of rate limits being a, a critical part of 515 00:26:57,765 --> 00:27:02,415 maintaining an interconnected stack, uh, in order to, to suppress these 516 00:27:02,415 --> 00:27:04,815 kind of, um, uh, like wavelike. 517 00:27:06,270 --> 00:27:10,230 Formations that oscillations, that start growing on each other and amplifying 518 00:27:10,230 --> 00:27:13,890 themselves, uh, can, can take any infrastructure down and being able 519 00:27:13,890 --> 00:27:17,100 to put in, uh, just the right point, a little, a couple switches and say, 520 00:27:17,250 --> 00:27:21,240 Nope, suppress that signal, uh, really made a big difference in our ability 521 00:27:21,240 --> 00:27:23,100 to, to bring back all of the services. 522 00:27:23,195 --> 00:27:23,315 Corey: I, 523 00:27:23,340 --> 00:27:26,610 I want to pivot to one last topic, but I, we could talk about 524 00:27:26,610 --> 00:27:27,835 this outage for days and hours. 525 00:27:28,605 --> 00:27:32,385 I, but there's, uh, something that you mentioned you wanted to go into that I 526 00:27:32,385 --> 00:27:37,185 wanted to pick a fight with you over, uh, was how to get people to instrument their 527 00:27:37,185 --> 00:27:41,534 applications to, for observability so they can understand their applications, 528 00:27:41,534 --> 00:27:43,245 their performance, and the rest. 529 00:27:43,425 --> 00:27:47,115 And I'm gonna go with the easy answer because it's a pain in the ass. 530 00:27:47,115 --> 00:27:50,445 Ben, have you tried instrumenting an application that already 531 00:27:50,445 --> 00:27:52,875 exists without having to spend a 532 00:27:52,875 --> 00:27:53,445 week on it? 533 00:27:53,804 --> 00:27:54,345 Ben Hartshorne: I. 534 00:27:58,064 --> 00:27:58,784 you're not wrong. 535 00:27:59,115 --> 00:28:02,175 It's a pain in the ass and it's getting better. 536 00:28:02,625 --> 00:28:04,274 There's lots of ways to make it better. 537 00:28:04,695 --> 00:28:06,975 Uh, there are packages that do auto instrumentation. 538 00:28:07,215 --> 00:28:07,784 Corey: Oh yeah, absolutely. 539 00:28:07,784 --> 00:28:08,235 From my case. 540 00:28:08,235 --> 00:28:08,354 Yeah. 541 00:28:08,354 --> 00:28:09,405 It's Claude Coat's problem. 542 00:28:09,405 --> 00:28:10,695 Now I'm getting another drink. 543 00:28:10,784 --> 00:28:14,804 Ben Hartshorne: You know, uh, you, you say that in jest and yet, um, 544 00:28:14,834 --> 00:28:16,965 they are actually getting really good. 545 00:28:17,235 --> 00:28:17,445 Yeah. 546 00:28:17,804 --> 00:28:19,304 Corey: No, that's what I've been doing. 547 00:28:19,304 --> 00:28:20,264 It works super well. 548 00:28:20,264 --> 00:28:22,844 You test it first, obviously, but yeah. 549 00:28:23,565 --> 00:28:24,825 YOLO slammed that into production. 550 00:28:24,885 --> 00:28:25,275 But yeah, 551 00:28:25,575 --> 00:28:27,555 Ben Hartshorne: the, uh, the, the LLMs are actually getting 552 00:28:27,555 --> 00:28:30,735 pretty good at understanding where instrumentation can be useful. 553 00:28:30,735 --> 00:28:32,625 I say understanding, I put that in their quotes. 554 00:28:32,895 --> 00:28:36,885 Uh, they're good at, uh, finding code that represents a, a good place to, 555 00:28:36,915 --> 00:28:40,215 to put instrumentation and, and adding it to your code in the right place. 556 00:28:40,555 --> 00:28:42,835 Corey: I need to take another try one of these days. 557 00:28:42,865 --> 00:28:46,885 Uh, the last time I played with Honeycomb, I instrumented my home Kubernetes 558 00:28:46,885 --> 00:28:51,085 cluster and I exceeded the limits of the free tier based on ingest volume 559 00:28:51,145 --> 00:28:52,615 by the second day of every month. 560 00:28:53,215 --> 00:28:58,495 And that led to either you have really unfair limits, which I don't believe to be 561 00:28:58,525 --> 00:29:03,595 true or the more insightful question, what the hell is my Kubernetes cluster doing? 562 00:29:03,595 --> 00:29:04,735 That's that chatty. 563 00:29:05,770 --> 00:29:08,199 So I rebuilt the whole thing from scratch, so it's time for me 564 00:29:08,199 --> 00:29:09,399 to go back and figure that out. 565 00:29:09,429 --> 00:29:09,699 Ben Hartshorne: Yeah. 566 00:29:09,699 --> 00:29:14,260 So, um, I will say a lot of, a lot of instrumentation is terrible. 567 00:29:14,980 --> 00:29:21,010 A lot of instrumentation is based on this idea that every single signal must 568 00:29:21,010 --> 00:29:28,510 be published all the time, and, um, that that's not relevant to you as a 569 00:29:28,510 --> 00:29:30,070 person running the Kubernetes cluster. 570 00:29:30,689 --> 00:29:35,100 You know, do you need to know every time, uh, the, the, um, a, a 571 00:29:35,100 --> 00:29:38,429 local pod checks in to see whether it's, uh, needs to be evicted? 572 00:29:38,939 --> 00:29:39,750 No, you don't. 573 00:29:40,169 --> 00:29:44,250 What you're interested in are the, the types of activities that are 574 00:29:44,250 --> 00:29:47,909 relevant to what you need to do as an operator of that cluster. 575 00:29:48,179 --> 00:29:49,560 And the same is true of an application. 576 00:29:50,040 --> 00:29:56,250 If you just, you know, put, uh, uh, in the tracing language, put a 577 00:29:56,250 --> 00:29:58,260 span on every single function call. 578 00:29:58,950 --> 00:30:03,784 You will not have useful traces because it doesn't map to, uh, a, 579 00:30:04,020 --> 00:30:07,710 a useful way of representing your user's journey through your product. 580 00:30:08,490 --> 00:30:12,120 So there's definitely some nuance to getting the right level of 581 00:30:12,120 --> 00:30:17,159 instrumentation, and I think the right level, it's not a single place, uh, it's 582 00:30:17,159 --> 00:30:20,639 a continuously moving spectrum based on what you were trying to understand 583 00:30:20,850 --> 00:30:22,350 about what your application is doing. 584 00:30:22,980 --> 00:30:24,600 So, uh, at least at Honeycomb. 585 00:30:25,440 --> 00:30:30,030 We add instrumentation all the time, and we remove instrumentation all the time 586 00:30:30,960 --> 00:30:35,550 because what's relevant to me now as I'm building out this feature is different 587 00:30:35,940 --> 00:30:40,379 from what I need to know about that feature once it is fully built and stable 588 00:30:40,560 --> 00:30:42,600 and running in, in a regular workload. 589 00:30:43,440 --> 00:30:48,720 Um, furthermore, a as I'm looking at a. Specific problem or question? 590 00:30:48,720 --> 00:30:52,140 I, we talked about, uh, you know, pricing for Lambdas at the beginning of this. 591 00:30:52,530 --> 00:30:57,300 Um, there was a time when we really wanted to understand pricing for S3 and 592 00:30:57,390 --> 00:31:00,810 part of our model, it, it's a struggle. 593 00:31:01,020 --> 00:31:04,740 Um, part of our, part of our storage model is that, uh, we store our customers 594 00:31:04,740 --> 00:31:07,080 telemetry in S3, in in many files. 595 00:31:07,295 --> 00:31:07,645 Files. 596 00:31:07,645 --> 00:31:11,520 And we put instrumentation around every single F three access. 597 00:31:12,125 --> 00:31:16,294 In order to understand both the volume and the latency of those to, to see 598 00:31:16,294 --> 00:31:19,294 like, okay, should we bundle them up or resize it like this and how 599 00:31:19,294 --> 00:31:20,675 does that influence SOS and so on. 600 00:31:21,004 --> 00:31:24,185 And it's incredibly expensive to do that kind of, uh, experiment. 601 00:31:24,425 --> 00:31:26,524 And it, it's not just expensive in dollars. 602 00:31:26,885 --> 00:31:30,725 Adding that level of instrumentation does have an impact on the overall 603 00:31:30,725 --> 00:31:32,345 performance of, of the system. 604 00:31:32,794 --> 00:31:36,845 When you're making 10,000 calls to S3 and you add a span around 605 00:31:36,845 --> 00:31:39,305 everyone, it takes a bit more time. 606 00:31:39,875 --> 00:31:40,325 So. 607 00:31:40,710 --> 00:31:44,340 Once we understood the system well enough to, to make the change, we wanted 608 00:31:44,340 --> 00:31:45,750 to make, we pulled all that back out. 609 00:31:46,890 --> 00:31:49,950 So, for your Kubernetes cluster, uh, you know, maybe it's interesting 610 00:31:49,950 --> 00:31:53,760 at the very beginning to, to look at every single, uh, connection 611 00:31:53,760 --> 00:31:55,350 that any, any process might make. 612 00:31:57,240 --> 00:32:00,210 But if it's your home cluster, that's not really what you 613 00:32:00,210 --> 00:32:01,710 need to know as an operator. 614 00:32:02,385 --> 00:32:07,395 So finding the right balance there of instrumentation that lets you fulfill 615 00:32:07,395 --> 00:32:11,055 the needs of the business, that lets you understand the, the needs of the 616 00:32:11,055 --> 00:32:16,275 operator in order to, uh, best be able to provide the service that this 617 00:32:16,275 --> 00:32:17,985 business is providing to its customers. 618 00:32:19,125 --> 00:32:22,185 It's a, it's a place somewhere there in the middle, and you're 619 00:32:22,185 --> 00:32:23,385 gonna need some people to find it, 620 00:32:23,745 --> 00:32:24,465 Corey: and that's 621 00:32:25,215 --> 00:32:26,895 easier said than done for a lot of folks. 622 00:32:27,270 --> 00:32:29,280 But you're right, it is getting easier to instrument these things. 623 00:32:29,280 --> 00:32:33,660 It is something that is iteratively getting better all the time, uh, to the 624 00:32:33,660 --> 00:32:37,140 point where now, like this is an area where AI is surprisingly effective. 625 00:32:37,740 --> 00:32:42,750 It doesn't take a lot to wrap a function call with a decorator. 626 00:32:42,930 --> 00:32:43,140 Ben Hartshorne: Mm-hmm. 627 00:32:43,875 --> 00:32:46,754 It just takes a lot of doing that over and over and over again. 628 00:32:47,175 --> 00:32:50,685 You, you do a lot of them and you see what it looks like and then you see, okay, 629 00:32:50,685 --> 00:32:56,895 which ones of these are actually useful to me Now that's gonna, and uh, we want. 630 00:32:58,245 --> 00:33:01,965 Open to that changing and willing to understand that, uh, 631 00:33:02,115 --> 00:33:03,495 that this is an evolving thing. 632 00:33:03,824 --> 00:33:07,064 And this does actually tie back to one of the core operating principles 633 00:33:07,155 --> 00:33:14,385 of modern sa uh, architectures, the ability to deploy your code quickly. 634 00:33:15,405 --> 00:33:19,215 Because if you're in this cycle of adding instrumentation, of removing 635 00:33:19,215 --> 00:33:20,445 instrumentation, you see a bug. 636 00:33:20,445 --> 00:33:25,185 It has to be easy enough to add a little bit more data to get insight 637 00:33:25,185 --> 00:33:26,895 into that bug in order to resolve it. 638 00:33:27,555 --> 00:33:31,305 It's gonna do it and the whole business suffer for 639 00:33:31,965 --> 00:33:33,075 what is quickly to you, 640 00:33:34,095 --> 00:33:36,405 uh, in. 641 00:33:38,685 --> 00:33:42,915 Uh, I need to make this change and, uh, it's visible in my test environment. 642 00:33:43,185 --> 00:33:46,005 A couple of minutes I need to make this change and have it 643 00:33:46,005 --> 00:33:47,415 visible running in production. 644 00:33:47,865 --> 00:33:52,995 Um, it depends on like how, how much the, the, uh, how frequency, how frequent 645 00:33:52,995 --> 00:33:56,385 the bug comes, but I'm, I'm actually okay with it being about, about an 646 00:33:56,385 --> 00:33:58,605 hour for that kind of, uh, turnaround. 647 00:33:58,935 --> 00:34:01,485 I know a lot of people say you should have your code running in 15 minutes. 648 00:34:01,875 --> 00:34:02,445 That's great. 649 00:34:02,955 --> 00:34:06,765 Uh, I know that's outta reach for a lot of people in a lot of industries, so, um. 650 00:34:07,889 --> 00:34:10,469 I'm, I'm not a hardliner on, on how quickly it has to 651 00:34:10,469 --> 00:34:12,089 be, but it can't be a week. 652 00:34:12,659 --> 00:34:18,299 It can't, it, it can bear, it can't be a day that just like you're gonna 653 00:34:18,359 --> 00:34:20,969 wanna do this two or three times in the course of resolving a bug. 654 00:34:21,299 --> 00:34:26,219 And so if it's something too long, you're just really pushing out any 655 00:34:26,219 --> 00:34:27,900 ability to respond quickly to a customer. 656 00:34:28,230 --> 00:34:30,929 Corey: I really wanna thank you for taking the time to speak with me about all this. 657 00:34:31,049 --> 00:34:33,120 If people wanna learn more, where's the best place for them to go? 658 00:34:33,989 --> 00:34:38,279 Ben Hartshorne: You know, I have, uh, backed off of almost all of 659 00:34:38,279 --> 00:34:42,029 the platforms in which people carry on conversations in the internet. 660 00:34:42,239 --> 00:34:42,870 Corey: Everyone 661 00:34:42,870 --> 00:34:43,830 seems to have done this. 662 00:34:44,669 --> 00:34:48,299 Ben Hartshorne: I, I, uh, I, I did work for Facebook for 663 00:34:48,419 --> 00:34:50,759 two and a half years and, um, 664 00:34:50,940 --> 00:34:51,899 Corey: someday I might forgive you. 665 00:34:52,739 --> 00:34:53,879 Ben Hartshorne: Someday I might forgive myself. 666 00:34:54,089 --> 00:34:54,600 Um. 667 00:34:58,050 --> 00:35:03,030 Really different environment and, uh, I could see the allure of the world they're 668 00:35:03,030 --> 00:35:05,010 trying to create and it doesn't match. 669 00:35:05,220 --> 00:35:06,960 Oh, I interviewed there in 2009. 670 00:35:06,960 --> 00:35:08,430 It was, it was incredibly compelling. 671 00:35:08,970 --> 00:35:12,300 Um, it doesn't match the, the view that I see of the world we're in. 672 00:35:12,750 --> 00:35:17,610 And so, um, uh, I have a, a presence at, at Honeycomb. 673 00:35:17,700 --> 00:35:22,950 Um, I do have, uh, accounts on all of the major, um, platforms, 674 00:35:23,250 --> 00:35:24,420 so you can find me there. 675 00:35:24,825 --> 00:35:30,134 Uh, there, there will be links afterwards I'm sure, but, um, LinkedIn, blue Sky. 676 00:35:30,855 --> 00:35:31,245 I dunno. 677 00:35:31,755 --> 00:35:33,375 GitHub, is that a social media platform now? 678 00:35:34,035 --> 00:35:34,665 Corey: They wish. 679 00:35:35,384 --> 00:35:36,075 We'll put all this in. 680 00:35:36,075 --> 00:35:38,025 The show notes Problem solve for us. 681 00:35:38,085 --> 00:35:40,095 Thank you so much for taking the time to speak with me. 682 00:35:40,095 --> 00:35:40,785 I appreciate it. 683 00:35:41,055 --> 00:35:41,745 Ben Hartshorne: It's a real pleasure. 684 00:35:41,895 --> 00:35:42,285 Thank you. 685 00:35:42,585 --> 00:35:45,765 Corey: Ben Hartshorne is the principle engineer at Honeycomb. 686 00:35:45,855 --> 00:35:48,495 One of the possibly might have more than one. 687 00:35:48,495 --> 00:35:51,855 Seems to be something you can scale, unlike my nonsense as Chief Cloud 688 00:35:51,855 --> 00:35:53,325 Economist at the Duck Bill Group. 689 00:35:53,805 --> 00:35:55,395 And this is screaming in the cloud. 690 00:35:55,740 --> 00:35:58,589 If you've enjoyed this podcast, please leave a five star review on 691 00:35:58,589 --> 00:36:00,270 your podcast platform of choice. 692 00:36:00,359 --> 00:36:03,600 Whereas if you've hated this podcast, please leave a five star review on 693 00:36:03,600 --> 00:36:07,500 your podcast platform of choice along with an insulting comment that won't 694 00:36:07,500 --> 00:36:10,709 work because that platform is down and not accepting comments at this moment.

Never lose your place, on any device

Create a free account to sync, back up, and get personal recommendations.