
Screaming in the Cloud
ยทE661
Building Systems That Work Even When Everything Breaks with Ben Hartshorne
Episode Transcript
1
00:00:00,000 --> 00:00:04,019
Ben Hartshorne: For all of these
dependencies, there are clearly
2
00:00:04,019 --> 00:00:10,020
several who have built their system
with this challenge in mind and have
3
00:00:10,080 --> 00:00:12,660
a series of different fallbacks.
4
00:00:13,200 --> 00:00:16,170
Uh, I'll, I'll give you the story
of, um, uh, we used LaunchDarkly
5
00:00:16,170 --> 00:00:17,100
for our feature flagging.
6
00:00:17,460 --> 00:00:19,050
Their service was also impacted yesterday.
7
00:00:19,050 --> 00:00:22,230
One would think, oh, we need our
feature flags in order to boot up.
8
00:00:23,070 --> 00:00:27,390
Well, their SDK is built with
the idea that you set your
9
00:00:27,390 --> 00:00:28,650
feature flag defaults in code.
10
00:00:28,995 --> 00:00:31,215
And if we can't reach our service,
we'll go ahead and use those.
11
00:00:32,145 --> 00:00:33,795
And if we can reach our service, great.
12
00:00:33,915 --> 00:00:34,635
We'll update them.
13
00:00:35,145 --> 00:00:37,035
And if we can update
them once, that's great.
14
00:00:37,065 --> 00:00:39,255
If we can connect to the
streaming service even better.
15
00:00:40,455 --> 00:00:44,085
And I, I think they also have, uh,
some, some more, uh, bridging in
16
00:00:44,085 --> 00:00:47,385
there, but we don't use, uh, the,
the more complicated infrastructure.
17
00:00:47,655 --> 00:00:53,535
But this idea that they design the
system with the expectation that in the
18
00:00:53,535 --> 00:00:57,465
event of a service on unavailability,
things will continue to work.
19
00:00:58,665 --> 00:01:01,155
Made the recovery process
all that much better.
20
00:01:01,665 --> 00:01:06,225
And, uh, even when, when, uh, their
service was unavailable and ours
21
00:01:06,225 --> 00:01:10,905
was still running, uh, the SDK
still answers questions in code for
22
00:01:10,905 --> 00:01:12,255
the status of all of these flags.
23
00:01:12,735 --> 00:01:14,925
It doesn't say, oh, I, I
can't reach my upstream.
24
00:01:14,925 --> 00:01:16,575
Suddenly, I can't give
you an answer anymore.
25
00:01:16,785 --> 00:01:20,955
No, the SDK is built with that idea
of local caching so that it can
26
00:01:21,015 --> 00:01:23,745
continue to serve the correct answer.
27
00:01:24,630 --> 00:01:27,300
So far as it new from whenever
it lost its connection.
28
00:01:32,160 --> 00:01:33,840
Corey: Welcome to Screaming in the Cloud.
29
00:01:34,080 --> 00:01:35,220
I'm Cory Quinn.
30
00:01:35,400 --> 00:01:39,240
My guest today is one of those folks
that I am disappointed I have not
31
00:01:39,240 --> 00:01:43,289
had on the show until now, just
because I assumed I already had.
32
00:01:43,590 --> 00:01:48,330
Ben Hartshornee is a principal engineer at
Honeycomb, but oh, so much more than that.
33
00:01:48,539 --> 00:01:50,520
Ben, thank you for dating to join us.
34
00:01:50,789 --> 00:01:52,050
Ben Hartshorne: It's lovely
to be here this morning.
35
00:01:52,440 --> 00:01:55,530
Corey: This episode is sponsored
in part by my day job Duck.
36
00:01:55,530 --> 00:01:58,384
Bill, do you have a horrifying AWS bill?
37
00:01:59,009 --> 00:02:00,869
That can mean a lot of things.
38
00:02:01,110 --> 00:02:06,179
Predicting what it's going to be,
determining what it should be, negotiating
39
00:02:06,179 --> 00:02:11,609
your next long-term contract with AWS,
or just figuring out why it increasingly
40
00:02:11,609 --> 00:02:16,140
resembles a phone number, but nobody
seems to quite know why that is.
41
00:02:16,440 --> 00:02:20,010
To learn more, visit duck bill hq.com.
42
00:02:20,310 --> 00:02:23,190
Remember, you can't duck the duck bill.
43
00:02:23,220 --> 00:02:28,620
Bill, which my CEO reliably informs
me is absolutely not our slogan.
44
00:02:29,025 --> 00:02:35,174
So you gave a talk, uh, about roughly
a month ago, uh, at the inaugural
45
00:02:35,265 --> 00:02:38,144
finops uh, meetup in San Francisco.
46
00:02:39,015 --> 00:02:40,215
Give us the high level.
47
00:02:40,215 --> 00:02:40,945
What did you talk about?
48
00:02:41,520 --> 00:02:43,710
Ben Hartshorne: Well, I got
to talk about two stories.
49
00:02:43,830 --> 00:02:44,970
Um, I love telling stories.
50
00:02:45,210 --> 00:02:49,680
I got to about, talk about two stories of
how we used honeycomb and instrumentation
51
00:02:49,950 --> 00:02:51,780
to help optimize our cloud.
52
00:02:51,780 --> 00:02:55,680
Spending a topic near and dear to your
heart, uh, is what brought me there.
53
00:02:56,070 --> 00:03:01,410
We gotta look at the overall bill
and say, Hey, what, where are some
54
00:03:01,410 --> 00:03:02,610
of the big things coming from?
55
00:03:02,910 --> 00:03:05,940
Obviously it's people sending
us data and people asking us
56
00:03:05,940 --> 00:03:07,079
questions about those data.
57
00:03:07,920 --> 00:03:09,930
Corey: And if they would just
stop both of those things, your
58
00:03:09,930 --> 00:03:11,250
bill would be so much better.
59
00:03:11,640 --> 00:03:12,990
Ben Hartshorne: It would
be so much smaller.
60
00:03:13,380 --> 00:03:15,390
Um, so would my salary, unfortunately.
61
00:03:15,900 --> 00:03:22,240
Um, so we wanted to reduce some of
those costs, but, uh, it, it's a, it's
62
00:03:22,240 --> 00:03:26,430
a problem that that's hard to get into
just from like a, a general perspective.
63
00:03:26,430 --> 00:03:28,680
You need to really get in and
look at all the details to find
64
00:03:28,680 --> 00:03:29,970
out what you're gonna change.
65
00:03:30,480 --> 00:03:32,550
So, uh, I gotta tell two stories.
66
00:03:33,255 --> 00:03:35,025
Uh, reducing costs.
67
00:03:35,355 --> 00:03:40,665
One by switching from a MD to,
uh, arm architecture for Amazon.
68
00:03:40,665 --> 00:03:42,765
That's the Graviton chip
set, which is fantastic.
69
00:03:43,155 --> 00:03:47,235
Uh, and the other was about the
amazing power of spreadsheets.
70
00:03:49,455 --> 00:03:52,395
As much as I love graphs,
I also love spreadsheets.
71
00:03:53,040 --> 00:03:54,420
I, I'm sorry.
72
00:03:54,630 --> 00:03:55,980
It's a personal failing.
73
00:03:55,980 --> 00:03:56,430
Perhaps.
74
00:03:56,670 --> 00:04:00,510
Corey: It's wild to me how many tools
out there do all kinds of business
75
00:04:00,510 --> 00:04:04,530
adjacent things, but somehow never bother
to realize that if you can just export
76
00:04:04,530 --> 00:04:09,330
and CSV, suddenly you're speaking kind
of the language of your ultimate user.
77
00:04:09,540 --> 00:04:13,500
Play up with pandas a little bit more
and spit out an actual Excel file,
78
00:04:13,500 --> 00:04:14,640
and now you're cooking it with gas.
79
00:04:14,645 --> 00:04:14,965
Mm-hmm.
80
00:04:16,320 --> 00:04:19,950
Ben Hartshorne: So, uh, the, the second
story is about doing that with honeycomb.
81
00:04:20,310 --> 00:04:23,969
Taking, uh, a number of different
graphs and looking at, um, five
82
00:04:23,969 --> 00:04:29,070
different attributes of our lambda,
uh, costs and what was going into
83
00:04:29,070 --> 00:04:33,719
them, and making changes across all
of them in order to, uh, accomplish
84
00:04:33,719 --> 00:04:36,060
an overall cost reduction about 50%.
85
00:04:36,599 --> 00:04:37,469
Uh, which is really great.
86
00:04:37,860 --> 00:04:42,599
So the, the story, uh, it does
combine my love of graph because we
87
00:04:42,599 --> 00:04:44,159
gotta see the three lines go down.
88
00:04:44,580 --> 00:04:48,990
Um, the power of spreadsheets
and also this idea that.
89
00:04:49,530 --> 00:04:55,560
You can't just look for one answer
to find the, uh, solution to your
90
00:04:55,620 --> 00:04:58,169
problems around, well, anything really.
91
00:04:58,500 --> 00:05:00,510
Uh, but especially around reducing costs.
92
00:05:00,990 --> 00:05:03,870
It's going to be a bunch of
small things that you can put
93
00:05:03,870 --> 00:05:05,400
together, uh, into one place.
94
00:05:06,000 --> 00:05:09,359
Corey: There's a, there's a lot that's
valuable when we start going down that
95
00:05:09,359 --> 00:05:11,969
particular path of starting to look at.
96
00:05:12,719 --> 00:05:15,810
Things through a, a lens of a
particular kind of data that
97
00:05:15,810 --> 00:05:17,280
you otherwise wouldn't think to?
98
00:05:17,429 --> 00:05:21,929
I, I remain, I maintain that you
remain the only customer we have
99
00:05:21,929 --> 00:05:28,679
found so far that uses Honeycomb to
completely instrument their AWS bill.
100
00:05:28,859 --> 00:05:31,919
Uh, we had not seen that before or since.
101
00:05:32,099 --> 00:05:34,650
It, it makes sense for
you to do it that way.
102
00:05:34,650 --> 00:05:35,460
Absolutely.
103
00:05:36,150 --> 00:05:38,130
It's a bit of a heavy lift for.
104
00:05:38,565 --> 00:05:40,125
Shall we say everyone else?
105
00:05:40,755 --> 00:05:44,655
Ben Hartshorne: Uh, and it, it actually
is a, a bit of a lift for, for us to, to
106
00:05:44,655 --> 00:05:49,604
say we've instrumented the entire bill,
uh, is a, a wonderful thing to, to assert.
107
00:05:49,695 --> 00:05:55,215
And, uh, as we've talked about, we,
we use the power of spreadsheets too.
108
00:05:55,784 --> 00:06:00,195
So there are some aspects there is
that, there's some aspects of our
109
00:06:00,195 --> 00:06:03,885
eight of us spending and actually
really dominant ones, uh, that
110
00:06:04,664 --> 00:06:06,765
lend themselves very easily to be.
111
00:06:08,310 --> 00:06:09,000
Using Honeycomb.
112
00:06:09,360 --> 00:06:14,400
Um, the best example is Lambda because
Lambda is, uh, charged on a per
113
00:06:14,400 --> 00:06:21,690
millisecond basis and our instrumentation
is collecting spans, traces about your
114
00:06:21,690 --> 00:06:23,610
compute on a per millisecond basis.
115
00:06:23,910 --> 00:06:27,270
There's a very easy translation
there, and so we can get really good
116
00:06:27,270 --> 00:06:30,990
insight into which customers are
spending, how much or rather, which
117
00:06:30,995 --> 00:06:32,825
customers are causing us to spend.
118
00:06:32,885 --> 00:06:33,225
How much.
119
00:06:34,815 --> 00:06:39,945
Provide our product to them and, uh,
understand how that, how we can balance
120
00:06:39,945 --> 00:06:45,495
our, uh, development resources to both
provide new features and also, uh,
121
00:06:45,495 --> 00:06:49,440
understand when we need to shift and, uh,
spend our attention managing costs that.
122
00:06:50,520 --> 00:06:54,450
Corey: There's a continuum here, and
I think that it tends to follow a
123
00:06:54,450 --> 00:06:59,160
lot around company ethos and company
culture here, where folks have
124
00:06:59,280 --> 00:07:04,110
varying degrees of insight into the
factors that drive their cloud spend.
125
00:07:04,560 --> 00:07:05,230
Uh, you are.
126
00:07:05,700 --> 00:07:10,560
Clearly an observability company you
have been observing your AWS bill
127
00:07:10,560 --> 00:07:14,700
for, I would argue longer than it
would've made sense to on some level.
128
00:07:14,700 --> 00:07:18,270
In the very early days you were
doing this and your AWS bill was
129
00:07:18,270 --> 00:07:23,370
not the, the, the limiting factor to
your company's success back in those
130
00:07:23,370 --> 00:07:24,870
days, but, but you did grow into it.
131
00:07:25,349 --> 00:07:28,560
Other folks, even at very large
enterprise scale, more or less,
132
00:07:28,560 --> 00:07:30,300
do this based on vibes and.
133
00:07:31,025 --> 00:07:34,805
Most folks, I think, tend to fall
somewhere in the middle of this,
134
00:07:34,805 --> 00:07:36,545
but, but it's not evenly distributed.
135
00:07:36,695 --> 00:07:39,185
Some teams tend to have a
very deep insight into what
136
00:07:39,185 --> 00:07:40,685
they're doing and others are.
137
00:07:40,685 --> 00:07:42,485
Amazon Bill, you mean the books?
138
00:07:42,724 --> 00:07:46,265
It, it, again, most tend to
fall somewhere center of that.
139
00:07:46,265 --> 00:07:47,914
It's, it's a law of large numbers.
140
00:07:47,914 --> 00:07:50,255
Everything starts to revert to
a mean, past a certain point.
141
00:07:51,455 --> 00:07:53,585
Ben Hartshorne: Well, I mean, you, you
wouldn't have a job if, if they didn't
142
00:07:53,585 --> 00:07:55,534
make it a bit of a challenge to do so.
143
00:07:56,190 --> 00:07:59,250
Corey: Or I might have a better
job depending, but we'll see.
144
00:07:59,489 --> 00:08:02,729
Uh, I I do wanna detour a little bit
here because as we record this, it is the
145
00:08:02,729 --> 00:08:07,140
day after AWS's big significant outage.
146
00:08:07,140 --> 00:08:11,280
I could really mess with the conspiracy
theorists and say it is their first
147
00:08:11,280 --> 00:08:15,150
major outage of, uh, October of 2025.
148
00:08:15,539 --> 00:08:17,159
Uh, and then people are
like, wait, what do you mean?
149
00:08:17,370 --> 00:08:18,359
What do you mean this is World War?
150
00:08:18,780 --> 00:08:22,049
I like same type of approach, like,
but these things do tend to cluster.
151
00:08:22,320 --> 00:08:24,299
Uh, how was your day yesterday?
152
00:08:25,305 --> 00:08:27,345
Ben Hartshorne: Uh, well,
it did start very early.
153
00:08:27,855 --> 00:08:33,855
Um, uh, our, our service, uh, has,
has presence in multiple regions.
154
00:08:34,275 --> 00:08:40,485
Uh, but we do have our, our main, uh,
US instance in, in Amazon's US East one.
155
00:08:40,965 --> 00:08:46,555
And so as, uh, things stop working, uh,
a lot of our service stopped working too.
156
00:08:48,104 --> 00:08:52,305
I mean, the, the outage was, was
significant, but wasn't, uh, pervasive.
157
00:08:52,545 --> 00:08:56,594
There were still some things that
kept functioning, and amazingly,
158
00:08:56,805 --> 00:09:01,844
we actually preserved all of the
customer telemetry that made it
159
00:09:01,844 --> 00:09:03,584
to our front door successfully.
160
00:09:04,185 --> 00:09:06,584
Uh, which is a big deal
because we hate dropping data.
161
00:09:06,944 --> 00:09:09,974
Corey: Yeah, it's, that took some
work in engineering and, and I have to
162
00:09:09,974 --> 00:09:11,265
imagine this was also not an accident.
163
00:09:11,775 --> 00:09:12,645
Ben Hartshorne: It was not an accident.
164
00:09:13,125 --> 00:09:16,185
Now their ability to query
that data during the outage.
165
00:09:17,355 --> 00:09:17,685
Suffered.
166
00:09:18,255 --> 00:09:20,415
Corey: I, I'm gonna push back on
you on that for a second there.
167
00:09:20,444 --> 00:09:24,525
When AWS's US East, one where
you have a significant workload
168
00:09:24,525 --> 00:09:30,285
is impacted to this degree, how
important is, uh, observability?
169
00:09:30,680 --> 00:09:32,564
I, I know that when I've
dealt with outages in the.
170
00:09:32,655 --> 00:09:33,375
Past.
171
00:09:33,405 --> 00:09:36,525
There's, uh, the first thing you try
and figure out of, is it my shitty,
172
00:09:36,525 --> 00:09:38,685
shitty code or is it a global issue?
173
00:09:38,714 --> 00:09:39,464
That's important.
174
00:09:39,584 --> 00:09:43,155
And once you establish it's a global
issue, then you can begin, uh, the
175
00:09:43,155 --> 00:09:45,135
mitigation part of that process.
176
00:09:45,314 --> 00:09:48,464
And yes, observability becomes
extraordinarily important there for
177
00:09:48,464 --> 00:09:50,115
some things, but for others it's.
178
00:09:50,865 --> 00:09:54,074
There, there's also, at least with
the cloud being as big as it is now,
179
00:09:54,255 --> 00:09:58,365
there's some reputational headline,
uh, risk protection here in that
180
00:09:58,545 --> 00:10:01,545
no one is talking about your site
going down in some weird ways.
181
00:10:01,545 --> 00:10:05,145
Yesterday everyone's talking
about AWS going down, like they
182
00:10:05,145 --> 00:10:06,555
own the reputation of this.
183
00:10:06,735 --> 00:10:07,005
Yeah,
184
00:10:08,324 --> 00:10:09,015
Ben Hartshorne: that's true.
185
00:10:09,464 --> 00:10:15,314
Um, and also when a business's
customers are asking them.
186
00:10:15,825 --> 00:10:17,355
Which parts of your service are working?
187
00:10:17,565 --> 00:10:21,495
I know AWS is having a thing,
uh, how bad is it affecting you?
188
00:10:21,675 --> 00:10:23,535
You wanna be able to
give them a solid answer.
189
00:10:24,285 --> 00:10:28,185
So our customers were asking us
yesterday, Hey, are you dropping our data?
190
00:10:28,995 --> 00:10:32,775
And we wanted to be able to give them a,
a reasonable answer even in the moment.
191
00:10:32,955 --> 00:10:37,095
So yes, the, we we're
able to deflect a certain.
192
00:10:39,020 --> 00:10:40,635
The, the reputational harm.
193
00:10:40,785 --> 00:10:44,324
But at the same time, there are people
that have come back and say, well, I
194
00:10:44,324 --> 00:10:45,615
mean, shouldn't you have done better?
195
00:10:45,915 --> 00:10:48,765
It's important for us to be able
to rebuild our business and to
196
00:10:48,824 --> 00:10:52,094
to move region to region, and we
need you to help us do that too.
197
00:10:52,425 --> 00:10:53,145
Corey: Oh, absolutely.
198
00:10:53,145 --> 00:10:56,235
And I, I actually encountered a lot of
this yesterday when I, uh, early in the
199
00:10:56,235 --> 00:11:01,365
morning tried to get a, uh, what was it, A
Halloween costume and Amazon site was not
200
00:11:01,365 --> 00:11:03,704
working properly for some strange reason.
201
00:11:03,885 --> 00:11:05,040
Now, if I read some of the.
202
00:11:05,895 --> 00:11:09,194
Relatively out of touch analyses
in the mainstream press.
203
00:11:09,435 --> 00:11:12,165
Uh, that's billions and
billions of dollars lost.
204
00:11:12,165 --> 00:11:16,035
Therefore, I either went to go get
a Halloween costume from another
205
00:11:16,035 --> 00:11:19,485
vendor, or I will never wear
a Halloween costume this year.
206
00:11:19,635 --> 00:11:21,225
Better luck in 2026.
207
00:11:21,765 --> 00:11:23,985
Neither of those is necessarily true,
208
00:11:24,255 --> 00:11:24,915
Ben Hartshorne: and that's really.
209
00:11:25,469 --> 00:11:29,819
Exactly why we're, we, were focused
on preserving, successfully storing
210
00:11:29,819 --> 00:11:31,199
our customer's data in the moment.
211
00:11:31,709 --> 00:11:34,949
Because then when the, uh, when the
time comes afterwards, they're like,
212
00:11:34,949 --> 00:11:37,920
okay, now we, we, we said what we
said in the time, in the moment.
213
00:11:38,189 --> 00:11:40,319
Now they're asking us,
okay, what really happened?
214
00:11:40,709 --> 00:11:44,729
Uh, that data is invaluable in
helping our customers piece together
215
00:11:44,969 --> 00:11:47,579
which parts of their services
were working and which weren't.
216
00:11:48,150 --> 00:11:48,959
At what times
217
00:11:49,380 --> 00:11:52,829
Corey: did you see a drop in,
uh, telemetry during the outage?
218
00:11:53,370 --> 00:11:54,270
Ben Hartshorne: Yep, for sure.
219
00:11:54,660 --> 00:11:56,910
Corey: Is that because people's systems
were down, or is that because their
220
00:11:56,910 --> 00:11:58,200
systems could not communicate out?
221
00:11:58,380 --> 00:11:58,890
Ben Hartshorne: Both.
222
00:11:59,760 --> 00:12:00,210
Corey: Excellent.
223
00:12:01,740 --> 00:12:05,460
Ben Hartshorne: Uh, we did get some
reports of, uh, from our customers that
224
00:12:05,730 --> 00:12:10,740
their, uh, specifically the open telemetry
collector that was, uh, gathering the
225
00:12:10,740 --> 00:12:15,180
data from their application was unable
to successfully send it to Honeycomb.
226
00:12:15,810 --> 00:12:18,990
Uh, at the same time we
were not rejecting it.
227
00:12:19,515 --> 00:12:24,344
So clearly there were challenges in
the, the path between those two things.
228
00:12:24,735 --> 00:12:28,875
Uh, whether that was an AWS's network in
some other network unable to get to aws.
229
00:12:28,875 --> 00:12:30,015
I, I dunno.
230
00:12:30,194 --> 00:12:35,145
So, uh, we definitely saw there
were issues of reachability.
231
00:12:35,714 --> 00:12:39,615
Uh, and so undoubtedly there was some
data drop there that's completely out
232
00:12:39,615 --> 00:12:43,844
of our, so the, the only part we could
say is once the data got to us, we
233
00:12:43,844 --> 00:12:45,135
were able to successfully store it.
234
00:12:45,525 --> 00:12:48,380
So, um, the question is, uh, was it.
235
00:12:49,110 --> 00:12:50,850
Customers apps going down.
236
00:12:51,704 --> 00:12:57,285
Uh, absolutely many of our customers were
down and they were unable to send us on e
237
00:12:57,285 --> 00:12:59,055
telemetry because their app was offline.
238
00:12:59,564 --> 00:13:03,824
Uh, but the the other side is also
true that the ones that were up
239
00:13:04,094 --> 00:13:07,854
were having trouble getting to us
because of our location in US East.
240
00:13:08,645 --> 00:13:11,615
Corey: Now to continue reading what
the mainstream press had to say about
241
00:13:11,645 --> 00:13:17,225
this, does that mean that you are now
actively considering evacuating AWS
242
00:13:17,225 --> 00:13:21,875
entirely to go to a different provider
that can be more reliable, probably
243
00:13:21,875 --> 00:13:23,135
building your own data centers.
244
00:13:23,465 --> 00:13:25,055
Ben Hartshorne: Yeah, you know,
I've, I've heard people say
245
00:13:25,055 --> 00:13:26,405
that's the thing to do these days.
246
00:13:26,944 --> 00:13:29,319
Now, I, I have helped build
data centers in the past.
247
00:13:30,375 --> 00:13:32,594
Corey: As have I, there's a
reason that both of us have a
248
00:13:32,594 --> 00:13:33,974
job that does not involve that
249
00:13:34,454 --> 00:13:37,964
Ben Hartshorne: there is, uh, the data
centers I built, were not as reliable as
250
00:13:37,964 --> 00:13:42,645
any of the data centers that are available
from our, our big public cloud providers.
251
00:13:42,750 --> 00:13:45,194
Corey: I, I would've said, unless you
worked at one of those companies building
252
00:13:45,194 --> 00:13:48,104
the data centers, and even back then,
given the time you've been at Honeycomb,
253
00:13:48,104 --> 00:13:51,584
I can say with a certainty, you are
not as good at running data centers as
254
00:13:51,584 --> 00:13:53,925
they are because effectively no one is.
255
00:13:53,925 --> 00:13:56,685
This is something that you get to
learn about at significant scale.
256
00:13:56,835 --> 00:13:59,685
The concern is, I see it as one of
consolidation, but I've seen too
257
00:13:59,685 --> 00:14:04,095
many folks try and go multi-cloud
for resilience reasons, and all
258
00:14:04,095 --> 00:14:06,345
they've done is they've added a
second single point of failure.
259
00:14:06,345 --> 00:14:09,045
So now they're exposed to everyone's
outage, and when that happens, their
260
00:14:09,045 --> 00:14:13,185
site continues to fall down in different
ways as opposed to being more resilient,
261
00:14:13,215 --> 00:14:16,665
which is a hell of a lot more than
just picking multiple providers.
262
00:14:17,205 --> 00:14:19,365
Ben Hartshorne: But there is
something to say though of looking
263
00:14:19,365 --> 00:14:21,405
at a business and saying, okay, what.
264
00:14:22,845 --> 00:14:26,865
What is the cost for us to be, you
know, single region versus what is
265
00:14:26,865 --> 00:14:31,455
the cost to be fully, uh, you know,
multi-region where we can fail over
266
00:14:31,455 --> 00:14:33,435
in an instant and nobody notices?
267
00:14:34,005 --> 00:14:35,955
Uh, those cost differences are huge.
268
00:14:36,630 --> 00:14:38,130
And for most businesses
269
00:14:38,490 --> 00:14:40,200
Corey: of course ma, it's
a massive investment.
270
00:14:40,200 --> 00:14:40,830
At least 10 x.
271
00:14:41,040 --> 00:14:41,220
Ben Hartshorne: Yeah.
272
00:14:41,640 --> 00:14:44,040
So for most businesses
you're not gonna go that far.
273
00:14:44,400 --> 00:14:47,820
Corey: My, my newsletter, publication
is entirely bound within US West
274
00:14:47,820 --> 00:14:51,300
two, because if that goes down, that,
that just happened to be for latency
275
00:14:51,300 --> 00:14:52,830
purposes, not reliability reasons.
276
00:14:52,980 --> 00:14:55,530
But if the region is hard down and
I need to send an email newsletter
277
00:14:55,530 --> 00:14:58,470
and it's down for several days,
I'm writing that one by hand.
278
00:14:58,500 --> 00:15:00,420
'cause I've got a different
story to tell that week.
279
00:15:00,420 --> 00:15:02,550
I don't need it to do
the business as usual.
280
00:15:03,005 --> 00:15:03,125
Thing.
281
00:15:03,305 --> 00:15:06,875
And that that's a reflection of
architecture and investment decisions
282
00:15:07,055 --> 00:15:08,675
reflecting the reality of my business.
283
00:15:08,855 --> 00:15:09,035
Ben Hartshorne: Yes.
284
00:15:09,335 --> 00:15:10,955
And that's, that's exactly where to start.
285
00:15:11,585 --> 00:15:15,395
And there are things you can do within
a region to increase a little bit
286
00:15:15,395 --> 00:15:19,205
of resilience to certain services
within that region suffering.
287
00:15:19,865 --> 00:15:25,085
So, um, as an example, uh, uh, I don't
remember how many years ago it was, uh,
288
00:15:25,085 --> 00:15:29,135
but uh, Amazon had an outage in kms,
the, uh, the key management service.
289
00:15:29,675 --> 00:15:32,465
And that basically made everything stop.
290
00:15:33,150 --> 00:15:35,400
Uh, you can probably find
out exactly when it happened.
291
00:15:35,640 --> 00:15:36,810
Corey: Yes, I'm pulling that up now.
292
00:15:36,810 --> 00:15:37,560
Please continue.
293
00:15:37,560 --> 00:15:38,130
I'm curious.
294
00:15:38,130 --> 00:15:38,460
Now
295
00:15:38,640 --> 00:15:42,329
Ben Hartshorne: they provide a really
easy way to replicate all of your
296
00:15:42,329 --> 00:15:47,850
keys to another region and a pretty
easy way to fail over accessing those
297
00:15:47,850 --> 00:15:49,110
keys from one region to another.
298
00:15:49,680 --> 00:15:52,920
So even if you're not gonna be
fully multi-region, you can insulate
299
00:15:52,920 --> 00:15:56,490
against individual services that
might have an incident and prevent
300
00:15:56,550 --> 00:15:59,910
those one services from having an
outsized impact on your application.
301
00:16:00,690 --> 00:16:04,110
You know, we don't need their keys
most of the time, but when you
302
00:16:04,110 --> 00:16:07,170
do need them, you kind of need
them to start your application.
303
00:16:07,170 --> 00:16:10,080
So if you need to scale up or do
something like that and it's not
304
00:16:10,080 --> 00:16:12,600
available, you're really out of luck.
305
00:16:13,290 --> 00:16:18,210
So I, the, the thing is, I, I don't
wanna advocate that people try and go
306
00:16:18,210 --> 00:16:21,990
fully multi-region, but that's not to
say that we advocate all responsibility
307
00:16:21,990 --> 00:16:24,065
for insulating our application from.
308
00:16:24,795 --> 00:16:27,525
Having transient outages
in our dependencies.
309
00:16:27,945 --> 00:16:28,095
Corey: Yeah.
310
00:16:28,150 --> 00:16:31,005
To, to be clear, they did not do
a formal writeup on the KMS issue
311
00:16:31,005 --> 00:16:38,055
on their basically kind of not
terrific, uh, list of, uh, uh, list
312
00:16:38,085 --> 00:16:40,635
of, um, out post event summaries.
313
00:16:40,815 --> 00:16:42,855
It's, things have to be sort
of noisy for that to hit.
314
00:16:43,755 --> 00:16:46,965
I'm sure yesterday's will wind up on
that list once they have, uh, the, the
315
00:16:47,205 --> 00:16:48,795
had that up before this thing publishes.
316
00:16:49,065 --> 00:16:51,285
But yeah, they did not
put the KMS issue there.
317
00:16:51,465 --> 00:16:52,335
You're completely correct.
318
00:16:53,145 --> 00:16:56,985
It's a, this is the sort of thing
of what is, what is the ba, what is
319
00:16:56,985 --> 00:16:58,365
the blast radius of these issues?
320
00:16:58,725 --> 00:17:04,305
And I, I think that there's this
sense that before we went in the
321
00:17:04,305 --> 00:17:07,305
cloud, everything was more reliable,
but just the opposite is true.
322
00:17:07,665 --> 00:17:10,665
Uh, the difference was, is that if
we were all building our data centers
323
00:17:10,665 --> 00:17:14,025
today, my shitty stuff at Duck Bill
is down as it is every, you know,
324
00:17:14,085 --> 00:17:16,905
every random Tuesday and tomorrow.
325
00:17:16,970 --> 00:17:20,120
Honeycomb is down because, oops,
it turns out you once again are
326
00:17:20,120 --> 00:17:22,010
forgotten to replace a bad, hard drive.
327
00:17:22,339 --> 00:17:22,790
Cool.
328
00:17:23,300 --> 00:17:24,710
But those are not the happening.
329
00:17:24,710 --> 00:17:27,680
At the same time, when you start
with the centralization story,
330
00:17:27,740 --> 00:17:31,970
suddenly a disproportionate swath
of the world is down simultaneously,
331
00:17:31,970 --> 00:17:33,440
and that's where things get weird.
332
00:17:33,830 --> 00:17:37,010
It gets even harder though because
you can test your durability and
333
00:17:37,010 --> 00:17:38,870
your resilience as much as you want.
334
00:17:39,639 --> 00:17:42,459
But it doesn't impact, it doesn't
account for the, the challenge of third
335
00:17:42,459 --> 00:17:44,500
party providers on your critical path.
336
00:17:45,730 --> 00:17:49,060
You're, you obviously need to make
sure that if, in order for honeycomb
337
00:17:49,060 --> 00:17:51,730
to work, honeycomb itself has to be up.
338
00:17:51,850 --> 00:17:52,929
That's sort of step one.
339
00:17:53,139 --> 00:17:57,490
But to do that, AWS itself has
to be up in certain places.
340
00:17:57,790 --> 00:17:59,320
What other vendors factor into this?
341
00:17:59,409 --> 00:18:00,790
Ben Hartshorne: You know, that
was, I think, the most interesting
342
00:18:00,790 --> 00:18:04,300
part of yesterday's challenge,
bringing the service back up.
343
00:18:04,990 --> 00:18:05,435
Uh, is that.
344
00:18:06,195 --> 00:18:09,885
We do rely on an incredible
number of other services.
345
00:18:10,215 --> 00:18:14,085
Uh, there's some list of all of
our vendors that is hundreds long.
346
00:18:14,445 --> 00:18:16,335
Now those are obviously very
different parts of the business.
347
00:18:16,335 --> 00:18:20,385
They involve, uh, you know, companies we
contract with for marketing outreach and
348
00:18:20,385 --> 00:18:22,350
for, uh, business and for all of that.
349
00:18:22,940 --> 00:18:23,210
Corey: Right.
350
00:18:23,210 --> 00:18:26,930
We use Dropbox here, and if Dropbox is
down, uh, it, that doesn't necessarily
351
00:18:26,930 --> 00:18:30,770
impact our ability to wind up serving
our customers, but it does mean I to
352
00:18:30,770 --> 00:18:34,700
find a need, to find a different way, for
example, to get the recorded file from
353
00:18:34,700 --> 00:18:36,980
this podcast over to my editing team.
354
00:18:37,070 --> 00:18:39,290
Ben Hartshorne: Yeah, so there's,
there's the very long list.
355
00:18:39,850 --> 00:18:42,820
And then there's the much, much
shorter list of vendors that are
356
00:18:42,820 --> 00:18:46,330
really in the critical path, and
we have a bunch of those too.
357
00:18:46,720 --> 00:18:51,189
Um, we use, uh, uh, vendors for
feature flagging and for sending
358
00:18:51,189 --> 00:18:57,250
email and uh, for, um, uh, some, some
other, uh, forms of telemetry that,
359
00:18:57,340 --> 00:18:58,750
that are destined for other spots.
360
00:19:00,550 --> 00:19:05,110
For the most part, when we get that
many vendors all relying on each other.
361
00:19:06,090 --> 00:19:07,380
They're all down at once.
362
00:19:07,650 --> 00:19:10,410
There's this bootstrapping problem
where they're all trying to come back,
363
00:19:10,410 --> 00:19:13,260
but they all sort of rely on each other
in order to come back successfully.
364
00:19:13,830 --> 00:19:17,190
And I think that's part of what
made yesterday morning's, uh, out.
365
00:19:18,240 --> 00:19:24,270
Move from, uh, roughly what, like
midnight to 3:00 AM Pacific all the way
366
00:19:24,270 --> 00:19:29,160
through the rest of the day and, and
still have issues, uh, with, with some
367
00:19:29,160 --> 00:19:32,040
companies up until, uh, five six, 7:00 PM
368
00:19:32,400 --> 00:19:35,970
Corey: This episode is sponsored
by my own company, duck Bill,
369
00:19:36,210 --> 00:19:38,125
having trouble with your AWS bill.
370
00:19:38,415 --> 00:19:41,865
Perhaps it's time to renegotiate
a contract with them.
371
00:19:42,195 --> 00:19:47,595
Maybe you're just wondering how to predict
what's going on in the wide world of AWS.
372
00:19:47,655 --> 00:19:50,295
Well, that's where Duck
Bill comes in to help.
373
00:19:50,504 --> 00:19:53,235
Remember, you can't duck the duck bill.
374
00:19:53,235 --> 00:19:56,835
Bill, which I am reliably
informed by my business partner
375
00:19:56,955 --> 00:19:59,445
is absolutely not our motto.
376
00:19:59,685 --> 00:20:02,770
To learn more, visit duck bill hq.com.
377
00:20:03,585 --> 00:20:07,425
The, the Google SRE book talked
about this, oh geez, when was it?
378
00:20:07,455 --> 00:20:08,504
15 years ago now.
379
00:20:08,534 --> 00:20:09,794
Damn near that.
380
00:20:09,825 --> 00:20:13,814
Uh, that at some point when a service
goes down and then it starts to recover,
381
00:20:13,995 --> 00:20:17,715
everything that depends on it will
often basically pummel it back into
382
00:20:17,715 --> 00:20:19,905
submission, trying to talk to the thing.
383
00:20:20,235 --> 00:20:24,945
It's a, like I remember back when I worked
at, uh, as a senior systems engineer at
384
00:20:24,945 --> 00:20:28,185
Media Temple in the days before GoDaddy
bought and then ultimately killed them.
385
00:20:28,514 --> 00:20:28,845
Uh.
386
00:20:29,030 --> 00:20:31,940
They, they, I was touring the
data center my first week.
387
00:20:31,940 --> 00:20:34,040
We had, uh, we had three
different facilities.
388
00:20:34,040 --> 00:20:36,200
I was in one of them and
I asked, okay, great.
389
00:20:36,200 --> 00:20:39,350
I just trip over things and hit
the emergency power off switch.
390
00:20:39,350 --> 00:20:39,770
Great.
391
00:20:39,890 --> 00:20:41,330
And kill the entire data center.
392
00:20:41,630 --> 00:20:43,760
There's an order that you
have to bring things back up.
393
00:20:43,760 --> 00:20:47,000
In the event of those catastrophic
outages, is there a runbook?
394
00:20:47,000 --> 00:20:48,380
And of course there was great.
395
00:20:48,380 --> 00:20:48,740
Where is it?
396
00:20:48,770 --> 00:20:49,790
Oh, it's not Confluence.
397
00:20:49,880 --> 00:20:50,300
Terrific.
398
00:20:50,300 --> 00:20:50,810
Where's that?
399
00:20:50,840 --> 00:20:52,340
Oh, in the rack over there.
400
00:20:52,939 --> 00:20:55,610
And I looked the data center manager,
and she, she was delightful and
401
00:20:55,610 --> 00:20:57,979
incredibly on her point, and she
knew exactly where I was going.
402
00:20:58,310 --> 00:20:59,719
We're gonna print that out right now.
403
00:21:00,050 --> 00:21:01,399
Excellent, excellent.
404
00:21:01,399 --> 00:21:01,820
Like that.
405
00:21:01,850 --> 00:21:02,840
That's why you ask.
406
00:21:02,840 --> 00:21:05,899
It's, it's someone who has never seen it
before, but knows how these things were
407
00:21:05,899 --> 00:21:09,439
going through that because you build
dependency on top of dependency and you
408
00:21:09,439 --> 00:21:12,830
never get the luxury of taking a step
back and looking at it with fresh eyes.
409
00:21:13,010 --> 00:21:14,300
But that's what our industry has done.
410
00:21:14,340 --> 00:21:17,790
But you have, you have your vendors that
have their own critical dependencies
411
00:21:18,030 --> 00:21:21,419
that they may or may not have done as
good a job as you have of identifying
412
00:21:21,419 --> 00:21:23,399
those and so on and so forth.
413
00:21:23,399 --> 00:21:26,429
It's the end of a very long chain that
does kind of eat itself at some point.
414
00:21:26,850 --> 00:21:27,030
Ben Hartshorne: Yeah.
415
00:21:27,030 --> 00:21:28,500
There are two things
that that brings to mind.
416
00:21:28,649 --> 00:21:31,860
First, we absolutely saw exactly what
you're describing yesterday in our
417
00:21:31,860 --> 00:21:35,399
track patterns, where the, the volume
of incoming traffic would sort of
418
00:21:35,399 --> 00:21:36,629
come along and then it would drop.
419
00:21:36,885 --> 00:21:40,125
As their services went off, and
then it's quiet for a little while,
420
00:21:40,125 --> 00:21:43,275
and then we get this huge spike as
they're trying to like, you know,
421
00:21:43,305 --> 00:21:44,745
bring everything back on all at once.
422
00:21:45,135 --> 00:21:48,254
Uh, thankfully those were sort of
spread out across our customers, so
423
00:21:48,254 --> 00:21:52,815
we didn't have like, just one enormous
spike hit all of our, our servers.
424
00:21:53,205 --> 00:21:56,055
Um, but we did see them on
a, on a per customer basis.
425
00:21:56,060 --> 00:21:57,975
It's, it's a real, very real pattern.
426
00:21:58,485 --> 00:22:03,945
Um, but the second one, for all of
these dependencies, there are clearly
427
00:22:03,945 --> 00:22:05,985
several who have built their system.
428
00:22:06,659 --> 00:22:12,750
With this challenge in mind and have
a series of different fallbacks.
429
00:22:13,110 --> 00:22:18,060
Uh, and, and, uh, I'll, I'll give
you the story of, um, uh, we used
430
00:22:18,060 --> 00:22:19,530
LaunchDarkly for our feature flagging.
431
00:22:20,550 --> 00:22:22,740
Um, their service was
also impacted yesterday.
432
00:22:24,600 --> 00:22:27,270
One would think, oh, we need our
feature flags in order to boot up.
433
00:22:28,110 --> 00:22:33,270
Well, their SDK is built with the idea
that you set your feature flag default
434
00:22:33,270 --> 00:22:35,915
in code, and if we can't reach our
service, we'll go ahead and use those.
435
00:22:37,169 --> 00:22:38,820
And if we can reach our service, great.
436
00:22:38,940 --> 00:22:39,690
We'll update them.
437
00:22:40,169 --> 00:22:42,060
And if we can update
them once, that's great.
438
00:22:42,090 --> 00:22:44,280
If we can connect to the
streaming service even better.
439
00:22:45,480 --> 00:22:49,320
And I, I think they also have, uh,
some, some more, uh, bridging in there.
440
00:22:49,320 --> 00:22:52,980
But, um, we don't use, uh, the, the
more complicated infrastructure.
441
00:22:53,250 --> 00:22:59,129
But this idea that they design the
system with the expectation that in the
442
00:22:59,129 --> 00:23:03,030
event of a service on unavailability,
things will continue to work.
443
00:23:04,260 --> 00:23:06,750
Made the recovery process
all that much better.
444
00:23:07,260 --> 00:23:11,850
And, uh, even when, when, uh, their
service was unavailable and ours
445
00:23:11,850 --> 00:23:16,470
was still running, uh, the SDK
still answers questions in code for
446
00:23:16,470 --> 00:23:17,850
the status of all of these flags.
447
00:23:18,300 --> 00:23:20,520
It doesn't say, oh, I, I
can't reach my upstream.
448
00:23:20,520 --> 00:23:22,170
Suddenly, I can't give
you an answer anymore.
449
00:23:22,410 --> 00:23:26,550
No, the SDK is built with that idea
of local caching so that it can
450
00:23:26,610 --> 00:23:29,340
continue to serve the correct answer.
451
00:23:30,149 --> 00:23:32,909
So far as it new from whenever
it lost its connection.
452
00:23:33,030 --> 00:23:37,169
But it means that if, if they have a
transient outage, our stuff doesn't break.
453
00:23:37,830 --> 00:23:42,810
And that kind of design, uh, really,
uh, makes recovering from these like
454
00:23:42,840 --> 00:23:47,520
interdependent outages, uh, feasible
in a way that the, the, uh, the
455
00:23:47,520 --> 00:23:50,185
strict ordering you were describing
just is, is really difficult.
456
00:23:51,075 --> 00:23:54,405
Corey: At least in my case, I, I have
the luxury of knowing these things just
457
00:23:54,405 --> 00:23:58,995
because I'm old and I, I figured this
out Before it was SRE Common Knowledge,
458
00:23:58,995 --> 00:24:02,774
or SRE, was a widely acknowledged thing
where, okay, you have a job server
459
00:24:02,774 --> 00:24:04,845
that runs CR jobs, uh, every day.
460
00:24:05,115 --> 00:24:08,445
And when it, it turns out that, oh,
when you founded missed a CR job.
461
00:24:08,520 --> 00:24:09,270
Oopsy doozy.
462
00:24:09,270 --> 00:24:11,070
That's a problem for some of those things.
463
00:24:11,070 --> 00:24:14,010
So now you start building in error
checking and the rest, and then you
464
00:24:14,010 --> 00:24:17,280
do a restore for three days ago from
backup for that thing, and it suddenly
465
00:24:17,280 --> 00:24:21,150
dinks it, missed all theron jobs and
runs them all, and then hammers some
466
00:24:21,150 --> 00:24:22,980
other system to death when it shouldn't.
467
00:24:22,980 --> 00:24:27,270
And you, you learn iteratively of,
oh, that's kind of a failure mode.
468
00:24:27,450 --> 00:24:29,850
Like when you start externalizing
and hardening APIs, you
469
00:24:29,850 --> 00:24:31,650
build, you learn very quickly.
470
00:24:31,650 --> 00:24:36,210
Everything needs a rate limit, and
you need a way to make bad actors
471
00:24:36,240 --> 00:24:37,830
stop hammering your endpoints.
472
00:24:38,865 --> 00:24:40,575
Not just bad actors, naive ones.
473
00:24:40,755 --> 00:24:43,755
Ben Hartshorne: And, uh, rate limits
are a good, a good example because,
474
00:24:43,875 --> 00:24:48,345
um, uh, that is one of the things that
that did happen uh, yesterday as people
475
00:24:48,345 --> 00:24:52,000
were coming back, we actually wound
up needing to rate limit ourselves.
476
00:24:53,445 --> 00:24:56,594
We didn't have to rate, limit
our customers, but the, because,
477
00:24:56,625 --> 00:24:58,485
uh, so brief digression here.
478
00:24:58,814 --> 00:25:02,024
Um, honeycomb uses honeycomb
in order to build honeycomb.
479
00:25:02,115 --> 00:25:04,784
Uh, we, we are our own
observability vendor.
480
00:25:05,145 --> 00:25:10,544
Uh, now this, this leads to some obvious,
um, uh, challenges in architecture.
481
00:25:11,145 --> 00:25:13,274
Uh, you know, how, how
do we know we're right?
482
00:25:13,665 --> 00:25:16,545
Well, in the beginning we did have
some other services that we'd use to
483
00:25:16,545 --> 00:25:19,305
checkpoint our, our numbers and make sure
that they, they were actually correct.
484
00:25:19,665 --> 00:25:22,845
Uh, but our production instance
sits here and serves our customers
485
00:25:23,145 --> 00:25:26,500
and all of its telemetry goes
into the next one down the chain.
486
00:25:28,140 --> 00:25:31,560
We call that dog food because we are,
uh, you know, the, the whole phrase of
487
00:25:31,560 --> 00:25:34,650
eating your own dog food, uh, drinking
your own champagne is the other,
488
00:25:34,650 --> 00:25:37,050
uh, um, more, more pleasing version.
489
00:25:37,350 --> 00:25:40,200
Um, so the, from our
production, it goes to dog food.
490
00:25:40,200 --> 00:25:40,890
From dog food.
491
00:25:40,890 --> 00:25:42,000
Well, what's dog food made of?
492
00:25:42,000 --> 00:25:43,050
It's made up of, of kibble.
493
00:25:43,140 --> 00:25:45,180
So our third environment is called kibble.
494
00:25:45,360 --> 00:25:49,710
Uh, so the, the dog food telemetry, it
goes into this third environment and
495
00:25:49,710 --> 00:25:52,500
that third environment, well, we need
to know if it's working too, so it
496
00:25:52,500 --> 00:25:54,060
feeds back into our production instance.
497
00:25:54,555 --> 00:25:57,315
Each of these instances,
uh, is emitting telemetry.
498
00:25:57,555 --> 00:26:01,515
Uh, and we have our, um, rate
limiting and our, I'm sorry, our
499
00:26:01,515 --> 00:26:03,225
tail sampling proxy called refinery.
500
00:26:03,555 --> 00:26:08,745
That, uh, helps us reduce volume so it's
not a, a positively amplifying cycle.
501
00:26:09,525 --> 00:26:15,705
Um, but in this, in this incident
yesterday, uh, we started emitting
502
00:26:15,705 --> 00:26:17,805
logs that we don't normally emit.
503
00:26:18,585 --> 00:26:20,295
These are coming from some of our SD.
504
00:26:23,460 --> 00:26:24,180
Their services.
505
00:26:24,899 --> 00:26:30,120
And so suddenly we started getting
two or three or four log entries
506
00:26:30,120 --> 00:26:31,680
for every event we were sending.
507
00:26:31,680 --> 00:26:35,909
And, uh, did get into this
kind of amplifying cycle.
508
00:26:36,629 --> 00:26:39,899
So we, we put, uh, a pretty
heavy rate limit on the kibble
509
00:26:39,899 --> 00:26:43,620
environment in order to squash
that traffic and disrupt the cycle.
510
00:26:43,980 --> 00:26:47,370
Uh, which, which made it
difficult to ensure that was
511
00:26:47,370 --> 00:26:49,020
working correctly, which, but.
512
00:26:49,365 --> 00:26:53,205
It was, and, and that led us make sure
that, make sure that the production
513
00:26:53,205 --> 00:26:54,225
instance was working alright.
514
00:26:54,645 --> 00:26:57,735
Um, but this idea of rate limits
being a, a critical part of
515
00:26:57,765 --> 00:27:02,415
maintaining an interconnected stack,
uh, in order to, to suppress these
516
00:27:02,415 --> 00:27:04,815
kind of, um, uh, like wavelike.
517
00:27:06,270 --> 00:27:10,230
Formations that oscillations, that start
growing on each other and amplifying
518
00:27:10,230 --> 00:27:13,890
themselves, uh, can, can take any
infrastructure down and being able
519
00:27:13,890 --> 00:27:17,100
to put in, uh, just the right point,
a little, a couple switches and say,
520
00:27:17,250 --> 00:27:21,240
Nope, suppress that signal, uh, really
made a big difference in our ability
521
00:27:21,240 --> 00:27:23,100
to, to bring back all of the services.
522
00:27:23,195 --> 00:27:23,315
Corey: I,
523
00:27:23,340 --> 00:27:26,610
I want to pivot to one last
topic, but I, we could talk about
524
00:27:26,610 --> 00:27:27,835
this outage for days and hours.
525
00:27:28,605 --> 00:27:32,385
I, but there's, uh, something that you
mentioned you wanted to go into that I
526
00:27:32,385 --> 00:27:37,185
wanted to pick a fight with you over, uh,
was how to get people to instrument their
527
00:27:37,185 --> 00:27:41,534
applications to, for observability so
they can understand their applications,
528
00:27:41,534 --> 00:27:43,245
their performance, and the rest.
529
00:27:43,425 --> 00:27:47,115
And I'm gonna go with the easy answer
because it's a pain in the ass.
530
00:27:47,115 --> 00:27:50,445
Ben, have you tried instrumenting
an application that already
531
00:27:50,445 --> 00:27:52,875
exists without having to spend a
532
00:27:52,875 --> 00:27:53,445
week on it?
533
00:27:53,804 --> 00:27:54,345
Ben Hartshorne: I.
534
00:27:58,064 --> 00:27:58,784
you're not wrong.
535
00:27:59,115 --> 00:28:02,175
It's a pain in the ass
and it's getting better.
536
00:28:02,625 --> 00:28:04,274
There's lots of ways to make it better.
537
00:28:04,695 --> 00:28:06,975
Uh, there are packages that
do auto instrumentation.
538
00:28:07,215 --> 00:28:07,784
Corey: Oh yeah, absolutely.
539
00:28:07,784 --> 00:28:08,235
From my case.
540
00:28:08,235 --> 00:28:08,354
Yeah.
541
00:28:08,354 --> 00:28:09,405
It's Claude Coat's problem.
542
00:28:09,405 --> 00:28:10,695
Now I'm getting another drink.
543
00:28:10,784 --> 00:28:14,804
Ben Hartshorne: You know, uh, you,
you say that in jest and yet, um,
544
00:28:14,834 --> 00:28:16,965
they are actually getting really good.
545
00:28:17,235 --> 00:28:17,445
Yeah.
546
00:28:17,804 --> 00:28:19,304
Corey: No, that's what I've been doing.
547
00:28:19,304 --> 00:28:20,264
It works super well.
548
00:28:20,264 --> 00:28:22,844
You test it first, obviously, but yeah.
549
00:28:23,565 --> 00:28:24,825
YOLO slammed that into production.
550
00:28:24,885 --> 00:28:25,275
But yeah,
551
00:28:25,575 --> 00:28:27,555
Ben Hartshorne: the, uh, the,
the LLMs are actually getting
552
00:28:27,555 --> 00:28:30,735
pretty good at understanding where
instrumentation can be useful.
553
00:28:30,735 --> 00:28:32,625
I say understanding, I
put that in their quotes.
554
00:28:32,895 --> 00:28:36,885
Uh, they're good at, uh, finding code
that represents a, a good place to,
555
00:28:36,915 --> 00:28:40,215
to put instrumentation and, and adding
it to your code in the right place.
556
00:28:40,555 --> 00:28:42,835
Corey: I need to take another
try one of these days.
557
00:28:42,865 --> 00:28:46,885
Uh, the last time I played with Honeycomb,
I instrumented my home Kubernetes
558
00:28:46,885 --> 00:28:51,085
cluster and I exceeded the limits of
the free tier based on ingest volume
559
00:28:51,145 --> 00:28:52,615
by the second day of every month.
560
00:28:53,215 --> 00:28:58,495
And that led to either you have really
unfair limits, which I don't believe to be
561
00:28:58,525 --> 00:29:03,595
true or the more insightful question, what
the hell is my Kubernetes cluster doing?
562
00:29:03,595 --> 00:29:04,735
That's that chatty.
563
00:29:05,770 --> 00:29:08,199
So I rebuilt the whole thing
from scratch, so it's time for me
564
00:29:08,199 --> 00:29:09,399
to go back and figure that out.
565
00:29:09,429 --> 00:29:09,699
Ben Hartshorne: Yeah.
566
00:29:09,699 --> 00:29:14,260
So, um, I will say a lot of, a lot
of instrumentation is terrible.
567
00:29:14,980 --> 00:29:21,010
A lot of instrumentation is based on
this idea that every single signal must
568
00:29:21,010 --> 00:29:28,510
be published all the time, and, um,
that that's not relevant to you as a
569
00:29:28,510 --> 00:29:30,070
person running the Kubernetes cluster.
570
00:29:30,689 --> 00:29:35,100
You know, do you need to know
every time, uh, the, the, um, a, a
571
00:29:35,100 --> 00:29:38,429
local pod checks in to see whether
it's, uh, needs to be evicted?
572
00:29:38,939 --> 00:29:39,750
No, you don't.
573
00:29:40,169 --> 00:29:44,250
What you're interested in are the,
the types of activities that are
574
00:29:44,250 --> 00:29:47,909
relevant to what you need to do
as an operator of that cluster.
575
00:29:48,179 --> 00:29:49,560
And the same is true of an application.
576
00:29:50,040 --> 00:29:56,250
If you just, you know, put, uh,
uh, in the tracing language, put a
577
00:29:56,250 --> 00:29:58,260
span on every single function call.
578
00:29:58,950 --> 00:30:03,784
You will not have useful traces
because it doesn't map to, uh, a,
579
00:30:04,020 --> 00:30:07,710
a useful way of representing your
user's journey through your product.
580
00:30:08,490 --> 00:30:12,120
So there's definitely some nuance
to getting the right level of
581
00:30:12,120 --> 00:30:17,159
instrumentation, and I think the right
level, it's not a single place, uh, it's
582
00:30:17,159 --> 00:30:20,639
a continuously moving spectrum based
on what you were trying to understand
583
00:30:20,850 --> 00:30:22,350
about what your application is doing.
584
00:30:22,980 --> 00:30:24,600
So, uh, at least at Honeycomb.
585
00:30:25,440 --> 00:30:30,030
We add instrumentation all the time, and
we remove instrumentation all the time
586
00:30:30,960 --> 00:30:35,550
because what's relevant to me now as I'm
building out this feature is different
587
00:30:35,940 --> 00:30:40,379
from what I need to know about that
feature once it is fully built and stable
588
00:30:40,560 --> 00:30:42,600
and running in, in a regular workload.
589
00:30:43,440 --> 00:30:48,720
Um, furthermore, a as I'm looking
at a. Specific problem or question?
590
00:30:48,720 --> 00:30:52,140
I, we talked about, uh, you know, pricing
for Lambdas at the beginning of this.
591
00:30:52,530 --> 00:30:57,300
Um, there was a time when we really
wanted to understand pricing for S3 and
592
00:30:57,390 --> 00:31:00,810
part of our model, it, it's a struggle.
593
00:31:01,020 --> 00:31:04,740
Um, part of our, part of our storage
model is that, uh, we store our customers
594
00:31:04,740 --> 00:31:07,080
telemetry in S3, in in many files.
595
00:31:07,295 --> 00:31:07,645
Files.
596
00:31:07,645 --> 00:31:11,520
And we put instrumentation around
every single F three access.
597
00:31:12,125 --> 00:31:16,294
In order to understand both the volume
and the latency of those to, to see
598
00:31:16,294 --> 00:31:19,294
like, okay, should we bundle them
up or resize it like this and how
599
00:31:19,294 --> 00:31:20,675
does that influence SOS and so on.
600
00:31:21,004 --> 00:31:24,185
And it's incredibly expensive to
do that kind of, uh, experiment.
601
00:31:24,425 --> 00:31:26,524
And it, it's not just
expensive in dollars.
602
00:31:26,885 --> 00:31:30,725
Adding that level of instrumentation
does have an impact on the overall
603
00:31:30,725 --> 00:31:32,345
performance of, of the system.
604
00:31:32,794 --> 00:31:36,845
When you're making 10,000 calls
to S3 and you add a span around
605
00:31:36,845 --> 00:31:39,305
everyone, it takes a bit more time.
606
00:31:39,875 --> 00:31:40,325
So.
607
00:31:40,710 --> 00:31:44,340
Once we understood the system well
enough to, to make the change, we wanted
608
00:31:44,340 --> 00:31:45,750
to make, we pulled all that back out.
609
00:31:46,890 --> 00:31:49,950
So, for your Kubernetes cluster,
uh, you know, maybe it's interesting
610
00:31:49,950 --> 00:31:53,760
at the very beginning to, to look
at every single, uh, connection
611
00:31:53,760 --> 00:31:55,350
that any, any process might make.
612
00:31:57,240 --> 00:32:00,210
But if it's your home cluster,
that's not really what you
613
00:32:00,210 --> 00:32:01,710
need to know as an operator.
614
00:32:02,385 --> 00:32:07,395
So finding the right balance there of
instrumentation that lets you fulfill
615
00:32:07,395 --> 00:32:11,055
the needs of the business, that lets
you understand the, the needs of the
616
00:32:11,055 --> 00:32:16,275
operator in order to, uh, best be
able to provide the service that this
617
00:32:16,275 --> 00:32:17,985
business is providing to its customers.
618
00:32:19,125 --> 00:32:22,185
It's a, it's a place somewhere
there in the middle, and you're
619
00:32:22,185 --> 00:32:23,385
gonna need some people to find it,
620
00:32:23,745 --> 00:32:24,465
Corey: and that's
621
00:32:25,215 --> 00:32:26,895
easier said than done for a lot of folks.
622
00:32:27,270 --> 00:32:29,280
But you're right, it is getting
easier to instrument these things.
623
00:32:29,280 --> 00:32:33,660
It is something that is iteratively
getting better all the time, uh, to the
624
00:32:33,660 --> 00:32:37,140
point where now, like this is an area
where AI is surprisingly effective.
625
00:32:37,740 --> 00:32:42,750
It doesn't take a lot to wrap a
function call with a decorator.
626
00:32:42,930 --> 00:32:43,140
Ben Hartshorne: Mm-hmm.
627
00:32:43,875 --> 00:32:46,754
It just takes a lot of doing that
over and over and over again.
628
00:32:47,175 --> 00:32:50,685
You, you do a lot of them and you see
what it looks like and then you see, okay,
629
00:32:50,685 --> 00:32:56,895
which ones of these are actually useful
to me Now that's gonna, and uh, we want.
630
00:32:58,245 --> 00:33:01,965
Open to that changing and
willing to understand that, uh,
631
00:33:02,115 --> 00:33:03,495
that this is an evolving thing.
632
00:33:03,824 --> 00:33:07,064
And this does actually tie back to
one of the core operating principles
633
00:33:07,155 --> 00:33:14,385
of modern sa uh, architectures, the
ability to deploy your code quickly.
634
00:33:15,405 --> 00:33:19,215
Because if you're in this cycle of
adding instrumentation, of removing
635
00:33:19,215 --> 00:33:20,445
instrumentation, you see a bug.
636
00:33:20,445 --> 00:33:25,185
It has to be easy enough to add a
little bit more data to get insight
637
00:33:25,185 --> 00:33:26,895
into that bug in order to resolve it.
638
00:33:27,555 --> 00:33:31,305
It's gonna do it and the
whole business suffer for
639
00:33:31,965 --> 00:33:33,075
what is quickly to you,
640
00:33:34,095 --> 00:33:36,405
uh, in.
641
00:33:38,685 --> 00:33:42,915
Uh, I need to make this change and, uh,
it's visible in my test environment.
642
00:33:43,185 --> 00:33:46,005
A couple of minutes I need to
make this change and have it
643
00:33:46,005 --> 00:33:47,415
visible running in production.
644
00:33:47,865 --> 00:33:52,995
Um, it depends on like how, how much
the, the, uh, how frequency, how frequent
645
00:33:52,995 --> 00:33:56,385
the bug comes, but I'm, I'm actually
okay with it being about, about an
646
00:33:56,385 --> 00:33:58,605
hour for that kind of, uh, turnaround.
647
00:33:58,935 --> 00:34:01,485
I know a lot of people say you should
have your code running in 15 minutes.
648
00:34:01,875 --> 00:34:02,445
That's great.
649
00:34:02,955 --> 00:34:06,765
Uh, I know that's outta reach for a lot
of people in a lot of industries, so, um.
650
00:34:07,889 --> 00:34:10,469
I'm, I'm not a hardliner
on, on how quickly it has to
651
00:34:10,469 --> 00:34:12,089
be, but it can't be a week.
652
00:34:12,659 --> 00:34:18,299
It can't, it, it can bear, it can't
be a day that just like you're gonna
653
00:34:18,359 --> 00:34:20,969
wanna do this two or three times
in the course of resolving a bug.
654
00:34:21,299 --> 00:34:26,219
And so if it's something too long,
you're just really pushing out any
655
00:34:26,219 --> 00:34:27,900
ability to respond quickly to a customer.
656
00:34:28,230 --> 00:34:30,929
Corey: I really wanna thank you for taking
the time to speak with me about all this.
657
00:34:31,049 --> 00:34:33,120
If people wanna learn more, where's
the best place for them to go?
658
00:34:33,989 --> 00:34:38,279
Ben Hartshorne: You know, I have,
uh, backed off of almost all of
659
00:34:38,279 --> 00:34:42,029
the platforms in which people carry
on conversations in the internet.
660
00:34:42,239 --> 00:34:42,870
Corey: Everyone
661
00:34:42,870 --> 00:34:43,830
seems to have done this.
662
00:34:44,669 --> 00:34:48,299
Ben Hartshorne: I, I, uh, I,
I did work for Facebook for
663
00:34:48,419 --> 00:34:50,759
two and a half years and, um,
664
00:34:50,940 --> 00:34:51,899
Corey: someday I might forgive you.
665
00:34:52,739 --> 00:34:53,879
Ben Hartshorne: Someday
I might forgive myself.
666
00:34:54,089 --> 00:34:54,600
Um.
667
00:34:58,050 --> 00:35:03,030
Really different environment and, uh, I
could see the allure of the world they're
668
00:35:03,030 --> 00:35:05,010
trying to create and it doesn't match.
669
00:35:05,220 --> 00:35:06,960
Oh, I interviewed there in 2009.
670
00:35:06,960 --> 00:35:08,430
It was, it was incredibly compelling.
671
00:35:08,970 --> 00:35:12,300
Um, it doesn't match the, the view
that I see of the world we're in.
672
00:35:12,750 --> 00:35:17,610
And so, um, uh, I have a, a
presence at, at Honeycomb.
673
00:35:17,700 --> 00:35:22,950
Um, I do have, uh, accounts on
all of the major, um, platforms,
674
00:35:23,250 --> 00:35:24,420
so you can find me there.
675
00:35:24,825 --> 00:35:30,134
Uh, there, there will be links afterwards
I'm sure, but, um, LinkedIn, blue Sky.
676
00:35:30,855 --> 00:35:31,245
I dunno.
677
00:35:31,755 --> 00:35:33,375
GitHub, is that a social
media platform now?
678
00:35:34,035 --> 00:35:34,665
Corey: They wish.
679
00:35:35,384 --> 00:35:36,075
We'll put all this in.
680
00:35:36,075 --> 00:35:38,025
The show notes Problem solve for us.
681
00:35:38,085 --> 00:35:40,095
Thank you so much for taking
the time to speak with me.
682
00:35:40,095 --> 00:35:40,785
I appreciate it.
683
00:35:41,055 --> 00:35:41,745
Ben Hartshorne: It's a real pleasure.
684
00:35:41,895 --> 00:35:42,285
Thank you.
685
00:35:42,585 --> 00:35:45,765
Corey: Ben Hartshorne is the
principle engineer at Honeycomb.
686
00:35:45,855 --> 00:35:48,495
One of the possibly
might have more than one.
687
00:35:48,495 --> 00:35:51,855
Seems to be something you can scale,
unlike my nonsense as Chief Cloud
688
00:35:51,855 --> 00:35:53,325
Economist at the Duck Bill Group.
689
00:35:53,805 --> 00:35:55,395
And this is screaming in the cloud.
690
00:35:55,740 --> 00:35:58,589
If you've enjoyed this podcast,
please leave a five star review on
691
00:35:58,589 --> 00:36:00,270
your podcast platform of choice.
692
00:36:00,359 --> 00:36:03,600
Whereas if you've hated this podcast,
please leave a five star review on
693
00:36:03,600 --> 00:36:07,500
your podcast platform of choice along
with an insulting comment that won't
694
00:36:07,500 --> 00:36:10,709
work because that platform is down and
not accepting comments at this moment.