Episode Transcript
1
00:00:03,523 --> 00:00:16,243
So I'm curious if you can define what a Gaussian process is, because I think my audience
has a good idea of what a Bayesian neural network is.
2
00:00:16,643 --> 00:00:22,703
I've had, especially recently, Vincent Fortwin talk about that on the show.
3
00:00:22,703 --> 00:00:25,183
I'll put that also in the show notes.
4
00:00:25,583 --> 00:00:30,270
So these Bayesian deep learning, I think people are
5
00:00:30,270 --> 00:00:31,831
I'm familiar with.
6
00:00:31,831 --> 00:00:35,514
Can you tell us what a deep Gaussian process is?
7
00:00:35,514 --> 00:00:40,898
I think people see what a Gaussian process is, but what makes it a deep one?
8
00:00:40,898 --> 00:00:42,248
Great episode.
9
00:00:42,248 --> 00:00:43,840
The one with Vincent, by the way.
10
00:00:43,840 --> 00:00:44,820
I checked it out.
11
00:00:44,820 --> 00:00:45,311
Thank you.
12
00:00:45,311 --> 00:00:49,574
Because I guess they would say a lot of things that I would probably say also in my
episode.
13
00:00:49,574 --> 00:00:51,845
So it was great to see it.
14
00:00:52,045 --> 00:00:56,549
So yeah, so a Gaussian process, uh there's many ways which you can see it.
15
00:00:56,549 --> 00:00:58,942
The easiest way is probably to start from a linear model.
16
00:00:58,942 --> 00:01:02,843
I think I really like the construction from a linear model.
17
00:01:02,843 --> 00:01:12,676
So if we start from a linear model and we make it Bayesian, so we put a prior on the
parameters, then we have analytical forms for the posterior, the predictions, everything
18
00:01:12,676 --> 00:01:14,206
is nice and Gaussian.
19
00:01:14,206 --> 00:01:21,458
And so now one nice thing we can do is to start thinking about linear regression, but now
with basis functions.
20
00:01:21,458 --> 00:01:28,980
So we start introducing linear combinations, not of just the covariates or features, if
you want to call them that.
21
00:01:29,071 --> 00:01:36,271
But you have a transformation that say sine and cosine could be trigonometric functions of
any kind, could be polynomials.
22
00:01:36,271 --> 00:01:46,131
And it turns out that you can use kernel tricks to be able to say what the predictive
distribution is going to be for this.
23
00:01:46,131 --> 00:01:52,271
The model is still linear in the parameters, but now what we can do is to take the number
of basis functions to infinity.
24
00:01:52,271 --> 00:01:57,431
So we can make an large polynomial.
25
00:01:57,473 --> 00:01:59,945
And now the number of parameters will be infinite.
26
00:01:59,945 --> 00:02:14,077
But what we can do is to use this kernel, so-called kernel trick to actually express
everything in terms of scalar products among this mapping of inputs to this polynomial.
27
00:02:14,077 --> 00:02:23,304
And so if you do that, then what you can do is to, instead of working with polynomials or
these basis functions, now you can define a so-called kernel function, which is the one
28
00:02:23,304 --> 00:02:26,146
that takes inputs features.
29
00:02:26,146 --> 00:02:33,190
And it spits out a scalar product of these induced polynomials in this very large
dimension, infinite dimensional space.
30
00:02:33,190 --> 00:02:44,216
So this kernel trick allows you to just then work with something which is infinitely uh
powerful in a way, because it's infinitely flexible in a way that you have an infinite
31
00:02:44,216 --> 00:02:45,756
number of parameters now.
32
00:02:45,917 --> 00:02:53,331
But the great thing is that if you have only n observations, all you need to do is to care
about what happens for this n uh observations.
33
00:02:53,331 --> 00:02:56,045
And so you can construct this covariance matrix and
34
00:02:56,045 --> 00:02:58,557
you know, it can do and everything is Gaussian again.
35
00:02:58,557 --> 00:02:59,227
It's very nice.
36
00:02:59,227 --> 00:03:09,884
So the first time you generate a function from Gaussian process, it's beautiful because
you get these nice functions that look beautiful and it's just a multivariate normal
37
00:03:09,884 --> 00:03:10,365
really.
38
00:03:10,365 --> 00:03:12,546
And it's just, that's all it is, you know?
39
00:03:12,546 --> 00:03:21,231
So I still remember the first time I generated the function from a GP because it was a
eureka moment, you know, where you realize how simple and beautiful this is.
40
00:03:21,392 --> 00:03:24,940
And, and now, so then you can think that now.
41
00:03:24,940 --> 00:03:27,811
This is, this represents a distribution over functions.
42
00:03:27,811 --> 00:03:32,833
So if you draw from this uh GP, you obtain samples that are functions.
43
00:03:33,074 --> 00:03:43,618
And now what you can do is to say, well, what if I take this function now and instead of
just observing this function alone, I just put it inside as an input to another Gaussian
44
00:03:43,618 --> 00:03:44,438
process.
45
00:03:44,438 --> 00:03:50,441
So in a GP, you have inputs, which are your input data where you have observations.
46
00:03:50,521 --> 00:03:53,582
So now you're mapping into functions.
47
00:03:53,846 --> 00:04:04,049
And then this function can become now the input to another uh GP, for example, you know,
and then you can even say, okay, let's take these inputs and map them not just to a
48
00:04:04,049 --> 00:04:09,210
univariate Gaussian process where we have just one function, but maybe we can map it into
10 functions.
49
00:04:09,210 --> 00:04:13,131
And then these 10 functions become the input to a new Gaussian process.
50
00:04:13,131 --> 00:04:18,403
And so this would be a, a one layer deep Gaussian process, right?
51
00:04:18,403 --> 00:04:23,394
So you have now one layer, which is first hidden functions that then enter
52
00:04:23,394 --> 00:04:27,618
the, as input to another Gaussian process.
53
00:04:27,618 --> 00:04:28,859
What's the advantage of this?
54
00:04:28,859 --> 00:04:30,080
Why do we do this?
55
00:04:30,080 --> 00:04:41,629
Well, you know, with Gaussian process, yeah, so with Gaussian process is the, the, the
characteristics that you observe for the functions that you will generate are determined
56
00:04:41,629 --> 00:04:43,761
by the choice of the covariance function.
57
00:04:43,761 --> 00:04:52,598
So if you take a covariance function, which is a RBF, you're going to have infinitely
smooth functions that you generate.
58
00:04:52,928 --> 00:05:02,694
And the way these functions are going to be, the length scale of these functions and the
amplitude, they're going to be determined by the parameters that you put in the covariance
59
00:05:02,694 --> 00:05:03,644
function.
60
00:05:03,825 --> 00:05:09,378
And of course, you know, there might be problems where, you you have no stationarity.
61
00:05:09,378 --> 00:05:13,241
So in a part of the space, functions should be nice and smooth.
62
00:05:13,241 --> 00:05:17,173
In other parts of the space, maybe you want more flexibility.
63
00:05:17,333 --> 00:05:19,254
And then, you know,
64
00:05:19,270 --> 00:05:24,634
A Gaussian process with a standard covariance function cannot achieve that.
65
00:05:24,995 --> 00:05:35,590
And so in order to increase flexibility, you either spend time designing kernels that
actually can do crazy things, which is possible, but relatively hard because now you have
66
00:05:35,590 --> 00:05:36,965
a lot of choices.
67
00:05:36,965 --> 00:05:39,347
You can combine kernels in multiple ways.
68
00:05:39,347 --> 00:05:46,052
And if you have a space of possible kernels you want to choose from, combining them, you
know, becomes a combinatorial problem.
69
00:05:46,052 --> 00:05:48,418
So you may say instead, let's just...
70
00:05:48,418 --> 00:05:51,591
compose functions and composition is very powerful.
71
00:05:51,591 --> 00:05:53,383
And this is why deep learning works.
72
00:05:53,383 --> 00:05:57,767
Because in deep learning, you essentially have function compositions.
73
00:05:57,767 --> 00:06:04,894
And so even if you compose simple things, the result is something very complicated and you
can try it yourself.
74
00:06:04,894 --> 00:06:08,757
You know, take a sine function and put it into another sine function.
75
00:06:08,757 --> 00:06:12,821
If you play around with the parameters, you can get things that oscillates in a crazy way.
76
00:06:13,122 --> 00:06:14,112
And this is
77
00:06:14,274 --> 00:06:16,575
Very simple, but very powerful.
78
00:06:16,575 --> 00:06:29,823
And so the idea of deep Gaussian process is exactly this, to try to enrich the kind of
class of functions you can obtain by composing functions, composing Gaussian processes.
79
00:06:29,823 --> 00:06:34,485
And of course, now the marginals, you know, in a Gaussian process, all the marginals are
nice and Gaussian.
80
00:06:34,485 --> 00:06:37,487
If you compose, these marginals become non-Gaussian.
81
00:06:37,487 --> 00:06:43,430
And this is really, you know, getting to the point where you start thinking, well, why
should we then...
82
00:06:43,468 --> 00:06:49,522
restrict ourselves to composing processes that are Gaussian, maybe we can do something
else.
83
00:06:49,562 --> 00:06:58,589
And then maybe thinking about other ways in which you can be flexible in the way you
parametrize these complicated conditional distributions.
84
00:06:58,589 --> 00:06:59,029
Okay.
85
00:06:59,029 --> 00:06:59,629
Yeah.
86
00:06:59,629 --> 00:07:00,900
Damn, this is super fun.
87
00:07:00,900 --> 00:07:05,974
So it sounds to me like Fourier decomposition on steroids, basically.
88
00:07:05,974 --> 00:07:12,748
it's like decomposing everything through these basis functions and plugging everything
into...
89
00:07:12,842 --> 00:07:14,783
into, into each other.
90
00:07:14,783 --> 00:07:20,574
like, um, you know, like these mamushkas of Gaussian processes, basically.
91
00:07:20,574 --> 00:07:21,535
So, yeah.
92
00:07:21,535 --> 00:07:23,985
And I can definitely see the power of that.
93
00:07:23,985 --> 00:07:28,836
like, yeah, it's, it's like having very deep neural networks, basically.
94
00:07:28,836 --> 00:07:33,008
So I see, I definitely see the connection and why that would be super helpful.
95
00:07:33,008 --> 00:07:40,030
Um, and that helps, I'm guessing that helps uncover.
96
00:07:40,546 --> 00:07:47,490
very complex non-linear patterns that are very hard to express in a functional form.
97
00:07:47,570 --> 00:07:51,013
That functional form would be, well, you have to choose the kernels.
98
00:07:51,013 --> 00:07:58,197
And sometimes, as you were saying, the out of the box kernels can't express the complexity
you have in the data.
99
00:07:58,197 --> 00:08:05,521
then having basically the machine discover the kernels by itself is much easier.
100
00:08:06,326 --> 00:08:09,268
And it's really also about the marginals.
101
00:08:09,268 --> 00:08:14,911
If you believe that your marginals can be Gaussian and you're happy with that, then it's
all fine.
102
00:08:14,911 --> 00:08:16,332
You can do kernel design.
103
00:08:16,332 --> 00:08:22,956
You can spend a bit of time trying to find a good kernel that gives you good fit to the
data, good modeling, good uncertainties.
104
00:08:22,956 --> 00:08:28,599
But then there's still going to be this constraint in a way that you're working with the
Gaussian process.
105
00:08:28,599 --> 00:08:30,600
In the end, marginally, everything is Gaussian.
106
00:08:30,600 --> 00:08:34,432
You may not want that in certain applications where it may be the...
107
00:08:34,446 --> 00:08:43,906
Distributions are very skewed and other things, you know, and then maybe the skewness also
is position depend input dependent, you know, so this non-stationarity also, again, you
108
00:08:43,906 --> 00:08:48,966
can encode it in certain kernels, you know, but it's just so much easier to compose.
109
00:08:48,966 --> 00:08:58,226
mean, from the principle of just a mathematical composition, then of course,
computationally how to handle this, this is another pair of hands.
110
00:08:58,226 --> 00:08:58,686
Yeah, yeah, yeah.
111
00:08:58,686 --> 00:08:59,166
No, exactly.
112
00:08:59,166 --> 00:09:04,386
mean, you're trading, you're trading.
113
00:09:04,386 --> 00:09:11,052
basically something that's more comfortable for the user for something that's much harder
to compute for the computer.
114
00:09:11,052 --> 00:09:22,752
But yeah, like in the end, that also can be something that is more transferable because if
you have, unless you're a deep expert in Gaussian processes, coming up with your own
115
00:09:22,752 --> 00:09:26,785
kernels each time you need to work on a project is very time consuming.
116
00:09:26,785 --> 00:09:32,970
So it can be actually worth your time to turn into the deep Gaussian processes framework.
117
00:09:32,974 --> 00:09:42,219
throw computing power at it and, you know, go your merry way working on something in the
meantime while the computer samples.
118
00:09:42,219 --> 00:09:43,880
definitely makes sense.
119
00:09:43,880 --> 00:09:49,824
but again, the deep aspect carries other design choices.
120
00:09:49,824 --> 00:09:54,886
Now you have to choose how many layers, what's the dimensionality of each layer.
121
00:09:54,886 --> 00:10:01,580
So, and then there is this other uh problem of now what kind of inference you choose.
122
00:10:01,922 --> 00:10:03,802
which definitely has an effect.
123
00:10:03,943 --> 00:10:08,982
So we've done some studies on this, you know, trying to compare a little bit, various
approaches.
124
00:10:08,982 --> 00:10:16,267
I mean, we did this a few years ago now because the deep, I think we started working on
this right after TensorFlow came out.
125
00:10:16,267 --> 00:10:18,207
So this was 2016.
126
00:10:18,207 --> 00:10:23,219
So we started doing, we did our deep GP with a certain kind of approximation that is not
very popular.
127
00:10:23,219 --> 00:10:27,150
I the community seems to have agreed that
128
00:10:27,550 --> 00:10:31,723
know, inducing points methods are very powerful to do approximations.
129
00:10:31,723 --> 00:10:43,442
you know, I've also done some work on that with some great people, particularly James
Hansman, who has developed the GP flow with some other great guys.
130
00:10:43,783 --> 00:10:50,849
but random features is what you said before, you mentioned the Fourier transform uh on
steroids.
131
00:10:50,849 --> 00:10:52,590
mean, the idea is really to...
132
00:10:52,694 --> 00:10:57,785
You know, for certain classes of kernels, you can do some sort of expansions and sort of
linearize the Gaussian process.
133
00:10:57,785 --> 00:11:04,737
So before I was talking about going from a linear model to something which is uh infinite
number of basis functions.
134
00:11:04,737 --> 00:11:08,388
And now the idea is just truncate this number of basis functions.
135
00:11:08,388 --> 00:11:10,499
You know, you can do it in various ways.
136
00:11:10,499 --> 00:11:15,140
You know, there is a randomized version that we do when we do these random features.
137
00:11:15,140 --> 00:11:18,461
uh And then you sort of truncate.
138
00:11:18,461 --> 00:11:21,222
And so now instead of working with this uh
139
00:11:21,346 --> 00:11:26,688
You turn a Gaussian process into a linear model with a uh large number of basis functions.
140
00:11:26,688 --> 00:11:28,688
And then linear models are nice to work with.
141
00:11:28,688 --> 00:11:32,709
And then if you compose them, then that's when you get the deep Gaussian process.
142
00:11:32,709 --> 00:11:38,491
Essentially you get the deep neural network with some stochasticity in the layers.
143
00:11:38,491 --> 00:11:40,871
And that's all there is to it.
144
00:11:40,871 --> 00:11:48,793
And so when we did this, we implemented it in TensorFlow because it was the new thing and
ah it was very scalable.
145
00:11:48,793 --> 00:11:51,074
know, we took some competitors.
146
00:11:51,386 --> 00:11:58,048
And we really, you know, we're really fast at converging to good solutions and getting
good results, you know.
147
00:11:58,048 --> 00:12:02,199
So, and we have an implementation out there in TensorFlow, unfortunately.
148
00:12:02,199 --> 00:12:08,491
I we should now maybe port it to PyTorch, which has become what we work on more.
149
00:12:10,071 --> 00:12:10,851
No, for sure.
150
00:12:10,851 --> 00:12:20,280
I mean, yeah, that's definitely, that's definitely linked to that, to that TensorFlow
implementation that you have because yeah, I'm very big on
151
00:12:20,280 --> 00:12:33,439
pointing people towards how they can apply that in practice and basically making the
bridge between frontier research as you're doing and then helping people implement that in
152
00:12:33,439 --> 00:12:36,021
their own modeling workflows and problems.
153
00:12:36,021 --> 00:12:37,582
So let's definitely do that.
154
00:12:37,582 --> 00:12:42,818
um And yeah, I was actually going to ask you, okay, so that's...
155
00:12:42,818 --> 00:12:47,101
That's a great explanation and thank you so much for laying that out so, so clearly.
156
00:12:47,101 --> 00:13:01,934
think it's awesome to start from the linear representation, as you were saying, and
basically, um yeah, going to the very big deep GPs, which are in a way easier for me to
157
00:13:01,934 --> 00:13:07,698
represent to myself because they, you know, it's like in the infinity, in the limit.
158
00:13:07,698 --> 00:13:12,522
It's easier I find to work with than deep neural networks, for instance.
159
00:13:12,522 --> 00:13:21,288
But yes, like, can you give us a lay of the land of how, what's the field about right now?
160
00:13:21,288 --> 00:13:25,781
Let's start with the practicality of it.
161
00:13:25,781 --> 00:13:30,855
What would you recommend for people?
162
00:13:30,855 --> 00:13:35,438
In which cases would these DeepGPs be useful?
163
00:13:35,438 --> 00:13:41,410
First and second question, why wouldn't they use just
164
00:13:41,410 --> 00:13:44,571
deep tool networks instead of deep GP's.
165
00:13:44,571 --> 00:13:45,491
Let's start with that.
166
00:13:45,491 --> 00:13:47,752
I have a lot of other questions, but let's start with that.
167
00:13:47,752 --> 00:13:49,432
think it's the most general.
168
00:13:49,632 --> 00:13:50,233
Yeah.
169
00:13:50,233 --> 00:13:50,463
Yeah.
170
00:13:50,463 --> 00:13:52,453
I think, I mean, it's a, it's a great question.
171
00:13:52,453 --> 00:13:55,004
It's a, it's the mother of all questions really.
172
00:13:55,004 --> 00:13:57,325
I mean, what kind of model should you choose for your data?
173
00:13:57,325 --> 00:14:07,277
And I think, I think that is going to be a lot of great work that is going to happen soon
where we, we're going to maybe be able to give more definite answers to this.
174
00:14:07,277 --> 00:14:08,938
You know, I think.
175
00:14:09,356 --> 00:14:17,329
We're starting to realize that this overparameterization that we see in deep learning is
not so bad after all.
176
00:14:17,329 --> 00:14:26,405
know, so for someone working in business statistics, I think we have this image in mind
where, you know, we should find the right complexity for the data that we have.
177
00:14:26,405 --> 00:14:33,760
So there's going to be a sweet spot of a model that is sort of parsimonious in looking at
the data and, know, not too parameterized.
178
00:14:33,760 --> 00:14:38,642
But actually deep learning is telling us now a different story, which is not different
from
179
00:14:38,713 --> 00:14:42,665
the story that we know for non-parametric modeling, for Gaussian processes.
180
00:14:42,665 --> 00:14:45,768
In Gaussian processes, we push the number of parameters to infinity.
181
00:14:46,109 --> 00:14:52,093
And in deep learning now we're sort of doing the same, but in a slightly mathematical
different form.
182
00:14:54,315 --> 00:15:07,205
So where we're getting at is a point where actually this enormous complexity is in a way
facilitating certain behaviors for these models to be able to represent our data in a very
183
00:15:07,205 --> 00:15:08,216
simple way.
184
00:15:08,216 --> 00:15:13,208
So the emergence of simplicity seems to be connected to this explosion in parameters.
185
00:15:13,208 --> 00:15:24,373
And I think Andrew Wilson has done some amazing work on this and it's recently published
and I can link you to that paper, which says, deep learning is not so mysterious.
186
00:15:24,373 --> 00:15:28,014
uh And it's something I was reading recently.
187
00:15:28,014 --> 00:15:29,475
It's beautiful read.
188
00:15:30,715 --> 00:15:33,616
I think, you know, to go back to your question, so today, what should we do?
189
00:15:33,616 --> 00:15:35,897
Should we stick to a GP?
190
00:15:35,897 --> 00:15:37,718
Should we go for a deep neural network?
191
00:15:37,718 --> 00:15:43,000
I think for certain problems, we may have some understanding of the kind of functions we
want.
192
00:15:43,080 --> 00:15:52,924
so for those, if it's possible and easy to encode them with the GPs, I think it's
definitely a good idea to go for that.
193
00:15:52,985 --> 00:16:04,209
But there might be other problems where we have no idea or maybe there is too many
complications in the way we can think about the uncertainties and other things.
194
00:16:04,209 --> 00:16:06,220
And so maybe just throwing a...
195
00:16:06,432 --> 00:16:12,323
A data driven, I mean, if we have a lot of data, maybe we can say, okay, maybe we can go
for an approach that is data hungry.
196
00:16:12,324 --> 00:16:17,925
And then, you know, we can leverage that and deep learning seems to be like maybe a right
choice there.
197
00:16:17,925 --> 00:16:27,278
But of course now, there is also a lot of stuff happening in other uh spaces, let's say in
terms of foundation models.
198
00:16:27,278 --> 00:16:34,329
So now there is this class, this breed of new things, new models that have been trained on
a lot of data.
199
00:16:36,419 --> 00:16:40,380
with some fine tuning on your small data, you can actually adopt them.
200
00:16:40,380 --> 00:16:49,990
You know, this transfer learning actually works and we've done it for, so there's this
paper again by Andrew Wilson on predicting time series with language models.
201
00:16:49,990 --> 00:17:01,109
So you take chat GPT and you make it predict, you discretize your time series, you
tokenize and you give it to GPT and you look at the predictions, you invert the
202
00:17:01,109 --> 00:17:02,040
transformation.
203
00:17:02,040 --> 00:17:04,982
you get back a scalar values.
204
00:17:04,982 --> 00:17:07,724
And actually this seems to be working quite well.
205
00:17:07,724 --> 00:17:13,708
So we tried now for, with the multivariate versions of this probabilistic multivariate and
so on.
206
00:17:13,708 --> 00:17:15,769
So we've done some work on that also.
207
00:17:15,790 --> 00:17:25,827
But just to say that, I mean, now this is something also kind of new that is happening,
you know, because before maybe it was really hard to train these models at such a large
208
00:17:25,827 --> 00:17:26,257
scale.
209
00:17:26,257 --> 00:17:33,772
But now if you train a model on the entire web with all the language, language is
Markovian in a way.
210
00:17:33,772 --> 00:17:37,354
So, you know, these Markovian structures are sort of learned by these models.
211
00:17:37,354 --> 00:17:44,769
And now if you feed these models with the stuff that is Markovian, it will try to make a
prediction that is actually going to be reasonable.
212
00:17:44,769 --> 00:17:48,571
And this is what we've seen in the literature.
213
00:17:48,571 --> 00:18:00,138
And all these things are, I think are going to change a lot of the way we think about
designing a model for the data we have and how we we do inference and all these things.
214
00:18:00,138 --> 00:18:00,992
So.
215
00:18:00,992 --> 00:18:10,219
So as of today, think maybe still is relevant to think about, okay, if I have a particular
type of data, I know that, you know, it makes sense to use a Gaussian process because I
216
00:18:10,219 --> 00:18:11,790
want certain properties in the functions.
217
00:18:11,790 --> 00:18:18,234
want certain, you know, maternal, for example, gives us some sort of smoothness up to a
certain degree.
218
00:18:18,234 --> 00:18:23,598
And it's easy to encode length scales of these functions for the prior of the functions.
219
00:18:23,598 --> 00:18:26,938
And this is great, you know, for neural networks, this is very hard to do.
220
00:18:26,938 --> 00:18:29,658
So we've done some work trying to map the two, right?
221
00:18:29,658 --> 00:18:40,178
So we try to say, okay, what can we make a neural network imitate what Gaussian processes
do so that we gain sort of the interpretability and the nice properties of a Gaussian
222
00:18:40,178 --> 00:18:41,158
process.
223
00:18:41,158 --> 00:18:52,958
But then we also inherit the flexibility and the power of this deep learning model so that
they can really perform well and also give us sound uncertainty quantification.
224
00:18:52,958 --> 00:18:53,262
Yeah.
225
00:18:53,262 --> 00:18:54,642
Okay, yeah, yeah.
226
00:18:56,463 --> 00:18:59,458
Be sure you had to be a good peasy.
