·E534

diskcache: Your secret Python perf weapon

Episode Transcript

Michael Kennedy

Your cloud SSD is sitting there, bored, and it would like a job.

Today, we're putting into work with DiscCache, a simple, practical cache built on SQLite that can speed things up without spinning up Redis or other extra servers.

Once you start to see what it can do, a universe of possibilities opens up.

We're joined by Vincent Warmerdom to dive into DiscCache.

This is Talk Python To Me, episode 534, recorded December 19th, 2025.

Talk Python To Me, yeah, we ready to roll.

Upgrading the code, no fear of getting old Async in the air, new frameworks in sight Geeky rap on deck, Quart crew It's time to unite We started in Pyramid, cruising old school lanes Had that stable base, yeah sir Welcome to Talk Python To Me, the number one Python podcast for developers and data scientists.

This is your host, Michael Kennedy.

I'm a PSF fellow who's been coding for over 25 years.

Let's connect on social media.

You'll find me and Talk Python on Mastodon, Bluesky, and X.

The social links are all in your show notes.

You can find over 10 years of past episodes at talkpython.fm.

And if you want to be part of the show, you can join our recording live streams.

That's right.

We live stream the raw uncut version of each episode on YouTube.

Just visit talkpython.fm/youtube to see the schedule of upcoming events.

Be sure to subscribe there and press the bell so you'll get notified anytime we're recording.

Vincent, hello.

Michael, Michael, we're back.

Awesome.

Awesome to be back with you.

Yeah, this is almost the sequel to the last time you were on the show.

So it's going to be fun.

Vincent Warmerdam

Yeah, so sequel in this case, not the query language, like an actual sequel of events.

Yes.

Yeah, you can correct me if I'm wrong, but I think what happened is you had me on a podcast a while ago to talk about a course that I made, and a big chunk of the course that we were very enthusiastic about was about this tool called DiscCache.

And then we kind of came to the conclusion, well, we had to cap it off.

Maybe it's fun to do an episode on just DiscCache.

since we're both pretty huge fans of it.

I think that's how we got here.

Michael Kennedy

I think that is how we got here as well.

And we're going to dive into this.

Honestly, it's a pretty simple library called Disc Cache, but what it unlocks is really, really sweet.

And I'm going to talk about a lot of different angles.

And now, even though it's just been not that long since you were on the show, maybe just give us a quick intro of who you are.

Vincent Warmerdam

Hi, my name is Vincent.

I've done a bunch of data machine learning stuff, mainly in the past.

That's sort of what a lot of people know me from.

These days, though, I work for a company called Marimo.

You might have heard from us.

We make very modern Python notebooks.

We took some lessons from Jupyter, and we take a new spin of it.

So that's my day to day.

But I still like to write notebooks and do kind of fun little benchmarks and also stuff with LLMs.

And I've just noticed that for a lot of that work, boy, disk cache is amazing.

And I also use it for web stuff.

And I think that's also what your use case is a little bit more of.

But yeah, in notebook land, you also like to have a very good caching mechanism And on the Mremo side of things, we are also working on different caching mechanisms, which I might talk about in a bit.

But just for me, the bread and butter, the thing I've used for years at this point is disk cache whenever it comes to that territory.

Michael Kennedy

Yeah, it's funny.

This was recommended to me for Python Bytes as a news item over there quite a while ago, like years ago.

And I'm like, oh, that's pretty interesting.

And then I saw you using it in the LLM Building Blocks course, and it just unlocked for me.

Like, oh, my.

Oh, this is something else.

And so since then, I've been doing a bunch with it, and I'm a big fan.

I've been on this, like trying to avoid complexity, but still getting really cool responses, performance, et cetera, out of your apps.

And I think this is a really nice way to add multi-process, super fast caching to your app without involving more servers and more stuff that's got to get connected and keep running and so on.

But before we get into the details of that, maybe let's just talk about caching in general.

Like what types of caching is there?

You know, I sort of give a little precursor there.

But yeah, dive into it.

Vincent Warmerdam

So like in the course, the main example I remember talking about was the one-- you've got this LLM, and you want to do some benchmarks.

And it might be the case that, I don't know, using an LLM for, let's say, classification, like some text goes in, we got to know whether or not it's about a certain topic, yes, no, or something like that.

Then it would be really great if, suppose, the same text came by for whatever reason, that we don't run the query on the LLM Again, it's like wasted compute, wasted money.

So it'd be kind of nice if the same text goes in that we then say, oh, we know what the answer to that thing is already.

We cached it, so here you can go back.

And that's the case when you're dealing with heavy compute ML systems.

But there's a similar situation that you might have, I guess, with expensive SQL queries, or you want to reduce the load on a database somewhere.

Then having some sort of a caching layer that's able to say, oh, you're querying for something, but I already know what it is.

Boom, we can send it back.

I think the classical thing you would do in Python is you have this decorator in functools, I think, right?

The LRU_cache.

Michael Kennedy

Yeah, exactly.

Yeah.

Vincent Warmerdam

That's a hell of a world to that.

But the downside of that thing is that it's all in memory.

So if you were to reboot your Python process, you lose all that caching.

So that's why people historically, I think, resorted to-- I think Redis, I think, is the most well-known caching tool.

It's the one I've always used.

There's Memcache, I think.

There's other tools.

You could use Postgres for some of this stuff as well.

But recently, especially because disks are just getting quicker and quicker, people have been looking at SQLite for this sort of a thing as well.

So that's, I think, the quickest summary and also sort of the entryway to how I got started with disk cache.

Michael Kennedy

Yeah, and so for this example that you highlight in the LLM Building Blocks course, it's not a conversation.

It's like a one-shot situation, right?

You come up-- you say, I have some code or some documents, and I have almost like an API.

I'm going to send that off to the LLM and ask it, tell me X, Y, and Z about it.

And sure, it's got some kind of temperature and it won't always give an exactly the same answer, but you're willing to, you know, you're willing to accept an answer.

And at that point, like why ask it again and again and again, which it might take seconds, it might cost money.

Whereas if you just remember through caching somehow, you remember it, it's like, boom, instant.

Vincent Warmerdam

Yeah, and it tends to come up a lot in when you're doing benchmarks, for example.

So you have this for loop, you want to go over your entire data set, try all these different approaches.

And if you've got a new approach, then you want that to run, of course.

But if you accidentally trigger an old approach, then you don't want to incur the cost of like going through all those different LLMs.

I should say, like, even if you just forget about LLMs, let's just say machine learning in general.

Let's say there's some sort of image classification thing you're using in the cloud.

There also, you would say, like, file name goes in.

that's an image and if the same file name goes in we don't want the expensive compute cost to happen either so it's definitely more general than llms but llms do feel like it's the zeitgeisty thing to

Michael Kennedy

worry about yeah i think for two reasons one because they're just the topic du jour and two because they're they're i think a part of computing that most people experience that is way slower than

Vincent Warmerdam

they're used to yeah well and especially if you're you know if i suppose that you you have a an an attic somewhere and you're a dad and you want to do home lab stuff and you're playing with all these open source LLM models, then you also learn that, yeah, they're fun to play with, but they also take a lot of time to compute things.

So then immediately you get the motivation to do it the

Michael Kennedy

right way.

Yeah, I built a couple of little utilities that talk to a local LLM.

I think it's the OpenAI OpenWeights one, that 20 billion parameter one I have running on my Mac Mini, and it's pretty good, a little bit slow, but, you know, it's fine for what it's being used for.

And put-- use your disk cache technique on it.

And if I ask it the same question again, it's like, boom.

You don't need to wait 10 seconds.

Here's the answer.

Yeah.

Vincent Warmerdam

So that-- and I guess like-- but I guess from your perspective, I think your main entry point to this domain was a little bit more from the web dev perspective, right?

Like that's-- and I suppose you're using it a lot for preventing expensive queries to go to Postgres, or I don't exactly know your backend.

Michael Kennedy

You know how-- you won't believe how optimized my website is.

There's not a single query that goes to Postgres, because they go to MongoDB.

I'm just kidding.

Vincent Warmerdam

There you go.

Michael Kennedy

No, but your point is totally valid.

Go into the database, right?

Now, I don't actually cache that many requests.

I don't avoid that many requests going to the database.

They're really quite quick, and so I'm OK with that.

But when you think about a feature-rich database, feature-rich web app, there's just tons of these little edge cases you're like, oh, got to do that thing.

And it's not a big deal, but we've got to do it 500 times in a request.

Then it is kind of a thing.

So let me give you an example.

I'll give you some examples.

So for example, the good portions of the show notes on talkpython.fm are in Markdown.

I don't want to show people Markdown.

I want to show them HTML, right?

So when a request comes in, it'll say any fragment of HTML that needs to be turned into Markdown instead of just going, oh, let me process that.

It just goes, all right, what is the hash of this or some other indicator of the content?

And then I've already computed that and stored it in disk cache.

So here's the HTML result.

Another example is there's a little YouTube icon on each page.

And that's actually in the show notes, but then the website parses the YouTube ID out and then embeds it with an, like, there's a bunch of stuff going on there to keep YouTube out of spying on my visitors.

But stuff happens, YouTube ID is used.

That could be parsed every time.

Or I can just say this episode has this YouTube ID.

That information goes into a cache, right?

And because it's a disk cache sort of scenario, like a file-based one, not an LRU cache.

It doesn't change the memory footprint and it's shared across processes.

So in like the web world, it's really common to have a web garden where you've got like two or four processes all being like round robin to from some web server manager thing, right?

If you don't somehow out of process that, either Redis or SQLite or database or something, then all of those things are recreating that, right?

They can't reuse that, right?

So there's a lot of interesting components there.

Vincent Warmerdam

And I suppose your web deployment, you have like a big VM, I suppose, and then there's like multiple Docker containers running, but they do all have access to the same volume, and that's how you access SQLite.

Michael Kennedy

Bingo, yeah, exactly, exactly.

And how am I doing?

Yeah, so what I have done is in the Docker Compose file, I have an external, This is also important for Docker.

So I have an external folder on a big hard drive in the big VM that says, here's where all the caches go.

And then depending on which app, it'll pick like a sub directory it can go look at or whatever that it's using.

And so that way, even if I do a complete rebuild of the Docker image, it still retains its cache from version to version and all that kind of business.

You could do that with a persistent VM as well, volume as well.

But I've just decided-- you can go and inspect it a little easier and see how big the cache is and stuff like that.

Vincent Warmerdam

OK, so we're going to get into the weeds of how disk cache works exactly.

But I'm triggered here because it sounds like you've done something clever there.

Because what you can do in disk cache is you can say, look, here's a file that's SQLite.

And then it behaves like a dictionary, but it's persisted on disk.

But what I just heard you say is that you've got multiple caches.

So am I right to hear that, oh, for some things that need to be cached, let's say the YouTube things, that's a separate file.

And then all the markdown stuff, that's also a separate file, and therefore if connections need to be made to either, it's also kind of nicely split.

Is that also the design there?

Michael Kennedy

Yeah, that is.

And actually, before, like, we're going to dive into all the details of how it works, but I'll just go, I'm just to give people a little glimpse.

I'll go ahead and show, I've got this whole admin back in here.

And I've got different caches for different purposes.

Because they're just SQLite files, you can either say, give me the same one, or you can say, this one is named something else, and it has a different file name or different folder or whatever.

Right, so I've got one that stores things like that YouTube ID I talked about any markdown, any fragment of markdown anywhere in the web app that it needs to say that needs to go to HTML, like just.

Vincent Warmerdam

Yeah, and it's like 8,000 items in that thing.

Michael Kennedy

Yeah.

In this one, there's 8,970 items, which is nine megs, right?

I mean, it's not huge, but it's not too bad.

And you can actually even see where it thinks it lives, but that's not really where it lives because there's, you know, the volume redirects and stuff.

But I've also got stuff for directly about the episodes that it needs to pull back.

And then I do a lot of HTTP caching.

And one of the things that I think is really wrong with web development is people say, well, that's like a stale image or that's a stale CSS file or JavaScript, you know, all that kind of stuff.

So if you just do like super minor tricks and just put some kind of hash ID on the end of your content, it will, and you teach your CDN or whatever, that that's a different file if it varies by query string, then you never, ever have to worry about stale content ever.

right but computing that can be expensive especially for remote stuff like if it's it's on a different it's like a s3 thing but you still want to have it do that so i have a special cache for that and that takes that's like pretty complicated to build up because it's got to do like almost 700 web requests to figure out what those are but once they're done it's blazing fast you don't have to do it again right unless it changes then it doesn't change much and so on so there's that's the way

Vincent Warmerdam

that i'm sort of using and appreciating disk cache yeah it works well in your setup because you've gone for the VM route.

I mean, if you go for something like Fly.io or maybe even DigitalOcean has like a really, I think it's a nice like app service, but that all revolves around Docker containers that like spin up horizontally.

And I don't think those containers can be configured in such a way they share the volume.

So in that sense, you could still use disk cache, but then each individual instance of the Docker container would have its own cache, which still could work out.

Not going to be as well well functional.

It's going to be better with your setup, though.

Michael Kennedy

Yeah, absolutely.

I agree, though.

You could still do it.

Or you could go, I'll take the zen of what Vincent and Michael are saying today, and I'll apply that to Postgres, or I'll apply that to whatever data.

You could pull this off in a database.

Vincent Warmerdam

You would just have to do more work.

Yeah.

I mean, I've had a couple of, I think it was like a Django conference talk I saw a while ago.

They were also raving about disk cache.

But the merits of disk cache do depend a little bit on your deployment, though.

That is, I think, one observation.

Like in your setup, I can definitely imagine it.

Interesting.

Yeah.

Michael Kennedy

Yeah.

Well, I don't even think we properly introduced this thing yet, so.

Vincent Warmerdam

But let's maybe go there.

Yeah.

Let's start there.

Michael Kennedy

Let's start there.

Vincent Warmerdam

It's time.

OK.

It's time.

Yeah.

I guess the simplest way I usually describe it, it really behaves like a dictionary, except you persist a disk and under the hood is using SQLite.

I think that's the-- it doesn't cover everything, but you get quite close, if that's the way it is.

Michael Kennedy

I think there might be-- you know, I keep harping on this on the show, but there are so many people that are new to Python and programming these days.

Many, many of them, almost half of them.

I think it's worth pointing out, just like, what is SQLite?

Like, why is it different than any other database?

Like, why have I been using the word database or SQLite when SQLite is a database, right?

That's weird.

Vincent Warmerdam

- So, I never really took a good database, of course.

I might be ruining the formalism of it.

But the main, like, for me at least, the way I like to think about it is Postgres, that's a thing I can run on a VM, and then other Docker containers can connect to it because it's running out of process.

There's some other process that has the database somewhere, and I can connect to it.

And I think the main thing that makes SQLite different is that, no, you got to run it on the same machine, on the same process where your program is running.

And that's, I think, the main-- and there's all sorts of little details, like how the data structures are used internally, and SQLite doesn't have a lot of types.

There's lots of other differences.

I think that's the main one.

Unless, Michael, I forgot something.

Michael Kennedy

MICHAEL LUTH: Yeah, no, I think it's-- and it's-- operationally, it's a separate thing run.

It has to have both, it has to be secure because if your data gets exposed, like-

Vincent Warmerdam

For Postgres, is it not for SQL?

Yes, it's running somewhere.

People can SSH in if you're not careful.

You've got to be mindful of passwords and all that stuff.

That's totally true.

Michael Kennedy

Right.

And it can go down.

Like it could just become unavailable because you've screwed up something or whatever, right?

It's a thing you have to manage in the complexity of running your app when it's like, well, it used to just be one thing I could run in a Docker container.

Well, now I got different servers, they got to coordinate and there's firewalls and there's like, it's just, it just takes it so much higher in terms of complexity that like SQLite is a file.

Yes.

Vincent Warmerdam

I mean, I do want to maybe defend Postgres a little bit there.

Cause one thing that's like really nice and convenient in terms of like CICD and deployments and all that, oh, suppose you want to scale horizontally and there's like Docker containers running on the left and there's this one Postgres thing running on the right.

I mean, you can just turn on and off all those Docker containers as you see fit.

they're just going to connect to the Postgres instance.

And I've done this trick for Calm Code a bunch of times where I just switch cloud providers, because Postgres is running there, and I can just move the Docker containers to another cloud provider, and it all works fine.

No migration necessary.

With SQLite, that aspect is a little bit more tricky.

You have to be a bit more mindful.

Although, I should mention, might be worth a Google.

There's actually this one new cloud provider that's very much Python-focused.

It's called Plash, P-L-A dot S-H, I think.

Oh, this is new to me.

Michael Kennedy

Yeah, so I think-- Wow, OK.

Look at this.

From.py to.com in seconds.

Vincent Warmerdam

Yeah, it's the Answer AI, Jeremy Howard and friends.

I don't know to what extent this is super production ready.

And SQLite, you've got to be mindful of the production aspect for some reasons as well.

But one thing that is kind of cool about them is they give you a persistent SQLite as a database and a pipeline process that can just kind of attach to it.

And they just-- in their mind, that's the simplest way that a cloud provider should be.

take a very opinionated approach.

So yeah, if you're interested in maybe running this as a web service, migrations are a little bit tricky in that realm, because you do have to download the entire data set due to migration and upload it again, I think, if I recall correctly.

Michael Kennedy

And for some apps, that's no big deal.

Others, that's a mega deal.

Depends how big that data is.

Vincent Warmerdam

So I'm not suggesting this is going to be for everything and everyone, but I do think it's cool, which is why I figured I'd mention it.

Michael Kennedy

Oh, it's new to me.

I'm going to follow up with a lightstream.io.

Have you seen this?

Vincent Warmerdam

Yeah, that is also really neat.

So basically, what if you want to back up your SQLite?

Like, how could you do that?

Oh, it might be nice to do that with S3.

And I think it's like the guy who made the thing works at Fly.io.

He's doing a bunch of low-level stuff.

One thing about that open source package is also really interesting, by the way, is I think he refuses PRs from the outside.

He just wants to have no distractions whatsoever.

He has a very interesting way of developing software.

You can submit issues, of course.

I think if you scroll down, there used to be a notice that basically said, hey, this is a-- I'm not running this-- Yeah.

There you go.

We welcome-- yeah, contribution guide.

We welcome bug reports.

Yeah, this is a way where you can basically stream updates to S3.

And the main observation there is S3 is actually really cheap if all you do is push stuff into it.

If you never pull it out, usually getting it out is the expensive bit of S3.

So this is like pennies on the dollar for really decent backup.

And you can also send it to multiple-- you can send it to Amazon and also to DigitalOcean, if you like.

Michael Kennedy

Yeah.

Yeah, because these days, S3 is really a synonym for blob storage on almost any hosting platform.

Like, it used to be S3 might go to literally S3 at AWS.

But now it's like, or DigitalOcean object spaces, or to you name it.

They've all adopted the API, kind of like OpenAI's API.

Vincent Warmerdam

Yeah, I will say it's a little bit awkward that you have to-- like, sometimes you go to a cloud provider, and they say, you have to download a SDK from a competing cloud provider, and then you can connect to our cloud bucket.

Michael Kennedy

I know.

And it's usually Bodo 3.

And Bodo 3 is-- if you want to cry because you're using a library, like, Bodo 3 has a good chance of being the first one to make you do it.

It is so bad for me.

It's so not custom-- It's not built with craft and love.

It's like auto-generated where you pass these-- like, you pass this kind of dictionary, and then the other argument takes a separate dictionary that relates back-- it's just like, could you give me a real API here?

Vincent Warmerdam

IAN MCKAYAN: I mean, the one thing I can appreciate about Bodo that I do think is honest to mention is they do try to just maintain it.

The backward compatibility of that thing also means it can't move in any direction as well.

And I can't-- there is this meme where Google kills all of its products way too early, and Amazon's meme that they kill them way too late, sometimes never.

Right?

So in that sense, I can appreciate that they just try to keep Bodo just not necessarily as user friendly, but they do keep it super stable.

Like, I get there's a balance there.

Michael Kennedy

Yeah.

I feel like we still haven't really introduced this cache.

We've kind of set the stage.

Vincent Warmerdam

Anyway, but yeah, SQLite, super cool.

How does it work under the hood?

Well, it's really just like a Python dictionary.

So you can say something like, hey, make a new cache.

And then you can do things like cache, square brackets, string name, equals, and then whatever Python object you like can go in.

And Python has this serialization method called a pickle.

Serialization just means, well, you can persist it to disk in some way, and then you can sort of get it back into memory again.

And that's what disk cache just uses under the hood.

So in theory, any Python object that you can think of can go into disk cache.

The only sort of thing to be mindful of is if you have like Python version, if NumPy version 1 in Python 3.6, and you're going to inject a whole lot of that into this cache.

Don't expect those objects to serialize nicely back if you're using Python 3.12 and NumPy version 2 or something.

Michael Kennedy

Right, because pickle is almost an in-memory representation of the thing.

And that may have evolved over time.

That's also a true statement about your own classes, potentially.

Vincent Warmerdam

Yeah, so if you're dealing with multiple Python versions and multiple versions of different packages, there's a little bit of a danger zone to be aware of there.

That said, for most of the stuff that I do, that's basically a non-issue.

But I do get this nice little object that can just store stuff into SQLite and can get it out.

And it's very general.

It's going to try to be clever about it.

Like if you give it an int, it's going to actually store it as an int and not use the pickle format.

So there's a couple of clever things that it can do.

And it's also really like a Python dictionary.

So you can do the square bracket thing.

You can also do the delete and then cache square bracket thing to delete a key from the cache.

Just like a Python dictionary, you have the get method.

So you can say dot get key.

And if it's missing, you can pass a default value.

So it's very much like a dictionary.

I think Bob's your uncle on that one.

Unless, Michael, I've forgotten something.

But I think that's the simplest way to do it.

Michael Kennedy

Yeah, pretty much.

Yeah, I think so.

The difference being it's not in memory.

It's stored to a file.

It happens-- it's not always a SQLite file.

But often, it is a SQLite file as its core foundation that it's stored to.

So it gives you process restart ability, where it still remembers the stuff you cached.

It's not like LRU cache.

We got to redo it every single time.

And I think, I don't know where it is in the docs here, but the thread safety bit of it and the cross-process safety is really nice about, is it persistent?

You've got this whole table here, things like, is it persistent?

Yes.

Is it thread safe?

Yes.

Is it process safe?

Yes.

Compared against other things people might choose.

And that, honestly, I think that is the other half of the magic.

Vincent Warmerdam

Yeah, so especially for your web stuff, I would say that that's the thing you really want.

And some of that, of course, is just SQLite itself.

Historically, one reason why people always used to say, like, use Postgres, not SQLite, has to do with precisely this concurrency stuff.

My impression is that SQLite is really good at reading, but writing can be slow if multiple processes do it.

Some of that, I think, is related to the disk as well.

I don't know to what extent that has changed.

But historically, at least, whenever I was doing Django, hanging out at Django events, People are always saying, like, just use Postgres because it's better for the web thing.

But it is safe, the SQLite.

It might become slower, but it is thread safe if it's-- MARK MANDEL: Right.

Michael Kennedy

There's actually-- they've thought a lot about in this thing about transactions, concurrency, and basically dealing with that.

But it is ultimately, for the most part, still SQLite underneath.

But the thing with a cache is if you're writing it more than you're reading it, you probably shouldn't have a cache.

Yeah.

I mean, like...

Vincent Warmerdam

That beats the purpose.

Michael Kennedy

Exactly.

Like, you get no value if you're recreating it.

You're only probably just doing overhead and wasting memory or disk space.

So it's inherently a situation where it's going to be pretty read-heavy, and SQLite is good at read-heavy scenarios.

Vincent Warmerdam

And maybe it's also fair to say, like, the LRU cache that you get with basic Python, so also to maybe explain that one, so the LRU cache is a little bit different because you decorate a function with it, and then given the same inputs, one output goes out, you can kind of keep track of a dictionary that's in memory.

If you don't have a lot of stuff to keep in the back of your mind, then maybe you don't have to write to disk, right?

So there's also maybe a reason to just stick to caching mechanisms that use Python memory, because I also think, I would imagine it to be quicker too.

But maybe that's also...

Probably, yes.

That should be quicker.

It's just that if you're capped at memory, then you might want to spill to disk, and then disk cache becomes interesting too.

Michael Kennedy

Right, for example, you have literally zero serialization, deserialization.

What you put in LRU cache is the pointer to the object that you're caching, right?

If you've got a class or a list that's part of the LRU cache.

Vincent Warmerdam

The one thing that is good to mention is also a really nice feature of disk cache is just like LRU cache has a decorator, so you can decorate a function, disk cache also has that.

And it works kind of interestingly, too.

So when you decorate the function, you do have to be a little bit careful if you use that.

But then disk cache will-- I think it will hash the function name and the inputs that you pass.

I don't know if it also hashes the contents of the function.

Like if you change the function itself, I don't know if this cache will actually put that in a different slots, if that makes sense.

Michael Kennedy

Yeah, you can say @cache_memoize, which is the design pattern speak for just remember this.

Yeah.

It takes the arguments.

Vincent Warmerdam

And then it has like the Fibonacci sequence, which is the classic example, of course.

And like there are some extra things you can set there as well.

So you can say things like, hey, I think you're able to-- yeah, you're able to set the expiry.

So you can say things like, I want to cache this, but only for the next five minutes or so, which can make a lot of sense if you're doing a front page kind of a thing.

So like the Reddit front page or something like that that updates, but not every second.

It probably updates once every five minutes or something like that.

And then you do want to have something that's cached, but then after that, you want the cache to maybe basically just reset.

And that is something you can also control with a few parameters here.

Michael Kennedy

Right, that's interesting.

There's a couple good use cases that come to mind for me.

Like one, if I put this on the function that generated the RSS feed for Talk Python, I could just say every one minute and then it might be a little bit expensive to compute because it's got a pars, you know, 535 episodes or whatever.

But then for one minute, all the subsequent requests, just here's the answer, here's the answer.

And then without me managing anything, it will just automatically the next minute refresh itself by the nature of how it works, right?

Vincent Warmerdam

How much traffic do you get on that endpoint?

Just roughly, like you're asking.

Michael Kennedy

One terabyte of RSS a month.

Okay, gotcha.

Vincent Warmerdam

Okay, but there you go.

Like then just doing that like once a minute instead of like many times a minute will be a huge cookie.

Michael Kennedy

I would say it's probably more than one request a second.

And the file size, the response size of the RSS feed is over a meg.

And so it's a non-trivial amount of asking, you know.

Vincent Warmerdam

Yeah.

And then like, and how do you fix that with a whole bunch of infrastructure?

No, with a decorator.

Like that feels...

Exactly.

Michael Kennedy

Exactly.

You pretty much summed up all the reasons why I'm so excited about this, because it's like you could do all of this complex stuff or just like you could just literally in such a simple way, just not recompute it as often.

Yeah.

Here's the danger.

What if there's a race condition?

And oh my goodness, two of them sneak in.

You know what I mean?

Like, okay, so you've done a little bit extra work and you throw it away.

Vincent Warmerdam

Who cares?

I want to use Redis now.

And Redis is cool, but I've never used it before.

Better buy a book.

Okay, no.

With this cache, if you, I mean, I'm sure it won't solve everything, but this, I make a bit of a joke by saying, just use a decorator.

But it's honestly that feeling that this library really does give you.

You can just use it as a decorator, which has a lot of great use cases.

You can just use it as a dictionary.

So it still feels like you're writing Python.

It's just Python with one concern less.

And that is the magic.

Michael Kennedy

MARK MANDEL: And it takes on so many of the cool aspects of these high-end servers like Postgres or Redis or Valkey.

Valkey is sort of the shiny new Redis, right?

Vincent Warmerdam

I would actually love to do a Redis benchmark.

I haven't done that yet.

But one thing I do wonder with disks are getting so much faster.

Michael Kennedy

Yes.

Vincent Warmerdam

Right?

And so you can actually at some point wonder like how much faster is Redis really going to be and how much money are you willing to spend on it?

Because if your cache is huge and it allows to go in memory in Redis, it could be wrong, but Redis is fully in memory, I think, right?

Michael Kennedy

I believe so.

There is a database aspect.

Redis is weird because it could be so many.

Redis is cool.

It can do a lot.

But they do have, they actually have benchmarks here on.

Vincent Warmerdam

Ah, there you go.

Michael Kennedy

Compared against memcached and Redis.

And it has the get speed and the write speed.

And this is smaller is better.

Yeah, look at this.

Disc cache beats Redis.

Vincent Warmerdam

And I imagine that's because of the network hop or something.

Michael Kennedy

Yeah, exactly.

I bet it's the network, the network connection.

So if you're running it on the same machine,

Vincent Warmerdam

you would have a different number there.

Might be good to maybe caveat that.

Michael Kennedy

Yeah, it might be.

Vincent Warmerdam

I mean, but also that I think that I don't know when they ran this benchmark, but I just checked on PyPI.

This project started in 2016.

So it might-

Michael Kennedy

I bet this is 2016 data right here.

If I know how these docs go.

Vincent Warmerdam

Yeah, so it could also be that those are old disks comparing old memories, right?

Michael Kennedy

So then this is one of those weird benchmarks.

Vincent Warmerdam

You got to really run them every six months or so for them to remain relevant.

Michael Kennedy

Yeah, yeah.

I mean, the new NVM, VVM, whatever, disks,

Vincent Warmerdam

SSD disks are so fast.

And also they're not memory.

Memory is expensive nowadays.

Yes, it is.

People want to build data centers with them, I've heard.

Yeah, yeah.

Michael Kennedy

And on the cloud there, this is a totally, this is another really interesting aspect to discuss.

Probably more of a web dev side of things.

But if you do LRU caches or even to a bigger degree, I run a whole separate server, even if it is just a Docker container that holds a bunch of this stuff in memory, that's going to take more memory on your VM or your cloud deployment or whatever.

And if you just say, well, I have this 160 gig hard drive.

That's an NVVM high speed drive.

Like maybe I could just put a bunch of stuff there and you could really thin down your deployments, not just because it's not in memory in a cache somewhere, but if you're not having any form of cache, you might be able to dramatically lower how much compute you need and avoid them.

Right.

Like there's layers of how this

Vincent Warmerdam

could like shave off.

And again, it's one of those things of like, oh, I just, can I pay for disk instead?

Oh, that's a whole lot cheaper.

What else do I got to do?

You just got to write a tech

Michael Kennedy

creator yeah i think i pay five dollars for uh i think i remember exactly but i pay something like five dollars for 400 gigs of disc there you go and do you know how much 400 gigs of ram will cost

Vincent Warmerdam

on the cloud um well i mean more there goes the college tuition but the exactly sorry kids yeah no but like it's um and again like i vividly remember when i started college people were always saying oh keep it in memory because it's way faster than disc but i i think we got to let a lot of that stuff just go.

Michael Kennedy

Interesting idea.

Yeah, I agree, though.

I think you're right.

But anyway,

Vincent Warmerdam

so far, we've mainly been discussing the mechanics of it, but there's some bills and whistles I think we should maybe also mention.

The expiry definitely is one of them.

There's also first in, first out kinds of things that you can do.

So maybe if we could go back to that.

Michael Kennedy

Yeah, there's a bunch of features, actually, there.

The expiry is interesting already because if you do regular, say, LRU caching like that, you have a natural.

It's going to go away when the process restarts, and that's going to happen eventually.

Even in a web app, you ship a new version, you've got to restart the thing or something.

But when it goes to the disk, it starts to pile up, right?

That's why I have this little admin page that has, like, how big is this and a button to clear it.

Vincent Warmerdam

In fairness, that can blow up to a lot of Palooza as well, if you're not careful.

Michael Kennedy

It is a concern.

Yeah, it is definitely a concern.

And so I think a big way to fix it, like the expiry, we already talked about why it's interesting for stale data.

You want it to just like auto refresh, but it's also just a safeguard of maybe we'll just recompute this once a month.

It's really quick and easy.

Yeah.

Maybe just don't let it linger forever.

Vincent Warmerdam

I think what you can also do though, if I'm not mistaken, is I think you can also set like a max key size.

So you can say this cache, this particular disk cache can only have 10,000 keys in it and use a first in first out kind of a principle.

Michael Kennedy

Right, or last accessed or number of times.

There's a bunch of metrics there actually for how that works.

Yeah, it's pretty interesting.

Vincent Warmerdam

I've never had to fiddle around with them too much, but it's one of those things where even if I don't need it right now, it is just a relief to see that the feature is there in case you might need it later.

Michael Kennedy

Yeah, yeah, for sure.

Let me see if I can find out where that is.

I don't know, keep bouncing around the same spot.

I've got it, let's just talk through them.

So I think the tags and expiries are pretty interesting, but then there's also, I think, something that surprised me a little bit is these different kinds of caches.

So there's like a fan out cache.

Have you looked at these?

These are interesting.

Vincent Warmerdam

I remember reading about, I've never used them, but I do remember reading them.

Michael Kennedy

So let me give you the quick rundown and then you'll understand instantly.

It's super quick.

So it uses sharding, which is a database term, right?

So sharding is like, if I've got a billion records and it's a challenge to put them all in the same database entry, database table or server, I could actually have 10 servers and decide, well, okay, what we're going to do is if it's a number of the ID of the user, if the first number is one, then they go into this database.

If it's two, then they go in that, right?

So 2005 goes into the second database and so on.

So it does that as well.

This is one of the things it does to try to avoid the issues of multiple writers, I believe.

So you're less likely to write to the same database.

So it doesn't have to lock as hard.

Vincent Warmerdam

It's kind of what you do where you say, oh, I've got this for the YouTube link.

And I've got this for the HTML markdown.

Except those are like different chunks because of their use case.

But you can also imagine, well, I've got this long list of users.

but I still want to benefit from having multiple SQLite instances.

And I suppose that's when you use this, right?

Michael Kennedy

Yeah, I think that is why.

So it says, it's built on top of cache, cache fanout, automatically shards.

Automatically, you just said how many you want, it figures out what that means.

And it says, while readers and writers don't block each other, writers block other writers.

Therefore, a shard for every concurrent writer suggests it.

This will depend on your scenario, default is eight.

So that's pretty cool.

Vincent Warmerdam

Yeah, okay.

And presumably internally does something like hashing to figure out how to like

Michael Kennedy

send it around the shards.

Right.

The keys themselves have to be hashable anyway, probably.

So just hashes the key and then shards on like the first couple letters or whatever.

Yeah, this is cool.

So it avoids the concurrency crashes.

The one difference for me, the reason I didn't choose fanout cache is because I want to be able to say, I want to clear all the YouTube IDs, but I want to keep the really expensive to compute hashes.

I want to be able to clear stuff if I really have to by category.

And I guess you could also do that with tags, but I'm just not that advanced.

Vincent Warmerdam

Well, and also keep it simple, right?

And again, it's one of those things where it's, oh, it's nice to know that this is in here, even if you don't use it directly.

I do agree.

This is a really nice feature.

Michael Kennedy

It is.

And I probably will never in my life use it, but it's really cool that it's like, you know, it's one of those things about a library that when you're thinking about picking it, it's like, okay, the core feature is great, but if I outgrow it, what is my next step?

Do I have to completely switch to something really different like Redis or Valkey?

Or do I just change the class I'm using?

Vincent Warmerdam

I had that with this.

So I actually had that feeling a while ago.

We're going to get to my example I think a bit later.

But I was using the Memoize decorator to decorate a function to properly cache that.

But the one issue I had is an input to that function was a progress bar.

It's kind of a remote specific thing.

I wanted this one progress bar to update from inside of this one function.

But the downside was that every time I rerun the notebook, I make a new progress bar object.

Oh, and that means that the input has a different object going in.

MARK MANDEL: So you never actually hit the cache.

JOHN MUELLER: So, oh my god, I'm never hitting the cache.

This is horrible.

Then it turns out the Memoize decorator also allows you to ignore a couple of the inputs of the function.

So you can also-- MARK MANDEL: Oh, interesting.

Michael Kennedy

Like ignore by keyword or something, keyword argument.

Vincent Warmerdam

JOHN MUELLER: Precisely.

And there's a bunch of use cases for it, and this was one.

And you can just imagine my relief after writing the entire notebook to then look at the docs and go, oh, sweet.

Michael Kennedy

Yeah, that's super sweet.

Yeah.

Okay.

I think there's right below this.

Yeah, there's a couple.

We can just go down this list here.

There's some cool.

Oh, Django.

So they have a legit Django cache.

Vincent Warmerdam

Yeah, yeah, yeah.

Michael Kennedy

Straight in.

Vincent Warmerdam

Sweet.

Yeah, I recall a little bit.

It was like a huge Django following for this thing.

Michael Kennedy

Yeah, and I think this is a part why.

And when I first saw it, the reason it got sent to me is somebody's like, oh, this disk cache, Django cache is a really cool thing to just drop into Django.

And I'm like, that's cool.

I'm not using Django, but I admire it.

That's why I didn't really look into it until I saw your use case outside of Django.

I'm like, oh, okay, I understand how much this can do.

So the Django dish cache says it uses the fanout cache, which we just discussed with the sharding, to provide a Django-compatible cache interface.

And you just do that in your settings file, and you just say the back end is diskcache.django cache, and you give it a location.

Boom, off it goes, right?

So really, really nice.

Cool.

Yeah, and it sounds like you've done more Django than me.

How's this sit with you?

Vincent Warmerdam

I mean, to be very clear, I do think Django is really nice and really mature.

I do sometimes have a bit of a love-hate relationship with it because Django can go really, really deep.

And some of that configuration stuff definitely can be a little bit in your face.

So the main thing I just want to observe is doing everything manually inside of Django can be very time-consuming.

So it's definitely nice to know that someone took the effort to make a proper Django plugin in the way that Django wants it.

That's definitely the thing to appreciate here.

I've never really used this in a Django app, to be honest.

Michael Kennedy

Yeah.

You know, it has a lot of nice settings here.

Like you can set the number of shards, the timeout, so in case there's a write or read contention, it can deal with that or it can at least let you know you're failing.

It even has a size limit.

Vincent Warmerdam

Does it say how you can configure what to cache and whatnot?

Or is like the-- I've never really used caches in Django in general.

So I don't know if there's a general cache feature in Django itself that it will just plug into or if--

Michael Kennedy

I think there is a general cache feature in Django.

The Django people are like screaming silently.

Yes.

I apologize.

I know, I know.

But I'm pretty sure it's just like a built-in Django cache functionality.

Exactly.

Yeah, okay.

It just routes into this thing.

Vincent Warmerdam

Exactly.

So instead of configuring Redis, you just feed it this and you're good.

That's the idea.

Michael Kennedy

Yes, exactly.

Exactly.

So there's more.

The next one, I have to rage against the machine.

I'm sure it's the way, but DQ, pronounced deck.

Vincent Warmerdam

Yeah.

Michael Kennedy

So I don't know.

For me, I still say DQ.

I don't say deck.

like it's spelled D-E-Q-U-E.

And I know a lot of computer science people just call that deck, but this cache dot DQ or deck, however you want to use it.

It provides, there's, there's a couple of higher order data structures that like operate on what we talked about so far, but to give you data structure behavior, right?

Like what we talked about so far is sort of dictionary, but not list or order or any of that.

But this, you can actually do go and add a thing.

How do we add one?

Anyway, you can say like pop left, pop left.

over and over to get, I guess you just add it.

Vincent Warmerdam

It's kind of like a queue.

Michael Kennedy

Yeah, like a queue, exactly.

With the goal of taking stuff out of it instead of into it.

But you don't normally think of a cache as doing that.

But it'd be a cool way actually to fan out work across processes.

Vincent Warmerdam

I was about to say, that's a really good, I think, I mean, there's people that have made, I forget the name, the Python queuing system, salary.

So that one is also built in such a way that you can say like, oh, where do you have the list of jobs that still need doing?

And I think also Redis is used-- MARK MANDEL: Yeah, right, Redis and Qum.

Michael Kennedy

Yeah, exactly.

Or Rabbit and Qum as well.

Vincent Warmerdam

FRANCESC CAMPOY: Yeah, exactly.

But you can configure SQLite if you want to, though, if I recall with those.

It's just that in this particular case, if you don't want to use salary, you can still kind of roll your own by using this cache as well.

I'm assuming it uses the same pickle tricks, so you can do general Python things and if the process breaks for whatever reason, you still have the jobs that need doing.

Michael Kennedy

MARK MANDEL: Yeah, we're still going to have to talk about this serialization thing, These pickles.

Yes.

Not yet.

Let's go through this.

Let's go through this first.

Let's first.

Before we get distracted, because it's a deep.

So DEC, I guess we'll go DEC.

DEC provides an efficient and safe means for cross-thread, cross-process communication.

Like, you would never think you would get that out of a cache, really.

Vincent Warmerdam

But it's...

Michael Kennedy

You would do that in SQLite.

Yeah, exactly.

But you would do work to do that, right?

You would do like transactions.

And you would do sorting.

You would figure out, well, what if there's contention?

I mean, the fact that it's just kind of a pop, it's pretty nice.

Vincent Warmerdam

Yeah, that's definitely nice.

Michael Kennedy

No, that's definitely true.

Vincent Warmerdam

Although one thing that makes it easy in this case, though, is again, it is all running in like one process.

So it's not like we've got SQLite running in one place and there's 16 Docker containers that can randomly interact with it.

No, that can, though.

Is that the case?

Michael Kennedy

Because disk cache itself is already cross-process safe.

Like, that's why I was so excited about it.

Vincent Warmerdam

But it has to be on the same machine, though.

Like, that's what I do think.

Michael Kennedy

Yes, it's got to at least be accessible.

RushRite, that is true.

because technically there's nothing that says you can't put the file anywhere.

I think there's mega performance issues and locking issues on, it says basically don't use it on network drives.

Vincent Warmerdam

Yeah, so that's the thing.

Some of this is like, okay, you can do the locking.

You can do all those things.

You can do it well, but the practicality of the network overhead is something that usually causes a lot of confuffle,

Michael Kennedy

at least in my experience.

Yeah.

Okay, another one is disk cache index, which creates a mutable mapping and ordered dictionary.

So if you kind of really want to lay into the dictionary side,

Vincent Warmerdam

Yeah, you can do that.

Michael Kennedy

That one's transactions as well.

So you can actually, it has like sort of in-place updates and other things you can do.

So you can say, I want to make sure that I'm going to get two different things out of the cache.

And I want to make sure that they're not changed while I'm doing that, right?

Just like you would with threading or something.

Yeah, so nothing can happen in between.

Vincent Warmerdam

So they both have to come out at the same time.

So the two values that I get, they were, they were, they both existed at the same time in the cache at the point in time that I was retrieving it.

Michael Kennedy

Yeah, yeah, exactly.

So just with cache.transact and you just go to town on it.

That's pretty straightforward, right?

Yep.

Are there any more in here?

There's a bunch of recipes for like barriers and throttling and probably semaphore-like stuff, but I don't really want to talk about.

But you touched on these eviction policies.

Here's where I was looking for.

There's these different ones here that are kind of cool.

Whoops.

I didn't go away.

Vincent Warmerdam

Yeah, so you can set a maximum to the cache.

I think you do that by number of items typically in it.

It could also be the case.

Michael Kennedy

Size or something, yeah.

Vincent Warmerdam

Yeah, or like total disk size, maybe we should double check.

Michael Kennedy

The default for the disk size is one gig.

Yeah, there you go.

So there's already a built-in one, yeah.

Which might catch people off guard.

Much of the stuff is cash, but not always.

I don't understand.

Vincent Warmerdam

Yeah, you've got to be a little bit mindful of that, I suppose.

But it's the same default, not to have it go to infinity.

Agreed.

Yeah, I guess small screen on my side, but like, yeah, last recently.

Michael Kennedy

I'll read them out.

I'll read them out for you.

Yeah.

So we got recent last recently stored as a default, every cache item records the time it was stored in the cache and that adds an index to that field.

So it's nice and fast, which is cool.

We have, this is, there's some other ones that are more nuanced, like least recently used, not in terms of time, but we've got one that was accessed a hundred times and one that was accessed two times.

Even if the one is accessed two times was just accessed, that one's getting kicked out because it's not as useful.

I don't know.

That's a pretty neat feature.

And then the one people would expect is, I don't know, maybe at least recently used.

How are you?

Vincent Warmerdam

Yeah.

Yeah, exactly.

And there's also pruning mechanisms, if I'm not mistaken.

So there's all sorts of fun.

You can argue there are bells and whistles until you need them.

And one thing I have always found is every item that I see here, you might not need it right now, but for every item you see, you do plausibly go, oh, but that might be useful later down the line somewhere.

Like the transaction thing where you retrieve two things at the same time.

I don't really have a use case for it, but I can imagine it one might, where the consistency really matters.

Michael Kennedy

Yeah, I can.

I could see using the fan out cache.

Vincent Warmerdam

Yeah, definitely.

Michael Kennedy

But probably not the transaction.

But I'm already talking to MongoDB, which doesn't have transactions effectively.

So not really.

What about performance?

Should we talk about your graphs?

You brought pictures.

Vincent Warmerdam

Yes.

So that might be, so when you told me like, hey, let's do an episode on disk cache, and I kind of told myself, okay, then I need to do some homework.

Like, I actually have to use it for something real.

It's a bit complex.

So what we're going to try and do is we're looking at a chart right now, and I'm going to explain to Michael what it does, and I'm going to try to explain it in such a way such that if you're not watching but listening, that you're also going to be fairly interested in what you're seeing.

Michael Kennedy

And I'll link to the chart, of course, people can.

Vincent Warmerdam

Yeah, so this is all running on GitHub pages, and the charts that you see here definitely needed a bit of disk cache to make it less painful.

So what I've done is I've downloaded a Git repository.

What you're looking at right now is the Git repository for Marimo.

And then I just take a point in time and I say, okay, let's just see all the lines of code.

And then I take another point in time.

And then I basically just do kind of a Git blame to see if the line got changed in between.

So what you're looking at here is kind of a chart over time where it's basically like a bar chart, but it changes colors as time moves forward.

and the shape that you see is that things that happened early on, well, there's a nice thick slab, but it gets a little bit thinner and thinner as time moves forward because some of those lines of code got replaced.

But in the case of Marimo, you can see that, you know, most of the lines of code actually stay around for a long time

Michael Kennedy

and it's kind of like a smooth sedimentary layer every time we move forward.

It's compressing a little over time.

Like there's the weight of the project has sort of compressed it.

So, yeah, it's pretty interesting.

So, okay, so that's pretty cool.

Vincent Warmerdam

But you can also go to Django.

Michael Kennedy

So there's a director on top.

MARK MANDEL: So if I go to Django.

Oh, you can put this cache in here.

This is really different.

FRANCESC CAMPOY: Yes.

Vincent Warmerdam

MARK MANDEL: What is this telling us?

FRANCESC CAMPOY: So what you can see here is that at some point in time, there's a huge shift in the sediment.

There's a lot of light sand and a lot of the dark sand goes away.

There's also a button that allows you to show the version number.

So I've-- yep, there you go.

Michael Kennedy

MARK MANDEL: There we go, yeah.

Vincent Warmerdam

FRANCESC CAMPOY: So you can see that right before a new version, a bunch of changes got introduced or right after.

It's usually around the version number that you can see that shift.

Michael Kennedy

Right, once the feature freeze is lifted, some stuff comes in, PRs come in maybe or something.

Yes.

Vincent Warmerdam

And one other thing that's actually kind of fun, if you go-- there's this project called Psychot LEGO that you can also go ahead and select.

And folks--

Michael Kennedy

I've heard a pretty cool guy maintains that, yeah.

Vincent Warmerdam

Well, so the funny thing is you can see that there's a massive shift there at some point.

Michael Kennedy

OK.

Vincent Warmerdam

That's when we got the new maintainer.

Michael Kennedy

Are you on the purple or the green side?

Vincent Warmerdam

So there's this dark blue sediment that sort of goes down massively at that point.

But yeah, no, so but in this case, like the first thing he did is redid all the docs.

So we went from Sphinx to make docs.

And that's like a huge, if you look at the lines of code that changed as a result that, you know, that's quite a lot.

But if we now start talking about how you make a chart like this, you got to imagine like, I take the start of the GitHub history, I take the end of the GitHub history, I sample like 100 points in between, and then for every line in every file, I do a git blame.

Michael Kennedy

MARK MANDEL: I think Django is something like 300,000 lines of code.

I mean, that's a lot of-- A lot of--

Vincent Warmerdam

MARK MANDEL: So that thing took two hours and 15 minutes on my M4 Mac.

And if you go there, you can actually select it.

That one-- that was a chunky boy, is what I'll say.

MARK MANDEL: Yeah, 550,000 lines.

Yeah.

MARK MANDEL: There you go.

But you can see that there's one version change, I think, where they made a bunch of changes.

And it could be that-- MARK MANDEL: Yeah.

That might have been, again, because I checked the docs as well on Markdown files, it might have been a big docs change.

But hopefully by just looking at this, you can go like, oh yeah, this is probably a notebook somewhere.

And there's a huge for loop that does threading and tries to do as much in parallel as possible.

And there's a progress bar.

Right?

Yeah.

And we don't--

Michael Kennedy

Now I see why you had this problem with the caching.

Vincent Warmerdam

That's right.

But yeah, but here's also where the threading came in.

Because the moment you say this point in time, now do all the files, do the git blame, that's definitely something that can happen in parallel.

But then for every file, for every point in time, you do want to have something in the cache that says, okay, if I have to restart this notebook for whatever reason, that number is just known.

Don't check it again.

Michael Kennedy

Yeah.

Yeah, super interesting.

Okay.

I wonder if this 4.0 in 2022, is that might be when they switched to async?

They started supporting async.

It could be docs as well.

I'm not sure.

Vincent Warmerdam

Well, yeah, so that's kind of the hard thing of some of these charts.

Like, I could expand these charts by saying things like, okay, only the Python files, et cetera.

But, like, the way that this is hosted, this is, like, really using the disk as a cache because all these charts are Altair charts, and you can save them to disk, and then you can easily upload them to GitHub pages.

So I do everything in disk cache to make sure that if I, for whatever reason, the notebook fails, I don't have to sort of do anything fancy to get it back up.

But then once it's time to actually put it on the site, I could use disk cache to show the charts, but then I would need a server.

So actually using disk to actually just serve some files is also just a pretty fine and good idea.

There are some things on the Marimo side where we are also hoping to maybe give better caching tools to the library itself.

It's just that when I was doing this, I actually found a bug in our caching layer, so then I switched back to disk cache.

Michael Kennedy

You know what?

Look, that's valuable.

That's maybe not the way you will find, but it's valuable.

Vincent Warmerdam

It's-- oh, so one thing you learn is that caching is actually hard to get right.

Michael Kennedy

Oh, it is.

Vincent Warmerdam

It is.

Michael Kennedy

Very hard.

Vincent Warmerdam

It's on par with naming things.

Michael Kennedy

It is one of the two things that goes wrong-- naming things, cache invalidation, and off by one errors.

Vincent Warmerdam

Yes, exactly.

Michael Kennedy

It's the middle one.

Vincent Warmerdam

Yeah.

Dad jokes are amazing.

Anyway, so one thing about this repo, by the way, this is all my-- we're going to add a link to the show notes.

There is a notebook, so if you feel like adding your own project that you want to just add, feel free to spend your compute resources two and a half hours to add a popular project.

I would love to have that.

One thing I think will be cool with these sorts of charts is to see what will change when LLMs kind of get into the mix.

Do we see more code shifts happen if more LLMs get used for these libraries?

Michael Kennedy

MARK MANDEL: Very interesting.

Vincent Warmerdam

I don't know if more code--

Michael Kennedy

old code will get changed, but they are verbose code writers, those things.

Vincent Warmerdam

Yes.

So this, assuming we can do this over time and we're going to start tracking this, I'm calling this code archaeology.

I do think it will be an interesting chart.

As is, I think it's already quite interesting to see differences between different projects.

I think if you go to sentence transformers, you can also see when the project got moved from an academic lab to hugging face.

So there are interesting things you can see with these charts.

But you are going through every file, every line,

Michael Kennedy

get blame 100 times.

Vincent Warmerdam

Yeah, a lot.

Michael Kennedy

You got to do 100 times per line as well.

Vincent Warmerdam

Well, so the start of the project with a Git repository, and then to make a chart like this, you got a sample over the entire timeline.

And it is a bit cheeky because sometimes you can go like, OK, but there's a character that changed because of a linter.

And then is that really a change?

Does it really matter?

Michael Kennedy

It's whoever decided to run the format on the thing or whatever.

Vincent Warmerdam

We're looking at the Django chart.

It could also just be that black just got an update or something like that, right?

Yeah, exactly.

It's also possible.

Michael Kennedy

It's very possible.

Vincent Warmerdam

It's unlikely, but it's not impossible, let me say.

But yeah, anyway.

Yeah, this was one of the benchmarks that I did with disk cache that I thought was pretty amusing and pretty interesting.

But there's this one other feature that I think we should also talk about, which is that if you want to, disk cache actually lets you do the serialization yourself.

So normally, what it would do is it would say, like, OK, let's do the pickle thing.

And it's a bit clever, right?

So if the thing you're storing is like an integer, doesn't go through the whole pickle thing and just stores it as an integer there's these native types

Michael Kennedy

that sqlite has and then you know it's able to do something clever but right as soon as it becomes like a custom class or a list of weird things then yeah then then it's i personally don't like it pickling i would i would prefer that it makes me do something it's i think it's weird well so the

Vincent Warmerdam

thing is you can write your own disk class and then what you can do is you can pass set disk class onto uh the disk cache itself and i'm just kind of wondering like when might it make sense to do this sort of a thing and if you go to the docs there's actually like a really good example just

Michael Kennedy

right there which is jason yeah i i lost your example if i'll get it back if you type disk

Vincent Warmerdam

you'll find it so jason has this interesting thing the it is text if you think about it but it's text that has a bit of structure and that means that there are these compression libraries you can actually run on them and especially if you have like a pattern that repeats itself let's say a list of users or something like that and there's always the user key and there's always maybe the the email key and those things just repeat themselves all over the place, then there is an opportunity to, there's this library called zlib where you can just take that string, you can compress it and then that compressed representation can go into disk cache instead.

Yeah, I figured that sounds like a lot of fun.

You can just grab the implementation there.

I have this notebooks repository where I have LLMs just write fun little notebooks.

I always check the results obviously just to be clear on that.

But one thing that was I think pretty cool to see, if you just do the normal data type and you pickle it, then you get a certain size.

And if you just have a very short, normal Python dictionary basic thing, then it's negligible.

You shouldn't use this JSON trick.

But the moment you get text heavy, there's just a lot of text that you're inputting there and there's some repetition of characters.

Or if you really do something that's highly compressible, it is not unheard of to get like 80%, 90% savings on your disk space, basically.

Now, there is a little bit of overhead because you were doing the compression and decompression.

But if you're doing text-heavy stuff, this is something that can actually save you a whole bunch.

And I can imagine for LLMs, this would also be a win.

Michael Kennedy

OK.

So this JSON disk, not only does it serialize to and from JSON, which I think is safer.

It can be a pain if you had date times.

Vincent Warmerdam

You've got to do something about that.

or JSON or something like that.

You can.

Michael Kennedy

Yeah, yeah, I think.

But then it's using Zlib here.

You know, I just actually did something like this for just something in my database.

It had nothing to do with caching.

But these records are holding tons of text for something sort of tangential to the podcast.

And I'm like, I don't really want to put 100K of text that I'm not going to query against or search into the database.

So I used Python's XZ implementation.

And like you said.

Vincent Warmerdam

There's a bunch of compression algorithms you could use.

Michael Kennedy

It was way fast.

So I just store it as bytes now, and it's like a tenth the size.

It's great.

So I guess this is the same, but for the cache back end, right?

Vincent Warmerdam

Yeah.

And I think-- well, you can see, I think Zlib is being used internally.

Michael Kennedy

Yeah.

I mean, it's not XZ, but it's the same idea.

Vincent Warmerdam

Yeah, exactly.

And there's always new compression algorithms.

Like, feel free to check whatever makes sense.

But the fact that you have one very cool example to add on the docs, because you can just copy and paste it.

A lot of people benefit from it.

But why stop here?

Michael Kennedy

Because this is the company you can do for JSON.

Vincent Warmerdam

What else can you do?

Michael Kennedy

Before we move on, though, if I were writing this, I would recommend using microJSON or ORJSON or whatever, some of the more high performance versions right there.

Vincent Warmerdam

Yeah, and I think ORJSON-- I mean, performance is cool.

The reason I use ORJSON a lot more has to do with the types that it supports.

So it can accept the NumPy arrays, for example, and just listifies it.

And I think it has a few things with dates as well.

It just has a slightly better support for a few things.

Michael Kennedy

OK, good to know.

All right, where are we going?

What's next?

Vincent Warmerdam

Numpy arrays.

OK.

So a lot of people like to do things with embeddings nowadays.

So like text thing goes in, some sort of array thing comes out.

And then hopefully, if two texts are similar, then the arrays are also similar.

So you can do all sorts of fun little lookups.

And I do a fair share of doing things with embeddings.

And embeddings are also not notoriously expensive to calculate, but still pretty expensive to calculate.

OK, but you can write your full Python thing in there.

So if you compare storing NumPy as bytes compared to that to a pickle, it's actually even.

There's very little to gain there.

But one thing you could do is you could say, well, let's maybe bring it down to float 16.

That's a thing you can do.

You can sort of say, before we save it, we actually make it just a little bit less accurate on the numeric part of it.

But that'll save us a whole bunch of disk space.

Michael Kennedy

So that's already kind of old.

Well, you need it to be super precise when it's involved in calculations.

But then in the end, if you're not going to report the numbers to great decimal places, maybe going down is good, yeah.

Vincent Warmerdam

Yeah, it depends on the use case.

But typically, you could argue maybe a 1% difference in similarity if we have 100x savings on disk.

That'll be kind of a win.

So one thing I was sort of focusing on is just this-- you can do things like, OK, come up with your own little weird data structure where you say, OK, let's pretend we're going to quantize the whole thing.

So we're going to calculate the quantiles of the float values that it can take.

And we're going to take basically 256 buckets.

We're going to store the scale.

We're going to store the mean.

And then we're going to store in what bucket the number was in.

And you can turn that into a string representation.

These things are pretty fun to write.

Nice.

And yeah, and then you scroll down into your big notebook and then you find this.

There you go.

That's a retrieval time.

I think I got like a 4x improvement in terms of like disk space being saved.

It was like a 1% similarity score that I had to give up for doing things like this.

Mileage can vary, of course, but like, again, these are fun things to sort of start playing with because you have access to the way that you write that down.

So that was also like a fun little exercise to do.

Michael Kennedy

Yeah.

Could you save NumPy arrays by just putting up, just converting it to bytes?

There's probably some efficient.

You know what?

What about a parquet file?

Like in an in-memory Parquet file, then you just say, here's the value in bytes.

Vincent Warmerdam

So I tried the bytes thing and compared it to the pickle thing, and that was basically the same size.

Michael Kennedy

OK.

Vincent Warmerdam

That barely led to anything.

About the Parquet one, I mean-- You do get compression.

Well, yeah, but I could be wrong on this one.

But I think Parquet is optimized to be a disk representation.

And then once you want to have it in memory, it becomes an arrow representation.

I see.

Yeah, probably.

So in that sense, what I would do is, OK, if you have something in Arrow, you use this cache to make sure it's written as parquet.

But then you have to be-- you kind of have

Michael Kennedy

to know what you're doing if you're going to make parquet

Vincent Warmerdam

files.

And also, the benefit of a parquet file is that you have one huge table.

Because then-- Right, you can scan it, yeah.

Yeah, it's a columnar format.

So then if I were a column, boy, would I want to have all the rows in me.

So in that sense, what you-- yeah, so in that sense, what I would do instead is if you, for whatever reason, you just have a lot of data, It's still kind of a cache, but it makes more sense to store all of it in like a huge parquet file.

In parquet, you can store a partition.

So you can say this one column that's partitioned, a date would be like a very typical thing to partition on.

And then if you point polars to like parquet, but you say, I only want to have this date, it can sort of do the forward scan and only pick the rows that you're interested in.

And I would imagine that that would beat anything we might do with this cache, especially if the table is big.

Michael Kennedy

So you don't always want to cache stuff.

Like I said, I actually don't avoid hitting the Mongo database a lot for my projects because it's like the response time is quick enough and might as well.

I want to take two little avenues here.

But the first one is, what about DuckDB?

I know at least on the data science side and the analytics side, DuckDB is really popular, really well respected, really fast.

Maybe you don't even cache it.

Maybe you just use DuckDB as a thing.

Vincent Warmerdam

How do you feel about that?

I mean, DuckDB does solve a very different problem than SQLite or Postgres in a way.

So I don't believe-- to name one thing, I believe DuckDB does assume that everything under the hood is immutable.

So it will never be ASCID compliant, because it doesn't necessarily have to be.

You can still insert rows, if I'm not mistaken.

But the use case is just assumed to be analytical in general.

That like-- I see.

--it's really designed to sort of fit that use case.

You can insert rows, though.

Michael Kennedy

So like-- I mean, you might be caching data science things that you're only computing once.

Like, for example, your charts.

Once you've computed that, it's not going to change because it's historical.

Vincent Warmerdam

I mean, I might want to rerun it a month later or something like that.

That's something I might want to do.

And in that particular case, it would be cool if the sampling is the same and I just want to add one sample at the end that all those samples I had before, that those are in a cache somewhere.

Michael Kennedy

Or maybe you want faster, better resolution.

Instead of going 100, you're going to go to 1,000 points, but you could do 10% less because those are done, right?

Vincent Warmerdam

So stuff like that.

But then what you would never do with a cache is do a group buy and then a mean, for example.

It's like--

Michael Kennedy

It's not-- it's outgrown its use at that point.

That's for sure.

Vincent Warmerdam

Yeah.

And if it were a part of it, then the docs would say so.

But like-- no, so in my mind, DuckDB really just solves a different problem, similar to like general SQLite also solves a different problem than disk cache.

And also Postgres is also solving a slightly different problem.

Michael Kennedy

Sure.

Vincent Warmerdam

All right, fair.

Michael Kennedy

Like the other angle-- Yeah?

No, go ahead.

Finish your thoughts.

Vincent Warmerdam

Well, I also really love Postgres, I got to say.

Like the thing I really like about it is that it is like boring, but in a good way, software where like I have a Postgres thing running there and whatever SSH thing I need, I can just swap the cloud provider and it'll just go ahead and still run without me having to move the data or do any migration or anything like that.

That is also just like a nice feeling, but it solves a different problem.

Michael Kennedy

Yeah.

These would very likely be used together, not instead of-- I mean, Postgres can be used instead of disk cache, but disk cache, definitely not instead of Postgres.

So the other one, the other angle I wanted to riff on, have you riff on just a little bit is think people, especially people who are maybe new to this idea of caching, they can end up thinking, okay, I'm going to store stuff.

We talked a lot about like, I get something back from the database.

I could store that in a cache.

So I don't have to query it again or whatever.

And those are certainly good use cases.

But I think a lot of times an even better use case is if you're going to get 20 rows back from a database, do some Python and construct them along with a little other information into some object.

And then that's really what you want to work with.

Store that constructed thing in the cache.

You know what I mean?

Like, as far as you can go down the compute layer, like, don't just stop like, well, it comes back from the database, so we cache it.

Like, if there's a way to say it, like, there's a bunch of work after that, think about how you might cache at that level.

Vincent Warmerdam

Okay, I'm going to pitch you a dream then.

Imagine you have a Python notebook and, oh, you're running a cell, you're running a cell, you're running a cell, and halfway the kernel dies for whatever weird reason.

Michael Kennedy

Right.

Vincent Warmerdam

it'd be nice if I could just reboot the notebook and it would just pick it up again and move further.

Because again, I'm picking something out of a database and I'm doing something little with it and processing, processing, processing.

But wouldn't it be nice if maybe every cell had a caching mechanism?

If only you had some influence.

If only we're a company that did this kind of stuff.

You can imagine these are things that we are thinking about.

And again, what I'm about to suggest is definitely a dream.

This is not something that works right now.

Don't pin me on this.

like we're thinking out loud here.

You can also imagine this being super useful where an entire team can share the cache.

Yeah.

Right?

So if your colleague already calculated something, you don't have to recalculate it again.

There's all sorts of use cases like that as well.

But there may be, there are these moments when you want to have very tight manual control over what goes into the cache.

That makes a lot of sense and it's great.

But there are also moments when you just really don't want to think about it at all.

And you just want everything to be cached.

Michael Kennedy

Could I just use this thing as a checkpoint?

I can autosave as my code runs.

Yeah, that's cool.

Vincent Warmerdam

Yeah.

And again, doing this right is hard, because there's all sorts of weird Python types.

And I mentioned the progress bar thing.

Michael Kennedy

And there's all sorts of things that we've

Vincent Warmerdam

got to be mindful of here.

But if you're thinking about really, how would you use this in data science when you fetch a little bit of data and deal with it, to me, it is starting to feel more natural than thinking about cells in a notebook, maybe cache on that level.

Michael Kennedy

Yeah, that's pretty interesting.

just sort of cascade them along as a hash of the hashes of the prior cells or--

Vincent Warmerdam

Well, and this is where things become tricky, of course, because then, OK, I've got this one cell, and I change one function in it.

Oh, if you're going to cache on the entire cell, oh, everything has to rerun.

And then, oh, if you're not careful, your cache is going to be huge.

So like, OK, how do you do this in a user-friendly way?

There's all sorts of-- it sounds easier than it is the one thing I want to say.

Michael Kennedy

Yeah, well, I think you've got a better chance with Marima than with Jupyter, at least, because you have a dependency graph.

So you can at least say, if this one is invalid, that means the following three are also invalid, being sort of propagated there.

Vincent Warmerdam

Totally.

But I try to focus a little bit more on the user experience side of things.

And one thing I've really learned from the notebook with the progress bar is just there were moments when I felt like, oh, I just want this entire thing to be automated.

Don't make me think about this.

And then there were moments where I thought, oh, it's really nice to have tight manual control.

How do I provide you with both?

Yeah.

That's quite tricky.

But it is a dream, so that's something to keep in mind.

Michael Kennedy

Yeah, maybe someday there'll be a turn on cash flow checkbox something.

Vincent Warmerdam

Yeah, or well, at least till then, I do think having something that works on disk instead of memory these days is also just a boon.

Michael Kennedy

Right.

So this works in data science notebooks.

It works in web apps.

It works in little TUIs.

It doesn't care.

Vincent Warmerdam

It works with LLMs.

And if you have a kind of-- actually, similar to your set of, If you have multiple processes with your web app running on one VM, if you have one big VM that you share with your colleagues, you can also just share the cache, actually.

Michael Kennedy

Yeah, that's true.

So that's the reason you can't just point the same file, yeah.

Vincent Warmerdam

Yeah, and especially if you're doing like big experiments like grid search results or stuff like that, that you really don't want to recalculate the big compute thing.

That's actually not too unreasonable.

Michael Kennedy

Yeah, that's cool.

Yeah, if you have a shared Jupyter server.

Vincent Warmerdam

Yeah, and a bunch of universities have that, right?

Michael Kennedy

Yeah, exactly.

I don't want to go down this path because we're basically out of time.

I respect your time.

However, I do think there's a whole interesting conversation to be had about like, how do you choose the right key for your, what goes into the cache?

Because you can end up with staleness really easy if something changes.

But if you incorporate the right stuff, you might never run into stale data problems because, you know, like for example, I talked about the YouTube ID.

Basically the cache key is something like episode, episode data, episode YouTube thing, colon YouTube, colon the hash of the show notes, right?

Something like that, where there's no way that the show notes are gonna change and I'll get the old data because guess what?

It's constructed out of the source, right?

Things like that.

There's probably a lot, especially like in your notebook side.

There's a lot to consider there, I think.

Vincent Warmerdam

- Yeah, I mean, so I remember this one thing with the course where I still wanted it to be cached, but I wanted to have like text goes in And then five responses from the LLM go out.

And the way you solve that is you just add another key.

But you have to be mindful of the cache key.

And you can-- oh, and you can use tuples, by the way.

That's also something you can totally use as a cache key.

Michael Kennedy

Right, right.

Vincent Warmerdam

So that was easy to fix.

It's just that you have to be mindful.

Michael Kennedy

Yeah, that's kind of-- I want to give a quick shout out to that.

I want to leave-- I don't want to leave on a sour note.

But I think it's necessary to give this shout out, or this like call this out, rather, is the way I should say it, is I think this project is awesome.

You think it's awesome.

Honestly, I think it doesn't need really very much.

But if you look at last updated, if you look at the last updated date, it's really, it hasn't got a lot of attention in the last six months or something like that.

Vincent Warmerdam

Yeah, and if I look at PyPI, the last release was 2023, which, yeah, a year and a half ago.

Michael Kennedy

Yeah, and it's okay to, I would like to say that it's okay for things to be done.

It doesn't have to, things don't have to change, but there's also a decent amount of, like conversations on the issues and they haven't, you know, like a couple of days ago, actually someone asked about this, but you know, the last change, I believe the guy Grant who works on it, started working at open AI about the time changes stopped going on.

I'm not entirely sure.

I feel like I could be confused with another project.

So Grant, if that's not true, I apologize, but I think that it is.

Pretty sure that is.

Do we have LinkedIn?

Vincent Warmerdam

I mean, I am comfortable stating is-- let me put it this way.

Yes, my colleague might make a different caching mechanism.

And yes, I might use that at some point.

But at least for where I'm at right now, this cache needs to break vividly in front of my face for me to consider not using it.

Because it does feel like it's done in a really good way.

The main thing that needs to happen, I think, functionally to make sure this doesn't get deprecated too badly is just you got to update the Python version.

When a new Python version comes out, you got to update PyPI to confirm, like, OK, we do support this Python version.

But I mean, most of the-- if you look at the area that needs to be covered, a lot of that has been covered by SQLite.

And that thing is definitely still being maintained.

Michael Kennedy

It's getting mega maintained.

That's right.

So I also don't see the problem.

I'm not going to not use it.

I just want to put it out there on the radar for people for whom my co-look is to go, oh, Michael and Vincent were so psyched, and I started to use this.

And now I'm really disappointed because of whatever.

I saw this.

Vincent Warmerdam

I mean, the only real doom scenario I can come up with is if SQLite made like a breaking change.

That's the only thing I can kind of come up with.

Michael Kennedy

But the odds of that seem very low.

Yeah, and it's on GitHub.

You can fork it.

I fork it.

Exactly, exactly.

So, no, I definitely am still super excited about it.

I just want to make sure that we put that out there.

I'd intended to talk about it sooner in the conversation, but you know what?

We were just so excited.

Vincent Warmerdam

Yeah, no.

This is definitely in my top five favorite Python libraries outside of the standard lib.

Awesome.

Michael Kennedy

Yeah, I've really, really have gotten awesome results out of it as well.

So remember the way we opened the show and you talked about this, like when we talked about the LLM building block stuff on the previous time you were on the show, it was like, oh, we better not go too deep on this, even though we're both so excited because it's going to derail the show.

We're now one hour and 15 minutes into it.

We kind of cut ourselves off.

I think that was accurate.

Vincent Warmerdam

Yeah, I mean, you get two dads making dad jokes and riffing on tools they both like.

It's bound to exceed a barbecue.

Michael Kennedy

Yes, I know.

I wonder what would happen if sometime we just removed the time limit, just got real comfortable and just riffed on something.

It could be hours.

It would be fun.

But maybe not today.

Two-hour live streams exist, Michael.

I know.

I've listened to some podcasts at over three hours.

I'm like, how is this still going?

But you know what?

Yeah.

It's all good.

Vincent Warmerdam

But yeah, this is a good point in time.

Michael Kennedy

We're both excited.

That's the summary.

That is the summary.

And I think I'm going to let you have the final word on this topic here.

like maybe speak to people just about caching in general and disk cache in particular as we close it out?

Vincent Warmerdam

I mean, I guess the main thing that I learned with the whole caching thing in the last couple of years, I always thought it was kind of like a web thing.

Like, oh, you know, front page of Reddit, that thing has to be cached.

That's the way you think about it.

Yeah, of course.

And thinking about it too much that way totally blocked me from considering, like, oh, but if you do stuff in notebooks and data science land, then you need this as well.

And I think there's actually a little emerging discovery phenomenon happening where people that do things LLM's at some point go like, oh, I need a cache.

And then, oh.

So that's the main thing I suppose I want to say.

Like, even if you're doing more data stuff, like give this disk cache thing a try.

It's just good.

Michael Kennedy

Yeah, it's so easy to adopt and try out.

Like, you can throw it in there.

Just add a decorator.

Exactly, see what you get.

See what you get.

All right, Vincent, welcome.

Oh, thank you for coming back.

I really appreciate it.

Thanks for having me.

Always good to talk to you.

Yeah.

Vincent Warmerdam

And yeah, see you next time when we, again, find out there's a cool Python library.

Michael Kennedy

Yeah, that's going to be the three-hour episode.

Watch out, y'all.

Have a good one.

Later.

This has been another episode of Talk Python To Me.

Thank you to our sponsors.

Be sure to check out what they're offering.

It really helps support the show.

If you or your team needs to learn Python, we have over 270 hours of beginner and advanced courses on topics ranging from complete beginners to async code, Flask, Django, HTMX, and even LLMs.

Best of all, there's no subscription in sight.

browse the catalog at talkpython.fm.

And if you're not already subscribed to the show on your favorite podcast player, what are you waiting for?

Just search for Python in your podcast player.

We should be right at the top.

If you enjoyed that geeky rap song, you can download the full track.

The link is actually in your podcast blur show notes.

This is your host, Michael Kennedy.

Thank you so much for listening.

I really appreciate it.

I'll see you next time.

I'm out.

diskcache: Your secret Python perf weapon

Episode Transcript

Never lose your place, on any device