·E33

Storing Files in the Database w/ Bogdan Kharchenko and Skyler Katz

Episode Transcript

00:00:07.05 Chris Morrell All right. Welcome back to Overengineered, the podcast where we ask the question, what's the absolute best way to do things we already have a perfectly acceptable solution for? ah Today, I am back with Bob Denkarchenko and Skylar Katz, and um we're going to be hearkening back to the early episodes of Overengineered, where I sort of I started this podcast with this concept of like, there are these things that we talk about and we never have time to actually like really just dig deep and come up with a really great solution for because they just don't matter that much. 00:00:47.28 Chris Morrell Um, And I feel like the show has shifted over the years away from that some, but I still want to have those conversations from time to time. So we we had an opportunity arise um and I thought we'd get into it. But ah before we do, you guys want to say hi? 00:01:06.34 Skyler Katz Hello, hello. It's good to be back. 00:01:09.62 Bogdan Kharchenko Hey, Chris. Hey, everybody. Thank you, Chris, for inviting me and Skylar back on the show. We are the OG. If it wasn't for us, I don't know where your show would be today, Chris. 00:01:17.12 Chris Morrell That's right. 00:01:18.95 Bogdan Kharchenko So it's great to have us back. 00:01:20.39 Chris Morrell it It wouldn't exist without you. Yeah. 00:01:22.87 Bogdan Kharchenko ah Exactly. 00:01:23.45 Skyler Katz You're welcome. 00:01:25.01 Bogdan Kharchenko you're 00:01:27.70 Chris Morrell um All right, let me set the stage. So ah we have we have this like sort of programming architecture, database architecture question that we want to kind of work through, which is um when you have tables, 00:01:48.12 Chris Morrell that hold big blobs of data, right? So we have a couple of use cases. One is a table that holds ah the contents of large documents in it, um and others are tables where we're holding base 64 encoded image data. 00:02:03.59 Chris Morrell um And in both of those cases, the amount of data at the individual row level is not that insane, right? The the base 64 images are um you know small cropped images of ah of a black and white signature, so it's not that much data. 00:02:23.17 Chris Morrell But when you compound that over many, many rows and a couple different tables holding this kind of data, it starts to add up. um And in the other case, you know we have this table that's it's got a million and a half rows in it. 00:02:35.35 Bogdan Kharchenko Thank you. 00:02:37.30 Chris Morrell It's not an insane ah number of rows. It's not an insignificant number of rows either. um But ah each one of those rows rows contains an entire document in it. 00:02:51.05 Chris Morrell um And just operating on the table is a little bit ah clumsy. And if you have to take a database dump, it takes a long time. And if you need to import that database dump, it takes a long time. 00:03:01.88 Chris Morrell It's a little awkward to work with. And we're also seeing this with verbs events. If we fire a ton of verbs events that have large 00:03:08.44 Skyler Katz Thank you. 00:03:11.29 Chris Morrell chunks of data attached to them. Now our verbs events table is getting kind of large. So it's a problem that we've bumped into a bunch of times. um It's a conversation that's happened on the the verbs discord a bunch. you know How do I fire events that have files associated with them, which maybe those files could be even larger than what we're talking about. And right now, verbs doesn't have a great solution to that other than ah just tell people, put the file in s three set up some ah you know permissions on that bucket such that the files can't be easily deleted and and ah associate and attach reference to those files in your event. 00:03:52.33 Chris Morrell But anyway, um there are a couple different ways to approach this. ah starting from we just don't change anything and it's fine going up to a very ah involved but very clever solution that I think I've come up with and and I thought we'd ah we've run through them. 00:04:05.12 Bogdan Kharchenko you 00:04:13.48 Chris Morrell ah Is there anything that i'm missing? 00:04:17.59 Skyler Katz I don't think that. i think those are, ah I mean, those are the two ends of the spectrum, I suppose. 00:04:24.22 Chris Morrell so I mean, i I think that this is a problem that probably a lot of people have bumped into in different ways, right? It's very convenient to have relational data. 00:04:36.12 Chris Morrell i Actually, before I get started on that, do think there's one other relevant piece of information, which is in in almost all the cases that we've been talking about internally at InterNACHI, it's been data that we... 00:04:55.45 Chris Morrell we care about for a specific period of time, and then it becomes less and less likely that we'll ever need, um that we'll need to pull up the larger parts of the data later, right? 00:05:10.99 Chris Morrell But we still need to keep it around and we still may need it, but it's like, in the case of the documents, it's like while the document is active, we're gonna be loading it from the database regularly. 00:05:22.31 Chris Morrell um But then those records are mostly just used for aggregate queries and and sort of directory listings and stuff like that and not necessarily shown or queried ever. 00:05:33.70 Bogdan Kharchenko Thank you. 00:05:37.05 Chris Morrell um So there's an element of sort of time to the the the way we're thinking about it that I think comes into play where um it's per perfectly feasible to potentially like sort of offload that data somewhere else after a certain period of time um so that we can keep that the database size and sort of like the keep the database nimble, but still have like quick immediate access to the data when needed. 00:06:07.61 Chris Morrell um so that that just feels like one other relevant detail. 00:06:12.66 Bogdan Kharchenko Yeah, so one thing I wanted to maybe talk about a little bit before we get into the weeds of your over-engineered solution, Chris, is kind of the issues that we're having, right? It's, you know, we're trying to just, with our ORM, query a few hundred records or a few thousand records. 00:06:29.19 Bogdan Kharchenko And because we have to pull this additional column every time, You know, it's just, you know, slowing things down because it is doing like an IO read of that data. And, you know, i just kind of want to set the stage of like what we're actually dealing with, you know, and, you know, I think like one of the ways that we've been combating this so far you 00:06:45.34 Chris Morrell Yeah. 00:06:50.58 Bogdan Kharchenko you know, explicitly asking for certain columns from our database in our eloquent query. And, you know, and that's, it's great, but I think like it is a little bit tedious where you have to say like, yeah, of course I need to get a created as column and you know, the, some other column, the user ID column in like six different places is just not ideal. 00:07:14.60 Chris Morrell yeah 00:07:15.64 Bogdan Kharchenko so um 00:07:16.72 Chris Morrell And it's easy to forget as well. 00:07:17.03 Bogdan Kharchenko you know, yeah. 00:07:18.68 Skyler Katz Thank 00:07:18.69 Chris Morrell Like maybe you optimize a bunch of different queries and then you add a new feature and you don't remember, oh, I have to explicitly select kle columns with this particular model. 00:07:30.64 Chris Morrell You know, it's just like not an obvious thing to keep track of. 00:07:34.30 Bogdan Kharchenko Yeah, yeah, totally. I mean, we've obviously just been experiencing that by grabbing too much data, you know, and the query, even though it's very fast and it's all optimized, but because it has to get all this data, it's just super slow, so. 00:07:49.57 Skyler Katz And it just doesn't, it's not fun to write that code when you're like, of you know, model query, you've got all these things and you're like, ah, I gotta, gotta select with just an array of, of like column IDs. 00:08:01.40 Skyler Katz It just isn't, I don't know. It's not Laravel. 00:08:04.54 Chris Morrell It's not eloquent code. 00:08:06.51 Skyler Katz Exactly. 00:08:06.65 Bogdan Kharchenko it's It's true. It's true. Yeah. 00:08:09.20 Chris Morrell Yeah, I do think, i know that there are people who sort of ah take a different stance than I do on this stuff, but my general my general rule of thumb with eloquent is try not to heavily optimize until you need to, because 00:08:09.26 Bogdan Kharchenko Yeah. 00:08:29.47 Chris Morrell um those optimizations always come with some cost. And oftentimes the upside is is minimal, if any. And the downside is like you're going to spend three hours debugging something because in this one place you happen to like decide not to fetch a column and it's the error is not like showing up in some way or you're like, you know, your relational query is scoped ah somewhere else and you don't and you're not getting the data back that you're expecting but it's like not an obvious thing I just find that the more you can just do straight queries with eloquent even if it's a little bit less optimal the better your life is going to be and then you optimize them when you actually start to see a need um 00:09:16.14 Bogdan Kharchenko Amen. 00:09:21.88 Chris Morrell Or, you know, there's some cases where it's like, okay, I'm going to be operating on every single record in this table, um and I only need the ID. Of course, I'm only going to, like, query just the ID in that case. 00:09:33.04 Chris Morrell But in most cases, you just don't need to do that, and it makes your life so much better. So I think you're right, Skylar. Like, it just sucks to have to do it all the time, basically. 00:09:43.45 Skyler Katz Yeah. I've, I've run into situations previously where like I've selected columns and then I'm just like, oh, it's null. Like, why is it null? Oh, because I, in, in the part where I was fetching the data, I didn't select that, that like new column that I added and pain. 00:09:59.66 Chris Morrell Right. 00:10:01.98 Chris Morrell Yes. And then the default eloquent behavior is it just gives you a null and you don't know, is this actually null or did I not request this data? 00:10:13.49 Chris Morrell Which there is a way to prevent, but. 00:10:13.94 Bogdan Kharchenko Oh, yeah. I was just going to mention that, Chris. i mean to cut you off. There is a way to turn that flag on so that way, you know, your your application blows up a little bit when you try to do that. 00:10:27.16 Bogdan Kharchenko ah But still, I agree. It's not obvious. it's it's It's painful to work with. 00:10:32.84 Chris Morrell Yeah. 00:10:35.31 Chris Morrell So there are couple simple solutions to this problem or simpler solutions to this problem. 00:10:35.71 Bogdan Kharchenko So, 00:10:39.09 Bogdan Kharchenko mm-hmm. 00:10:41.40 Chris Morrell the The most obvious one is just, you know, move the heavy data to another table, right? Query it when you need it ah You can even query it through joins so that you can still treat it like um a regular attribute on the table and just use a scope to join it in when you need to. 00:11:03.42 Chris Morrell And that way it's sort of like an opt-in instead of an opt-out. ah Maybe throw an accessor on there so that if it's not loaded, 00:11:09.51 Bogdan Kharchenko Thank you. 00:11:12.79 Chris Morrell um lazily loads it and ah you could throw some short sort of exception if it was an n plus one situation i'm sure there's a way to tie into the default uh Laravel behavior there um and i think that would solve a lot of the problems right i mean frankly that may be the solution that we end up reaching for because it's it's pretty simple and it doesn't require us to reinvent too much of the wheel um It doesn't solve a bunch of the DX problems though, because it doesn't really, you know you still have 40 gigs of data in your database dumps if you're if you need to pull that that ah backup down. 00:11:57.74 Chris Morrell um 00:11:59.14 Skyler Katz Well, and I feel like. 00:11:59.34 Chris Morrell And it just means, yeah, you've got separate tables now that you're dealing with, right? 00:12:03.93 Bogdan Kharchenko Thank you. 00:12:04.78 Skyler Katz You have separate tables, but also, i mean, in even in the verbs, the verbs events context, like you would have to have like weird stuff would happen if you don't end up with a record in the other table where the blob of data is supposed to be. 00:12:21.88 Chris Morrell Right. 00:12:22.02 Skyler Katz And so like, if we had an articles table and the content was in some other table, but for some reason it didn't get written, like you just end up in a weird place too, with things potentially getting out of sync. 00:12:33.77 Chris Morrell Right. Yeah, for sure. 00:12:34.67 Skyler Katz which isn't resolved by your complicated ah situation either, but it's, 00:12:37.72 Bogdan Kharchenko Yeah. 00:12:41.04 Bogdan Kharchenko I mean, I will say, um you know, yes, there's an extra table and there's maybe an additional join, but this is like a pattern that everybody's used to. You know, you just have this belongs to relationship or has one relationship and you call it a day, right? This is like a very known pattern. 00:12:57.38 Bogdan Kharchenko And we do these types of relationships, not necessarily for specific like blob column or you know, a long text column, but You know, they exist and I feel like developers are used to that. 00:13:10.18 Bogdan Kharchenko So yes, there could be some instances where something goes south and, you know, that thing didn't get copied over to the to that additional table. um But I really think that that could be mitigated. and ah just think in general, people are much more comfortable with that type of, you know, offloading of data into another table. 00:13:34.35 Bogdan Kharchenko So I think, yeah, totally, totally. 00:13:34.47 Chris Morrell Yeah, I mean, it's definitely more straightforward, for sure. 00:13:39.32 Chris Morrell the The downside, i mean, i think the big downside for me, or or the thing that makes me think that at least in our use case, it's not the right solution, is verbs. um Because... 00:13:51.48 Chris Morrell either we have to introduce some sort of like special, i guess we could create a new table called like verbs heavy data or like, i don't know, some other table that we make clear is also part of verbs. 00:14:08.17 Chris Morrell Um, and just sort of internally treat that like the event store, like this, this table can't be messed with. Um, 00:14:19.87 Chris Morrell Or, yeah, I don't know what the solution, and unless we did some weird, we would have to do some weird stuff with verbs, right? We'd have to maybe push that data to, first we'd write that data data to this other table, and then we'd fire the event and reference the ID of the other, I don't know. 00:14:35.35 Bogdan Kharchenko you 00:14:38.94 Chris Morrell It just feels, it feels pretty bad when you get into the event sourced side of things. um to have something that's so separate from the event sourced, all the other event source data, but is so integral to it. 00:14:57.29 Chris Morrell That's my take, at least. 00:15:00.11 Bogdan Kharchenko But I mean, it sounds like in the case of verbs specifically, the event data is very crucial for you know you know making sure the the event state is basically built up correctly, right? 00:15:13.56 Bogdan Kharchenko And you know I feel like, i't maybe i don't fully understand the problem with verbs, but there is obviously also a large chunk of JSON data inside the table. 00:15:24.41 Bogdan Kharchenko And it's basically really unneeded unless you're replaying the event, right? Is that what I understand? 00:15:31.61 Chris Morrell Well, yeah, and even if you're replaying the event, there are lots of replays that you might do that never touch the document contents, right? So in in the case of these documents, it's like, 00:15:40.84 Bogdan Kharchenko Sure. always I see. 00:15:43.02 Chris Morrell we have a you know document created, document updated type events. And right now we're projecting that to the database, but we may want to project to some sort of analytical tool as well in the future. 00:15:59.74 Chris Morrell And most likely for all of those projections, we don't need the document contents at all. So being able to quickly and efficiently fetch all of that data without having to get the that big blob of contents that we don't care about, which in this case we can't exclude because it's a JSON column. um i mean we could do I guess we could do some really crazy like JSON sub query type stuff, but I don't know what the efficiency of that looks like and I don't know. yeah I'm just not sure what that would look like. 00:16:36.41 Chris Morrell um 00:16:37.49 Skyler Katz I mean, maybe... 00:16:37.57 Bogdan Kharchenko I see, I understand. 00:16:37.87 Chris Morrell And I imagine that no matter what, like, InnoDB still has to fetch the JSON contents, right? In the case of a long text, you know, InnoDB is putting that contents on a different page, right? 00:16:52.15 Bogdan Kharchenko Mm-hmm. Mm-hmm. 00:16:52.38 Chris Morrell So unless you, if you don't ask for it, there's like literally no performance overhead. Whereas with JSON content, I can't imagine that that's the same. I could be wrong, but I can't imagine it's the same. 00:17:04.86 Skyler Katz I in verbs, when when we're serializing the data down to store it in the database, could there be some thing that verbs does where it's just like this JSON is large or like a key is large and it's just going to like, um within verbs, dump that somewhere and just store a reference to it in the data column? Yeah. 00:17:31.28 Chris Morrell Yeah, 100%. 00:17:31.83 Skyler Katz And then pull it back out, just like sort of behind the scenes so that the end user has no care in the world about it. 00:17:42.38 Chris Morrell I mean, I would like i would like um verbs to have a built-in solution. It feels bad for us to like... 00:17:49.23 Bogdan Kharchenko Thank you. 00:17:53.56 Chris Morrell put together some sort of taped on extra thing that we do on top of verbs, it feels like, you know, we, we, it would be so much better if it was something that was built in and like, you know, we already have the serialization deserialization pipeline in verbs. So like, it's the type of thing that could be handled fairly transparently. It, 00:18:18.86 Chris Morrell it it does feel like, 00:18:19.48 Skyler Katz Put an attribute on the key in the payload or something and just ship it off. 00:18:22.86 Chris Morrell Yeah, exactly. Yeah, yeah. 00:18:28.05 Chris Morrell But whatever it is, I think it just feels like creating a separate like big contents table that's just like a UUID and a long text and just like referencing that UUID somewhere ah feels gross. 00:18:46.94 Chris Morrell if it If it weren't for verbs, I think I could live with it. But with the way we do verbs, it feels a little gross to me. um I'm not necessarily opposed to it. 00:18:59.83 Chris Morrell um and Because it does, you know, the simplicity side of it is really nice. Right? um But, yeah, it just feels like not great. 00:19:16.06 Bogdan Kharchenko Yeah, so I guess ah here, let me kind of step back a little bit. So like one of the things that you know we wanna solve with you know potentially the software engineer solution is you know not just the query performance, but just not having to always ask for all of the data if you're just you know needing like a small little column in like the verb JSON payload, right? 00:19:43.76 Bogdan Kharchenko Because what you're saying is in the verbs JSON event payload, um you would just say like, I just want this and this key rather than the entire document that's attached to it. 00:19:57.44 Bogdan Kharchenko Is that what I understand? 00:19:59.29 Chris Morrell Yeah, I mean, i think you would essentially wrap the data behind some sort of abstraction that has a way to retrieve the data. 00:20:11.08 Chris Morrell And until you retrieve it, it's like, um you know, the thing that makes me, it makes me think of is when I was, when I was exploring Swift, there's like the concept of wrapped variables where it's like, 00:20:24.12 Chris Morrell when the variable's wrapped, you don't know if it has a value or if it's an error, and you can kind of pass them around. ah And then in the code that actually cares, it um unwraps the variable and then deals with sort of the consequences of that action. 00:20:38.88 Bogdan Kharchenko Mm-hmm. Mm-hmm. 00:20:39.97 Chris Morrell um So it was kind of that idea, like you'd pass around some sort of object that if you'd never need the data, all you're doing is passing around a reference to an object. 00:20:51.63 Chris Morrell And then the moment you need the data, either the object would already have the data or the object could load the data or the object could throw an exception if for some reason the data wasn't there, right? 00:21:02.61 Chris Morrell um And ah did ah did a little bit of research and the term for this is the claim check pattern. um And so this is like a relatively well-established pattern in applications that deal with lots of... 00:21:19.82 Chris Morrell big data payloads where you essentially have a system where you exchange some data for a claim check, you know which is just like some metadata. 00:21:31.11 Chris Morrell ah And then at any time you can exchange the claim check back for the data. But until you need it, you're just passing around like a UID or something like that, that um represents the data and you only need to load it if you need it. 00:21:46.58 Chris Morrell um And I feel like you could implement this with a database table. You can implement this with something like s three You could put it on the file system if you wanted to, although you probably don't want to. But like um 00:22:00.87 Chris Morrell it it's a nice sort of generalized pattern that solves this problem of just like, let's offload this and only get it if we need it. You know what I mean? 00:22:12.72 Bogdan Kharchenko Yeah, yeah, I mean, um you know, I've obviously heard you talk about this already and it sounds very, you know, promising. I do wanna, I feel like, poke at this a little bit more. and Maybe if you dive deeper, I'll find some holes at it. But, you know, it just seems like, you know, maybe in bulk operations, you know, retrieving um data with this claim check somehow. I feel like maybe are you gonna be back in the same spot as storing data on the row itself you know um so like if you just had the json blob in that or whatever the content in that event and if you have to rehydrate this thing with some claim check from another table or s3 bucket in real time i mean are you just back at step one okay 00:23:01.52 Chris Morrell Yeah. I mean, it's 100% a trade-off, right? like Essentially, you're trading the though optimizing for the one case for optimizing for the other case. 00:23:13.77 Chris Morrell And the downside but downside of the like sort of traditional claim check pattern and or just... I mean, the other thing that we could do, like we do with other file system-related things, you know there's lots of tables that... 00:23:27.64 Chris Morrell that we have that are related to a file where we just have a disk column and a path column, right? And you just load those from the file system as needed. 00:23:34.41 Bogdan Kharchenko Mm-hmm. 00:23:39.32 Chris Morrell um And I think claim check the claim check pattern has that same downside, which is every time you need the data, you have to load it from this secondary store. 00:23:51.89 Chris Morrell um And so the solution that I had proposed before we got on this on this call was the idea that, well, what if we still had a long text column on this table um and we just write the data to the long text column as usual? 00:24:05.98 Bogdan Kharchenko Mm-hmm. 00:24:13.57 Chris Morrell um and Then we essentially um you know we implement some sort of trait or attribute or something that you can put on your model to describe or some sort of you'd probably use like a custom cast um to describe which data should be handled in this way. 00:24:36.10 Chris Morrell and we run a scheduled job every day. um And like if I'm stepping back and thinking about this more as like an open source package and less is just like a solution that we're gonna use, um because the ah the advantage of implementing some sort of open source package is then we can also adopt that as sort of the canonical way to do this in verbs. 00:24:57.06 Chris Morrell um We could just have a scheduled task that runs a command that essentially discovers every model that has this cast. 00:25:09.28 Chris Morrell um And sort of like Laravel Scout, you could ah optionally you know define a method that tells this job how to efficiently load the data. 00:25:21.44 Chris Morrell And you could optionally ah you know implement a method that tells this job when how to do the logic of when things should be moved to like what what I'm calling cold storage. 00:25:23.89 Skyler Katz Mm-hmm. 00:25:34.87 Chris Morrell right So by default, maybe it would be any record that's older than seven days or 14 days or 30 days or whatever it is gets moved to cold storage. But like each model could separately define its own internal logic. 00:25:49.55 Chris Morrell um And that way you might say, okay if this like document hasn't been sealed yet, then it never moves to cold storage. But like the moment that it it has been sealed and is more than 30 days older, a it it moves to cold storage. right And so what moving to cold storage would be is we you know We grab the contents of the record, we write it to S3, and we replace it with some sort of JSON payload that says, you know this is ah this is the the the path, this is the disk, this is the path. 00:26:26.74 Bogdan Kharchenko Thank you. 00:26:27.61 Chris Morrell This is like the timestamp that it was last accessed, perhaps. um And ah well, that that's a thing that i I haven't quite figured out. But ah essentially, it would swap it out with a reference to a file in S3, right? 00:26:43.07 Chris Morrell But then since we're using a custom cast, when you try to access the attribute, Essentially um Eloquent would just say like, okay, well, if the contents is there, just return it. 00:26:57.07 Chris Morrell So for all the stuff that's sort of in the hot path, the stuff that's in the last 30 days or whatever, that long text column is just going to have the contents and it'll just work like it does right now. 00:27:01.58 Bogdan Kharchenko Mm-hmm. Mm-hmm. 00:27:10.05 Chris Morrell um If it's not in ah in the contents, ah you know the the cast can just exchange that claim check for the the value in S3. 00:27:25.75 Chris Morrell And we could even have drivers. There could be a database driver, an S3 driver, and a DynamoDB driver, who cares, you know like whatever. um And so transparently, you could just access the contents as though it was always there. 00:27:42.16 Chris Morrell um And for the content that is most likely to be needed, it will just be there. And for the content that has been moved to this like cold storage concept, it's just one additional call to S3, but it's transparent to the application, 00:27:59.45 Chris Morrell um which ah you know has a lot of appeal to it. The downside is that now... um you know, we're back to those records are always being loaded no matter what. 00:28:17.16 Chris Morrell So it doesn't solve that first problem that we talked about, right? So it's not a perfect solution to that, but it's an interesting solution to a lot of the other problems that we've faced. 00:28:27.72 Skyler Katz I mean, what's the downside just getting rid of the column in the database altogether and just storing it in S3. Yeah. 00:28:39.16 Chris Morrell Right. So just always storing it in S3. 00:28:42.86 Bogdan Kharchenko But then you would lose, 00:28:43.02 Skyler Katz yeah I mean, in the context of these, even, in even in the verbs events context, like 00:28:50.17 Skyler Katz There's a small performance penalty, but it's also, I just don't think it's that small. If you're hosting your application in 00:29:00.60 Chris Morrell Right. 00:29:00.84 Skyler Katz AWS and S3, mean, even if you were like on Laravel Forge's new boxes and you were using R2 from Cloudflare, like this stuff is so close to your box that like, I just don't know that there's really that big of a performance penalty to 00:29:00.88 Chris Morrell We're on Yeah. yeah 00:29:05.08 Bogdan Kharchenko Thank you. 00:29:20.73 Skyler Katz pulling stuff that's like not accessed that much. I mean, it needs, these documents are accessed several times over the course of a week. And then probably never again until somebody decides to look back up for like historical purposes at their document. 00:29:36.89 Skyler Katz And then you just pull it back in. 00:29:37.55 Bogdan Kharchenko Thank you. 00:29:38.33 Skyler Katz i don't, I mean, in verbs, like we don't use verb state. I suppose if you were doing mostly state stuff and you needed it to hydrate all of the events every time, like that would, 00:29:51.02 Skyler Katz Come at a cost. 00:29:53.51 Chris Morrell Yeah. Now, I think because the like cold storage concept doesn't address that first issue, which is that you're always that you're you're having to like manually handle which columns you're loading, um it doesn't feel like actually a great solution to our problem. 00:30:18.58 Bogdan Kharchenko I mean, 00:30:18.70 Chris Morrell I do think that maybe... 00:30:19.69 Bogdan Kharchenko but 00:30:20.91 Chris Morrell ah abstracting the that that behind a concept called a claim check, right? Where there is like a um an object that you get back, you know, that like you can pass that object around until you need it. And then the moment you need it, it either has already been loaded or can be loaded dynamically. 00:30:43.68 Chris Morrell um and And that would provide a way to potentially say like, if you're storing your claim check data in the database instead of s three you could eager load that stuff. Or maybe there's a way to even like load multiple records more efficiently in S3 in a single call. 00:31:00.22 Chris Morrell um But I do think that maybe just a s sink simple like, okay, here's an S3 bucket that's only for this. It's got restrictive policies on it so that the data can't be um deleted. 00:31:16.41 Chris Morrell And we just store yeah essentially a UUID that is a reference to that file. It could be option. 00:31:29.85 Chris Morrell the best option 00:31:31.79 Skyler Katz mean, I've never used it, but S3 has like a way to query S3 files, like in a SQL like syntax. And I wonder if the file was stored in a certain way that you could just like pull them back. 00:31:40.87 Chris Morrell and okay 00:31:49.89 Skyler Katz Um, which would be interesting. 00:31:50.33 Chris Morrell Yeah. 00:31:54.37 Bogdan Kharchenko Yeah, I mean, so one thing that kind of ah strikes to me is like, for example, obviously the file exists only in s three like you kind of lose some other, you trade off something else, right? Like you can't do like JSON queries, for example, on that column, even if you've had some data hydrated, like you can't do search, text search, for example, of whatever document. 00:32:21.72 Bogdan Kharchenko um And you know one thing as you guys were discussing this that came to my mind is, you know and maybe Chris, you touched up on this a little bit. you know If there was a table called claim check that had some sort of long text column, 00:32:37.17 Bogdan Kharchenko and it had a UUID or an ID or whatever, that that is the thing that you've exchanged somewhere else, right? So there is this like relationship and you know that kind of solves that issue of having to limit or ah explicitly say which columns you want to fetch from like your main model. 00:32:58.73 Bogdan Kharchenko because the content is just in another table, but it ultimately will end up in some sort of S3 bucket. But still there's like a direct reference to that column in this table. 00:33:14.96 Bogdan Kharchenko And you don't have to basically query everything all at once. And you know if you do have the warm data, you could you know maybe do some sort of searching or operations. 00:33:26.28 Skyler Katz could put the warm data in a claim check table. And then that table takes old records that haven't been claimed access in a while and dumps them off to s three with a, with a reference. 00:33:40.53 Skyler Katz And, um, 00:33:42.42 Bogdan Kharchenko Well, that's basically what I'm describing. 00:33:43.54 Skyler Katz And then like a custom, yeah, a customer relationship where if you're eager loading, we just then fetch the content from S3 and push it into the back into the model when you're grabbing them. 00:33:57.74 Bogdan Kharchenko It almost doesn't even have to be in the model, right? It just, you know, there has, doesn't even have to be a column. 00:34:01.82 Chris Morrell This is a relationship. 00:34:03.21 Bogdan Kharchenko It's just a relationship, right? There's just this claim check and whatever it is, you know, you can change a cast to it or whatever you want to do. 00:34:05.91 Skyler Katz Yeah. 00:34:14.37 Bogdan Kharchenko I don't know. 00:34:14.64 Chris Morrell I do like that. 00:34:14.69 Bogdan Kharchenko I think that, I think that's a happy medium. 00:34:16.63 Chris Morrell I like offloading that data to the database also solves like the, because the the problem with, or like a comp the most complicated piece of this like cold storage idea was having to auto discover all the places where you needed to look. 00:34:29.83 Bogdan Kharchenko Mm-hmm. 00:34:31.29 Chris Morrell But if you just had all that data in one table, then that that scheduled task could just run every day. and just grab everything that hasn't been accessed in more than 30 days or whatever, you know you just configure it and just offload all of that to s three you know And that could be a pure that could be a totally optional process. right You could choose to do it or not. right You could turn that off and just keep the stuff in the database if that's what you wanted. 00:35:01.19 Chris Morrell Or if you were concerned about the size of your your database tables, you could you could offload some of the data at S3 as you needed to. 00:35:09.77 Skyler Katz Yeah. I mean, there's a like storing data that's rarely accessed in S3 is going to be cheaper than storing it in RDS by fractions of a penny per hour, but yeah. 00:35:19.40 Bogdan Kharchenko Yeah, I mean, i think there's also, yeah, I think there's also like, you know, ah this the the savings in data storage will probably be offset by like access reads and writes to S3. 00:35:22.54 Chris Morrell ah Sure. 00:35:33.77 Bogdan Kharchenko I don't know how much it costs, but it probably is also very minimal. But, you know, I think like the solution that we're kind of like, I think going after is, ah you know, dealing with large data, right? And maybe when you scale up to 10 gigabytes in 20, 30, 50, 100 gigabytes of data, then it actually makes sense. 00:35:53.84 Bogdan Kharchenko But like, if you only have like, you know, one gigabyte of data or a hundred megabytes, which even that is a lot. um You know, I think just keeping in a database, like if somebody was like, here's this package and you have to use S3 and I'm like, oh man, that's like another thing I got to maintain. 00:36:10.59 Bogdan Kharchenko But if I just optionally can say like, oh yeah, well, this is kind of where big blob data is stored. And I'm okay with the cost because have two users on my applications, me and my wife. 00:36:21.74 Bogdan Kharchenko you know But I think that's super appealing um you know just to have a specialized table to hold some of this bigger size data. 00:36:32.79 Chris Morrell Yeah. I mean, that's another thing that like is pretty common in this pattern is um just having a data size threshold. 00:36:43.59 Chris Morrell So that would be another that could be another thing configuration option that we could offer and and tweak for ourselves where it's like, Yeah, if if the size is over X, don't even write it to the database. 00:36:56.48 Chris Morrell Just go straight to S3 because we don't want to put a gig of data in ah in a blob column. 00:37:02.37 Bogdan Kharchenko Yep. 00:37:03.55 Chris Morrell You know what I mean? um But as long as it's under some number of kilobytes of data, write it to the database first and then archive it to S3 later once it's it hasn't been accessed for a while. like we could we could also implement something like that. 00:37:23.06 Chris Morrell And that would give us sort of the best of both worlds. You could essentially write whatever amount of data to one of these things and just know that it's just gonna be sort of transparently handled for you. 00:37:38.16 Bogdan Kharchenko I like it. 00:37:40.19 Skyler Katz Ship it. 00:37:41.25 Chris Morrell There we go, we solved it. 00:37:42.51 Bogdan Kharchenko Chris already wrote the code. 00:37:42.84 Chris Morrell This is a rare occurrence, 37 minutes, and we just solved the problem. 00:37:48.87 Skyler Katz Yeah, it is. 00:37:49.19 Bogdan Kharchenko Yeah, I mean, it's all, you know, i think on paper looks good. i know that, you know, actually implementing this and working with this long term is going to prove to be challenging no matter what. I'm sure that there's probably, you know, table partitioning that we should think about or maybe as you're writing this package, like how to automatically partition by date. 00:38:09.81 Bogdan Kharchenko ah you know I don't know if it's necessary, but it could be worth exploring because you know if you say to me, like yeah, here's this magical table that's going to store millions of records, and then you know at some point it's going to come to a grinding halt the same way some of the other tables that we've been dealing with. 00:38:28.74 Bogdan Kharchenko um you know I don't know. It has to be, you know i think, time proven. But I think it's it's a good premise of doing this. And obviously we didn't end invent anything. 00:38:39.05 Bogdan Kharchenko Like you mentioned earlier, you know this is like a common pattern for ah storing data in like long-term storage, cold storage with this claim check concept. 00:38:49.12 Chris Morrell Yeah. 00:38:50.10 Skyler Katz I mean, there was, there's one other thing that like when we were upgrading our database, the person for Berkona was like, you can, in my SQL, like you can make a shadow column, like the column is there, but when you do select star, it just never returns it. 00:39:08.49 Skyler Katz You have to like explicitly say, like, give me the content column. 00:39:09.06 Chris Morrell Yeah. 00:39:10.42 Bogdan Kharchenko Oh, that's a good idea. Yeah. 00:39:14.00 Skyler Katz Yeah. Which could also be an approach for some of these tables where, and then we just have a ah query scope that says like with big content or whatever that adds the select star comma select content to the query. 00:39:25.94 Chris Morrell Right. 00:39:26.13 Bogdan Kharchenko yeah 00:39:27.58 Chris Morrell Mm-hmm. 00:39:31.14 Chris Morrell who 00:39:32.24 Skyler Katz Like that's, an option instead of having to remember to do all of them. i don't know. Or like an attribute that, don't know, makes a second query to the table if you're trying to access the content column, but it's not there. 00:39:47.37 Skyler Katz um It's like a potential approach. 00:39:49.18 Bogdan Kharchenko I mean, I really like that simple solution, Skyler, but I think it's not over-engineered enough. This is the problem. 00:39:54.93 Skyler Katz Well, I mean, the downside to this is that it's opaque and you're like, what like why why do I have to say with content? 00:39:59.33 Bogdan Kharchenko Yes. 00:40:03.12 Skyler Katz Like, do you still get that same problem of like, I did a select star, but it's not giving me back all of the data that I'm expecting. Yeah. 00:40:10.72 Chris Morrell Well, I mean, I would go further to say the problem is that it doesn't solve two of the major headaches that were we are specifically dealing with, right? 00:40:18.27 Skyler Katz yes 00:40:19.78 Chris Morrell it It does solve the one. It solves the querying problem, but it doesn't solve the size of data when we need to do debugging. And yeah, yeah and and it doesn't like... 00:40:34.25 Chris Morrell it's still that, I mean, ultimately the data is still just in that table, right? All the problems that are around the data being in that table remain just because it's not being selected by default, like makes it slightly better, but it doesn't actually address the fact that like that data is still there. 00:40:54.40 Chris Morrell And I mean, for context, you know, so one part of our development process at InterNACHI is we have, 00:40:58.28 Skyler Katz Thank you. 00:41:01.83 Chris Morrell ah We have a ah job that runs every day that takes a production database dump, restores it, runs a bunch of sanitization and cleanup on the the data, um like anonymization and stuff like that, and then ah exports that to, you know, dumps that to an S3 bucket that we can then use for local development. 00:41:26.67 Chris Morrell And... and you know like where we just are When we introduced this feature, our database dumps went from five gigabytes to 40 gigabytes overnight. 00:41:41.17 Chris Morrell um And that's just like a real frustrating ah experience. you know So that's like another piece of it that I do want to address. you know now now Right now, we're just sort of clearing out the contents of those tables. 00:41:56.82 Chris Morrell and we've got that that dump size back down to ah even a little bit smaller than what it was before, but it it comes with the trade-off of like now we don't have any of that data um if we're debugging locally. 00:42:10.89 Bogdan Kharchenko one One thing I wanted to add, Chris, that unfortunately we still deal with sometimes is ah you know what you just said about basically we we kind of don't get that content because we cleared out. But even with this ah mechanism, right we would then have to take an S3 bucket, potentially backup, 00:42:31.76 Bogdan Kharchenko and sanitize that data too and make sure that it's available for our development environment. So it is another trade off that, you know, right now, at least we can say like, oh yeah, well maybe we should keep, you know, one week worth of data. 00:42:36.28 Chris Morrell Yeah. 00:42:44.19 Bogdan Kharchenko Right. And it's, it's all kind of there contained, but if we have a place where we're offloading it to, another bucket, you know we have to clone that bucket as well and you know do that same kind of sanitization process on the data um you know so that way it's available for like local debugging and development. 00:43:03.73 Bogdan Kharchenko And that's just another challenge that would... 00:43:06.87 Skyler Katz I mean, we could do something similar to what we do with Stripe, where we have the read-only key for pulling in the Stripe billing stuff, where if we try to write when we're in local, it'll just throw an exception because it's like that's... 00:43:13.93 Bogdan Kharchenko Hmm. 00:43:23.27 Skyler Katz you're trying to write test data to the prod, but and but you're able to like read the prod transaction history or whatever. So we could have a read only production key to S3. That's like, all right, well here's the claim check, like read key. 00:43:37.70 Skyler Katz And in local, like it can read from there, but if it, but it would write to the, it can't write to it. 00:43:42.85 Chris Morrell But I can't write to it. 00:43:44.36 Bogdan Kharchenko yeah 00:43:45.00 Skyler Katz We can't overwrite the production stuff. I mean, I think like at least with Stripe, when we're pulling in customer stuff, it tries to find it in our dev instance. And if it can't, it reads from the prod instance. So 00:44:02.53 Chris Morrell Yeah, I mean, the other the other thing is if we do this sort of like two phase, first it goes to the database and then it moves to s three you know, essentially then it's only an issue if you're trying to access data that got moved to S3, which in a lot of cases theoretically wouldn't happen because like, 00:44:03.66 Skyler Katz yeah. 00:44:06.67 Bogdan Kharchenko Mm-hmm. 00:44:19.22 Bogdan Kharchenko true 00:44:24.63 Chris Morrell its content from from years ago that you know you're not likely to to need for local debugging. right So I do like the idea of like optionally supporting some sort of like you know, alternate read key for the claim checks so that in your local environment, you could set it up so that it's like, yeah, if you don't have it locally, there's a way to access a claim check, a production claim check locally. 00:44:42.65 Bogdan Kharchenko Thank you. 00:44:52.56 Chris Morrell But I don't even think you would need that most of the time because the data would just be in the database. 00:44:56.39 Skyler Katz I mean, that's like special to our instance. I don't know that many people that just use their production database dumps as their local environment. 00:44:59.44 Chris Morrell Sure. 00:45:06.38 Chris Morrell I mean, i think that more people than want to admit do. 00:45:11.64 Skyler Katz It's fair. 00:45:12.49 Bogdan Kharchenko It's true. 00:45:12.53 Chris Morrell I don't know many people who are who publicly say on the internet that they do it. But, ah 00:45:18.54 Bogdan Kharchenko Yeah. 00:45:20.12 Chris Morrell yeah. 00:45:21.51 Bogdan Kharchenko It is pretty interesting. I will say one other thing I noticed the other day, i was on Twitter and I think Tim McDonald posted that they migrated like six billion records in their ClickHouse database, like in their staging environment or something like that. 00:45:36.23 Bogdan Kharchenko And I was just like, man, that's a lot of data, right? And, you know, and I'm not saying let's go use ClickHouse, but I think it's worth maybe investigating 00:45:39.29 Chris Morrell Yeah. 00:45:44.71 Bogdan Kharchenko if some of this type of data could be suitable in you know some of these column store tables. um Because it seems like that, you know like Nightwatch, for example, itself stores a lot of data and they're constantly sorting it, querying it by time. 00:46:03.17 Bogdan Kharchenko um you know Like, show me this exception. you know or list of exceptions by these timeframes. So I don't know if that's a potential solution or another you know ah can of worms potentially. 00:46:16.78 Bogdan Kharchenko um But I just thought that I was like, man, that is a lot of data as having us just gone through oh ah me a minor migration in comparison. um 00:46:27.34 Chris Morrell Yeah. 00:46:27.100 Bogdan Kharchenko you know Six billion is a lot of rows. And I suspect that they are also quite heavy. don't know, just some food for thought. 00:46:34.63 Chris Morrell Yeah. I mean, i i don't, my impression um based on what I know of like ClickHouse and and and similar databases is that they're not, they are not optimized for returning like a single record by ID, right? 00:46:53.87 Bogdan Kharchenko Mm-hmm. 00:46:54.26 Chris Morrell there They're there for like very efficiently doing aggregate queries on data. um And so it's kind of like a trade off, a different trade off. I don't know that it's the right, that is the right solution. 00:47:08.26 Chris Morrell I do think that there's probably, I mean, what was the one that you brought up um the other day? 00:47:13.70 Bogdan Kharchenko parquette um But I believe, yeah, but I believe from what I understand that S3 is built on top of Parquet. 00:47:14.81 Chris Morrell yeah yeah, the Apache project. 00:47:22.52 Bogdan Kharchenko And I think Skylar, what you were referring to as far as that SQL language, I think that is that Parquet whatever's querying language. 00:47:22.79 Chris Morrell Yeah. 00:47:30.02 Bogdan Kharchenko And I could be totally wrong, obviously. do your own research. But um you know I've just heard that name in various contexts of dealing with data store and it's like S3 and all these R2 buckets that just all seem to be wrappers on top of that project. 00:47:43.37 Chris Morrell yeah Yeah, I mean, I wouldn't be surprised. 00:47:49.94 Skyler Katz I mean, ah well, I was going to change the topic little bit. 00:47:51.03 Chris Morrell go ahead, Skylar. 00:47:55.72 Bogdan Kharchenko Let's do it. 00:47:56.36 Chris Morrell Oh, i well, all I was going to say is, um I mean, I wouldn't be surprised if like S3 is their own custom custom thing, but that like, if we wanted to run our own version of this database, we could, but I am happy to just use S3 and not, you know, not run our own, you know, bespoke database that's for these these large chunks of of content. You know what i mean? 00:48:26.37 Skyler Katz I mean, this is this is not this would not be helpful for verbs events, but the things that we're running into are like we're storing you know but the base 64 encoded basically like signature ah doodle canvas drawings. 00:48:44.96 Skyler Katz And like maybe we should be storing those on S3 later. 00:48:49.53 Bogdan Kharchenko Immediately. 00:48:49.65 Skyler Katz as PNGs to begin with. 00:48:49.73 Chris Morrell Right. Yeah. 00:48:50.77 Bogdan Kharchenko Yeah. 00:48:51.65 Skyler Katz um And then these like documents that are HTML in nature, like maybe they, maybe they also just should have been stored in S3 as HTML and loaded like lot of 00:48:51.83 Chris Morrell yeah 00:49:07.61 Skyler Katz loaded client side or, or still just like pulled in with file, get contents. Like maybe these particular use cases aren't actually the database isn't the right place for them. 00:49:21.10 Chris Morrell Yeah. Right. 00:49:21.59 Skyler Katz Uh, It doesn't solve the verbs events thing other than we then wouldn't be storing the document in verbs events. And in these tables, we would store a reference to the document in the verbs event because we'd have to push it to S3 first, but then we lose versioning of these documents unless you turned on versioning in S3. 00:49:42.76 Chris Morrell right 00:49:46.35 Skyler Katz So, 00:49:48.93 Skyler Katz so 00:49:49.86 Chris Morrell Right. Yeah. 00:49:51.47 Bogdan Kharchenko Thank you. 00:49:52.99 Chris Morrell yeah Yeah, I mean... i think that... ah 00:49:59.41 Chris Morrell I think that ultimately... That is probably true. I mean, I think that the reason that we're storing right now, the reason that we decided to store these base 64 encoded pings the way we are is because we are fundamentally interacting with them as base 64 data that go that they gets used in the JavaScript component instead of like loaded as a file. 00:50:29.90 Chris Morrell Um, so i under So I think that that's why the approach was the way that it was. But like in hindsight, I think that there's a good argument for, yeah, these five or eight tables that have these like big these columns that just have a ton of data in them, like they should probably all be offloaded from the database you know in some way. 00:50:54.60 Chris Morrell ah Or at the very least offload into their own table in the database so that it's like, or tables, maybe one per. I don't know exactly what the solution in each case is, but yeah, it's like probably something that mostly looks like a file should be stored in a place that's mostly made for things that look like files, you know? 00:51:07.06 Skyler Katz Yeah. 00:51:21.58 Bogdan Kharchenko Yeah, I agree. I feel like, yeah, there are, you know, um you know I always hear sometimes like on the internet, it'd be like, oh yeah, like we have like 20 megabytes of images in our table. And I'm like, why would you ever do that? 00:51:34.63 Bogdan Kharchenko And then, you know, just looking back at what I did last week or two years ago or whatever, we do the same thing. It's just the scale hasn't caught up to us. And now it has, I suppose. 00:51:42.87 Chris Morrell Yeah. 00:51:44.49 Bogdan Kharchenko And we're kind of like, oh yeah, of course. why would Why would we ever do that? Yeah. 00:51:48.55 Skyler Katz And then we're just like, oh, another program that needs these canvas signatures. 00:51:51.64 Bogdan Kharchenko Yep. 00:51:51.87 Skyler Katz Let's just create another table so that each table doesn't look too big. 00:51:53.33 Bogdan Kharchenko Ship it, ship it. Yeah. 00:51:56.85 Chris Morrell Yeah. I mean, some, know, sometimes those just the decisions that you have to make in the moment. And, and ultimately, I mean, a lot of these things are not really problems. Um, you know, 00:52:07.01 Bogdan Kharchenko It's true. 00:52:08.06 Skyler Katz There are only problems when I open table plus and I'm trying to look at something. 00:52:11.75 Bogdan Kharchenko Well, that's because they're using TablePlus. 00:52:11.94 Chris Morrell Yes. 00:52:13.92 Skyler Katz Quirious is also just slow and spins on this. 00:52:14.43 Bogdan Kharchenko Yeah. Yeah. yeah 00:52:17.97 Chris Morrell No. Yeah. that They were problems. they They only became... you know This is the first time when we hit a real problem, which was when we were doing these big these big queries where we're processing thousands of records and we're loading you know an entire document's worth of content in each record. 00:52:40.90 Chris Morrell um And we were able to address that. right you know like All of this is... mostly manageable through just the the solutions that we already have but i do think that there's a i do think that there is a better option that like we solve this once and then we just have a solution to this type of problem um And arguably, we're at the point where, okay, we've hit this enough times. 00:53:10.20 Chris Morrell And like I said, it comes up often enough ah in the Verbs Discord, like this question of, what do I do with an event that has like an avatar changed event? 00:53:23.69 Chris Morrell You know, like, what do I do with that? or And that's that's even worse than like, ah I mean, that's even easier because it's not that much data. But like a you know, video, 00:53:35.33 Chris Morrell queued for a processing event where you've got like a 30 gig 4K video, right? 00:53:35.38 Bogdan Kharchenko Yeah. 00:53:43.70 Chris Morrell That you're trying to to add to some data pipeline, but you want to run it through an event source system, like, Ferb should have a solution to that problem. um we're actively having this problem in verbs for ourselves. 00:54:00.03 Chris Morrell You know, we've we've had to think creatively about how to to deal with ah events and files in the past already. And this is like a case where we've we're dealing with it even more. 00:54:11.96 Chris Morrell So it just feels like, okay, we've hit this point where a better solution is warranted. And I like this. I like this like sort of two phase, uh, claim check concept where it goes to the database first and then gets pushed off the S3 as, um, as appropriate. 00:54:33.20 Chris Morrell And you kind of, but the The thing that I'm not certain about is it is it better to like hide that whole process behind a cast? Or do you just have a cast that returns a claim check object and make the consuming code have to be sort of like quote unquote claim check aware, right? Where it's like you get a claim check and you have to explicitly exchange it so that like you kind of understand in the consuming code what's happening? 00:55:02.54 Chris Morrell Or do we do that transparently? 00:55:04.42 Skyler Katz I think this is where I was, I was referencing a custom relationship because if we had a custom has one relationship, we could in the relationship logic, we can say, well, did is the attribute in the database column or is it a claim check? 00:55:14.94 Bogdan Kharchenko I the 00:55:21.84 Skyler Katz And if it's a claim check, then fetch it from S3 and just return back the content. Yeah. 00:55:29.90 Chris Morrell Mm-hmm. 00:55:30.37 Skyler Katz like in the, in the custom relationship when it's doing its matching. Um, 00:55:35.63 Chris Morrell Well, would you do that? Or you could also potentially do something where, like, in your custom relationship, it it, like, tries to load all the data, and then if it's not there, it actually, like, inserts the data back into the database. Yeah. 00:55:56.65 Skyler Katz Yeah, i mean I think that is also something that you could do because then it's accessed again. 00:55:56.99 Bogdan Kharchenko And 00:56:01.76 Skyler Katz So it would insert it back in and it would go back the hot path. 00:56:02.01 Chris Morrell Right, so then it goes back into the hot path and and eventually gets moved off. Yeah. 00:56:06.98 Skyler Katz um And it all just happens transparently. 00:56:08.94 Chris Morrell Yeah, that's interesting. 00:56:12.72 Skyler Katz I feel like all these file system calls have to get like wrapped in you know retries and all sorts of stuff that... 00:56:19.66 Chris Morrell Yes. 00:56:20.34 Skyler Katz like 00:56:21.33 Chris Morrell Yeah. 00:56:22.37 Skyler Katz S3 is notoriously, well, all file systems are notoriously just like flaky in, 00:56:27.89 Chris Morrell Right. 00:56:29.39 Bogdan Kharchenko This is the other unfortunate downside with some of this. 00:56:29.71 Chris Morrell Right. 00:56:32.23 Bogdan Kharchenko It's like when you insert it a database, you almost have like a guarantee it's there, right? Like a very high chance, especially if you do like a transaction and, you know, but like when you're pushing stuff into S3, now it's going over the wire, even if it's, you know, located in the same data center, like things happen. 00:56:42.62 Chris Morrell Yeah. 00:56:51.49 Bogdan Kharchenko So that's just certainly something to be aware of. Um, 00:56:55.100 Chris Morrell Yeah. And I do think that that's like, that's another argument for if this was an open source package, like we could collectively, because I'm certain that there are other people out there who have the same problem. 00:57:07.65 Chris Morrell Like we could collectively improve the resilience of that. 00:57:09.37 Bogdan Kharchenko Thank you. 00:57:12.59 Chris Morrell um to make it less likely. But it is it is true. I mean, you're, you know, fundamentally writing to a database or writing to it's like there's a chance for to for either of those to go wrong. 00:57:24.83 Chris Morrell But yeah, with transactions and and like, 00:57:25.60 Skyler Katz them. 00:57:28.97 Chris Morrell you know, all of the stuff that we have built around interacting with the database in Laravel, like you you have certain guarantees that we would have to make sure we, we were like getting the same assurance with another solution. 00:57:46.26 Skyler Katz yeah 00:57:47.68 Chris Morrell Does S3 do any, like, is there any way to do, like, checksums or fingerprinting on S3 or, like, some sort of data verification? 00:57:47.72 Bogdan Kharchenko I like it. 00:57:56.28 Skyler Katz You can't, well, you can pass tags, like you can pass keys um metadata in. 00:58:03.49 Chris Morrell h 00:58:03.98 Skyler Katz And so you can then get that metadata back out um when you're fetching an object. I don't know with Laravel's, just with the file system adapter, if you're going to get all of information. 00:58:17.26 Chris Morrell Mm-hmm. Mm-hmm. 00:58:17.84 Skyler Katz in one go. Well, and I actually am not sure that you can get it in one go. I'm pretty sure it's like a get metadata call, even with the, with the SDK itself. 00:58:29.51 Bogdan Kharchenko Yeah, but i you if it's a background job, there's not that much load, right? If it's just offloading and you're just storing, making sure that the thing was stored and the checksum matches, I guess on the retrieval, that could be an additional call. 00:58:44.72 Chris Morrell I mean, if we wanted to go crazy, we could always like write it to S3 and then read it from S3. And before we delete the data from the database, like it's running in a background process. 00:58:58.92 Bogdan Kharchenko Right. 00:58:59.02 Chris Morrell So we could choose to be inefficient there for data integrity considerations. 00:58:59.68 Bogdan Kharchenko Thank you. 00:59:06.82 Chris Morrell um And there's also like you know building in encryption at rest. you know That would be... relatively straightforward to do in a situation system like this. So that would be really nice. 00:59:20.51 Chris Morrell Yeah. I don't know. I like that. I like that as an approach. 00:59:25.19 Skyler Katz Yeah, it seems, it seems interesting. And, and I mean, in our case of dumping out the database, like having the claim check table that has a handful of days of of, big content, and then the rest of the records are all just paths, keeps the content, keeps the whole database size smaller. 00:59:50.93 Chris Morrell Yeah. 00:59:52.34 Chris Morrell Well, and ironically, we've now solved the problem ah enough for ourselves that we don't necessarily need to go and implement this solution now. 00:59:59.77 Bogdan Kharchenko Thank you. 01:00:01.51 Chris Morrell But ah like ah ah I like that we came to a good place with it. It feels it feels like a good solution. 01:00:09.80 Skyler Katz ah Chris, you have any train train trips coming up? You're just going train-induced development of ah a new package? 01:00:17.58 Chris Morrell Yeah, no no long train trips on on the horizon, unfortunately. 01:00:22.76 Bogdan Kharchenko Me and Skylar are going to to sponsor you and send you off to, don't know, Alaska on the train or something. 01:00:28.35 Chris Morrell Oh, God. 01:00:28.62 Skyler Katz Yeah. 01:00:30.02 Chris Morrell What a nightmare. 01:00:30.98 Skyler Katz but Well, you know, Bogdan's going to be gone at the convention this week. And so, but you know, we're not doing anything. We're just ah 01:00:36.50 Chris Morrell God. 01:00:39.83 Skyler Katz Bogdan's working hard and then we'll just work on claim check. 01:00:39.85 Chris Morrell oh god We'll just noodle around. 01:00:42.99 Bogdan Kharchenko Yeah, let's do 01:00:44.40 Chris Morrell Yeah, there you go. 01:00:44.70 Bogdan Kharchenko Yeah, I like it. 01:00:46.22 Chris Morrell Oh, man. All right. Well, yeah, I like it. 01:00:47.49 Bogdan Kharchenko Awesome, I think this was cool. ah I think we came to a good conclusion. I think, you know, I feel like, you know, aside from like coming to a conclusion, I feel like a lot of, all of us have recognized all the trade-offs, right? 01:00:59.38 Bogdan Kharchenko Because all of this stuff has trade-offs, whether you're inserting directly in the table, offloading somewhere and, 01:01:00.03 Chris Morrell Yeah. 01:01:06.44 Bogdan Kharchenko So on and so forth. So I don't know. For me, I feel like pretty good about, you know, seeing the reality of what this would potentially look like if we go about it. 01:01:17.33 Chris Morrell Yeah. 01:01:19.73 Chris Morrell I mean, and also just to be clear, Claude, and I quote, says, and no where did I put that? ah It's a genuine, genuinely, a genuine innovation opportunity right here that we've got. 01:01:35.71 Skyler Katz yeah 01:01:35.83 Bogdan Kharchenko You're absolutely right, Chris. 01:01:39.99 Bogdan Kharchenko Yes, do it. 01:01:40.48 Chris Morrell oh God. 01:01:42.77 Skyler Katz Yeah, I'm sure Claude could just write all this code in like, you know, an afternoon. 01:01:43.06 Chris Morrell All right. 01:01:46.24 Chris Morrell Oh, my God. 01:01:47.71 Skyler Katz It'd be fine. but Just look, you know, verbs 1.0 hasn't been, hasn't been tagged yet. 01:01:49.02 Chris Morrell Yeah, easy peasy. ah 01:01:54.94 Skyler Katz And it's, you know, you have an opportunity here to get claim check in. 01:02:00.60 Chris Morrell Just sick clawed on it. 01:02:02.21 Skyler Katz Exactly. 01:02:02.33 Chris Morrell let's let's Let's introduce a poorly written buggy ah version of ClamCheck. 01:02:02.76 Bogdan Kharchenko Yeah. 01:02:09.56 Chris Morrell ah All right. 01:02:10.84 Bogdan Kharchenko v zero ah Yeah, so I just had my upstate PHP meetup in Greenville, South Carolina, October 9th. 01:02:11.15 Chris Morrell um Well, ah before we stop, Bogdan, you had your meetup recently. How was that? 01:02:27.72 Bogdan Kharchenko So we're recording, I believe it's October 14th. 01:02:30.98 Chris Morrell It's the 14th. Yeah. 01:02:32.11 Bogdan Kharchenko Yep, and it was awesome, man. we had We had five speakers scheduled to talk, including I was gonna do a presentation and I did. One of the speakers, unfortunately, was called out sick, but still we have a pretty jam-packed event and we had over 20 people show up. 01:02:48.53 Bogdan Kharchenko um And i don't know, I love doing these meetups and i like talking to people. I'm like, I mentioned this to you guys before, I'm like always surprised how many new faces I see. 01:03:01.09 Bogdan Kharchenko ah You know, people are they're just like willing to connect. And, you know, i know not of everybody is able to come out to all these things regularly. But, you know, ah as far as like the new guys, I mean, they just keep showing up and it's super awesome. And then after some time, they tend to be regulars. So don't know. I'm like, ah I'm excited the fact that I started this and I'm excited that people are still coming and sending me text messages the next day and saying, dude, I had such a great time keeping it, keep it going. 01:03:30.39 Bogdan Kharchenko um So it was ah it was a really great event. 01:03:33.05 Chris Morrell And then there's PHP ex-Atlanta now, right? You went down for that? 01:03:37.65 Bogdan Kharchenko Yeah, so Jace runs PHPX Atlanta, and you know I went down there maybe three weeks ago, and you know he is also trying hard to organize and you know finding a permanent venue spot. I think that's one of the challenges that Jace is having right now is he has to swap to different places. 01:03:57.37 Bogdan Kharchenko um But you know he also has a pretty good turnout, and you know people are you know getting together. There's a ah a bunch of old-timers, it seems like, in the Atlanta. PHP meetup, you know, people who have been attending the Atlanta PHP meetup for like 15 years, you know, 20 years, they have like a pretty long history um of people showing up. But yeah, i mean, it's super awesome to, you know, go down there and hang out with them as well. 01:04:25.55 Bogdan Kharchenko And, you know, it's relatively close to me. And I'm, I told Jace, I'm going to come out to all the events that you have because, you know, I just want to support him, make sure that, you know, 01:04:36.77 Bogdan Kharchenko you know Because you know when when you show up, that's actual support. You know i mean? And I feel like a lot of people get hung up on like sponsorships or all this other stuff. 01:04:40.81 Chris Morrell Yeah. Yeah. 01:04:46.71 Bogdan Kharchenko But like if nobody shows up and you have a fully sponsored event, you know ah you know you don't it sucks. You know i mean? 01:04:54.94 Chris Morrell yeah 01:04:55.16 Bogdan Kharchenko But so if you have people showing up, even if you're losing money on it, I feel like it's a good time. So yeah, I'm excited to visit Atlanta for all the meetups there too. 01:05:07.16 Chris Morrell And Skylar, do we have another PHPX St. Louis on the horizon or is that slowed down for a little while? 01:05:13.32 Skyler Katz ah It slowed down for the summer and then ah Mark Binder or my co-organizer was just moving, but we we've been texting about ah the next meetup. 01:05:24.48 Skyler Katz I think we're going to do like a November, um like a happy hour um meet and greet to hang out and regroup, s see where where people want to to travel. 01:05:29.74 Chris Morrell Nice. 01:05:37.09 Skyler Katz The St. Louis region is is quite large and a lot of people are coming from from all over the region. And so we're trying to find find a good central location to to meet up with. 01:05:47.66 Chris Morrell Yeah, i came back from Laracon like all energized, ready to do another another meetup and then life got in the way and I still haven't scheduled another PHPX Philly. 01:05:58.51 Chris Morrell ah went up to New York for for the PHP NYC, PHPX NYC last event and that was really fun. 01:05:58.80 Bogdan Kharchenko well 01:06:06.53 Chris Morrell ah But i'm I'm hoping to get another PHP Philly started soon. 01:06:11.87 Bogdan Kharchenko Yeah, I mean, i was gonna say, first off, I wanna come to the next PHP event sometime. 01:06:12.18 Chris Morrell were going say, Bogdan? 01:06:19.46 Bogdan Kharchenko And I was actually talking to Steven Fox. He was visiting the meetup I just had and we were kind of low key talking about crashing the PHP NYC meetup one of these days, just gonna show up unannounced. 01:06:28.100 Chris Morrell That'd 01:06:31.29 Bogdan Kharchenko So watch out. 01:06:32.06 Chris Morrell be awesome. 01:06:33.09 Bogdan Kharchenko um But, ah yeah, I don't know, Chris, if you're the only one pushing the meetup, but you need to find a co-host who can ah do this scheduling for you, you know, and then you just have to show up, um you know, or do some of that legwork. Because I know it's hard for, you know, to get motivated ah to to to schedule, put something on a calendar. 01:06:53.97 Chris Morrell Yeah, we've got a couple people who are are involved um that I probably should sort of invite to to be a little bit more involved in that piece since it's such a challenge for me. 01:07:04.82 Chris Morrell um And I'm actually, you know, um I'm talking with Matt Stauffer. He wants to come up and and ah join for one of the meetups, maybe do a talk. And Ian Landsman is planning to come down at some point for one of the events too. 01:07:16.01 Bogdan Kharchenko Thank you. 01:07:17.95 Chris Morrell So we've got a couple of like, ah bigger names in the industry who folks would want to come see that I'd love to have do something at at one of our next events. So, 01:07:30.25 Chris Morrell If you're in the Philly area, ah you know keep an eye out. and you know Just in general, if you're you know if you're listening to this an hour plus in, ah you will definitely enjoy going to um ah local meetup. 01:07:48.80 Chris Morrell So if you don't already you're not already a part of one you know take a look php x.world uh there's the laravel.com slash meetups i think it is um there are a couple others that i can't think about off the top my head that have a good list of of events and uh yeah i mean like bogdan said if you can show up for one of these events it means a lot to the organizers and it's going to be fantastic for you but um don't know. 01:08:20.32 Chris Morrell ah It's just been so much fun to to be involved in like these events. and I love to it. Yeah. 01:08:31.47 Bogdan Kharchenko Yeah, same. it's It's a rush, honestly. I think after the meetup, I couldn't fall asleep for like two hours because I was just pumped. But, you know, I just know a lot of people had that same energy. 01:08:40.26 Chris Morrell yeah 01:08:44.15 Bogdan Kharchenko You know, they were just, you know, chit-chatting and, you know, getting to know each other. But... Yeah, I highly encourage anybody to, even if it's not a PHP meetup or JavaScript, whatever you're into, just go do it. I think the the in-person energy is worthwhile. 01:09:02.79 Skyler Katz Yes, they're tons of fun. 01:09:04.83 Chris Morrell Yeah. All right. Well, with that, ah this has been fun. I'm glad that once again, we have just solved solve the problem. a Case closed. 01:09:16.04 Chris Morrell No further action needed. And we could just move on. 01:09:19.100 Skyler Katz Yeah. Just feed this transcript to Claude. Let's go. 01:09:25.41 Chris Morrell There you go. 01:09:26.37 Bogdan Kharchenko All right, Chris. See you later. 01:09:26.70 Chris Morrell ah All right. Thanks, guys. 01:09:29.57 Skyler Katz See ya.

Storing Files in the Database w/ Bogdan Kharchenko and Skyler Katz

Episode Transcript

Never lose your place, on any device