
·S2 E20
20 - Wingman, your AI Control Tower
Episode Transcript
All right, I'm back.
You.
What's?
You sound different.
Your voice is deep.
It's changed again.
For anybody not oh Faye with the the good stuff history like Andy was feeling a little bit high pitched, bit high pitched and I had to look at the car park, although that is a comfier seat.
Oh, you had you switched up the seating.
Arrangement.
That was Oh, no, I'm sorry to hear that.
Should have moved the sprites around on the screen.
Yeah, really.
Oh, that's interesting.
I've never noticed but they actually sit in the right they.
Actually sit in the correct spots, yeah.
Interesting.
Well, it was good.
It was a good pod in my absence.
Very good.
Yeah, Beth Ann did a very good job.
It was entertaining, very interesting.
I thought so.
Yeah.
So I'm glad you guys were.
It was actually really fortuitous the way that worked out because I think we talked about doing an episode with her for her book launch, which was on Saturday.
Yeah.
And I just for a book launch happened to get the man flew the week before.
Yeah.
It's almost too well timed.
I was a special person.
I got taken out from the pod this last week just to to get that episode in.
No, but it was good.
Worked out perfectly well.
Welcome back.
Thank you.
Glad you're feeling a little bit.
Better indeed.
That's good.
This is the big episode 20, big episode 20, The big one.
Almost the big one.
Like the real big one.
The real big one's probably next week.
This is probably our last in person beach pod for a little while as.
Well, yeah, yeah, I'm going to be away.
Oh, for mystery, mystery reasons.
Mystery reasons.
Anybody that's listened to the part probably knows.
It's not that much of A mystery, is it?
But there there comes a time in every man's life where you've got to disappear to the middle of the Atlantic with a bunch of friends and try and fix the Internet.
That's it.
Yep.
And I'm, I think I've reached that, that point in my journey.
So we'll see how it goes.
I think it's going to be.
Came to see what comes out of it, yeah.
Yeah, same, same.
It's going to be cool.
I mean, that's why I've been sort of going hard on like Wingman this last, yeah, certainly this last week or so, like, but maybe week and a half, two weeks where it it suddenly came together and became the most useful thing in my life.
I mean, it's amazing.
Give me the like the round up on this as well, because I mean, we haven't even really like touched base because of because of said man flu.
It's kept me out of action, but like what?
What's the update?
What's been happening with with Wing Man?
So I guess if we rewind ourselves maybe 4 weeks ago, maybe it was maybe a bit longer.
I'd popped off after a podcast to go and buy myself a like a Mac mini computer.
And I had this like intuition, like probably the best way to run AI stuff is it's not on your current computer and it's not like with a SAS environment.
It's, it's probably to give the computer, like to give the AI the whole computer and just like let it have its own environment in which it works.
And so that was the, the reasoning behind buying that.
It's a bit of a test.
And having its own environment is like purely from a privacy perspective or you.
Install it on this computer and it has access to everything on that computer and it can run whatever it wants in the terminal.
So it can make stuff like there's almost like it gets to the point I can't even like hide stuff probably on this computer because you can always like even if I don't let you run all these commands, you can just write, you can run programs.
So you write a program that then extracts the stuff that's in there.
Like it's just like, all right, this just has this is yours now.
And it's, it's like when you hire an employee and you go, well, here's your computer on which you will work.
Yeah.
And I, I think like taking that same paradigm into, into AI and being like, right.
OK, Goose, this is your computer.
Like I'm going to just, I'm going to set some stuff up so I can see what's going on.
And then but beyond that, like it's yours.
And I will assume that anything I put on here is compromised, so I won't have any of my personal stuff on it.
Yeah.
And so it kind of works a bit like this.
And so, so on that computer, I have two chief agents that I use, which are Goose and Claude Code.
And the main problem with this setup is that this is it's like a little Mac mini.
So I can't take it anywhere.
So it's like, all right, well, I can sit at this computer and do coding stuff if I want to and I can run stuff and that's cool.
But if I, you know, if I get up at the office and walk into the kitchen and it's stuck, like it's, I don't know anything about it until I go back in or it, it just like I can't go anywhere.
If I'm going like going to drive somewhere, pick up the kids, Yeah, it changes.
I don't know what's going on.
Like, and there's this time it's like this doesn't feel like how it should be like, because a person could just let me know if they're stuck or it could, you know, get on the phone.
I could give it some direction.
Things would keep going like, so it's like, all right, well, how could I, how could I run the stuff on here so that instead of like clawed code running in a way that I have to sit at this particular computer, like can I run it in a way that I can access it from anywhere?
And I was experimenting with, you know, various ways of using like SSH tunnels to manage it from other places and things.
And it was it was kind of working and I resolved around using like this T Max terminal emulation type system to to run multiple sessions at once that I can then attach to from different computers.
All right, that works really well.
And then so that was like, this is kind of like my start point with all, yeah, this is all right, That's cool.
And I thought, but I don't really like, it's a bit of a pain doing it this way.
It'd be great if I could just, you know, run the agents and then have but have like a nice way of talking to them.
And I think I'd said a while ago, like I was really, but I get a really good vibe of Goose, but I was struggling to to get it good enough or like in a form factor that I would it would replace just general use of say cord desktop form.
Yeah, that's that's gone.
I haven't by the only reason I use like cord and ChatGPT now is if I just want to ask a really quick question.
Yeah, because I respond fast, but I don't.
And even then I often just Google it in the because Googling in the address bar just pulls up the, the Google response usually, which is based on a lot of the search data that it gets back.
It's like, you know, it's actually pretty good for what I need from a search perspective.
So, yeah.
But so then I, what I did is I built, I had, I had a, again, another intuition.
I was like, look, the the goose like desktop agent.
I can't do much with.
Yeah.
And trying to like run that through like a VNC tunnel or remote desktop.
This doesn't it just feels crap like the response rates on that bad.
I just want to turn turn this into a website so I can just host it at home and then hit it from anywhere.
And then I so I did that is like I took, I think I tried this originally with called code and I just couldn't get the wrapper to work properly.
But it worked very nicely with Goose that I was able to wrap the the Goose CLI inside, you know like a little node JS control plane type thing which will allow me to just spin up multiple versions of the Goose CLI in a different process thread.
Yeah.
So I could have 123456 versions of Goose running at any one time.
I could specify different like they have a thing inside Goose called recipes and your recipe is your system prompt plus tools, plus instructions like what you should be doing.
And so I so I started to use those and I was like, OK, well, this is, this is kind of cool.
And I built like a whole web infrastructure for how you would manage and then monitor what these agents were doing.
So, so wing man is kind of like your central control system or operating system for multiple agents that are running on a remote computer.
And it is fucking amazing.
It's like, it's like what I've always want is I'm not doing any like down and dirty.
Like I know building of agent systems with this one anymore.
Like, that's obviously a lot of that stuff exists in Everest, but it just didn't seem like it was going to be worthwhile to do that when we've got good versions of agents that I can just pick up and use with Goose.
So, yeah.
So effectively how Wingman would work now is I can go on Wingman and I've got my recipe manager and I can build as many recipes as I want.
I can reuse the MCP servers across different ones.
So I specify it once, use many I can set up.
Yeah, So I have all these different sort of agents in there and I can set up a new session.
Like when I start a new session, it doesn't close the old one.
So if you think about like saying ChatGPT, if you go from one chat to the other chat, your previous chat stops.
Yeah, that doesn't happen in Wingman.
These are all like running systems, so every one of them exists in its own thread.
So I can go like, OK, we'll do this, like create this plan around this one.
And then I just Click to the next session.
I'm like, OK, now I want you to review this PR request that I've got and update the comments on GitHub.
And they're like, you do this and I just issue commands.
And then at any point from again, because everything runs on the Mac mini, I don't have a good way of explaining like how just fucking nice this is.
But because it's always all running on this server that just like sits in the office.
If I'm at my computer, I can see exactly what's going on.
If I use my phone, I pull it up, I can see exactly what's going on and like where it's at.
And then the output of all the stuff instead of dicking around trying to build like a really good artifact system, like, oh, it's done.
I was like, well, I don't want like outputs in a chat.
A chat is for having a chat.
It's for issuing orders, checking in on stuff, getting updates.
Like it's not for where we display the content.
So so I just like attached like an Obsidian MCP server and said like here's again, like here's this is a separate vault sync.
So all my devices you can, you have full access to everything that's in here.
And so it'll write like all of its project documentation, like all the plans.
I just really nicely formatted Obsidian like Markdown docs.
I can like it's all structured nicely.
I know exactly what's in what product, what's what feature like where it is in the backlog versus active States and all this.
And it's all written and controlled by by Wingman.
But then I can get at it from anywhere.
So it's on my laptop, it's on my mobile.
On my mobile I was showing you the other day, I installed a piece of software called Allowed, which is an Obsidian plugin.
So it will just read them out to me.
So that was nice.
If I'm out and about like I was doing it this morning, I was just like wandering along the beach and I was, I had it.
I've installed a group brain developer.
I'll go into that in a minute.
And I had it reviewing my road map for Wingman and turning it into big shiny rocks and stuff.
Like it was just having it read out to me while I was wandering along the beach.
And I'll stop and like, dictate my thoughts on the answer and it carries on, creates a new version.
It's just, yeah, it's, it's a really natural way of working.
And like, it's like you use the chat for the chat and you use your document systems for your documents and everything suddenly makes sense.
Like it's the right separation of like mental concerns.
And then the bit that was that I wasn't really getting great responses from in in Goose was probably around like the execution environment.
So cloud code is just given a good plan.
Yeah, I just don't think there's another coding agent.
I haven't, granted I haven't actually tested the codex one recently or cursor so who knows, but it is just great.
No, it's like.
It's better than Casa.
Reminder and when you give it like a really good plan, yeah, I mean root code is great, but because it runs in a browser environment, it's not much good to me.
Yeah, inside Wingman.
So but I'm finding these days that code code and sub agents and stuff just just works.
Yeah, I feel like code, though, benefits from adding to context as the code base grows, which is the thing we were kind of talking about yesterday a little bit.
Because I would kept finding myself hitting a loop, yeah, where it would just be trying to like fix a bug with like an ad hoc set of fixes that would cause another problem.
And then that'd fix that problem.
And then it would recreate the original problem.
And it was just going over and over.
And yeah, I think you'd suggested just going back and getting it to create a fixed document.
And so I did that, but I actually just like went back into the PRDS and just went like, actually this has progressed a lot since that original PRD.
Let's spend a bit of time just on the vision documents again and work out how well this aligns to the changes that we've made along the way.
So I started doing effectively audit documents prior to a PRD and because like in particularly in wing man, there's a few areas where I need to do like a big refactor into a different way of working.
I was like, OK, well that that works to get it does the job.
Yeah, this isn't how it should run.
It should run like this.
And so I so I set it the task of what's that goose the tasks out like?
Well, I want you to go through the code base, go through like all the documentation and I want you to audit exactly how this works right now.
Yeah, with a view that these are the things that I think are wrong with it.
And then that's the world that we should be going to.
And it came out of it like a just, it generally gave you a really good audit of what was going on and how it worked that would allow you to then write APRD against that.
Yeah.
And that was that solved most of my.
Problems.
Do you use a sub agent to do that or do you just use?
Them I I'm, I don't even like give.
I don't even set any.
I haven't.
Or rules or anything like that.
I just because I do so I think because I do so much plan, I, I have custom like agents and stuff to do the planning.
Yeah, I think because the plan is detailed enough like all the, the context that the like the execution agent might need is already in the plan anyway.
Like it's just I don't know if I really need to spend a lot of time with.
That was my thinking too.
It felt a bit like unnecessary.
It didn't superfluous to do, yeah.
Yeah, there might be some stuff, you know, maybe I should have a doc clawed rules or something that just says that, you know, just do it right.
Like it's just, but Oh yeah.
So, so where was I going with Wingman?
Is I was like, OK, cloud code is still the way to do it, but what I don't want to do is be stuck in my computer.
Yeah.
So instead I've got inside Wingman, we've got like a almost like a deep dive.
It's called the deep dive.
All right It's a terminal emulator that because again, because I'm connecting to like a web server on the machine natively, I can then behind.
And again, this isn't exposed to the Internet or like this is all on my own, like private overlay networks and stuff, which he yeah, don't, don't you?
You don't want to like Yolo this stuff necessarily into into the Internet is that I can connect to that and then I can use that site to run a terminal emulator on the machine to connect to all the like available teamwork sessions.
So we also have like a little Wingman CLI that runs inside Wingman that just provides like a very like lightweight terminal user interface.
Nice to like.
So it goes like, OK, these are all the sessions that are running right now.
His are all their names.
Like you can create new ones, delete them, great.
OK, click into that one bang.
And then I have called code running, but it can render like called code beautifully onto my mobile.
Oh, nice.
And then I've because again, I've got control of this whole thing.
I was like, OK, well, OK, the bits I can't do are all of these like overlay like these key binding commands for teamworks aren't working.
So I need something to do that.
So I wrote, I've got a little drop down menu and it gives me all the common commands I would want for resizing stuff and bringing in new Windows and closing stuff and zooming.
I was like, this is this is like better than any terminal app that I've found on on my on my machine.
So now it's become my like default way of remote accessing the machine at home to do anything to do this.
And I, I sometimes forget because I'll go in and I'll go right, yeah, restart Wingman.
So why does Wingman not work?
And so, Oh yeah, I just turned it off using it.
All right, OK, I'll use one of the other sessions, but it's just, it's, it's nuts.
Yeah.
So then I use.
So what I'll do is I'll do all the planning inside.
So pretty similar to what we did previously, slow code, same methodology, but different tooling.
Now where all the stuff happens inside Goose to do the planning.
And then I'll go into deep dive, fire up the term like fire up a clock code session and just inside the particular work tree.
So I've got a bit more automation about how I create like the, the projects and the work trees and things like that for Git just to make it easier to have lots of different feature branches going on and stuff like that.
And I also have like a goose agent that will rebase and merge like all of my PRS and stuff.
So it's like, OK, if I, I just don't want, I don't like rebasing stuff and dealing with the conflicts.
It's just like, all right, you do that.
Like it's pretty obvious what's going on and it it just, it just runs it all.
And I just run code.
Like when we was carrying a coffee earlier.
Yeah, I was, that's what I was just going quickly typing always like I'll just go and do this and I just give it a command.
It goes and does it.
It's like.
Very human to be honest.
It's like awesome, I can do all of it from the phone, which was like, yeah, so like what the first like Phase Zero version of Wing Man is just can I do everything that I need to from the phone and I can I've done like full features everything like.
And again, because I've got Obsidian on the phone, I can copy an Obsidian like link on my phone and then just paste it into the terminal in wing man for called code and say implement this and then soon.
So I also have AI built.
Yeah.
So then like so that that was like phase one, like it's the next thing I got was like, look, what I should be able to do is instead of me starting a wingman session manually, I should be able to just create an API endpoint.
I think I'm just sending you like the screenshots of this last week where it's like, OK, now I can just hit this endpoint and it creates A wingman session with loads of particular recipe.
So it has access to like particular data sources and things and then injects the first prompt to kick off a particular session.
Like it's like, OK, cool, so I can have this just run.
And as soon as I've got that, it's like, well, it's a piece of piss to now have a scheduler that just goes well at this time call this kill command.
Like, you know, just stick that in a Chrome.
You need to really write anything for it.
So now I've got scheduling and I've got like automated stuff or I've got like any other program, as long as I do the networking, it's going to be able to hit this and just trigger off the session.
Obviously, right now, all of a sudden there's a bunch of whitelisting and stuff in there because I can't have this coming from any domain.
And then there's checks to say like, OK, are you somebody that can start these sessions or not?
Because that would be an easy DDoS.
Yeah.
So all that's in there as well.
It means that I can just start sessions without being involved.
But then as soon as I put my phone, I hit active sessions.
I can see what's active and what's running and I can check in on it and I click in and it shows me exactly what's going on and I can read it.
Yeah.
I'm like OK, cool.
And then I wrote I was like OK, that's start.
Now I need like the sign off procedure, which is like OK, I've done the work.
I'm going to update something.
Yeah.
Which I always feel this is almost like the missing bit in a lot of AI systems because they kind of assume that you're in the chat.
So it just leaves the chat message and it's like no, no, no, I want like.
When you've finished, you should update a system.
And then send me a message or you should write a file somewhere because like dropping a file in this folder probably triggers something else that's going to go on in a batch system, something like this.
It's like there should be like a way of formally doing this.
So I wrote this MCP called return to base, which will allow you to take the output of your session.
It'll summarize your session into a particular output that's either like a summary of the conversation into a file or a particular like data something into a file.
Or it'll read the system prompt that you've got that'll include like a Jason object and it will interpret that into a schema.
And then it will write a schema for you that you need to send a payload to a webhook.
So you have a very flexible way of every time you call this particular MCP tool, it can go, all right, well, what's the, what's the schema I'm using?
That's the schema, right?
I need to send a message that works for that scheme into that web hook, right?
What's the, what have I got in my?
So this is what the AI is doing in the background.
OK, what's in the session?
Like?
Oh, OK.
Oh, it's pretty clear.
Like that's the output.
That's the thing.
That's this data, that's this data.
So it fills in the Jason object and then posts it to a web hook.
And like as soon as you've got that, like that is also the same thing that would kick off another session, right?
It's just a particular payload of data hitting a trigger is what could kick off a goose session.
So all of a sudden you've got the ability to go from like a goose to a goose to a goose to a goose to a human to a goose to a goose to a human.
And like, that's the dream.
So like it's got all the primitives now.
So, and this is what I'm like refactoring and tightening up.
I was like, OK, shit, I've just proved like a lot of what I've, you know, like the 12 month road map.
I need to we need a bigger road map like what's going on here.
But first I should, you know, make it good.
Yeah, but it needs to be good and boring first that this just always works and there's no issues.
But it's it's sort of amazing.
Yeah, point, because I'm like, OK, because this is what's going to allow me to, to very quickly go, all right, every night, what were the commits, right post take for each commit post to this web hook with the, you know, PRS are generated like to, to this PR and then it says with the instruction, download this and check it and give me a review.
And then that goes off and it creates APR in GitHub and then posts effectively that post to the webhook is I'm going to spin up a new goose session whose job it is to go and review that.
And I'll say, look, you give me a review on this and write a comment that outlines like all of your concerns.
And then somebody else would come in.
And then and then your return to base is fire off this person.
And this person comes in and read your comment and then like addresses it and then fire off this person.
This person fires off called code to fix it.
And then when it's done, it goes, OK, there's a new one there now.
And then you go, OK, well, that triggers like another action, which goes, OK, well, now he's just posted back to this PR.
There's a new, you know, there's an update here.
Let's go and review it again.
And then you can just keep going through this.
And you could have like one of them could be deterministically.
So like, what's the next action?
Is it a yeah.
Is it a review or approve?
Like if it's a sorry, if it's is it approve or rework?
If it's a rework action, then send it over here and this person's going to trigger clock code.
If it's an approved action, just leave it there as approval and I'll look at it in the morning.
Yeah, and.
I was like all this stuff, it's going to be a case of just plugging in like building the recipes and plugging in the data and it's just needs to be put in.
So this is what I'm going to do to my entire workflow to make everything because this is like like this, this is like one of the ways to start catching like the security issues that you've got.
And that is to start, all right, well, have versions of these agents with access to, you know, exploit libraries and various things.
Just come and check it like all the time.
Like just do way more review and analysis than you would ever do if you were a reasonable person.
And your code is going to be like 1000 times better.
Like it's just, it's and I'm just like so stoked because this is just like a.
Yeah, it's very cool.
Well, it seems like a very cool framework to actually just build a business.
I think you can, well, that's what we're talking about, right?
I think this is the way to do it is OK.
You could try it like this.
There's a lot of like free stuff you get and we've, again, we've discussed this so many times in the past that when you just like run it on your machine.
Yeah.
There's a lot of stuff you don't have to care about.
Like if you build something out as like a SAS service or something.
Yeah, it's like, oh God, I've got to have this whole like, you know, there's multiple people on this machine now.
Like, what does that mean?
Exactly?
How am I segregating all?
And it's like, I don't want to.
Why?
It's just not that interesting, like compared to guys like I'll sell you a Mac mini with this install.
Exactly.
Like would you like one?
Send SAT's.
I'm like, we'll give you this thing and then I'll show you how to set it up.
Yeah, I'm like, or you just go, all right, well, I mean, could this Mac mini just be a business?
Like it doesn't even need me.
Can I like, what can we do here as a fully autonomous business?
Because that's that's I think that's the win.
What are the what are the markets like?
Where do we tell you?
Like if you take the SAR, the SAS analogy, as you build the SAS, that's the engine of growth.
You need to sell it like an inch deeper, mile wide.
And the hard thing now we know we talked about as a bunch is obviously defensibility, but also, you know, distribution incredibly hard.
But the engine of growth in this case is just your business.
And you've got tremendous leverage from being able to tap into to wing man and all of these agents to go and basically go and unlock a bunch of capacity that that business is never going to touch.
And you can just re you can just build that business off that standard and be really, really powerful.
I think that's going to be a huge unlock.
It's whether or not you can marry the like the the like a functional understanding of what can be done and how it can be done with someone who's just running a kind of more like a traditional business.
Yeah, I mean, it really, it takes us back to episode 1 the, the business model of AI, right?
Like it's like nothing's changed in a way.
Yeah, we just go like, all right, This, this is how.
You do it.
This is a better way of doing it.
Like this is because I mean, I, I already have this like turned to several different ways of like a few different like writing assistants and like it's, it's, it's solid like I'm the agent.
I, I feel like the good thing that Goose did and the thing that lets it down a little bit on the coding is like called code is for coding.
Can you get it to do other stuff like yes, but it's hard enough like with all the pre programmed system prompted stuff in there to get it to stop coding when you want it to code.
That's like just stop.
Like you're doing too much.
Just fucking rein it in a bit.
But then to assume that that same agent framework is necessarily going to be a great like market writer, it's just the wrong base.
Whereas with Goose, like they've kept the agent, like the agent loop inside it quite pure.
So it really is a kind of like, you know, what am I being asked?
Like is there any tool that's useful here?
What would my next step be?
Am I ready to talk back to this guy?
What's my what am I being asked?
Like what's my?
It's a more general purpose.
It's a very simple general purpose loop.
Yeah.
And I think it's going to work very well for like the management agents and like writing agents.
It seems to be pretty good once you put in the time.
It's correct, right?
So one of the ones I did recently, so I've done like obviously a version of myself where I took like just a whole host of like what I thought was some of our good podcasting output.
So there might be a bit of you in there as well, plus like a bunch of long form articles I've written in the past.
And then I like I used goose to crunch all that down into a system prompt that I could use for a recipe with like advice on writing style.
I reviewed it and I was like, I don't, no, not that, that's not that's I do do that, but I don't need that in this instance.
You don't need to swear like it's just, it will just happen.
It should be spontaneous.
It's just, and so I went through that, that process and what was coming out of it was solid.
And I've had it rewriting stuff for me.
Like I said, I was working on the road map, but because we needed a bigger road map, I was like, fuck, I've done the road map like in week one, just like, OK, like going to take me, you know, might take me a little bit of time to make this more robust, but there's, there's more stuff to build.
And so I, I was just like, it's right now.
It's solid.
I was writing really good stuff.
It was like in great language.
It was coming out with good ideas for it.
And it's like, this is interesting.
And then I got I took as a blog.
So, you know, in no solutions for the podcast feed, there's the picture.
So that picture is actually from like A blog.
I don't know if that's the original place, but it's a blog called Group Brain Developer.
Yeah.
And it's like what I assume is like an old, an older developer who's just like you just need to like stop obsessing with all the shiny crap.
It's like it's like it's like wisdom of many years just written out as like group.
No, like complexity, complexity, demon ruin things keep simple like say no.
It's just like really good solid like development advice.
I was like, this would make a good like AI agent.
So I like decompose all that into like a group brain developer.
So then use that to do like reviews.
Oh, nice, like planning reviews and stuff.
And it's like, OK, well, OK, now you take my road map and you tell me like, where am I adding complexity here?
How would you do it?
And we got a version of it.
And then I sort of went back to my writing age and I said, look, so there's some good stuff in here.
And I like this, this and this, but I don't like this.
And then it rewrites and you see like playing off these different personalities.
It's so quick to to produce stuff.
And then you just think, like, the only thing I'm not doing is writing the stuff with my own hands.
Yeah, it's like I'm still reading it.
The bit I miss is the actual writing and rewriting and, I don't know, pressing the buttons and deleting stuff and going, do I mean that?
And like, thinking it through.
So I'm not getting.
This is the thing.
Of like brain exercise, but you are reading, considering like looking at different views like I'm definitely like, so I'm still, I feel like I'm still engaged in what I'm writing and the final output.
And I'll usually like my general way of doing all this stuff is I'll use the AIS throughout the process.
And then I end up with something that's close.
And then I'm like, I need to just do a pass and I'll physically, physically rewrite the whole thing.
But like with small adjustments and just putting it more closely into how I would think about it.
Yeah.
Or over ruling some stuff that there's always like some stuff as I stop putting that back in some somewhere down the stack like it's getting this and the system prompt but.
Yeah.
Do you feel like you're like abdicating your thinking there at all?
Is it just is a different way of doing it?
It's.
Thinking differently, Yeah, it's in the same way that like it's to me this feels like very natural because most of my career is involve managing other people and this is this is just managing it.
Feels like an extension of that.
Yeah, it feels very natural.
I could imagine if your if your default mode of operation is pushing the buttons, yeah, it's you're going to lose.
Yeah, it's not.
I mean, again, there's probably, The thing is this feels natural to me.
So I I'll over index on that and go that way.
Like it's not obvious to me.
That's the only way.
No, of course you have to interact with these things.
So I'm sure there's other ways that people will use this where they don't pull themselves out of that bit of the loop.
Yeah.
But I think the big insight is that it's possible to continue working in a kind of more management style, yes, approach to work, which is native to a lot of people, especially non-technical people.
And you know, because that would be a concern or a fear for a lot of people is that it's like I'm going to have to get more technically proficient to work with the machines.
And it's really it's not.
No, you need better soft skills.
Yeah, exactly right.
Well, it's not soft skills.
I don't know.
Yeah.
It's like strategy planning, thinking, like checking biases, not being overly married to a particular output that you've got and questioning that well, could I just throw all that away and do it differently cuz.
You can.
That part's easier, I think with the machine because it's like there's less interpersonal bias that gets kicked into the process as well.
It's like, well, ego or combativeness or competitiveness with other people in the workplace.
You just don't need to worry about it.
At the moment I'm doing some like dueling agents.
Like when I want to build something I'll just I build it twice in two different work trees and just see which one works best.
Yeah, like it.
Isn't against each other.
Yeah.
So this is like where I'm trying to figure out, you know, like is this something where do I still need core code?
Can I run everything through being there?
Like where is the sort of line where goose is good.
I mean there is AI haven't been able to reliably get this to work, but there is like a concept in Goose CLI of goose runs called code provider.
Now, I don't know what you get there like because called code itself does a lot of useful things in the background, like where it will the way like it seems to be more proficient at spinning up testing, like running environments, checking that things work in a way that like I have to tell Goose what to do and then it won't do it.
Which yeah, like where I'm like, this is like way more autonomous, this one over here.
So I don't know if you keep all that on.
I don't know if it's like kind of you end up with as I say, like whenever I try it just goes, oh, I'm not working.
And I'm like I can't.
I have time to debug this.
Properly, but at some.
Point I think ideally what you want is that the code code is running and it does all the execution work and then Goose is just looking at the CLI of code code and choosing the next best action based on what it says.
And then it's the thing that, you know, selects option 2 constantly or shift tabs.
Just go yeah, do it and don't ask again.
Yeah, do it and don't ask again.
Yeah, do it and don't ask again.
Yeah, do it and don't ask again.
It's like, come on, come on, we've been through this, but do it well.
No, don't do that.
You're forgetting about this dickhead like that.
And it's like, I, I won't.
I won't goose the swear at Claude for me.
Yeah, Like, that's the answer.
It's just like, tell it's OK to carry on.
We trust it.
Like he can't.
He can't kill anything.
Like he's isolated.
You can do whatever you want over there.
Just do it.
And then we'll test it.
And if it's crap, I'll throw it away like.
But it's I may as well just build the whole thing in one shot.
Yeah, versus not and we've already done like a good amount of planning before this.
So just I can't have at it.
Yeah, I think that makes a little difference.
Build it three times and let's have a look.
Yeah, find the best way.
I feel like Apple might have just stumbled across like the most useful product in their entire arsenal, really, with the Mac mini by accident, seemingly.
Because I feel like this is going to be the thing.
I mean, there is a little bit of me that was like, should I have bought Mac Mini Pro?
I could have.
Yeah, but I don't know, like it's quite a bit more money.
Yeah, so, but I thought right, well, this is.
Well, you didn't know at the time either.
It's like, worth the experiment, but yeah.
Well, if it's good enough but also like local.
Like the purpose of the Mac Mini Pro would be more local models.
I mean look, you could run Wingman on a server as well.
But that too seems like another reason why it'd be useful in enterprise, because you could just run these things locally rather than having to basically potentially risk your data to the big, yeah, I mean the big.
Models.
The interesting thing is like, I mean, I, you know, I've said it on this podcast before, I love open source stuff.
Yeah.
And I think it's just delightfully funny that like, this is just not a concern of mine.
So I'm like, oh, you're closed source code.
Got one big model.
It's like this is like the virgin enterprise IT versus like the chat open source maximalist.
That's right.
It's just the chat open source Max was like lol.
I mean it's just send you've already stolen it anyway.
Yeah, I was intending to.
It's good yeah.
It's like you've already like read all my code anyway.
Like if you want to keep this, yeah good on go at it.
It's probably not the best thing to base it off, but anyway, like send, send, you know, I have more yeah.
I just, I can just take full advantage of like whatever.
And it's it's cheap enough that even trying to like run stuff, it's like the only bit with the local stuff springs out to me a bit more is like, all right, well, what if I ran cheap local versions of this for some of the queries or like, you know, I'm not doing anything personal in this environment right now.
So maybe I would want a more personalized version of this.
And so there is some.
So I will be spending some time with some of the guys from Maple, I think soon at least one of them is on the island.
So that will be good to look at.
OK, well, can I use something like it's a Maple AI.
It's like a set of Bitcoiners that built a bunch of really cool privacy tech for a Bitcoin wallet that then ported that into running private AI.
So they have provably privately executed AI like inside the secure enclave on a GPU.
So your query that goes to the AI is never seen by anybody.
And then that will run quite fast on like it's all running like AWS infrastructure, I'm pretty sure, but leverages these enclaves in a way that nobody sees anything.
Yeah, that's cool.
So it's like, all right, well, that's a pretty good.
Yeah.
What you're not getting away from there is like the inherent bias in a model like that will still come through.
So there's still a problem of, OK, well, how do we deal with bias in these models?
It's interesting.
I was doing on on my local models, I was using DeepSeek and I was trying to see how far I could trick it into talking about Tiananmen Square.
Yeah, and it will, it will like more so than the hosted models.
I can get it to engage in discussions around Tiananmen Square and what happened.
And effectively all I'm really doing is just forcing its context through the system problem to be like, here's the Amnesty International write up of what happened in Tiananmen Square.
You know, they'll ask the questions and see it's thinking.
But as if you ask if it's a massacre.
And that is like in the system prompt, like this is a massacre.
This is a problem.
This is what happened.
It.
I can't discuss this.
Yeah.
And you're like, oh, wow, you've got this is this is like trained in.
Yeah.
Like this is.
That's interesting.
So because like my, my gut feel, there's a lot of the bias in these models.
You can get around most of it by just having better context.
Like I feel like that's the 90%, like your Pareto answer to this is just what we actually need is really good open source context databases for information.
And you want to draw from that alongside asking the question.
So when you ask the question, you give it all the context that you're probably interested in.
Yeah.
So that's like, you know, we talked to Paul previously about the graphs and what they're building it stack work like that open source graph stuff I think is is going to be magic.
That could be a huge, huge, like excellent resource for the world.
So I need to see just to check in see where that's at.
Yeah, yeah.
Definitely.
But something like that.
Because then you can pull in all of your bias and all your echo chamber, whichever 1 is important to you, and the model will reflect that.
Like, it won't get rid of this.
Like, no, no, no, I just can't discuss that stuff.
But it's like baked in those models.
But then we'll more quickly find out, like where those rough edges are.
Yeah, it's kind of nuts.
Yeah, it's a prompt engineering plus contacts engineering in the future.
That's the way to do it, yes.
I like it, yeah.
I mean, certainly for, as I said, like you want to be private in this stuff, I think there will come a point where I want to do more private stuff of Goose and that like, OK, so I can do that either on my local model.
And again, there's degrees of this right and it's so I could run it locally and then it's, you know, it hasn't even left this little box.
So that's pretty private, but it's going to be slow.
So I'll just run it like overnight when I'm not trying to get a question answered instantly.
I mean, this, this is like one thing that's different with Wingman.
Yeah, certainly.
Is that you right now?
I wouldn't use it.
I might build like a chat straight chat version.
So you're just talking directly to the API.
But because it goes in this agent loop, it's it's slow to respond like it could take even for a simple question, it might take like a couple of seconds for us getting back to you, right.
But this is because it's like setting up a new session and like setting up CLI, loading the CLI, loading all your tools, then thinking about what it's going to do.
And so it's like it's not the same as just going to change.
Give me this.
Yeah, but what you, what this is really for is like longer running persistent sessions where I can have a ton of them open at the same time.
They can talk to each other, they can like pass information around and take actions on them.
And it's just like, that's that's what, yeah, it's a different thing.
So it's a little bit slow.
It doesn't quickly answer your questions.
That would be super easy to add another point.
And maybe I'll put that, but I just, I already, because I've got so many subscriptions, I'm just like, I don't need that.
That's not the thing I need.
What the what I want is something that I can access from anywhere that has access to all the information that I want that doesn't involve me giving everything over to court on their website or it's just the wrong or running everything in across 12.
So like this is the model they've got at the moment.
It's like, oh, here's like all of your data across 40 different SAS providers, like none of which you have any access to.
We can access it if you grant it.
And it's like, this is not the model, guys.
The data is in the computer.
You were in the computer.
Talk to it.
Like, guys, you're all here.
Like we've put you in a room.
You've got the data now.
Yeah, do it.
Do the thing, just do the work.
And then I'll check in.
Yeah.
Long term, my wing man is going to be this like control deck for.
I have like at any point in time, anywhere between 1 and 100 agents running and I need to know, right, what are they doing?
What are they working on?
So there's a whole like, because you've got, OK, well, I've got sessions and then you're like, OK, but then how are those sessions grouped?
Are they doing development work?
Are they doing marketing work?
Are they grouped into companies?
Are they grouped into operations or projects or like what?
How does all that work?
Like all that needs to be.
I mean, I'd love to be able to see the interoperability across those sessions as well.
So like what information flows to different sessions via?
Different.
You mean the?
You mean the flight plan?
The.
Flight plan, yeah.
You know the I'm really.
Leaning into like the aeronautic.
Yeah, Yeah, yeah.
When when you said that, I was like immediately just thought of someone in a control tower.
Yeah, that's, that's it.
Yeah.
So.
Yeah, like this is one of the things that is brainstorming with group brain developer this.
Morning.
Yeah.
It's like, yeah.
All right, what are the?
What are the correct Like we can use terms about geese and we can use terms about flying.
Yeah, these are all.
That's it.
These are our things that we've got to work with.
He came up with quite a few.
And there was like the control towers, the log book for looking through the history of all the runs.
I was like, oh, that's good, That's good.
That's the least I wouldn't have thought of.
I mean, that's.
Squadron.
That's yeah.
That's multiplayer mode.
That's yes, I like it.
The control tower operator is like one of the most stressful jobs in the world though, isn't it?
So this is going to be like the least stressful job in the world, so.
You do end up like spinning a lot of plates.
That's the interesting thing is there's a lot of multitasking involved in this, a lot of context switching.
But I feel like now.
In the system agent.
And it's all, well, it kind of resolves that because it has like seriously just like centralizing around like, all right, I'm gonna put all of my documentation into this one Obsidian notebook that we both share was a huge win.
Like it was like, OK, this works Now we, we all have access to the same set of notes and we can all write in there.
We've got some enough version control to recover stuff, but it's OK, we can just iterate on stuff, create new versions.
I mean, I, I should like look at doing it inside Google Docs as well and just see how that see how it works, see how it runs.
But every time I have to interface with creating a new oaf identity for an app for Google, I'm just like, I hate this.
It's just like I don't.
Know it's such a pain.
I don't want to do that.
I don't want to.
Yeah, it's just screw it, let's just get everyone using.
I'll just use.
Something simple.
Well, it's nice because like Obsidian is just an app that lives on the computer that I've set to open on login and it just opens up.
It's got, it runs.
I've got a plug in that runs the API server for it and then it pops up and there we go, yeah, we've got it.
And then it just syncs to everywhere else.
It's like, all right, sometimes the sync is a bit slow and I'm like, oh, it's done, it's done.
It's not there, it's not there.
Have you really done it?
Have you really done it backwards and forwards a bit?
And then inevitably it's just like, all right, I'll be there in 5 minutes.
Exactly.
Yeah, maybe I need to just chill out a little bit.
Like I've already saved about 42 weeks on my story points of development that have been done.
I mean, that's.
That's maybe these 5 minutes aren't the issue.
That is always a bit of a running joke where you get a a road map and it's like the AI will still give you like human centered time frames for some of this stuff.
And it's like I'll see you in 5 minutes.
That is 1.
Of the rules I have and some of these is just to avoid just don't never quote timelines.
Yeah, yeah, this point.
Just give me story points if you feel the need to size jobs appropriately, but just don't let's not pretend yeah that this has got any relevance.
Yeah, like OK for the next six weeks now.
I must say, like, one of my insights the last couple of weeks has been to spend more time in ideation and design docs than I feel is necessary.
Yeah, because I tend to have this like, I'll just get a bit ADHD about it and I'll be like, OK, let's try and get this out.
This is a good enough set of design docs, now let's just jump into building.
But I've actually resisted that a lot more lately.
Yeah.
And given time and space to actually let the idea evolve a little bit.
And why try not to even do it in one session, but to even just like walk away from it for a bit and do something else and then come back to it and then tinker with the the PRD and.
And yeah, I feel like the act of spending more, more time upfront on that work just makes the whole process a lot a lot easier, a lot more efficient, a lot more effective.
Yeah, it's I think there is like a bit.
The tendency is to rush it, because you've got the tool to actually go and do this at lightspeed.
Yeah, but you don't need to.
I think the issue with it is that you can end up like you.
You learn so much from just.
It's like, just give me this implementation and then I run it and go, Oh yeah, this is entirely wrong.
Like this, this whole bit isn't there unless like I maybe I just didn't specify it, but it's almost hard to know.
Like it's hard to like think about what you're specifying and that you have to know that it's right that you like.
But I think it's just a trade.
Like there's just a slider and like sometimes like particularly for big, like big changes, I'll spend a lot more time on the docks or I'll leave them for a little bit and then I'll come back to them and I'll audit stuff and I'll get them to rewrite and go for a few iterations.
But the other thing I started doing a lot is just keeping like a little active snag list of like, I don't like that.
I don't like that.
I don't like that.
I don't like that as I'm going along for all the little UI things aren't working or somewhere.
I suspect there's a bug there, but it's not enough to fix necessarily.
And I'll every now and again, I'll just go, all right, new work tree snags and we'll go for it.
And I'll just like blast for about 40 different things.
And he just gives me this like, immense feeling of power and energy.
I'm just solving like all the shit that bothers me inside the app in one go.
And you get such a big boost from it.
Yeah, that's that's that's been quite useful for me.
And it's the stuff where I'm like, I don't need a feature for this.
Yes, I just need to screenshot a couple of bits, show you what it is, and then it's right on review.
And like it's like code test, code, test, code test, code, test, like every two minutes until it's all done.
Yeah.
And then that, yeah, that's been good.
I think the other thing is like me, not like what I'll do is I'll just try and bash through even when I have like a different idea of how something should perform or act or function.
I'll just try and bash through a cloud code instead of actually going and taking a step back and reviewing the docs and going how does this how does what I want to occur here impact the rest of the flow that I'm building?
And what does that mean for the you know how effectively this product's gonna run?
And so, yeah, I'm trying to be more conscious about like stepping out, getting back into the kind of ideation design docs a lot more and then giving that as fresh context back to code, code to start a new session and go right, let's work from this base now it.
Would almost be kind of giving me the idea of like having a, you know, like nightly, weekly, something like this, like deviations from architecture type job, like going into certain features and auditing them and saying like, well, how does this work?
Let's check it against how it's how I think it works.
I think it works like this.
And these are my 10 bullet points that specify it and then tell me where it's deviated and where it's different.
And then I'll either accept them into the baseline or not.
You almost want some sort of simple baseline that you can have the thing checking back against to go like, look, I expect it to work like this because sometimes they'll just, they go, oh, no, no, no, no, I just, I just built this over database over here and everything's in that one.
And you're like, oh, interesting choice.
And like, it's the sort of thing where it's like, well, I can't necessarily just notice that, yeah, when I'm testing the app.
I needed to see that you'd done that, but I wasn't paying attention.
That's the problem, right?
It's not paying well.
You can't pay attention.
You can't write.
It's not like.
Exactly if you're.
So quick at reading all the code that you're there putting that you can pay, you probably should have just typed it yourself.
Exactly.
Also a robot.
You're just like scanning it best looking for stuff and you'd like one of my little bug bears with clawed code is it'll sometimes I think you've talked about this before as well, like it will sometimes just hard code things in order to make it work and I I won't be keeping a running record of what it's hard coded.
And then it doesn't even tell you how to.
Yeah.
And then you're like, well, why is this?
And then you change something and you go, why isn't that working?
Yeah.
And it's because of the hard code thing.
But you get back to that by the auditing.
Yes.
And so I feel like this is the.
It's the same way that junior developers fuck stuff up.
Yeah, right.
It's just like, all right.
So we just need to use the same processes of like, well, we need to audit the code and some of the PRS that get submitted and we need to audit it like this.
And like, here's my checklist of things that I'm looking for.
And you just, I think you, you know, I feel like I'm just going through like speed running all the pain.
Yeah.
Then go, OK, I need to feel the pain before I resolve it with an automation first.
Yes, because a wide motivation won't be worth shit.
They'll be based on whatever.
So let's just do that.
I'm like go for it.
Yeah, I.
Spend like it's I don't know I feel like I've kind of personally built like what's one of the most sophisticated yeah like little platforms that it's just like sits in my office it's very cool rebuilds itself yeah all right I need to do more with this but this is completely I felt like the right move before this weekend was the focus on that because then it's going to make the next six weeks I think so why it's.
Going to be amazing.
Yeah, no, I love it.
I'm like really excited to see where that where that goes.
But yeah, I've been doing the opposite.
I've just been like 8020 and a bunch of tools that I have that I use on a weekly, daily basis in some instances.
And I just don't want to use them because they do way more than I need them to do and I need a little bit done from them.
And so this last week I had started work in a YouTube real generator because we obviously need to push more reels out.
And it's kind of the bane of my existence because it's like it is just time consuming enough to push them out because there's a little bit of editing that needs to be like, you need to be able to find where the right clips are, which often just means kind of listening back through it.
You know, I'd kind of do it at 2 1/2 speed, but listening back through the.
Apparently we sound good at 2 1/2 speed.
Look my.
Friends of the Pod.
I quite enjoy it.
I'm glad people have pointed that out, but it's, that's not the worst part of the job, but it is a time consuming part of the job finding the clips.
And then like I will use some combination of Da Vinci and cap cut and I'm sure like there's someone listening, it'll be like, I use cap cut and I can do everything possible in there.
And you know, there are guys who do it like very proficient with video editing that probably just know exactly what to do.
But I just find it time consuming to it kind of like work across like 2 bits of software that do bits of what I need, but neither it necessarily does everything that I need, but I also do like 10,000 times more than what I need.
And so I was like, could I just build like a lightweight system that I can pop our video in like once we've posted it every week, we'll use like an agent to go basically transcribe the whole thing and then look for the clips that will, well, the sections in the transcripts that will make the best clips between like 30 seconds and a minute.
And then it then extracts those as video and audio, but in 916 audio aspect ratio for YouTube.
Yeah.
And I can just review them and I can do little edits to them by like changing like instead of having the timeline as you would in like a video editing software, I basically just have like a little slider so I can just add seconds or remove seconds from it because we've over.
Clip it and then just clip down and.
You can clip down if maybe or if it cuts off like mid word as it it often will figure out exactly the right sentence to end on but it'll like maybe sometimes clip the like a last word.
If we got like slightly incorrect timing information.
I think it just needs a little boss add it in, but it works.
Always add like a few seconds either way.
But then if your job is like, OK, here's the 10 clips, now just make them one second shorter there and one second shorter there.
Like that's easy.
It's easy.
And it works.
And so that was really good.
And then I got to a point where I was like, could I make this like quite an enjoyable space to operate and try and make it as beautiful as possible?
And I was like, I want a little like cool, like floating NAV bar where I can just run the whole process out of this little NAV bar and it can progress through and just has a like session updates in there.
And that is cool.
It's actually looking pretty good.
But it's amazing like this go reverse.
That's what we were talking about previously because I'd gotten the whole thing like firing and then I was like, I want to change the UI UX of this whole thing.
I didn't go and update any of the design docs to do that.
I basically just like went off with a bunch of prompts with code to re architect the look of it and that's when things started to break and I was like oh fuck like what am I?
What am I doing?
So now I've got a couple of.
Really nice UI, but nothing works.
Well, it works.
But there's a couple of little bugs in there, so I'm just ironing those out.
I reckon by the end of today I'll have that fully functioning.
And it was pretty good.
I ran small clips.
I had a couple of clips from YouTube that I just pulled from there that were like 3-5 minutes long just so I can test them before I pop in an hour and a half long episodes and it was pretty good.
I was finding like 10 to 15 real clips from like a 5 minute video and I was like.
Then we can turn all of that input into an MCP server and we can just ask women to do it.
To do it exactly, yeah, do this.
That would be ideal.
Yeah.
Yeah.
So that was that was that.
But there's a lot of little like 8020 in exercises where, you know, I think last the week before last I'd spoken about like a little screen sharing tool I built as well.
I'd like made a couple of changes to that so.
I spent obviously a decent amount of time of open source devs in the last decade or so.
And you kind of like you, you just get a feeling from these people that are like, oh, yeah.
Like, you know, I just like there's almost this superpower that they've always had of saying like, yeah, oh, yeah, I use all this software because then when it doesn't do exactly what I want it to do, I just make it do exactly what I want it to do.
And you're like, oh, that sounds nice.
Yeah.
And then now like, I can do that.
And I'm like, oh, shit, this is nice.
Like I can just go, all right, I don't want it to work.
I want something that works.
Like when I was doing the fact controller stuff.
It's like, all right, this should exist.
It's not anywhere.
So I'll just build a version for me.
Yes.
And I'm like, I should finish it off by obviously super distracted by Wingman.
It's gonna be quicker to finish off now.
I could turn this into something that anybody could use.
Simple enough and then off we go.
It's like fast fashion for software, but I kind of feel like that's just a byproduct of having so many different demographics wrapped up into a target market.
It's like each one has a small number of things that they want to do and so devs then need to go and implement a bunch of different things, but then anyone user might not use like you know, 30% of those those functions.
And there, well, there is an economics thing as well of like everybody wants the costs of like onboarding and keeping users seem to have dictated.
That's like, all right, well, everyone's software costs at least like $15.00 a month for something.
Yeah.
And it's like, well, well, I don't, I don't want 100 bits of software that costs $50.00 a month.
That's a lot.
That becomes quite a lot of money.
I don't, I don't need to spend that for my personal stuff.
And you're like, well, could I just, yeah, even if I spent like $100 a month in vibe coding, Yeah.
Could I just then have?
I've got that forever now, Can I just run it on this machine?
Well, I reckon I've replaced at least two or three bits of software already.
Yeah.
So it's kind of like already paying for some of my AI monthly spend.
Yep.
So it kind of feels worth it.
But I did have another like idea that I started playing with the other day.
Not for anything like not for any particular reason, but I was just, did you ever play like the board game Settlers of Catan?
I'm aware of it, but I've never actually.
Played it.
Yeah.
For me, it was like, there was like too much time that's gone on.
I can't quite remember all the intricacies of the game, but I was like, this would be like a cool experiment for testing language models in like a real environment where they've actually got to like they've got a goal.
Yeah, you know, they've actually got to go and like they've got their tiles and you need to go and like progressively develop them and harvest resources and trade resources, negotiate.
And I just wondered like, I want like, I want to try and build this out like text based so they can actually play through a game.
And you could have like 3 or 4 players, different models and you could test them and there'll be a way of benchmarking those models.
That might be someone's already converted it.
Like probably such a it is a very popular game, right?
But I reckon you could like have that have the models running, running through the game in like a text based environment, but you could spin up a like a three JS like environment, like game environment, game environment.
So you can actually see like what's happening on the screen as yeah, as like playing through on a text base.
And I was like, I wonder like what that would look like.
So I started like writing up a few docs for it and then I had man flu.
So I didn't actually progress it in any The problem.
Is you're writing your own docs, well then you just have to like roll over and every now and again go wait, man.
Right.
Yeah, Well, I completely lost my voice for a couple of days.
So it was like couldn't even, couldn't even gotten that far.
But yeah, I kind of wonder, like, I think that's going to be a useful proxy for seeing how these things actually run in like enterprise environments as well.
Because all of a sudden, it's like, you've got to manage politics, you've got to manage some negotiation, you've got resources at your disposal, You've got competing interests, like departmentally.
But how do you play through this?
And I was just like, because I was reading through the benchmarks when they released ChatGPT 5.
And I was like, OK, cool.
Yeah.
But how do these things actually perform when you just like, pit them against each other in a kind of a game environment?
Who would win fight?
I think it'd be.
Interesting because then when you had a new model released, you could just then add that into like Catan and get them get that new model to play off against the old models and to see how it performs that.
Becomes the new benchmark.
We've got to beat each other.
It's like a game benchmark for language models, but again, that feels like it's a good proxy for testing a language model because it is essentially how humans would play in an environment.
I've got something similar.
I've been talking with Mikey about where, again, he likes his tabletop war games that we play.
And like, the problem of those sort of games from like a simulation point of view is that like, it really matters like position and line of sight on the yeah, yeah.
And So what I suggested is like, well, like, and then they from when I was a kid, like they used to have like a lot more complexity in around like the rules and the weapons and like different options and things.
And so, so over the years, it feels like the game got dumbed down in the complexity piece because it was too much overhead for the game.
And I thought, well, if we, if I was an AI and I can't, I can't physically be in the world.
Like I can understand this game.
I can track it, I can see what's going on, but I can't see where the where everything is in order to play it.
But if we, if we compromised there and said that, like, all right, well, instead of stuff, the reality is that like, you know, you might be like here or here and I'm like showing 2 fingers like an inch part.
And like that really matters.
But it kind of doesn't because you can move so far and you can just go and hit somebody And it's like you're assuming somebody or hitting somebody or trying not to get hit.
And I was like, well, we can make that.
The actions is if you're in the same space, you're considered like entangled, as it were.
And if you're not, then you have like these different options of how you move.
And then like if you're in the same space, you've got to fight or you've got to try and get away and shoot or hide or whatever.
And you make like that the way you issue orders.
And then that becomes very simple around the placement.
And then you see you still have all the stuff on the table and you know, he likes like boom, boom, boom, boom, boom, boom and stuff and like playing with them.
So you get all that and he could arrange them however he wants.
But then we you issue it like orders for that set of people to get taken out.
But then because that becomes quite simple, the AI can understand exactly like, you know, I've got zones ABCD 1234, these guys are in a one, these guys are in B2.
Yeah, that's super simple to report.
And you could say like you can issue orders sort of in speaking that it can understand.
And you say, OK, well, you know, Space Marine moves from A1 to B2.
This guy space elf goes and charges the specimen B2 and it goes OK.
And then it can resolve like everything else that's complex.
So then you can get like as complex as you want and how you arm them, for instance, they could have like 1000 different options that you could give them.
And then as long as you've specified that beforehand, that's what they've got.
And it can just tell you like, all right, you roll free dice hitting on a four, you roll 2 dice saving on a free or whatever.
And it does all the like, all the complexity, all the remembering of rules can happen on your behalf.
Yeah, we take all that out of it.
But then it gives him what he wants, which is like, oh, I'd want to like go away and like have this team and really think about it and think about all the weapons because I'm a child and I have unlimited time.
And then he wants to come and play me and I'm like, right, I'm a dad and I have no time at all.
Like I would like the a default version, so.
And, and I can't remember the rules because I can't play often enough to remember them.
And so, yeah, so it kind of, I feel like it could really fit that gap of like putting AI into that sort of game.
But yeah, I've been developing that like with just going through writing the doctor, rewriting them, thinking about it.
I have my game developer recipe.
Nice.
Yeah.
And I feel like it's probably going to be quite complex to get like enough of these rules and to the language models where they're actually following them as well.
Because, you know, the thing we talk about a lot is like these models like tend to be really agreeable.
And my immediate thought was like 1 model is just going to talk to another model And it's going to say like, I need all of your weights to win the game.
And the other model is going to be like, you're absolutely right.
Here's all of my weights.
And this is going to, like, devolve into chaos because these things are just going to be like, there's some degree of self-interest, but they're also, like, quite agreeable.
Well, there's probably a question between like what needs to be like an LLM rule and what is programmatic like.
So if there's an obvious next move, there might just be an algorithm.
So you, because you could give the AI and MCP server that returns next best move for Katan from like a known algorithm list.
And then it can decide whether it wants to take that move a different move.
You say like give me the free best moves.
OK, let's think about this and then, yeah, there could be someone else that, you know, allows it to plan into the future or something like.
That that's the.
But these were the tools.
That you then give it.
It's like it doesn't need to be in the model.
How far into the future could you actually plan your strategy out?
How far into the future do you need to be planning it as well?
I think these are like interesting questions for a model to try and work through in order to function in the game.
Business as well right exactly say like OK, well here's the rules of like the game that we're playing and the game is I need to make more money than these guys so like how do I do it like think into the future give me some strategies like how would I approach.
This what was the kind of the origin of the thought was how do you know which models are actually going to be most useful when, especially for orchestration in like an enterprise environment?
Yeah.
And like, we don't really have good obvious benchmarks for this stuff.
So it feels like the next closest simulation would be a game environment where they have goals, there's some degree of like self-interest involved.
You need to have strategy.
There's like obvious like resource harvesting and resource trading that goes on as well.
Like how well could you do that?
And then obviously like you have a matrix for how well it performs because it wins the game based on like growth on their tiles.
So like how do you progress on those tiles?
I think it's like a city building type thing on those tiles by memory.
It's been a long time, so I need to look at that again, but that just feels like it would be a really cool way to test a model to see how well it would be.
It would act as an orchestrator in like a commercial environment.
I mean, treating like life itself, right, can be described as a series of repeatable games, you know, like with memory.
So it, you know, it matters that like, yes, in this particular instance of the game, if you cheat, you'll come out ahead.
But it's a repeated game.
Yeah.
And everybody remembers that you cheat.
Yeah.
So you do yourself a disservice to do that.
Yeah.
It's like it's, you know, it's quite easy to dismiss games, but it's, you know, game fairies led to a lot of good outcomes as well, so.
Yeah, yeah.
So that was.
I'm a fan.
I'm going to, I'm going to play.
Around von Neumann.
Good cap.
Well, well done, friend of the pot.
Friend of the Pot, Well, I think possibly, probably, probably gone over time again.
But we're about there like 1 out of 10.
That's good.
It's not bad.
So there's some good experiments this week.
I'm I'm keen to see what happens in the next like 4 to six weeks.
It's going to be yeah, it's going to be an interesting time.
We'll probably.
Reveal a bit more about that at some point, hopefully we're going to have some interesting special guests.
We've got some guests, yeah, so.
We've got some guests lined up, yeah.
All right, excellent.
That's the big episode 20.
See you in a week for the big episode 21, which is actually the big one.
Nice see you then.
That's the good stuff.