The Ollama Model-Swap Death Spiral That Killed Every Cron at Once

May 6
6 mins

Episode Description

3 a.m. Every cron job on the Mac Studio failed inside the same 90-second window. No code changes. No model updates. No new jobs. Just a wall of timeout errors that lit up every channel I had wired to alerts. The culprit was hiding in plain sight: a fallback chain doing exactly what I told it to.

The Setup

One Mac Studio. One Ollama daemon. A handful of cron jobs each calling the local LLM for different tasks: code review, log summarization, doc indexing, a nightly digest. Each cron specified a preferred model. Each one inherited a "be resilient" fallback chain from the task router: try the preferred model, fall back to a smaller one, fall back to a tiny one if both fail.

It looked clean on paper. Big model for the smart stuff, smaller model when the big one chokes, tiny model as a safety net. Classic graceful degradation. The kind of pattern you'd put in a "production-ready" checklist without thinking twice.

The models on disk ranged from 4GB to 22GB. Loading the big one into VRAM took roughly 60 seconds cold. Generation, once warm, took 5 to 10 seconds. Guess which number I used to set the timeout.

What's Actually Going On

Here's the cascade. Cron A fires at 3:00:00 and asks for `qwen2.5-coder:32b`. The model isn't loaded. Ollama spends the entire 30-second timeout just paging the weights into VRAM. It never gets to generation. The request fails. The fallback chain kicks in and asks for `qwen2.5-coder:14b`. Ollama evicts the half-loaded 32b, starts loading the 14b. Another 30 seconds gone. Fallback again. Tiny model loads, finally generates. Cron A "succeeds" with degraded output.

Meanwhile, Cron B fires at 3:00:15 expecting the 32b model that Cron A's first attempt was loading. Now there's a tiny model in VRAM instead. Cron B starts the same dance from a different starting point. Cron C lands on top of that. Within 90 seconds, every cron is waiting on a model swap that the next cron is about to invalidate.

The fallback chain wasn't degrading gracefully. It was thrashing the VRAM and guaranteeing nobody finished. Every safety net I'd added was making the failure worse.

As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

The Fix

Two changes. No clever code. Just operational discipline.

First, pin one model in VRAM with `keep_alive: 24h`. This is a request-level option that tells Ollama to stop evicting the model after the response. Default behavior is to unload after 5 minutes of idle. That's the eviction that lets the next caller's load attempt thrash everything.

# Pin model in VRAM with keep_alive curl -s http://localhost:11434/api/generate -d '{ "model": "qwen2.5-coder:32b", "prompt": "test", "keep_alive": "24h" }'

Second, force every frequent cron to use that same pinned model. Kill the fallback chain for hot-path workloads. Fallback is fine for one-off scripts you run by hand. It's poison when three crons fire in parallel against shared VRAM.

To make sure the model is loaded before any cron fires, I added a LaunchAgent that runs the warm-up curl on boot:

Label com.local.ollama-warmup RunAtLoad ProgramArguments /usr/bin/curl -s http://localhost:11434/api/generate -d {"model":"qwen2.5-coder:32b","prompt":"warmup","keep_alive":"24h"}

Load it with `launchctl load ~/Library/LaunchAgents/ollama-warmup.plist`. Now the model is hot before login completes. Every cron hits a warm model and finishes in the 5-to-10-second window the timeouts were designed for.

Result: zero model-swap thrashing since the change. Crons that used to fail intermittently now run consistently.

Why This Matters

The lesson isn't about Ollama. It's about cold-load math. Anytime your "graceful degradation" path is slower than your timeout, every retry makes the next caller's situation worse. Fallback chains assume the fallback is fast. Model loads aren't fast. Database failovers aren't fast. Cold containers aren't fast.

Operational discipline beats clever code here. One hot model, no swaps, every cron pointed at the same target. The "less resilient" design is actually more reliable because it removes the failure mode entirely.

If you're running local LLMs on shared hardware, assume VRAM is a single resource that gets thrashed under parallelism. Pin what matters. Warm it before it's needed. Don't trust fallback chains during peak hours.

Quick Reference

* Cold model load on a 20GB+ model: roughly 60 seconds

* Warm generation: 5 to 10 seconds

* Default Ollama eviction: 5 minutes of idle

* Pin a model: `keep_alive: 24h` in the API request body

* Warm-up on boot: LaunchAgent (macOS) or systemd unit (Linux)

* Hot path rule: one model, no fallback, same model across every concurrent caller

* Reserve fallback chains for interactive, single-caller use

If you found this article useful, you can find more articles like this at:

As The Geek Learns



Get full access to As The Geek Learns at astgl.com/subscribe
See all episodes