E16 Running LLMs Locally: Privacy, Performance, and Practical Trade-offs

May 29
35 mins

Episode Description

We explore what it really means to run AI models locally instead of relying on cloud providers like OpenAI or Anthropic. From powerful desktop setups with dual NVIDIA RTX 3090s to tiny models running on embedded systems, we cover the full spectrum of local AI deployment.

Luca shares hands-on experience running local models for client work, explaining the hardware requirements (spoiler: you need fast VRAM, not just lots of RAM), performance trade-offs, and practical tools like Ollama and LM Studio. We discuss how modern open-weight models from Meta, Google, and Chinese companies compare to hosted solutions - typically about a year behind state-of-the-art but surprisingly capable. We also look at edge AI applications, from elderly fall detection to traffic accident monitoring, where compact models shine. The conversation covers context window limitations, quantization techniques, and why getting started is easier than you might think - though you'll need to manage expectations about what local models can deliver compared to their cloud-based cousins.

Key Topics:

  • [00:00] Introduction: What are local models and why run them?
  • [02:30] Hardware requirements: VRAM vs system RAM, and why graphics cards matter
  • [05:45] Luca's setup: dual RTX 3090s and real-world client work with local models
  • [08:20] Performance metrics: time to first token, tokens per second, and output quality
  • [12:00] Tiny models for edge AI: Google's 270M parameter model and specific use cases
  • [15:30] Tools and workflows: Ollama, LM Studio, and OpenAI-compatible APIs
  • [18:45] Where models come from: Hugging Face, Meta's Llama, and the open-weight ecosystem
  • [22:10] Context window limitations and quantization techniques
  • [25:00] Getting started: realistic expectations and practical first steps

Notable Quotes:

"What those models really need is tons of memory and as fast memory as you can get it. This is why people like Macs because they've got the unified memory architecture." — Luca Ingianni

"Modern models are actually pretty good. They are smaller, so they will have less knowledge. They are a tad slower. They struggle with much smaller context windows. But if you can work within those bounds, they work pretty well." — Luca Ingianni

"Using AI is a different mindset. It's a different way of thinking about solving a problem, but it's just another tool in your toolbox and it's just another way of getting work done." — Ryan Torvik

Resources Mentioned:

  • Ollama - Docker-like tool for running local LLMs with simple pull/run commands and OpenAI-compatible API
  • LM Studio - Graphical user interface for downloading and running local models easily
  • Open WebUI - Web interface for local models that mimics ChatGPT's chat interface
  • Hugging Face - Repository with hundreds of thousands of models in various sizes and configurations
  • Meta Llama - Open-weight model family from Meta that helped start the local LLM movement
  • Google Gemma - Model family from Google including compact vision-capable models (270M parameters)
See all episodes