Cloud APIs are convenient, but local LLMs offer privacy, control, and predictable costs. Here’s what you need to run them effectively.
The Basics
Running LLMs locally comes down to three things:
- GPU memory — Determines model size you can run
- GPU compute — Determines inference speed
- System RAM — For context windows and data processing
Budget Build ($1,500-2,500)
GPU: RTX 4070 Ti Super (16GB VRAM)
- Can run 7B-13B models comfortably
- 70B models with quantization
CPU: AMD Ryzen 7 7700X RAM: 64GB DDR5 Storage: 2TB NVMe SSD
What it runs:
- Llama 2/3 13B at full precision
- Mistral 7B with large context
- CodeLlama for development tasks
Mid-Range Build ($3,000-5,000)
GPU: RTX 4090 (24GB VRAM)
- Sweet spot for most local LLM work
- Can run 30B-70B models
CPU: AMD Ryzen 9 7950X RAM: 128GB DDR5 Storage: 4TB NVMe SSD
What it runs:
- Llama 2/3 70B with 4-bit quantization
- Multiple smaller models simultaneously
- Embedding models and vector DBs
High-End Workstation ($8,000+)
GPU: Dual RTX 4090s or RTX 6000 Ada (48GB) CPU: Threadripper or Xeon RAM: 256GB+ DDR5 Storage: 8TB+ NVMe RAID
What it runs:
- 70B+ models at full precision
- Fine-tuning workflows
- Multi-user inference serving
Cloud vs. Local Economics
At current API prices, a $4,000 local rig breaks even around 20M tokens/month. If you’re doing serious volume, local becomes attractive quickly.
Software Stack
- llama.cpp — Fastest CPU/GPU inference
- Ollama — Easy local model management
- vLLM — High-throughput serving
- Text Generation WebUI — User-friendly interface
The Verdict
Start with cloud APIs. When you know your workload and volume, consider local for cost savings and control. The hardware pays for itself if you’re a heavy user.