Why Local LLMs Are Back in Serious Conversation
For a while, running models locally felt like a hobbyist pursuit — the quality gap between local models and GPT-4 class APIs was wide enough that serious production applications defaulted to the cloud. That gap has narrowed dramatically. Llama 3.1 405B, Mistral Large, and Qwen 2.5 72B all deliver output quality that is competitive for a wide range of production workloads, and they run on hardware that organizations can actually buy.
The business case has also shifted. Data residency requirements in healthcare, finance, and government now make cloud LLMs non-starters for many use cases. The economics have changed too: API costs at scale add up to meaningful infrastructure spend, while a server with four A100 GPUs amortizes nicely over two to three years of inference.
Ollama's growth reflects this shift. It crossed 10 million downloads in early 2025 and added OpenAI-compatible API endpoints, making it a drop-in local backend for any application already using the OpenAI SDK. The desktop client release in July 2025 made it accessible to non-technical users. The ecosystem has genuinely arrived.
The Ollama Architecture
Ollama runs as a local server exposing a REST API on port 11434. It handles model loading, quantization management, GPU allocation, and request queuing transparently. From an application perspective, you point your HTTP client at localhost instead of api.openai.com and the interface is nearly identical.
Model management is handled through a CLI: ollama pull llama3.1:70b downloads the model, applies GGUF quantization if needed, and makes it available for inference. Ollama maintains a model library at ollama.com with hundreds of models across families including Llama, Mistral, Gemma, Qwen, Phi, and Code Llama.
Performance depends heavily on hardware. A consumer M3 Max MacBook Pro with 64GB unified memory handles Llama 3.1 8B at around 60 tokens per second — fast enough for interactive use. Llama 3.1 70B drops to about 12 tokens per second on the same machine, which is acceptable for background processing but feels slow in interactive contexts. A dedicated server with an RTX 4090 (24GB VRAM) runs 70B models at 20–30 tokens per second with full-precision weights.
What Changes for Production
Running Ollama locally for development is straightforward. Running it in production requires thinking through several concerns that the tool itself doesn't address. Request concurrency is the first: Ollama queues requests by default and processes them sequentially per model. Under load, this produces unpredictable latency. For high-throughput applications, vLLM or llama.cpp with tensor parallelism across multiple GPUs is the better foundation.
Model versioning and reproducibility also require explicit management. Unlike cloud APIs where the model is pinned by version string, local deployments can drift as Ollama updates its quantization or base models receive updates. Pin specific model digests in production configurations and test before updating.
Monitoring deserves particular attention. Cloud providers surface token usage, latency, and error rates automatically. For local deployments, you need to instrument your Ollama instances explicitly — track GPU utilization, queue depth, time-to-first-token, and tokens-per-second. Without this telemetry, debugging production issues is guesswork.
Model Selection for Production Workloads
Not all local models are equal for production, and the right choice depends heavily on the task. For code generation and technical assistance, Code Llama 70B and DeepSeek Coder V2 consistently outperform general-purpose models of similar size. For instruction following and document processing, Llama 3.1 70B and Qwen 2.5 72B are strong generalists. For multilingual applications, Qwen 2.5 has notably stronger coverage across Asian languages than the Llama family.
Quantization is a significant lever. GGUF Q4_K_M quantization reduces memory requirements by roughly 60% with quality degradation that is minimal for most tasks. Q8_0 quantization is near-lossless but requires more VRAM. The tradeoff depends on your hardware budget and quality requirements — for most business document processing and summarization tasks, Q4_K_M is indistinguishable from full precision.
LM Studio, the main alternative to Ollama, deserves mention here. It offers a more polished UI for model experimentation and benchmarking, which makes it useful for model selection before committing to a production deployment. Once you've chosen a model, Ollama's API compatibility and lighter server footprint make it the better runtime choice for automated workloads.
The Hybrid Architecture Pattern
The most pragmatic production pattern we've seen is a hybrid routing layer: lightweight requests go to local models for latency and cost, complex reasoning tasks route to frontier API models where quality is paramount. A classification step at the ingress determines which path each request takes.
This approach captures the cost and privacy benefits of local inference for the bulk of requests — typically 70–80% of volume in enterprise applications — while maintaining frontier-model quality for the tasks where it genuinely matters. It requires more architectural work upfront, but the operational economics over time are compelling for applications at meaningful scale.