Llama 4: How Meta Changed the Open-Weights Game Again

THE TWO-MODEL STRATEGY

Meta released Llama 4 as a family rather than a single model, following the playbook established by Llama 3.1. The Scout variant is a 17-billion parameter model using a 16-expert Mixture of Experts architecture — small enough to run on a single high-end consumer GPU or a modest cloud instance, but punching significantly above its active parameter weight thanks to the MoE routing. The Maverick variant scales to 400 billion total parameters with 128 experts, placing it in the same tier as DeepSeek V3 and squarely competitive with mid-tier proprietary API offerings.

Scout's target is latency-sensitive applications where inference speed matters as much as output quality — chatbots, real-time summarisation, interactive coding assistants, voice interfaces where response time is user-experience-critical. Maverick is for the cases where quality is the primary constraint and you're willing to pay the compute cost: complex reasoning, multi-document analysis, code generation across large codebases, research synthesis.

The architectural choice to use MoE across both variants reflects lessons learned from DeepSeek's efficiency demonstrations in 2025. Activating a fraction of the total parameters per forward pass dramatically improves tokens-per-second on equivalent hardware, making the quality-per-dollar proposition of open-weights models increasingly difficult for proprietary APIs to match at scale.

WHAT THE BENCHMARKS SHOW — AND DON'T SHOW

Llama 4 Maverick's benchmark performance on coding tasks is legitimately impressive. On HumanEval+, it scores competitively with Claude Sonnet and GPT-4o, closing a gap that Llama 3.1 405B left open. On the MATH benchmark, its reasoning capabilities reflect the training recipe improvements Meta incorporated from the reasoning model research that dominated 2025: more process reward modelling, more diverse chain-of-thought data, and better calibration between the model's confidence and its actual accuracy.

Benchmarks, as always, are a partial picture. The categories where Llama 4 still trails the frontier are instruction following on complex multi-constraint prompts and output formatting consistency under adversarial conditions. These gaps matter in production: an application that needs a model to reliably produce a specific JSON schema, follow a 15-point instruction set, and handle edge-case inputs gracefully will find Opus 4.7 or GPT-4o more reliable than Maverick on any given request. The question is whether the reliability gap justifies the cost difference at your specific scale.

Scout's benchmark position is more nuanced. At 17 billion parameters, it outperforms Llama 3.1 70B on most benchmarks despite being significantly smaller in total parameters — the MoE efficiency is real. For teams that were running Llama 3.1 70B as their local inference model, Scout is a straightforward upgrade: better quality, faster inference, lower memory footprint.

THE LICENCE SITUATION THIS TIME

Llama 4 ships under a custom Meta licence that is meaningfully more permissive than some earlier iterations but still not an OSI-approved open-source licence. The key practical constraint: applications built on Llama 4 cannot be used to train other language models without explicit Meta approval. Commercial use is broadly permitted, including SaaS products and enterprise internal tools, without revenue thresholds or per-seat fees.

For most commercial applications, the licence is workable. For AI infrastructure companies whose business model involves using open-weights models as training data for competing models, it's a meaningful restriction. This was a deliberate choice — Meta wants the community benefit of open weights without enabling direct model distillation pipelines that would replicate their capability improvements without the research investment.

The practical advice for teams evaluating Llama 4: read the acceptable use policy, not just the licence headline. The acceptable use restrictions cover a specific list of prohibited applications (CSAM, weapons development, election interference infrastructure) that won't affect the vast majority of commercial builders. If your application is in a grey area, get legal review before committing to a Llama 4 dependency in production.

RUNNING IT IN PRODUCTION

Scout runs comfortably on a single NVIDIA RTX 4090 (24GB VRAM) at Q4 quantisation — approximately 40 tokens per second, which is interactive speed for most applications. On an A100 with full precision weights, it reaches 80–100 tokens per second. Maverick requires multi-GPU setups for practical throughput: 4× A100s (80GB) for comfortable production serving, 8× for high-concurrency applications. The Ollama library added Llama 4 support within days of the weight release, and vLLM's tensor parallelism implementation handles Maverick's multi-GPU requirements cleanly.

Memory management is more complex for MoE models than for dense models of equivalent active parameter counts because the routing layer must load all expert weights even though only a fraction are active per token. Plan your inference infrastructure around total parameter count, not active parameter count, when sizing hardware. A Maverick instance configured for maximum quality uses more memory than a dense 70B model, despite having similar active parameters per token.

Cloud providers had Llama 4 available within a week of the weight release across Together AI, Fireworks AI, and Groq. For teams that want Llama 4 quality without managing GPU infrastructure, the hosted options are mature and competitively priced — typically 3–5x cheaper per token than equivalent Claude or GPT-4o API calls, with the quality tradeoff that the benchmark comparisons describe.

WHAT THIS CHANGES FOR INDEPENDENT TEAMS

The cumulative effect of Llama 4 alongside DeepSeek V3 and Qwen 2.5 is a decisive shift in the open-weights quality ceiling. In 2023, the honest assessment was that open-weights models were good enough for prototypes and low-stakes internal tools but not for customer-facing applications where quality was a differentiator. In 2026, that assessment no longer holds as a blanket statement.

For teams building data-sensitive applications where proprietary API data handling terms create compliance friction, Llama 4 Maverick running on-premise is now a genuinely viable production alternative for many use cases. For teams running high-volume workloads where API costs are a material operating expense, the economics of self-hosted Llama 4 at scale are increasingly compelling. The main remaining argument for proprietary APIs in these contexts is the reliability and instruction-following consistency advantage — which is real, but narrowing.