Llama 3.1

What is it?

Llama 3.1 is Meta’s open-weight large language model family, available in 8B, 70B, and 405B parameter sizes. The models are free to use commercially, can be fine-tuned, and run on your own infrastructure. The 70B variant approaches GPT-4 performance on many benchmarks while fitting on a single high-end GPU or a modest multi-GPU setup.

Why does it matter?

Open-weight models fundamentally change the economics and governance of AI. Instead of sending your data to an API provider, you run the model on your own hardware. This matters for healthcare, finance, legal, and any domain where data can’t leave your infrastructure. It also eliminates per-token API costs at scale — once you’re past the infrastructure investment, marginal inference cost drops dramatically.

Trade-offs

Strengths:

70B model competitive with GPT-4 on reasoning and coding benchmarks
Full control over data — no external API calls
Fine-tuning enabled for domain-specific optimization
Active open-source community (quantized versions, optimized inference)
No per-token costs beyond infrastructure

Limitations:

Self-hosting requires GPU infrastructure expertise
70B needs ~40GB VRAM (A100/H100 or multi-GPU consumer cards)
405B is impractical for most teams to self-host
No built-in tool use or function calling (requires custom implementation)
Keeping up with model releases and optimization techniques is a full-time job

Our take

Llama stays at Trial — the right choice for teams with specific data sovereignty requirements or high-volume inference workloads where API costs become prohibitive. For most startups, the operational overhead of self-hosting doesn’t justify the savings versus using Claude or GPT-4 APIs. But if you’re processing millions of requests or handling sensitive data, Llama 70B on your own infrastructure is a credible production option.

What is it?

Why does it matter?

Trade-offs

Our take

Stay sharp