Fine-tuning LLMs

What is it?

Fine-tuning adapts a pre-trained language model to your specific domain by training it on your data. This can mean full fine-tuning (updating all model weights), LoRA (low-rank adaptation of specific layers), or distillation (training a smaller model to mimic a larger one). The goal is better performance on your specific task than prompting alone can achieve.

Why does it matter?

When RAG and prompt engineering aren’t enough — when the model needs to learn a specific output format, adopt a particular reasoning style, or handle domain-specific terminology consistently — fine-tuning is the next tool in the arsenal. It can also reduce inference costs by teaching a smaller model to perform tasks that otherwise require a larger, more expensive model.

Trade-offs

Strengths:

Can achieve higher quality than prompting alone for specific tasks
Smaller fine-tuned models can replace larger general models (cost savings)
Teaches consistent output formats and domain-specific behavior
LoRA makes fine-tuning accessible with modest GPU resources

Limitations:

Requires high-quality training data (garbage in, garbage out amplified)
Training costs are significant (compute time, data preparation, iteration)
Risk of catastrophic forgetting (model loses general capabilities)
Evaluation is difficult — how do you measure “better” for open-ended generation?
Model updates from providers invalidate your fine-tuned weights

Our take

Fine-tuning moved down to Assess because we’ve seen too many teams jump to it prematurely. The decision tree should be: (1) try better prompting, (2) try RAG, (3) try few-shot examples, (4) only then consider fine-tuning. In our experience, 80% of “we need to fine-tune” cases are actually “we need better retrieval” cases. Reserve fine-tuning for when you have clear evidence that the base model can’t learn your task from prompting and you have a robust evaluation pipeline to measure improvement.

What is it?

Why does it matter?

Trade-offs

Our take

Stay sharp