Why Fine-Tune

The commitwriter model works because the job is small. 3 example pairs in the Modelfile are enough to lock in Conventional Commits format, and the model holds that format reliably.

FROM granite4.1:3b  

# [...truncated...]

MESSAGE user added a retry with backoff to the HTTP client and a test for it

MESSAGE assistant feat(http): add retry with backoff to client

MESSAGE user fixed a typo in the README install section

MESSAGE assistant docs(readme): fix typo in install section

MESSAGE user renamed getUser to fetchUser across the codebase, no behavior change

MESSAGE assistant refactor: rename getUser to fetchUser

That is the ceiling of few-shot prompting in our context. Knowing why it is the ceiling tells you when to stop prompting and start fine-tuning.

The examples you put in the Modelfile are not free. Every MESSAGE pair gets prepended to the conversation on each request, so the model reads all of them again before it sees the actual diff. You pay for that in tokens, in prompt-processing time, and in context budget, on every single call. 3 examples is a rounding error, but 30 is not.

The deeper limit is what few-shot actually does. It steers the model, it does not teach it. The base model's weights never change. You are reminding it what good output looks like, and a capable base model can follow a short, clear pattern. For a tight, well-specified format like a commit message, a short reminder is all it takes, which is why the commitwriter Modelfile is the correct tool for that job.

Few-shot breaks down on two fronts, the first one is volume, and the second one is novelty:

Volume: when a behavior only becomes reliable after dozens of examples, those examples cannot fit in the Modelfile or the context window without crowding out the input you actually care about.

Novelty: when the task involves a format or skill the base model has never encountered, steering has nothing to grab onto, because there is no learned behavior to nudge toward.

Consider a model that turns plain-English requests into a proprietary internal DSL, a custom syntax your team uses to define alerting rules. A request like page me if the error rate goes over 5 percent for 10 minutes has to become valid rule syntax with the right operators, thresholds, and nesting. The base model has never seen this grammar. Three examples cannot teach it. You would need hundreds, covering the operators, the edge cases, and the ways rules combine, and hundreds of examples do not fit in a prompt and would cost you on every request if they did. This is the case where fine-tuning is the right move and the only practical one.

Fine-tuning writes the behavior into the weights. You show the model many examples once, during training, and the pattern becomes part of the model itself. The examples never ride along in the request, so each call stays cheap. And because the model learns the underlying pattern rather than memorizing 3 samples, it generalizes to inputs you never demonstrated.

The decision comes down to a few questions:

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next