The KV Cache

The KV cache is the model's short-term memory during generation. As the model reads each token, it does some heavy math on it and stores the result in the cache. Every following token reuses that stored work instead of redoing it from scratch.

When producing the next token, the model attends over every token already in the context: the system prompt, the full chat history, your latest message, and whatever it has generated so far. Without a cache, it would recompute the attention math for all of those from scratch on every single token. That cost grows quadratically with sequence length, fast enough to be unusable.

The cache fixes this. As the model processes each token, it saves that token's keys and values into memory once. Every subsequent token reuses them. New work per token stays roughly constant instead of growing with the conversation.

There are 2 things to know in practice about the cache:

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next