Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
19%

What Is GGUF and Why Does It Exist?

Before GGUF, running a model meant juggling several files, including but not limited to:

Tokenizer

The rules for splitting text into tokens and mapping them to IDs.

Model's Settings

The architecture details (layer count, hidden size (aka dimension), and other parameters).

Generation Defaults

The default sampling parameters (temperature, top_p, etc.) the model ships with.

Chat Template

A chat model needs to know who said what. The chat template is the formatting recipe that labels each message as coming from the user, the assistant, or the system (the hidden instructions that set the model's behavior). Different model families use different recipes, kind of like how letters, emails, and text messages all communicate but follow different conventions.

Example of a template that has 3 parts:

<system>
    {{ system_message }}
</system>

{% for message in messages %}
    <{{ message.role }}>
        {{ message.content }}
    </{{ message.

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.