Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
18%

What Is a Weight?

A layer is 2 things working together:

  • A matrix: a grid of numbers, like a spreadsheet of values
  • An activation function: a small curve applied to each number after the matrix does its work

The matrix takes every incoming number, multiplies each by a weight from the grid, and sums those products to produce each output number. For example, with 3 inputs (x1, x2, x3) and weights (w1, w2, w3), one output is computed as: (x1 * w1) + (x2 * w2) + (x3 * w3). A layer with many outputs does this calculation once per output, each with its own row of weights. This operation is called matrix multiplication.

Matrix multiplication

Matrix multiplication

The numbers inside all those matrices are called weights, and they hold the entire knowledge of the model.

When you see "8B" or "70B" next to a model name, that's the parameter count: 8 billion or 70 billion individual weights. Training is the process of adjusting those weights, one tiny nudge at a time, across billions of examples, until the network produces good predictions. Once training is done, the weights are frozen (no longer changing), saved to a file, and shipped.

(i) When you download a model, you're downloading its weights (plus a small amount of config and metadata). Running the model is called inference: your computer turns your input into numbers, then pushes those numbers through the model's layers, doing the math against those frozen weights to produce the output.

The outputs of a layer are not text. They are signals passed to the next layer. Only the final layer is decoded into text, and it has a special shape: it produces one score for every token in the model's vocabulary. With a vocabulary of 128,000 tokens, that is 128,000 scores, each one a raw rating (a logit) for a candidate next token. Those scores are converted to probabilities, one token is selected from that distribution, and the selected token is an integer ID. The tokenizer maps that ID back to a piece of text, such as a word, a fragment, or a punctuation mark. The chosen token is then fed back in as input, and the model runs again to produce the next token, one step at a time.

From scores to text

From scores to text

Why This Matters in Practice

Once you accept that "a model = a big pile of numbers + a recipe for multiplying them", several things stop being mysterious. A few common questions answer themselves:

Why are model files huge?

The file size is roughly the parameter count times the bytes per weight:

file size = number of weights * bytes per weight

Most models ship at 16 bits per weight (called half precision). That is 2 bytes per weight (2x8), so an 8 billion parameter model lands at about 16 GB:

# 16 bits = 2 bytes
8,000,000,000 * 2 bytes = 16,000,000,000 bytes = 16 GB

At this precision there's no compression involved; the file is essentially the weights, plus a small amount of tokenizer data, config, and metadata.

Why do RAM and VRAM requirements track model size?

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.