Choosing the Right Number of Parallel Slots

The right value depends on memory headroom and how much per-request latency you can tolerate. The math is the same in both cases: weights stay constant, and KV cache scales linearly with OLLAMA_NUM_PARALLEL × num_ctx.

On GPU, VRAM is the hard limit, and compute scales reasonably well up to a point. Rough starting points:

VRAM	Typical hardware	Starting slots
8 GB	Consumer cards, small models only	1 to 2
16 GB	Mid-range cards, 7B-8B at Q4	2 to 4
24 GB	RTX 4090, RTX 3090, 7B-8B with room	4 to 8
48 GB+	RTX A6000, dual GPUs, datacenter cards	8 to 16+

While the numbers above are just rough estimations, you can base your decisions on testing: raise the value, restart the server, watch ollama ps:

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next