Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Concurrency: Parallel Requests and the Queue
54%

Choosing the Right Number of Parallel Slots

The right value depends on memory headroom and how much per-request latency you can tolerate. The math is the same in both cases: weights stay constant, and KV cache scales linearly with OLLAMA_NUM_PARALLEL × num_ctx.

On GPU, VRAM is the hard limit, and compute scales reasonably well up to a point. Rough starting points:

VRAMTypical hardwareStarting slots
8 GBConsumer cards, small models only1 to 2
16 GBMid-range cards, 7B-8B at Q42 to 4
24 GBRTX 4090, RTX 3090, 7B-8B with room4 to 8
48 GB+RTX A6000, dual GPUs, datacenter cards8 to 16+

While the numbers above are just rough estimations, you can base your decisions on testing: raise the value, restart the server, watch ollama ps:

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.