Concurrency: Parallel Requests and the Queue
55%
How They Interact
The flow on a busy server:
A request arrives.
If a parallel slot is free on the target model, the request starts immediately.
If all slots are busy and the queue has room, the request queues.
If the queue is full, the request is rejected with 503.
As slots free up, queued requests are picked up in order.
The 3 knobs you actually tune for a multi-user setup are OLLAMA_KEEP_ALIVE, OLLAMA_NUM_PARALLEL, and OLLAMA_MAX_QUEUE.
Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
