Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Concurrency: Parallel Requests and the Queue
55%

How They Interact

The flow on a busy server:

  1. A request arrives.

  2. If a parallel slot is free on the target model, the request starts immediately.

  3. If all slots are busy and the queue has room, the request queues.

  4. If the queue is full, the request is rejected with 503.

  5. As slots free up, queued requests are picked up in order.

The 3 knobs you actually tune for a multi-user setup are OLLAMA_KEEP_ALIVE, OLLAMA_NUM_PARALLEL, and OLLAMA_MAX_QUEUE.

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.