The Context Window: num_ctx

Everything the template just rendered has to fit inside num_ctx: the system prompt, every prior turn, the tool definitions if any, plus headroom for the reply. Exceed it and Ollama drops messages from the front of the history without telling you.

Here is a random example of what fills the window on a single tool-calling turn, with num_ctx at 4096:

[SYSTEM]  ~40 tokens
You are a helpful assistant. Use tools when needed.

[TOOLS] ~110 tokens
get_weather(city: string) -> returns current
temperature and conditions for a city.

[USER]  ~10 tokens
What's the weather in Paris?

[ASSISTANT]  (tool call)  ~20 tokens
get_weather(city="Paris")

[TOOL]  (result fed back in)  ~25 tokens
{"temp_c": 14, "conditions": "light rain"}

[USER]  ~12 tokens
And in Tokyo?

Running total going into this turn: ~217 tokens. The model then generates its
reply inside whatever is left: 4096 - 217 = 3879 tokens of headroom.

Context window

num_ctx values vary widely by model. Here are some examples:

deepseek-v4-flash: 1M tokens.
qwen3.6: 256K tokens.
gpt-oss: 128K tokens.
granite3.3: 128K tokens.
smollm2: 8K tokens.
smollm: 2K tokens.

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next