Cost Notes

Summarization is not free. Each time the middleware triggers, it makes a separate model call to compress the old messages. On a small local model that's typically a one to three second pause, perceptible but not painful. If you switch to a larger model for the worker but want summarization to stay fast, point the summarizer parameter at a smaller model you have pulled locally (e.g., qwen2.5:0.5b or llama3.2:1b). If you're using a GPU, you have more power to spare, so point the summarizer

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next