Pass 4: Trim the History so It Never Outgrows the Model's Context Window

The REPL we have now is correct, but it has a quiet problem that surfaces only when you actually use it for a while. Every turn re-sends the full conversation history to the server, which means turn 20 sends roughly 20 times more text than turn 1. The server has to re-evaluate that entire prompt before it can start generating the reply, so each turn gets slower than the one before it. Eventually you hit a harder limit: the model's context window. When the prompt grows past it, the server silently drops the oldest tokens to make room, and the model loses the start of the conversation. There's no error or warning, just a model that suddenly forgets things you told it ten minutes ago.

The right way to handle this depends on what you're building. For a basic REPL we'll use the simplest strategy that works for now: cap the total size of the conversation by character count, and drop the oldest user/assistant pairs once we exceed the cap. It's crude, it's not what you'd ship to production, but it solves the immediate problem with a dozen lines of code, and it's a good baseline to understand before you reach for anything more sophisticated.

Step 1: Define a Size Budget

At the top of the file we set a maximum total character count for the history:

MAX_HISTORY_CHARS = 8000

Characters are a rough proxy for tokens (roughly 4 chars per token for English). 8000 chars is about 2000 tokens, which leaves plenty of room for the model's reply inside a typical context window.

The character threshold is a deliberate simplification. Tokens are what the model actually counts, and the conversion depends on the tokenizer.

Step 2: Write the Trimming Function

trim_history(messages, max_chars) is the new piece of logic. It returns a shortened copy of messages that fits the budget by dropping the older turns.

There are 2 important rules:

Rule 1: protect the system message.

If the very first entry has role == "system", we hold it aside and never drop it. System messages set the model's overall behavior, and removing one mid-conversation changes how the model responds:

has_system = (
    bool(messages)
    and messages[0].get("role") == "system"
)
head = messages[:1] if has_system else []
body = messages[1:] if has_system else messages[:]

head is the protected part (zero or one message). body is everything we're allowed to trim.

Rule 2: drop messages in pairs.

A turn is always a user message followed by an assistant reply. If we drop only one half, we leave an orphan, e.g., a user question with no answer, or an assistant reply with no question. That confuses the model. So we always remove two at a time from the front:

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next