DeepSeek-OCR flips the OCR script. Instead of feeding full image tokens to the decoder, it leans on an encoder to compress them up front, trimming down input size and GPU strain in one move. That context diet? It opens the door for way bigger windows in LLMs.
Why it matters: Shoving compression earlier in the pipeline could shift how multimodal models train and run, especially when hardware is tight.









