Local AI Engineering with Ollama

What you'll learn

	Understand what a model is actually doing: You will learn how text becomes tokens, how tokens become predictions, and what weights, embeddings, attention, and the KV cache really are. Just enough to make decisions, with every concept tied to a setting you will later change.
	Install Ollama and size your hardware honestly: You will learn to install the runtime, tell whether a model fits in your RAM or VRAM before you download it, and read the tradeoffs between parameter count, quantization, and speed so you stop pulling models you will delete an hour later.
	Pick, pull, and manage models: You will learn to read the Ollama library and Hugging Face GGUF repos, choose the right quantization (Q4_K_M, Q5_K_M, Q8_0, and the rest), and manage what is on disk and in memory with list, show, ps, stop, copy, and remove.
	Drive Ollama from its API: You will move past the CLI and talk to Ollama the way your apps will, over HTTP, so anything you build (a script, a backend, an agent) can run models without a human typing commands. You will also learn to read tokens-per-second straight off the API so you can compare models and hardware on numbers, not vibes.
	Control the context window: You will take control of how much your model remembers in a single conversation, so you can stop a model from silently forgetting the start of a long chat and start sizing the context window deliberately for the job at hand. You will also learn to see exactly what gets sent to the model on each turn, which is the difference between guessing why a model misbehaves and knowing.
	Operate a model under real conditions: You will learn to tune behavior at runtime with temperature, top_p, top_k, penalties, and seed, control how long models stay loaded with keep-alive, and set concurrency so one model can serve parallel requests without falling over.
	Package a custom model with a Modelfile: You will turn a general-purpose model into a customized one that does a specific job the same way every time, then ship it as a single named artifact a teammate can pull and run with zero setup.
	Fine-tune a model on your own data: You will learn when prompting stops being enough and training begins, then fine-tune Granite to turn plain English into SQL using QLoRA with Unsloth, understand SFT versus preference tuning, and export the result to GGUF to run it in Ollama.
	Build against the Python SDK: You will stop parsing raw JSON by hand and start building real Python programs against Ollama, with typed responses your editor can autocomplete and your code can trust, ending with a small CLI that does the everyday model-management jobs from inside your own tooling.
	Build a working chat loop and see why it forgets: You will write a REPL that sends one message and prints one reply, then watch it fail to recall the previous turn, the concrete proof that the model itself holds no state.

	Understand what a model is actually doing: You will learn how text becomes tokens, how tokens become predictions, and what weights, embeddings, attention, and the KV cache really are. Just enough to make decisions, with every concept tied to a setting you will later change.
	Install Ollama and size your hardware honestly: You will learn to install the runtime, tell whether a model fits in your RAM or VRAM before you download it, and read the tradeoffs between parameter count, quantization, and speed so you stop pulling models you will delete an hour later.
	Pick, pull, and manage models: You will learn to read the Ollama library and Hugging Face GGUF repos, choose the right quantization (Q4_K_M, Q5_K_M, Q8_0, and the rest), and manage what is on disk and in memory with list, show, ps, stop, copy, and remove.
	Drive Ollama from its API: You will move past the CLI and talk to Ollama the way your apps will, over HTTP, so anything you build (a script, a backend, an agent) can run models without a human typing commands. You will also learn to read tokens-per-second straight off the API so you can compare models and hardware on numbers, not vibes.
	Control the context window: You will take control of how much your model remembers in a single conversation, so you can stop a model from silently forgetting the start of a long chat and start sizing the context window deliberately for the job at hand. You will also learn to see exactly what gets sent to the model on each turn, which is the difference between guessing why a model misbehaves and knowing.
	Operate a model under real conditions: You will learn to tune behavior at runtime with temperature, top_p, top_k, penalties, and seed, control how long models stay loaded with keep-alive, and set concurrency so one model can serve parallel requests without falling over.
	Package a custom model with a Modelfile: You will turn a general-purpose model into a customized one that does a specific job the same way every time, then ship it as a single named artifact a teammate can pull and run with zero setup.
	Fine-tune a model on your own data: You will learn when prompting stops being enough and training begins, then fine-tune Granite to turn plain English into SQL using QLoRA with Unsloth, understand SFT versus preference tuning, and export the result to GGUF to run it in Ollama.
	Build against the Python SDK: You will stop parsing raw JSON by hand and start building real Python programs against Ollama, with typed responses your editor can autocomplete and your code can trust, ending with a small CLI that does the everyday model-management jobs from inside your own tooling.
	Build a working chat loop and see why it forgets: You will write a REPL that sends one message and prints one reply, then watch it fail to recall the previous turn, the concrete proof that the model itself holds no state.
	Give the conversation a memory: You will keep a running message list and resend it every turn, so the assistant can follow a multi-turn conversation within a session.
	Stream replies and accept multi-line input: You will print tokens the moment they arrive instead of waiting for the full reply, and take pasted, multi-line prompts without breaking the loop.
	Keep long chats inside the context window: You will build chats that keep working past the point where they normally break, dropping the oldest turns on your terms so the prompt never overflows the context window and the model never silently forgets where it started.
	Summarize old turns instead of dropping them: You will replace hard trimming with a second model that condenses earlier messages, wired in through LangChain's summarization middleware, so a long conversation keeps its gist instead of its raw length.
	Cache replies in Redis: You will return repeated questions instantly from a cache, cutting both latency and the compute you spend regenerating the same answer.
	Add long-term memory that survives restarts: You will wire in mem0 so the assistant recalls facts about a user across separate sessions, not just within the current one, and handle the background writes cleanly on exit.
	Give the model tools to fetch live data: You will add function calling so the model can invoke your Python functions for things it cannot know, like the current weather or air quality, and guard it with a prompt that makes it admit ignorance instead of inventing numbers when a tool fails.
	Source those tools from an external MCP server: You will swap your hand-written tools for ones served over MCP, so the same agent gains capabilities you did not write and do not have to maintain, and you will see why the M times N integration problem becomes M plus N.
	Put a graphical interface in front of Ollama: You will stand up Open WebUI in Docker against a local or remote Ollama, pull models and chat with your own documents from the browser, and lock it down with the admin approval gate that turns a personal install into something you can safely hand to a team.

Read less

Description

Most people will only ever rent intelligence. They type into a box, a bill runs in the background, and the model that answers them lives on hardware they will never see, owned by a company that can change the price, the rules, or the model itself whenever it wants. That is the default now. I think it is a bad default.

This is not a theoretical risk. In August 2025, OpenAI retired GPT-4o overnight and pushed everyone onto GPT-5. People who had built their daily work around a fast, predictable model woke up to one that was slower and behaved differently, with no way to go back. In June 2026, a US government export control directive forced Anthropic to cut off Fable 5 and Mythos 5 for every customer at once, citing national security. Teams that had started building on those models lost them in an afternoon, for reasons that had nothing to do with their own work. None of these people did anything wrong. They just did not own what they depended on, so the model, the price, and the rules stayed in someone else's hands, and any of them could move without warning.

This book exists because I got tired of that arrangement. I wanted to own the thing that runs: take a model apart, change how it behaves, feed it my own data, and break it without a meter running. The first time a model answered me from my own machine, with the network unplugged, something clicked. It was mine. It would still be mine next year. The vendor could not revoke my access, change the terms, or deprecate the model out from under me. The limit was my own: I can run any model I want, when I want, where I want, and most importantly, how I want.

What surprised me was how few people knew this was possible, and how much of what little they could find was either a marketing page or a research paper. Almost nothing existed for the person in the middle: the developer who can read code and run commands but has no machine learning degree and wants none. So I wrote the book I needed when I started.

This is a practical book. You will not find history lessons or grand claims about the future. You will find things that run. By the end, three artifacts sit on your own disk: a model you customized to do one job the same way every time, a model you fine-tuned on your own data, and a chat application you built in nine passes until it became a tool-using agent. Every chapter leaves you with something working, because the only way to learn this is to make it work on your own hardware and watch it break in your own particular way.

It is also a tested book. Every command in it was run on a real machine, and every output you see - the JSON responses, the error messages, the token counts, the training logs - came from an actual session, not from documentation I trusted and pasted in. When Ollama behaved differently from its own docs, I say so and pin the version it happened on. When I could not verify a claim, I checked it against the source or cut it. The tooling moves fast enough that a confident guess is worse than no answer, so where accuracy and polish pulled apart, accuracy won. That is where the months went, and it is the part that ages well.

Local AI Engineering with Ollama: Run, understand, customize, fine-tune, and build agentic apps on your own hardware is a map crafted with a single purpose: to make local agentic AI a technology you control, not a service you submit to.

If you want to stop renting and start owning, this book is for you. Start at the first chapter, get a model running tonight, and keep going. The rest follows from there.

This book moves in one direction: from running your first model to shipping an agent that runs on your own hardware. Each chapter ends with something working, and each skill below builds on the one before it.

This book is for people who want local AI to be something they build with, not just read about. If you can run a command and edit a file, you are qualified. The roles below will each get something different out of it.

Whatever your title, if you want local agentic AI to be a tool you control instead of a service you call, you are in the right place.

Read less

Tools and technologies you will practice

Redis

Docker

Ollama

Unsloth

LangChain

Learning path

Follow the winding road from start to finish

Local AI Engineering with Ollama

4 sections · 46m read

Why This Book Exists 13m What You Will Learn 19m Who Is This Book For? 13m About the Author 1m

How to Get the Most Out of This Book

2 sections · 19m read

What This Book Asks of You 9m Conventions 10m

What's the Point of Local AI?

3 sections · 20m read

Should You Run AI Locally, or Just Use an API? 8m Why People Run Local Even When the API Is Cheaper 7m A Sensible Default 5m

What Is Ollama?

3 sections · 21m read

What It Actually Solves 6m Why Local, and Why Now 5m What Ollama Is, and What It Is Not 10m

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache

9 sections · 127m read

What Is a Token? 4m Embeddings 8m What Is a Neural Network? 11m What Is a Weight? 24m What Are Inference Parameters? 39m What Is GGUF and Why Does It Exist? 11m Quantization: Trade Precision You Don't Need for Memory You Do 17m Transformer Models 8m The KV Cache 5m

Requirements and Setup

1 section · 9m read

Installing Ollama 9m

Picking and Pulling Models

5 sections · 48m read

Understanding What You Can Run on Your System 4m Pulling Models 3m Understanding How Models Are Stored 7m Where to Find Models 20m Reading Ollama's Model Library 14m

Running Models and Understanding How They Work inside Ollama

4 sections · 48m read

Running a Model 4m One-Shot Mode 4m Running Models with the Ollama API 37m Ollama Conversation Flow 3m

The Context Window

4 sections · 22m read

How a Model "Remembers" with Ollama 4m What Actually Goes to the Model 9m The Context Window: num_ctx 5m Silent Truncation: The Trap 4m

Controlling and Tuning Model Behavior at Runtime

4 sections · 21m read

The /set Command 13m Using the API to Control the Model 5m The /save and /load Commands 2m Erasing History with `/clear` 1m

Working with the Model Library

6 sections · 36m read

Understanding What's Loaded 7m Inspecting Models 18m Listing Saved and Loaded Models 4m Stopping a Loaded Model 2m Removing a Model 2m Copying a Model 3m

Keep-Alive and Memory Control

6 sections · 17m read

Why Keep Models Loaded at All 2m Setting Keep-Alive Globally 4m Setting Keep-Alive per Request 2m Forcing an Unload Right Now 2m Multiple Models in Memory at Once 4m Picking a Keep-Alive That Makes Sense 3m

Concurrency: Parallel Requests and the Queue

4 sections · 17m read

How Many Requests One Model Handles at Once 7m Choosing the Right Number of Parallel Slots 6m How Many Waiting Requests Are Tolerated 2m How They Interact 2m

Building, Running, and Sharing Custom Models for Ollama (Modelfile)

8 sections · 53m read

Step 1: Put the Model under Your Own Name 4m Step 2: Give It One Job with SYSTEM 5m Step 3: Control the Output with PARAMETER 8m Step 4: Stop the Chatter with a Stop Sequence 8m Step 5: Teach the Format by Example with MESSAGE 13m Step 6: See and Pin the Prompt with TEMPLATE 6m Step 7: Package It and Read It Back 3m Sharing Your Model 6m

Creating a Fine-Tuned Model (English to SQL)

6 sections · 136m read

Why Fine-Tune 19m How Fine-Tuning Works 19m Base Models vs Instruct Models 6m Fine-Tuning the Model 80m Understanding What Happened 3m Fine-Tuning in the Browser with Unsloth Studio 9m

Running Your Fine-Tuned Model in Ollama

3 sections · 12m read

Step 1: Export to GGUF 7m Step 2: Create the Model in Ollama 4m Step 3: Run and Test the Fine-Tuned Model 1m

Building a Management CLI for Ollama Using the SDK

2 sections · 31m read

Setup and Requirements 15m Building a Management CLI for Ollama 16m

Building Advanced Agents: Introduction

1 section · 12m read

Pass 1: A Bare-Minimum Chat Loop against a Local Ollama Model 12m

Building Advanced Agents: Conversation History

1 section · 9m read

Pass 2: Keeping a Conversation History 9m

Building Advanced Agents: Streaming and Multi-Line Input

1 section · 15m read

Pass 3: Stream the Reply Token-by-Token and Accept Multi-Line Input 15m

Building Advanced Agents: Long Conversations

1 section · 13m read

Pass 4: Trim the History so It Never Outgrows the Model's Context Window 13m

Building Advanced Agents: Summarization with LangChain

2 sections · 32m read

Pass 5: Swap Hard Trimming for an Automatic Summary of Older Messages 29m Cost Notes 3m

Building Advanced Agents: Caching

1 section · 9m read

Pass 6: Cache Model Replies in Redis so Repeated Questions Come Back Instantly 9m

Building Advanced Agents: Long-Term Memory with mem0

1 section · 39m read

Pass 7: Give the Chat a Long-Term Memory That Survives Restarts 39m

Building Advanced Agents: Function-Calling

2 sections · 36m read

Pass 8: Let the Model Call Python Functions ("Tools") to Fetch Live Data 6m The Shape of a Tool Call, End to End 30m

Building Advanced Agents: Integrating MCP Servers

2 sections · 32m read

LangChain and MCP 4m Pass 9: Get Tools from an External MCP Server Instead of Writing Them In-Process 28m

User-Friendly Interfaces for Ollama

2 sections · 46m read

Local Chat UIs for Ollama 13m Installing and Using Open WebUI 33m

Afterword: Where to Go from Here

3 sections · 9m read

What's Next? 6m Keep Going 1m Your Feedback Matters 2m

The author

Aymen El Amri

@eon01

Aymen El Amri is an engineer, author, and founder of FAUN.dev, a developer platform reaching hundreds of thousands of engineers. With 15+ years in SWE and production systems and recognition by TechBeacon among the top 100 DevOps professionals to follow, he writes the practical, tested books he wishes he'd had, this one born from running local AI on his own hardware until it worked.