Building a Management CLI for Ollama Using the SDK
Building a Management CLI for Ollama
By the end of this section you have a mycli command with 4 subcommands (list, show, pull, ps) that mirrors the equivalent ollama CLI behavior, written against the Python SDK.
The point of this section is to touch 4 SDK methods, see how typed responses work, and handle one streaming case (pull progress): we're not building a chat yet; instead, the CLI will allow us to perform actions like the following:
uv run mycli list
# NAME SIZE MODIFIED
# granite3.3:2b 1.5 GB 2 hours ago
# llama3.2:3b 2.0 GB 3 days ago
uv run mycli show granite3.3:2b
# Architecture: granite
# Parameters: 2.5B
# Quantization: Q4_K_M
# Context: 131072
# Capabilities: completion, tools
uv run mycli pull qwen2.5:3b
# Pulling qwen2.5:3b
# downloading: 1.85 GB / 1.85 GB (100.0%)
# Done.
uv run mycli ps
# NAME SIZE PROCESSOR CONTEXT UNTIL
# granite3.3:2b 1.9 GB 100% CPU 4096 4 minutes from now
We're going to see the functions for the 4 subcommands in a single file. Each subcommand is its own function so it stays readable. We're also going to use argparse to parse the command line (show and pull need an additional argument).
Here are the functions:
The list command:
def cmd_list(client: Client, _args: argparse.Namespace) -> int:
"""List models on disk. Same data as `ollama list` and GET /api/tags."""
resp = client.list()
rows = [
[
m.model, # the tag, e.g. "granite3.3:2b"
human_size(m.size), # bytes -> "1.5 GB"
human_age(m.modified_at), # datetime -> "2h ago"
]
for m in resp.models
]
print_table(rows, ["NAME", "SIZE", "MODIFIED"])
return 0
The show command:
def cmd_show(client: Client, args: argparse.Namespace) -> int:
"""Print the metadata for one model. Same data as `ollama show`."""
resp = client.show(args.model)
# `details` is a small object with architecture/quantization/etc.
# `model_info` is the full raw metadata dict from the GGUF; we pull the
# context length out of it. The key is prefixed by architecture, e.g.
# "granite.context_length" for granite, "llama.context_length" for llama.
arch = resp.details.family
ctx_key = f"{arch}.context_length"
context = resp.modelinfo.get(ctx_key) if resp.modelinfo else None
print(f"Architecture: {arch}")
print(f"Parameters: {resp.details.parameter_size}")
print(f"Quantization: {resp.details.quantization_level}")
print(f"Context: {context if context is not None else 'unknown'}")
if resp.capabilities:
print(f"Capabilities: {', '.join(resp.capabilities)}")
return 0
The pull command:
def cmd_pull(client: Client, args: argparse.Namespace) -> int:
"""Pull a model, printing each progress event."""
for event in client.pull(args.model, stream=True):
if event.total and event.completed is not None:
print(f"{event.status}: {event.completed}/{event.total}")
else:
print(event.status)
return 0
The ps command:
def cmd_ps(client: Client, _args: argparse.Namespace) -> None:
"""List models currently loaded in memory."""
print(f"{'NAME':<20} {'SIZE':<10} {'GPU':<6} EXPIRES")
for m in client.ps().models:
gpu_pct = (m.size_vram / m.size * 100) if m.size else 0
print(
f"{m.model:<20} {human_size(m.size):<10} {gpu_pct:>4.0f}% {m.expires_at.astimezone()}"
)
The print_table, human_size, and human_age helpers should also be defined:
def human_sizeLocal AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
