Building a Management CLI for Ollama

By the end of this section you have a mycli command with 4 subcommands (list, show, pull, ps) that mirrors the equivalent ollama CLI behavior, written against the Python SDK.

The point of this section is to touch 4 SDK methods, see how typed responses work, and handle one streaming case (pull progress): we're not building a chat yet; instead, the CLI will allow us to perform actions like the following:

uv run mycli list
# NAME              SIZE      MODIFIED
# granite3.3:2b     1.5 GB    2 hours ago
# llama3.2:3b       2.0 GB    3 days ago

uv run mycli show granite3.3:2b
# Architecture:  granite
# Parameters:    2.5B
# Quantization:  Q4_K_M
# Context:       131072
# Capabilities:  completion, tools

uv run mycli pull qwen2.5:3b
# Pulling qwen2.5:3b
# downloading: 1.85 GB / 1.85 GB (100.0%)
# Done.

uv run mycli ps
# NAME              SIZE      PROCESSOR    CONTEXT    UNTIL
# granite3.3:2b     1.9 GB    100% CPU     4096       4 minutes from now

We're going to see the functions for the 4 subcommands in a single file. Each subcommand is its own function so it stays readable. We're also going to use argparse to parse the command line (show and pull need an additional argument).

Here are the functions:

The list command:

def cmd_list(client: Client, _args: argparse.Namespace) -> int:
    """List models on disk. Same data as `ollama list` and GET /api/tags."""
    resp = client.list()
    rows = [
        [
            m.model,                          # the tag, e.g. "granite3.3:2b"
            human_size(m.size),               # bytes -> "1.5 GB"
            human_age(m.modified_at),         # datetime -> "2h ago"
        ]
        for m in resp.models
    ]
    print_table(rows, ["NAME", "SIZE", "MODIFIED"])
    return 0

The show command:

def cmd_show(client: Client, args: argparse.Namespace) -> int:
    """Print the metadata for one model. Same data as `ollama show`."""
    resp = client.show(args.model)

    # `details` is a small object with architecture/quantization/etc.
    # `model_info` is the full raw metadata dict from the GGUF; we pull the
    # context length out of it. The key is prefixed by architecture, e.g.
    # "granite.context_length" for granite, "llama.context_length" for llama.
    arch = resp.details.family
    ctx_key = f"{arch}.context_length"
    context = resp.modelinfo.get(ctx_key) if resp.modelinfo else None

    print(f"Architecture:  {arch}")
    print(f"Parameters:    {resp.details.parameter_size}")
    print(f"Quantization:  {resp.details.quantization_level}")
    print(f"Context:       {context if context is not None else 'unknown'}")
    if resp.capabilities:
        print(f"Capabilities:  {', '.join(resp.capabilities)}")
    return 0

The pull command:

def cmd_pull(client: Client, args: argparse.Namespace) -> int:
    """Pull a model, printing each progress event."""
    for event in client.pull(args.model, stream=True):
        if event.total and event.completed is not None:
            print(f"{event.status}: {event.completed}/{event.total}")
        else:
            print(event.status)
    return 0

The ps command:

def cmd_ps(client: Client, _args: argparse.Namespace) -> None:
    """List models currently loaded in memory."""
    print(f"{'NAME':<20} {'SIZE':<10} {'GPU':<6} EXPIRES")
    for m in client.ps().models:
        gpu_pct = (m.size_vram / m.size * 100) if m.size else 0
        print(
            f"{m.model:<20} {human_size(m.size):<10} {gpu_pct:>4.0f}% {m.expires_at.astimezone()}"
        )

The print_table, human_size, and human_age helpers should also be defined:

def human_size

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next