Docker Model Runner APIs

Docker Model Runner provides a local server that exposes APIs compatible with OpenAI, Anthropic, and Ollama. This means you can use existing applications that are designed to work with these APIs to interact with your locally hosted models without any modifications (except changing the endpoint URL).

If the application that will consume an installed model runs in a Docker container, you need to use http://172.17.0.1:12434 as the API endpoint. If it's running on the host machine, you can use http://localhost:12434.

If your code implements OpenAI's SDK, you will use the /engines/v1 endpoint. For both Anthropic and Ollama compatibility, you will use the root / endpoint.

Here is a quick example of how to list the available models using curl with the OpenAI-compatible API:

curl -X GET "http://localhost:12434/engines/v1/models"

You should be able to see the list of models you have pulled using DMR, for example:

{
  "object": "list",
  "data": [
    {
      "id": "hf.co/qwen/qwen3-0.6b:latest",
      "object": "model",
      "created": 1769621930,
      "owned_by": "docker"
    },
    {
      "id": "ai/smollm2:135m-q2_k",
      "object": "model",
      "created": 1769622005,
      "owned_by": "docker"
    }
  ]
}

Let's go back to an example we have seen before. When we used this command docker model run ai/smollm2:135M-Q2_K "How to create a teleportation device using common household items." - at least in my case the model didn't stop after a reasonable amount of time; it continued to generate text indefinitely. When using the OpenAI-compatible API, you can set a max_tokens parameter to limit the number of tokens generated in response to a prompt:

curl -s -X POST "http://localhost:12434/engines/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "ai/smollm2:135M-Q2_K",
          "messages": [{"role":"user","content":"How to create a teleportation device using common household items."}],
          "max_tokens": 300
         }' | jq

The above command will return a JSON response containing different keys such as:

choices: An array of generated responses from the model. Each choice includes a message object with the generated content and a finish_reason indicating why the generation stopped (e.g., "stop", "length", "max_tokens").
usage: An object containing information about the number of tokens used in the prompt and the generated response, which can help you understand the cost of the API call in terms of token usage.
model: The identifier of the model that was used to generate the response.
and other metadata about the request and response.

What we're interested in is the choices array, which contains the answer. You can get it directly with jq like this:

curl -s -X POST "http://localhost:12434/engines/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "ai/smollm2:135M-Q2_K",
          "messages": [{"role":"user","content":"How to create a teleportation device using common household items."}],
          "max_tokens": 300
         }' | jq -r '.choices[0].message.content'

Painless Docker - 2nd Edition

A Comprehensive Guide to Mastering Docker and its Ecosystem

Enroll now to unlock all content and receive all future updates for free.

Unlock now ~~$31.99~~$25.59

Hurry! This limited time offer ends in:

To redeem this offer, copy the coupon code below and apply it at checkout:

Learn More

Previous Next