Running Models with the Ollama API

The Ollama CLI is convenient for poking around, but every real integration goes through the HTTP API on http://localhost:11434. Three endpoints cover 90% of what you'll do:

/api/pull to get a model
/api/generate for one-shot completions
/api/chat for multi-turn conversations with roles and tool calls

Prerequisites

Before using the API, make sure the server is running by testing the /api/version endpoint:

curl http://localhost:11434/api/version

Expected response:

{"version":"0.30.0"}

If you get Connection refused, start the server with systemctl start ollama. All examples below assume the default host (localhost) and port (11434). Override with OLLAMA_HOST and OLLAMA_PORT if you have a different setup.

Set a model variable so the rest of the lesson is simple to copy/paste:

export MODEL="llama3.2:3b"

/api/pull: Download a Model

/api/pull is the HTTP equivalent of ollama pull. It streams NDJSON (Newline Delimited JSON) progress events while the layers download. Use this when your app needs to provision a model on first run instead of requiring the user to pre-pull it.

curl http://localhost:11434/api/pull \
  -d "{\"model\": \"$MODEL\"}"

You'll see a series of JSON lines like:

{"status":"pulling manifest"}
{"status":"pulling dde5aa3fc5ff","digest":"sha256:...","total":2019377376}

// [... progress ...]

{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}

If you don't care about progress and just want to block until done, set stream: false:

curl http://localhost:11434/api/pull \
  -d "{\"model\": \"$MODEL\", \"stream\": false}"

You get a single response when the pull finishes. The tradeoff is no progress feedback, so use streaming for anything user-facing where a multi-GB download would otherwise look hung.

/api/generate: Single-Shot Completion

/api/generate takes a prompt and returns a completion without taking into consideration the context of previous interactions. Reach for it when you want stateless generation like a one-off summary, a classification, or a code rewrite.

Non-streaming for a clean response object:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"Explain what a transformer architecture is in two sentences.\",
  \"stream\": false
}" | jq

Response (truncated):

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-12T08:10:31.659858474Z",
  "response": "A transformer architecture is a type of neural network design that uses self-attention mechanisms to process sequential data, such as text or images...",
  "done": true,
  "done_reason": "stop",
  "context": [
    128006,
    // [... other tokens ...]
  ],
  "total_duration": 5946213748,
  "load_duration": 334729261,
  "prompt_eval_count": 37,
  "prompt_eval_duration": 91284365,
  "eval_count": 63,
  "eval_duration": 5389747128
}

The response object has these fields:

Field	What it tells you
`model`	Which model produced the response. Useful when your app routes between models.
`created_at`	Timestamp when the response completed, in RFC 3339 format.
`response`	The generated text. Empty string when `stream: true` since each token arrived in a prior event.
`done`	`true` on the final event, `false` on intermediate streaming events.
`done_reason`	Why generation stopped. Common values: `stop` (hit a stop token or natural end), `length` (hit `num_predict` limit), `load` (model was just loaded, no generation happened).
`context`	Token IDs representing the full conversation state. Pass this back in the next `/api/generate` call as `context` to continue without resending the prompt. Deprecated in favor of `/api/chat` with a `messages` array, but still works.
`total_duration`	Whole-request time in nanoseconds, including model load, prompt eval, and generation.
`load_duration`	Time spent loading the model into memory in nanoseconds. Zero or tiny when the model was already resident.
`prompt_eval_count`	Number of tokens in your prompt. When the prompt was served from cache, this will not be exact, so don't rely on it as an exact input-token count across repeated calls.
`prompt_eval_duration`	Time spent processing the prompt (the prefill phase) in nanoseconds.
`eval_count`	Number of tokens generated.
`eval_duration`	Time spent generating tokens (the decode phase) in nanoseconds.

Tokens per second is eval_count / (eval_duration / 1e9). This is the number you actually care about when comparing models or hardware.

TPS is, indeed, the headline number for "is this model usable on this hardware". People use it in concrete ways to compare models and quantization (on the same hardware). Here's a table of TPS for some hardware and the use cases that make sense on it:

Use case	Scenario	TPS	Hardware example
Background summarization	RAG, digest jobs, async tasks	5 to 10	CPU only, low-end GPU
Interactive chat	Streaming responses to a user	15 to 20	Apple Silicon M2, RTX 3060 12GB
Code completion	Inline suggestions in an editor	50+	RTX 4090 24GB, M3 Max

Streaming (default) returns one JSON object per token chunk. Each chunk has done: false until the last one, which carries the stats:

curl http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"List three reasons to run models locally.\"
}"

You'll see something like:

{"model":"llama3.2:3b","created_at":"...","response":"Here","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" are","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" three","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" reasons","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" to","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" run","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" models","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" locally","done":false}

// ...

{"model":"llama3.2:3b","created_at":"...","response":"","done":true,"total_duration":"..."}

To watch it accumulate in a terminal:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"List 3 reasons to run models locally.\"
}" | jq -j '.response'

The -j flag tells jq to skip newlines so the tokens flow as a single stream.

/api/generate accepts a handful of extra fields that matter in practice:

system: a system prompt prepended to your input
options: model parameters such as temperature, top_p, num_ctx, num_predict, seed
format: "json" or a JSON schema object for structured output
keep_alive: how long to keep the model in memory after the call (reminder: default 5m, set "0" to unload immediately, "-1" to keep loaded indefinitely)

Example with options and a system prompt:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"system\": \"You answer in exactly one sentence.\",
  \"prompt\": \"What is quantization?\",
  \"options\": {\"temperature\": 0.2, \"num_predict\": 80},
  \"stream\": false
}" | jq -j '.response'

/api/chat: Multi-Turn Conversations

/api/chat accepts a messages array where each message has a role (system, user, assistant, or tool) and content. The server applies the model's chat template, so you don't hand-build the prompt. Use this when you want conversation history, system prompts handled cleanly, multimodal input, or tool calling.

To get a non-streaming chat with a system prompt, you can use an array of 2 messages:

curl -s http://localhost:11434/api/chat -d "{
  \"model\": \"$MODEL\",
  \"messages\": [
    {\"role\": \"system\", \"content\": \"You are a terse Linux assistant. You answer in one sentence.\"},
    {\"role\": \"user\", \"content\": \"How do I find files larger than 100MB?\"}
  ],
  \"stream\": false
}" | jq

The response will look something like:

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-12T08:50:25.218995078Z"

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next