Running Models and Understanding How They Work inside Ollama
Running Models with the Ollama API
The Ollama CLI is convenient for poking around, but every real integration goes through the HTTP API on http://localhost:11434. Three endpoints cover 90% of what you'll do:
/api/pullto get a model/api/generatefor one-shot completions/api/chatfor multi-turn conversations with roles and tool calls
Prerequisites
Before using the API, make sure the server is running by testing the /api/version endpoint:
curl http://localhost:11434/api/version
Expected response:
{"version":"0.30.0"}
If you get Connection refused, start the server with systemctl start ollama. All examples below assume the default host (localhost) and port (11434). Override with OLLAMA_HOST and OLLAMA_PORT if you have a different setup.
Set a model variable so the rest of the lesson is simple to copy/paste:
export MODEL="llama3.2:3b"
/api/pull: Download a Model
/api/pull is the HTTP equivalent of ollama pull. It streams NDJSON (Newline Delimited JSON) progress events while the layers download. Use this when your app needs to provision a model on first run instead of requiring the user to pre-pull it.
curl http://localhost:11434/api/pull \
-d "{\"model\": \"$MODEL\"}"
You'll see a series of JSON lines like:
{"status":"pulling manifest"}
{"status":"pulling dde5aa3fc5ff","digest":"sha256:...","total":2019377376}
// [... progress ...]
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}
If you don't care about progress and just want to block until done, set stream: false:
curl http://localhost:11434/api/pull \
-d "{\"model\": \"$MODEL\", \"stream\": false}"
You get a single response when the pull finishes. The tradeoff is no progress feedback, so use streaming for anything user-facing where a multi-GB download would otherwise look hung.
/api/generate: Single-Shot Completion
/api/generate takes a prompt and returns a completion without taking into consideration the context of previous interactions. Reach for it when you want stateless generation like a one-off summary, a classification, or a code rewrite.
Non-streaming for a clean response object:
curl -s http://localhost:11434/api/generate -d "{
\"model\": \"$MODEL\",
\"prompt\": \"Explain what a transformer architecture is in two sentences.\",
\"stream\": false
}" | jq
Response (truncated):
{
"model": "llama3.2:3b",
"created_at": "2026-05-12T08:10:31.659858474Z",
"response": "A transformer architecture is a type of neural network design that uses self-attention mechanisms to process sequential data, such as text or images...",
"done": true,
"done_reason": "stop",
"context": [
128006,
// [... other tokens ...]
],
"total_duration": 5946213748,
"load_duration": 334729261,
"prompt_eval_count": 37,
"prompt_eval_duration": 91284365,
"eval_count": 63,
"eval_duration": 5389747128
}
The response object has these fields:
| Field | What it tells you |
|---|---|
model | Which model produced the response. Useful when your app routes between models. |
created_at | Timestamp when the response completed, in RFC 3339 format. |
response | The generated text. Empty string when stream: true since each token arrived in a prior event. |
done | true on the final event, false on intermediate streaming events. |
done_reason | Why generation stopped. Common values: stop (hit a stop token or natural end), length (hit num_predict limit), load (model was just loaded, no generation happened). |
context | Token IDs representing the full conversation state. Pass this back in the next /api/generate call as context to continue without resending the prompt. Deprecated in favor of /api/chat with a messages array, but still works. |
total_duration | Whole-request time in nanoseconds, including model load, prompt eval, and generation. |
load_duration | Time spent loading the model into memory in nanoseconds. Zero or tiny when the model was already resident. |
prompt_eval_count | Number of tokens in your prompt. When the prompt was served from cache, this will not be exact, so don't rely on it as an exact input-token count across repeated calls. |
prompt_eval_duration | Time spent processing the prompt (the prefill phase) in nanoseconds. |
eval_count | Number of tokens generated. |
eval_duration | Time spent generating tokens (the decode phase) in nanoseconds. |
Tokens per second is eval_count / (eval_duration / 1e9). This is the number you actually care about when comparing models or hardware.
TPS is, indeed, the headline number for "is this model usable on this hardware". People use it in concrete ways to compare models and quantization (on the same hardware). Here's a table of TPS for some hardware and the use cases that make sense on it:
| Use case | Scenario | TPS | Hardware example |
|---|---|---|---|
| Background summarization | RAG, digest jobs, async tasks | 5 to 10 | CPU only, low-end GPU |
| Interactive chat | Streaming responses to a user | 15 to 20 | Apple Silicon M2, RTX 3060 12GB |
| Code completion | Inline suggestions in an editor | 50+ | RTX 4090 24GB, M3 Max |
Streaming (default) returns one JSON object per token chunk. Each chunk has done: false until the last one, which carries the stats:
curl http://localhost:11434/api/generate -d "{
\"model\": \"$MODEL\",
\"prompt\": \"List three reasons to run models locally.\"
}"
You'll see something like:
{"model":"llama3.2:3b","created_at":"...","response":"Here","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" are","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" three","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" reasons","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" to","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" run","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" models","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" locally","done":false}
// ...
{"model":"llama3.2:3b","created_at":"...","response":"","done":true,"total_duration":"..."}
To watch it accumulate in a terminal:
curl -s http://localhost:11434/api/generate -d "{
\"model\": \"$MODEL\",
\"prompt\": \"List 3 reasons to run models locally.\"
}" | jq -j '.response'
The -j flag tells jq to skip newlines so the tokens flow as a single stream.
/api/generate accepts a handful of extra fields that matter in practice:
system: a system prompt prepended to your inputoptions: model parameters such astemperature,top_p,num_ctx,num_predict,seedformat: "json"or a JSON schema object for structured outputkeep_alive: how long to keep the model in memory after the call (reminder: default5m, set"0"to unload immediately,"-1"to keep loaded indefinitely)
Example with options and a system prompt:
curl -s http://localhost:11434/api/generate -d "{
\"model\": \"$MODEL\",
\"system\": \"You answer in exactly one sentence.\",
\"prompt\": \"What is quantization?\",
\"options\": {\"temperature\": 0.2, \"num_predict\": 80},
\"stream\": false
}" | jq -j '.response'
/api/chat: Multi-Turn Conversations
/api/chat accepts a messages array where each message has a role (system, user, assistant, or tool) and content. The server applies the model's chat template, so you don't hand-build the prompt. Use this when you want conversation history, system prompts handled cleanly, multimodal input, or tool calling.
To get a non-streaming chat with a system prompt, you can use an array of 2 messages:
curl -s http://localhost:11434/api/chat -d "{
\"model\": \"$MODEL\",
\"messages\": [
{\"role\": \"system\", \"content\": \"You are a terse Linux assistant. You answer in one sentence.\"},
{\"role\": \"user\", \"content\": \"How do I find files larger than 100MB?\"}
],
\"stream\": false
}" | jq
The response will look something like:
{
"model": "llama3.2:3b",
"created_at": "2026-05-12T08:50:25.218995078Z"Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
