Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support

This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-18 11:47:07 +00:00 · 2025-11-05 14:03:05 +08:00
parent fabe2c5cb7
commit ef14fb5b26
817 changed files with 241634 additions and 70888 deletions
--- a/docs/capabilities/embeddings.mdx
+++ b/docs/capabilities/embeddings.mdx
@@ -0,0 +1,113 @@
+---
+title: Embeddings
+description: Generate text embeddings for semantic search, retrieval, and RAG.
+---
+
+Embeddings turn text into numeric vectors you can store in a vector database, search with cosine similarity, or use in RAG pipelines. The vector length depends on the model (typically 384–1024 dimensions).
+
+## Recommended models
+
+- [embeddinggemma](https://ollama.com/library/embeddinggemma)
+- [qwen3-embedding](https://ollama.com/library/qwen3-embedding)
+- [all-minilm](https://ollama.com/library/all-minilm)
+
+## Generate embeddings
+
+Use `/api/embed` with a single string.
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/embed \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "embeddinggemma",
+        "input": "The quick brown fox jumps over the lazy dog."
+      }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    import ollama
+
+    single = ollama.embed(
+      model='embeddinggemma',
+      input='The quick brown fox jumps over the lazy dog.'
+    )
+    print(len(single['embeddings'][0]))  # vector length
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    const single = await ollama.embed({
+      model: 'embeddinggemma',
+      input: 'The quick brown fox jumps over the lazy dog.',
+    })
+    console.log(single.embeddings[0].length) // vector length
+    ```
+  </Tab>
+</Tabs>
+
+<Note>
+  The `/api/embed` endpoint returns L2‑normalized (unit‑length) vectors.
+</Note>
+
+## Generate a batch of embeddings
+
+Pass an array of strings to `input`.
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/embed \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "embeddinggemma",
+        "input": [
+          "First sentence",
+          "Second sentence",
+          "Third sentence"
+        ]
+      }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    import ollama
+
+    batch = ollama.embed(
+      model='embeddinggemma',
+      input=[
+        'The quick brown fox jumps over the lazy dog.',
+        'The five boxing wizards jump quickly.',
+        'Jackdaws love my big sphinx of quartz.',
+      ]
+    )
+    print(len(batch['embeddings']))  # number of vectors
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    const batch = await ollama.embed({
+      model: 'embeddinggemma',
+      input: [
+        'The quick brown fox jumps over the lazy dog.',
+        'The five boxing wizards jump quickly.',
+        'Jackdaws love my big sphinx of quartz.',
+      ],
+    })
+    console.log(batch.embeddings.length) // number of vectors
+    ```
+  </Tab>
+</Tabs>
+
+## Tips
+
+- Use cosine similarity for most semantic search use cases.
+- Use the same embedding model for both indexing and querying.
+
+
--- a/docs/capabilities/streaming.mdx
+++ b/docs/capabilities/streaming.mdx
@@ -0,0 +1,99 @@
+---
+title: Streaming
+---
+
+Streaming allows you to render text as it is produced by the model. 
+
+Streaming is enabled by default through the REST API, but disabled by default in the SDKs.
+
+To enable streaming in the SDKs, set the `stream` parameter to `True`.
+
+## Key streaming concepts
+1. Chatting: Stream partial assistant messages. Each chunk includes the `content` so you can render messages as they arrive.
+1. Thinking: Thinking-capable models emit a `thinking` field alongside regular content in each chunk. Detect this field in streaming chunks to show or hide reasoning traces before the final answer arrives.
+1. Tool calling: Watch for streamed `tool_calls` in each chunk, execute the requested tool, and append tool outputs back into the conversation.
+
+## Handling streamed chunks
+
+
+<Note> It is necessary to accumulate the partial fields in order to maintain the history of the conversation. This is particularly important for tool calling where the thinking, tool call from the model, and the executed tool result must be passed back to the model in the next request. </Note>
+
+<Tabs>
+  <Tab title="Python">
+
+    ```python
+    from ollama import chat
+
+    stream = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
+      stream=True,
+    )
+
+    in_thinking = False
+    content = ''
+    thinking = ''
+    for chunk in stream:
+      if chunk.message.thinking:
+        if not in_thinking:
+          in_thinking = True
+          print('Thinking:\n', end='', flush=True)
+        print(chunk.message.thinking, end='', flush=True)
+        # accumulate the partial thinking 
+        thinking += chunk.message.thinking
+      elif chunk.message.content:
+        if in_thinking:
+          in_thinking = False
+          print('\n\nAnswer:\n', end='', flush=True)
+        print(chunk.message.content, end='', flush=True)
+        # accumulate the partial content
+        content += chunk.message.content
+
+      # append the accumulated fields to the messages for the next request
+      new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+
+    ```javascript
+    import ollama from 'ollama'
+
+    async function main() {
+      const stream = await ollama.chat({
+        model: 'qwen3',
+        messages: [{ role: 'user', content: 'What is 17 × 23?' }],
+        stream: true,
+      })
+
+      let inThinking = false
+      let content = ''
+      let thinking = ''
+
+      for await (const chunk of stream) {
+        if (chunk.message.thinking) {
+          if (!inThinking) {
+            inThinking = true
+            process.stdout.write('Thinking:\n')
+          }
+          process.stdout.write(chunk.message.thinking)
+          // accumulate the partial thinking
+          thinking += chunk.message.thinking
+        } else if (chunk.message.content) {
+          if (inThinking) {
+            inThinking = false
+            process.stdout.write('\n\nAnswer:\n')
+          }
+          process.stdout.write(chunk.message.content)
+          // accumulate the partial content
+          content += chunk.message.content
+        }
+      }
+
+      // append the accumulated fields to the messages for the next request
+      new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
+    }
+
+    main().catch(console.error)
+    ```
+  </Tab>
+</Tabs>
--- a/docs/capabilities/structured-outputs.mdx
+++ b/docs/capabilities/structured-outputs.mdx
@@ -0,0 +1,194 @@
+---
+title: Structured Outputs
+---
+
+Structured outputs let you enforce a JSON schema on model responses so you can reliably extract structured data, describe images, or keep every reply consistent.
+
+## Generating structured JSON
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "gpt-oss",
+      "messages": [{"role": "user", "content": "Tell me about Canada in one line"}],
+      "stream": false,
+      "format": "json"
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+
+    response = chat(
+      model='gpt-oss',
+      messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
+      format='json'
+    )
+    print(response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    const response = await ollama.chat({
+      model: 'gpt-oss',
+      messages: [{ role: 'user', content: 'Tell me about Canada.' }],
+      format: 'json'
+    })
+    console.log(response.message.content)
+    ```
+  </Tab>
+</Tabs>
+
+## Generating structured JSON with a schema
+
+Provide a JSON schema to the `format` field.
+
+<Note>
+  It is ideal to also pass the JSON schema as a string in the prompt to ground the model's response.
+</Note>
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "gpt-oss",
+      "messages": [{"role": "user", "content": "Tell me about Canada."}],
+      "stream": false,
+      "format": {
+        "type": "object",
+        "properties": {
+          "name": {"type": "string"},
+          "capital": {"type": "string"},
+          "languages": {
+            "type": "array",
+            "items": {"type": "string"}
+          }
+        },
+        "required": ["name", "capital", "languages"]
+      }
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    Use Pydantic models and pass `model_json_schema()` to `format`, then validate the response:
+
+    ```python
+    from ollama import chat
+    from pydantic import BaseModel
+
+    class Country(BaseModel):
+      name: str
+      capital: str
+      languages: list[str]
+
+    response = chat(
+      model='gpt-oss',
+      messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
+      format=Country.model_json_schema(),
+    )
+
+    country = Country.model_validate_json(response.message.content)
+    print(country)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    Serialize a Zod schema with `zodToJsonSchema()` and parse the structured response:
+
+    ```javascript
+    import ollama from 'ollama'
+    import { z } from 'zod'
+    import { zodToJsonSchema } from 'zod-to-json-schema'
+
+    const Country = z.object({
+      name: z.string(),
+      capital: z.string(),
+      languages: z.array(z.string()),
+    })
+
+    const response = await ollama.chat({
+      model: 'gpt-oss',
+      messages: [{ role: 'user', content: 'Tell me about Canada.' }],
+      format: zodToJsonSchema(Country),
+    })
+
+    const country = Country.parse(JSON.parse(response.message.content))
+    console.log(country)
+    ```
+  </Tab>
+</Tabs>
+
+## Example: Extract structured data
+
+Define the objects you want returned and let the model populate the fields:
+
+```python
+from ollama import chat
+from pydantic import BaseModel
+
+class Pet(BaseModel):
+  name: str
+  animal: str
+  age: int
+  color: str | None
+  favorite_toy: str | None
+
+class PetList(BaseModel):
+  pets: list[Pet]
+
+response = chat(
+  model='gpt-oss',
+  messages=[{'role': 'user', 'content': 'I have two cats named Luna and Loki...'}],
+  format=PetList.model_json_schema(),
+)
+
+pets = PetList.model_validate_json(response.message.content)
+print(pets)
+```
+
+## Example: Vision with structured outputs
+
+Vision models accept the same `format` parameter, enabling deterministic descriptions of images:
+
+```python
+from ollama import chat
+from pydantic import BaseModel
+from typing import Literal, Optional
+
+class Object(BaseModel):
+  name: str
+  confidence: float
+  attributes: str
+
+class ImageDescription(BaseModel):
+  summary: str
+  objects: list[Object]
+  scene: str
+  colors: list[str]
+  time_of_day: Literal['Morning', 'Afternoon', 'Evening', 'Night']
+  setting: Literal['Indoor', 'Outdoor', 'Unknown']
+  text_content: Optional[str] = None
+
+response = chat(
+  model='gemma3',
+  messages=[{
+    'role': 'user',
+    'content': 'Describe this photo and list the objects you detect.',
+    'images': ['path/to/image.jpg'],
+  }],
+  format=ImageDescription.model_json_schema(),
+  options={'temperature': 0},
+)
+
+image_description = ImageDescription.model_validate_json(response.message.content)
+print(image_description)
+```
+
+## Tips for reliable structured outputs
+
+- Define schemas with Pydantic (Python) or Zod (JavaScript) so they can be reused for validation.
+- Lower the temperature (e.g., set it to `0`) for more deterministic completions.
+- Structured outputs work through the OpenAI-compatible API via `response_format` 
--- a/docs/capabilities/thinking.mdx
+++ b/docs/capabilities/thinking.mdx
@@ -0,0 +1,153 @@
+---
+title: Thinking
+---
+
+Thinking-capable models emit a `thinking` field that separates their reasoning trace from the final answer. 
+
+Use this capability to audit model steps, animate the model *thinking* in a UI, or hide the trace entirely when you only need the final response.
+
+## Supported models
+
+- [Qwen 3](https://ollama.com/library/qwen3)
+- [GPT-OSS](https://ollama.com/library/gpt-oss) *(use `think` levels: `low`, `medium`, `high` — the trace cannot be fully disabled)*
+- [DeepSeek-v3.1](https://ollama.com/library/deepseek-v3.1)
+- [DeepSeek R1](https://ollama.com/library/deepseek-r1)
+- Browse the latest additions under [thinking models](https://ollama.com/search?c=thinking)
+
+## Enable thinking in API calls
+
+Set the `think` field on chat or generate requests. Most models accept booleans (`true`/`false`).
+
+GPT-OSS instead expects one of `low`, `medium`, or `high` to tune the trace length. 
+
+The `message.thinking` (chat endpoint) or `thinking` (generate endpoint) field contains the reasoning trace while `message.content` / `response` holds the final answer.
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    curl http://localhost:11434/api/chat -d '{
+      "model": "qwen3",
+      "messages": [{
+        "role": "user",
+        "content": "How many letter r are in strawberry?"
+      }],
+      "think": true,
+      "stream": false
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+
+    response = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'How many letter r are in strawberry?'}],
+      think=True,
+      stream=False,
+    )
+
+    print('Thinking:\n', response.message.thinking)
+    print('Answer:\n', response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    const response = await ollama.chat({
+      model: 'deepseek-r1',
+      messages: [{ role: 'user', content: 'How many letter r are in strawberry?' }],
+      think: true,
+      stream: false,
+    })
+
+    console.log('Thinking:\n', response.message.thinking)
+    console.log('Answer:\n', response.message.content)
+    ```
+  </Tab>
+</Tabs>
+
+<Note>
+  GPT-OSS requires `think` to be set to `"low"`, `"medium"`, or `"high"`. Passing `true`/`false` is ignored for that model.
+</Note>
+
+## Stream the reasoning trace
+
+Thinking streams interleave reasoning tokens before answer tokens. Detect the first `thinking` chunk to render a "thinking" section, then switch to the final reply once `message.content` arrives.
+
+<Tabs>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+
+    stream = chat(
+      model='qwen3',
+      messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
+      think=True,
+      stream=True,
+    )
+
+    in_thinking = False
+
+    for chunk in stream:
+      if chunk.message.thinking and not in_thinking:
+        in_thinking = True
+        print('Thinking:\n', end='')
+
+      if chunk.message.thinking:
+        print(chunk.message.thinking, end='')
+      elif chunk.message.content:
+        if in_thinking:
+          print('\n\nAnswer:\n', end='')
+          in_thinking = False
+        print(chunk.message.content, end='')
+
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    async function main() {
+      const stream = await ollama.chat({
+        model: 'qwen3',
+        messages: [{ role: 'user', content: 'What is 17 × 23?' }],
+        think: true,
+        stream: true,
+      })
+
+      let inThinking = false
+
+      for await (const chunk of stream) {
+        if (chunk.message.thinking && !inThinking) {
+          inThinking = true
+          process.stdout.write('Thinking:\n')
+        }
+
+        if (chunk.message.thinking) {
+          process.stdout.write(chunk.message.thinking)
+        } else if (chunk.message.content) {
+          if (inThinking) {
+            process.stdout.write('\n\nAnswer:\n')
+            inThinking = false
+          }
+          process.stdout.write(chunk.message.content)
+        }
+      }
+    }
+
+    main()
+    ```
+  </Tab>
+</Tabs>
+
+## CLI quick reference
+
+- Enable thinking for a single run: `ollama run deepseek-r1 --think "Where should I visit in Lisbon?"`
+- Disable thinking: `ollama run deepseek-r1 --think=false "Summarize this article"`
+- Hide the trace while still using a thinking model: `ollama run deepseek-r1 --hidethinking "Is 9.9 bigger or 9.11?"`
+- Inside interactive sessions, toggle with `/set think` or `/set nothink`.
+- GPT-OSS only accepts levels: `ollama run gpt-oss --think=low "Draft a headline"` (replace `low` with `medium` or `high` as needed).
+
+<Note>Thinking is enabled by default in the CLI and API for supported models.</Note>
--- a/docs/capabilities/tool-calling.mdx
+++ b/docs/capabilities/tool-calling.mdx
@@ -0,0 +1,777 @@
+---
+title: Tool calling 
+---
+
+Ollama supports tool calling (also known as function calling) which allows a model to invoke tools and incorporate their results into its replies.
+
+## Calling a single tool
+Invoke a single tool and include its response in a follow-up request. 
+
+Also known as "single-shot" tool calling.
+
+<Tabs>
+  <Tab title="cURL">
+
+    ```shell
+    curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "qwen3",
+      "messages": [{"role": "user", "content": "What's the temperature in New York?"}],
+      "stream": false,
+      "tools": [
+        {
+          "type": "function",
+          "function": {
+            "name": "get_temperature",
+            "description": "Get the current temperature for a city",
+            "parameters": {
+              "type": "object",
+              "required": ["city"],
+              "properties": {
+                "city": {"type": "string", "description": "The name of the city"}
+              }
+            }
+          }
+        }
+      ]
+    }'
+    ```
+
+    **Generate a response with a single tool result**
+    ```shell
+    curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "qwen3",
+      "messages": [
+        {"role": "user", "content": "What's the temperature in New York?"},
+        {
+          "role": "assistant",
+          "tool_calls": [
+            {
+              "type": "function",
+              "function": {
+                "index": 0,
+                "name": "get_temperature",
+                "arguments": {"city": "New York"}
+              }
+            }
+          ]
+        },
+        {"role": "tool", "tool_name": "get_temperature", "content": "22°C"}
+      ],
+      "stream": false
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    Install the Ollama Python SDK:
+    ```bash
+    # with pip
+    pip install ollama -U
+
+    # with uv
+    uv add ollama    
+    ```
+
+    ```python
+    from ollama import chat
+
+    def get_temperature(city: str) -> str:
+      """Get the current temperature for a city
+      
+      Args:
+        city: The name of the city
+
+      Returns:
+        The current temperature for the city
+      """
+      temperatures = {
+        "New York": "22°C",
+        "London": "15°C",
+        "Tokyo": "18°C",
+      }
+      return temperatures.get(city, "Unknown")
+
+    messages = [{"role": "user", "content": "What's the temperature in New York?"}]
+
+    # pass functions directly as tools in the tools list or as a JSON schema
+    response = chat(model="qwen3", messages=messages, tools=[get_temperature], think=True)
+
+    messages.append(response.message)
+    if response.message.tool_calls:
+      # only recommended for models which only return a single tool call
+      call = response.message.tool_calls[0]
+      result = get_temperature(**call.function.arguments)
+      # add the tool result to the messages
+      messages.append({"role": "tool", "tool_name": call.function.name, "content": str(result)})
+
+      final_response = chat(model="qwen3", messages=messages, tools=[get_temperature], think=True)
+      print(final_response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    Install the Ollama JavaScript library:
+    ```bash
+    # with npm
+    npm i ollama
+
+    # with bun
+    bun i ollama
+    ```
+
+    ```typescript
+    import ollama from 'ollama'
+
+    function getTemperature(city: string): string {
+      const temperatures: Record<string, string> = {
+        'New York': '22°C',
+        'London': '15°C',
+        'Tokyo': '18°C',
+      }
+      return temperatures[city] ?? 'Unknown'
+    }
+
+    const tools = [
+      {
+        type: 'function',
+        function: {
+          name: 'get_temperature',
+          description: 'Get the current temperature for a city',
+          parameters: {
+            type: 'object',
+            required: ['city'],
+            properties: {
+              city: { type: 'string', description: 'The name of the city' },
+            },
+          },
+        },
+      },
+    ]
+
+    const messages = [{ role: 'user', content: "What's the temperature in New York?" }]
+
+    const response = await ollama.chat({
+      model: 'qwen3',
+      messages,
+      tools,
+      think: true,
+    })
+
+    messages.push(response.message)
+    if (response.message.tool_calls?.length) {
+      // only recommended for models which only return a single tool call
+      const call = response.message.tool_calls[0]
+      const args = call.function.arguments as { city: string }
+      const result = getTemperature(args.city)
+      // add the tool result to the messages
+      messages.push({ role: 'tool', tool_name: call.function.name, content: result })
+
+      // generate the final response
+      const finalResponse = await ollama.chat({ model: 'qwen3', messages, tools, think: true })
+      console.log(finalResponse.message.content)
+    }
+    ```
+  </Tab>
+</Tabs>
+
+## Parallel tool calling
+
+<Tabs>
+  <Tab title="cURL">
+    Request multiple tool calls in parallel, then send all tool responses back to the model.
+
+    ```shell
+    curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "qwen3",
+      "messages": [{"role": "user", "content": "What are the current weather conditions and temperature in New York and London?"}],
+      "stream": false,
+      "tools": [
+        {
+          "type": "function",
+          "function": {
+            "name": "get_temperature",
+            "description": "Get the current temperature for a city",
+            "parameters": {
+              "type": "object",
+              "required": ["city"],
+              "properties": {
+                "city": {"type": "string", "description": "The name of the city"}
+              }
+            }
+          }
+        },
+        {
+          "type": "function",
+          "function": {
+            "name": "get_conditions",
+            "description": "Get the current weather conditions for a city",
+            "parameters": {
+              "type": "object",
+              "required": ["city"],
+              "properties": {
+                "city": {"type": "string", "description": "The name of the city"}
+              }
+            }
+          }
+        }
+      ]
+    }'
+    ```
+
+    **Generate a response with multiple tool results**
+    ```shell
+    curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
+      "model": "qwen3",
+      "messages": [
+        {"role": "user", "content": "What are the current weather conditions and temperature in New York and London?"},
+        {
+          "role": "assistant",
+          "tool_calls": [
+            {
+              "type": "function",
+              "function": {
+                "index": 0,
+                "name": "get_temperature",
+                "arguments": {"city": "New York"}
+              }
+            },
+            {
+              "type": "function",
+              "function": {
+                "index": 1,
+                "name": "get_conditions",
+                "arguments": {"city": "New York"}
+              }
+            },
+            {
+              "type": "function",
+              "function": {
+                "index": 2,
+                "name": "get_temperature",
+                "arguments": {"city": "London"}
+              }
+            },
+            {
+              "type": "function",
+              "function": {
+                "index": 3,
+                "name": "get_conditions",
+                "arguments": {"city": "London"}
+              }
+            }
+          ]
+        },
+        {"role": "tool", "tool_name": "get_temperature", "content": "22°C"},
+        {"role": "tool", "tool_name": "get_conditions", "content": "Partly cloudy"},
+        {"role": "tool", "tool_name": "get_temperature", "content": "15°C"},
+        {"role": "tool", "tool_name": "get_conditions", "content": "Rainy"}
+      ],
+      "stream": false
+    }'
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+
+    def get_temperature(city: str) -> str:
+      """Get the current temperature for a city
+      
+      Args:
+        city: The name of the city
+
+      Returns:
+        The current temperature for the city
+      """
+      temperatures = {
+        "New York": "22°C",
+        "London": "15°C",
+        "Tokyo": "18°C"
+      }
+      return temperatures.get(city, "Unknown")
+
+    def get_conditions(city: str) -> str:
+      """Get the current weather conditions for a city
+      
+      Args:
+        city: The name of the city
+
+      Returns:
+        The current weather conditions for the city
+      """
+      conditions = {
+        "New York": "Partly cloudy",
+        "London": "Rainy",
+        "Tokyo": "Sunny"
+      }
+      return conditions.get(city, "Unknown")
+
+
+    messages = [{'role': 'user', 'content': 'What are the current weather conditions and temperature in New York and London?'}]
+
+    # The python client automatically parses functions as a tool schema so we can pass them directly
+    # Schemas can be passed directly in the tools list as well 
+    response = chat(model='qwen3', messages=messages, tools=[get_temperature, get_conditions], think=True)
+
+    # add the assistant message to the messages
+    messages.append(response.message)
+    if response.message.tool_calls:
+      # process each tool call 
+      for call in response.message.tool_calls:
+        # execute the appropriate tool
+        if call.function.name == 'get_temperature':
+          result = get_temperature(**call.function.arguments)
+        elif call.function.name == 'get_conditions':
+          result = get_conditions(**call.function.arguments)
+        else:
+          result = 'Unknown tool'
+        # add the tool result to the messages
+        messages.append({'role': 'tool',  'tool_name': call.function.name, 'content': str(result)})
+
+      # generate the final response
+      final_response = chat(model='qwen3', messages=messages, tools=[get_temperature, get_conditions], think=True)
+      print(final_response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```typescript
+    import ollama from 'ollama'
+
+    function getTemperature(city: string): string {
+      const temperatures: { [key: string]: string } = {
+        "New York": "22°C",
+        "London": "15°C",
+        "Tokyo": "18°C"
+      }
+      return temperatures[city] || "Unknown"
+    }
+
+    function getConditions(city: string): string {
+      const conditions: { [key: string]: string } = {
+        "New York": "Partly cloudy",
+        "London": "Rainy",
+        "Tokyo": "Sunny"
+      }
+      return conditions[city] || "Unknown"
+    }
+
+    const tools = [
+      {
+        type: 'function',
+        function: {
+          name: 'get_temperature',
+          description: 'Get the current temperature for a city',
+          parameters: {
+            type: 'object',
+            required: ['city'],
+            properties: {
+              city: { type: 'string', description: 'The name of the city' },
+            },
+          },
+        },
+      },
+      {
+        type: 'function',
+        function: {
+          name: 'get_conditions',
+          description: 'Get the current weather conditions for a city',
+          parameters: {
+            type: 'object',
+            required: ['city'],
+            properties: {
+              city: { type: 'string', description: 'The name of the city' },
+            },
+          },
+        },
+      }
+    ]
+
+    const messages = [{ role: 'user', content: 'What are the current weather conditions and temperature in New York and London?' }]
+
+    const response = await ollama.chat({
+      model: 'qwen3',
+      messages,
+      tools,
+      think: true
+    })
+
+    // add the assistant message to the messages
+    messages.push(response.message)
+    if (response.message.tool_calls) {
+      // process each tool call 
+      for (const call of response.message.tool_calls) {
+        // execute the appropriate tool
+        let result: string
+        if (call.function.name === 'get_temperature') {
+          const args = call.function.arguments as { city: string }
+          result = getTemperature(args.city)
+        } else if (call.function.name === 'get_conditions') {
+          const args = call.function.arguments as { city: string }
+          result = getConditions(args.city)
+        } else {
+          result = 'Unknown tool'
+        }
+        // add the tool result to the messages
+        messages.push({ role: 'tool', tool_name: call.function.name, content: result })
+      }
+
+      // generate the final response
+      const finalResponse = await ollama.chat({ model: 'qwen3', messages, tools, think: true })
+      console.log(finalResponse.message.content)
+    }
+    ```
+  </Tab>
+</Tabs>
+
+
+## Multi-turn tool calling (Agent loop)
+
+An agent loop allows the model to decide when to invoke tools and incorporate their results into its replies. 
+
+It also might help to tell the model that it is in a loop and can make multiple tool calls. 
+
+<Tabs>
+  <Tab title="Python">
+    ```python
+    from ollama import chat, ChatResponse
+
+
+    def add(a: int, b: int) -> int:
+      """Add two numbers"""
+      """
+      Args:
+        a: The first number
+        b: The second number
+
+      Returns:
+        The sum of the two numbers
+      """
+      return a + b
+
+
+    def multiply(a: int, b: int) -> int:
+      """Multiply two numbers"""
+      """
+      Args:
+        a: The first number
+        b: The second number
+
+      Returns:
+        The product of the two numbers
+      """
+      return a * b
+
+
+    available_functions = {
+      'add': add,
+      'multiply': multiply,
+    }
+
+    messages = [{'role': 'user', 'content': 'What is (11434+12341)*412?'}]
+    while True:
+        response: ChatResponse = chat(
+            model='qwen3',
+            messages=messages,
+            tools=[add, multiply],
+            think=True,
+        )
+        messages.append(response.message)
+        print("Thinking: ", response.message.thinking)
+        print("Content: ", response.message.content)
+        if response.message.tool_calls:
+            for tc in response.message.tool_calls:
+                if tc.function.name in available_functions:
+                    print(f"Calling {tc.function.name} with arguments {tc.function.arguments}")
+                    result = available_functions[tc.function.name](**tc.function.arguments)
+                    print(f"Result: {result}")
+                    # add the tool result to the messages
+                    messages.append({'role': 'tool', 'tool_name': tc.function.name, 'content': str(result)})
+        else:
+            # end the loop when there are no more tool calls
+            break
+      # continue the loop with the updated messages
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```typescript
+    import ollama from 'ollama'
+
+    type ToolName = 'add' | 'multiply'
+
+    function add(a: number, b: number): number {
+      return a + b
+    }
+
+    function multiply(a: number, b: number): number {
+      return a * b
+    }
+
+    const availableFunctions: Record<ToolName, (a: number, b: number) => number> = {
+      add,
+      multiply,
+    }
+
+    const tools = [
+      {
+        type: 'function',
+        function: {
+          name: 'add',
+          description: 'Add two numbers',
+          parameters: {
+            type: 'object',
+            required: ['a', 'b'],
+            properties: {
+              a: { type: 'integer', description: 'The first number' },
+              b: { type: 'integer', description: 'The second number' },
+            },
+          },
+        },
+      },
+      {
+        type: 'function',
+        function: {
+          name: 'multiply',
+          description: 'Multiply two numbers',
+          parameters: {
+            type: 'object',
+            required: ['a', 'b'],
+            properties: {
+              a: { type: 'integer', description: 'The first number' },
+              b: { type: 'integer', description: 'The second number' },
+            },
+          },
+        },
+      },
+    ]
+
+    async function agentLoop() {
+      const messages = [{ role: 'user', content: 'What is (11434+12341)*412?' }]
+
+      while (true) {
+        const response = await ollama.chat({
+          model: 'qwen3',
+          messages,
+          tools,
+          think: true,
+        })
+
+        messages.push(response.message)
+        console.log('Thinking:', response.message.thinking)
+        console.log('Content:', response.message.content)
+
+        const toolCalls = response.message.tool_calls ?? []
+        if (toolCalls.length) {
+          for (const call of toolCalls) {
+            const fn = availableFunctions[call.function.name as ToolName]
+            if (!fn) {
+              continue
+            }
+
+            const args = call.function.arguments as { a: number; b: number }
+            console.log(`Calling ${call.function.name} with arguments`, args)
+            const result = fn(args.a, args.b)
+            console.log(`Result: ${result}`)
+            messages.push({ role: 'tool', tool_name: call.function.name, content: String(result) })
+          }
+        } else {
+          break
+        }
+      }
+    }
+
+    agentLoop().catch(console.error)
+    ```
+  </Tab>
+</Tabs>
+
+
+## Tool calling with streaming
+
+When streaming, gather every chunk of `thinking`, `content`, and `tool_calls`, then return those fields together with any tool results in the follow-up request.
+
+<Tabs>
+  <Tab title="Python">
+```python
+from ollama import chat 
+
+
+def get_temperature(city: str) -> str:
+  """Get the current temperature for a city
+  
+  Args:
+    city: The name of the city
+
+  Returns:
+    The current temperature for the city
+  """
+  temperatures = {
+    'New York': '22°C',
+    'London': '15°C',
+  }
+  return temperatures.get(city, 'Unknown')
+
+
+messages = [{'role': 'user', 'content': "What's the temperature in New York?"}]
+
+while True:
+  stream = chat(
+    model='qwen3',
+    messages=messages,
+    tools=[get_temperature],
+    stream=True,
+    think=True,
+  )
+
+  thinking = ''
+  content = ''
+  tool_calls = []
+
+  done_thinking = False
+  # accumulate the partial fields
+  for chunk in stream:
+    if chunk.message.thinking:
+      thinking += chunk.message.thinking
+      print(chunk.message.thinking, end='', flush=True)
+    if chunk.message.content:
+      if not done_thinking:
+        done_thinking = True
+        print('\n')
+      content += chunk.message.content
+      print(chunk.message.content, end='', flush=True)
+    if chunk.message.tool_calls:
+      tool_calls.extend(chunk.message.tool_calls)
+      print(chunk.message.tool_calls)
+
+  # append accumulated fields to the messages
+  if thinking or content or tool_calls:
+    messages.append({'role': 'assistant', 'thinking': thinking, 'content': content, 'tool_calls': tool_calls})
+
+  if not tool_calls:
+    break
+
+  for call in tool_calls:
+    if call.function.name == 'get_temperature':
+      result = get_temperature(**call.function.arguments)
+    else:
+      result = 'Unknown tool'
+    messages.append({'role': 'tool', 'tool_name': call.function.name, 'content': result})
+```
+
+  </Tab>
+  <Tab title="JavaScript">
+```typescript
+import ollama from 'ollama'
+
+function getTemperature(city: string): string {
+  const temperatures: Record<string, string> = {
+    'New York': '22°C',
+    'London': '15°C',
+  }
+  return temperatures[city] ?? 'Unknown'
+}
+
+const getTemperatureTool = {
+  type: 'function',
+  function: {
+    name: 'get_temperature',
+    description: 'Get the current temperature for a city',
+    parameters: {
+      type: 'object',
+      required: ['city'],
+      properties: {
+        city: { type: 'string', description: 'The name of the city' },
+      },
+    },
+  },
+}
+
+async function agentLoop() {
+  const messages = [{ role: 'user', content: "What's the temperature in New York?" }]
+
+  while (true) {
+    const stream = await ollama.chat({
+      model: 'qwen3',
+      messages,
+      tools: [getTemperatureTool],
+      stream: true,
+      think: true,
+    })
+
+    let thinking = ''
+    let content = ''
+    const toolCalls: any[] = []
+    let doneThinking = false
+
+    for await (const chunk of stream) {
+      if (chunk.message.thinking) {
+        thinking += chunk.message.thinking
+        process.stdout.write(chunk.message.thinking)
+      }
+      if (chunk.message.content) {
+        if (!doneThinking) {
+          doneThinking = true
+          process.stdout.write('\n')
+        }
+        content += chunk.message.content
+        process.stdout.write(chunk.message.content)
+      }
+      if (chunk.message.tool_calls?.length) {
+        toolCalls.push(...chunk.message.tool_calls)
+        console.log(chunk.message.tool_calls)
+      }
+    }
+
+    if (thinking || content || toolCalls.length) {
+      messages.push({ role: 'assistant', thinking, content, tool_calls: toolCalls } as any)
+    }
+
+    if (!toolCalls.length) {
+      break
+    }
+
+    for (const call of toolCalls) {
+      if (call.function.name === 'get_temperature') {
+        const args = call.function.arguments as { city: string }
+        const result = getTemperature(args.city)
+        messages.push({ role: 'tool', tool_name: call.function.name, content: result } )
+      } else {
+        messages.push({ role: 'tool', tool_name: call.function.name, content: 'Unknown tool' } )
+      }
+    }
+  }
+}
+
+agentLoop().catch(console.error)
+    ```
+  </Tab>
+</Tabs>
+
+This loop streams the assistant response, accumulates partial fields, passes them back together, and appends the tool results so the model can complete its answer.
+
+
+## Using functions as tools with Ollama Python SDK
+The Python SDK automatically parses functions as a tool schema so we can pass them directly.
+Schemas can still be passed if needed.
+
+```python
+from ollama import chat
+
+def get_temperature(city: str) -> str:
+  """Get the current temperature for a city
+  
+  Args:
+    city: The name of the city
+
+  Returns:
+    The current temperature for the city
+  """
+  temperatures = {
+    'New York': '22°C',
+    'London': '15°C',
+  }
+  return temperatures.get(city, 'Unknown')
+
+available_functions = {
+  'get_temperature': get_temperature,
+}
+# directly pass the function as part of the tools list
+response = chat(model='qwen3', messages=messages, tools=available_functions.values(), think=True)
+```
--- a/docs/capabilities/vision.mdx
+++ b/docs/capabilities/vision.mdx
@@ -0,0 +1,85 @@
+---
+title: Vision
+---
+
+Vision models accept images alongside text so the model can describe, classify, and answer questions about what it sees.
+
+## Quick start
+
+```shell
+ollama run gemma3 ./image.png whats in this image?
+```
+
+
+## Usage with Ollama's API
+Provide an `images` array. SDKs accept file paths, URLs or raw bytes while the REST API expects base64-encoded image data.
+
+
+<Tabs>
+  <Tab title="cURL">
+    ```shell
+    # 1. Download a sample image
+    curl -L -o test.jpg "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg"
+
+    # 2. Encode the image
+    IMG=$(base64 < test.jpg | tr -d '\n')
+
+    # 3. Send it to Ollama
+    curl -X POST http://localhost:11434/api/chat \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "gemma3",
+        "messages": [{
+        "role": "user",
+        "content": "What is in this image?",
+        "images": ["'"$IMG"'"]
+        }],
+        "stream": false
+    }'
+    "
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from ollama import chat
+    # from pathlib import Path
+
+    # Pass in the path to the image
+    path = input('Please enter the path to the image: ')
+
+    # You can also pass in base64 encoded image data
+    # img = base64.b64encode(Path(path).read_bytes()).decode()
+    # or the raw bytes
+    # img = Path(path).read_bytes()
+
+    response = chat(
+      model='gemma3',
+      messages=[
+        {
+          'role': 'user',
+          'content': 'What is in this image? Be concise.',
+          'images': [path],
+        }
+      ],
+    )
+
+    print(response.message.content)
+    ```
+  </Tab>
+  <Tab title="JavaScript">
+    ```javascript
+    import ollama from 'ollama'
+
+    const imagePath = '/absolute/path/to/image.jpg'
+    const response = await ollama.chat({
+      model: 'gemma3',
+      messages: [
+        { role: 'user', content: 'What is in this image?', images: [imagePath] }
+      ],
+      stream: false,
+    })
+
+    console.log(response.message.content)
+    ```
+  </Tab>
+</Tabs>
--- a/docs/capabilities/web-search.mdx
+++ b/docs/capabilities/web-search.mdx
@@ -0,0 +1,360 @@
+---
+title: Web search
+---
+
+Ollama's web search API can be used to augment models with the latest information to reduce hallucinations and improve accuracy.
+
+Web search is provided as a REST API with deeper tool integrations in the Python and JavaScript libraries. This also enables models like OpenAI’s gpt-oss models to conduct long-running research tasks.
+
+## Authentication
+
+For access to Ollama's web search API, create an [API key](https://ollama.com/settings/keys). A free Ollama account is required.
+
+## Web search API
+
+Performs a web search for a single query and returns relevant results.
+
+### Request
+
+`POST https://ollama.com/api/web_search`
+
+- `query` (string, required): the search query string
+- `max_results` (integer, optional): maximum results to return (default 5, max 10)
+
+### Response
+
+Returns an object containing:
+
+- `results` (array): array of search result objects, each containing:
+  - `title` (string): the title of the web page
+  - `url` (string): the URL of the web page
+  - `content` (string): relevant content snippet from the web page
+
+### Examples
+
+<Note>
+  Ensure OLLAMA_API_KEY is set or it must be passed in the Authorization header.
+</Note>
+
+#### cURL Request
+
+```bash
+curl https://ollama.com/api/web_search \
+  --header "Authorization: Bearer $OLLAMA_API_KEY" \
+	-d '{
+	  "query":"what is ollama?"
+	}'
+```
+
+**Response**
+
+```json
+{
+  "results": [
+    {
+      "title": "Ollama",
+      "url": "https://ollama.com/",
+      "content": "Cloud models are now available..."
+    },
+    {
+      "title": "What is Ollama? Introduction to the AI model management tool",
+      "url": "https://www.hostinger.com/tutorials/what-is-ollama",
+      "content": "Ariffud M. 6min Read..."
+    },
+    {
+      "title": "Ollama Explained: Transforming AI Accessibility and Language ...",
+      "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/",
+      "content": "Data Science Data Science Projects Data Analysis..."
+    }
+  ]
+}
+```
+
+#### Python library
+
+```python
+import ollama
+response = ollama.web_search("What is Ollama?")
+print(response)
+```
+
+**Example output**
+
+```python
+
+results = [
+    {
+        "title": "Ollama",
+        "url": "https://ollama.com/",
+        "content": "Cloud models are now available in Ollama..."
+    },
+    {
+        "title": "What is Ollama? Features, Pricing, and Use Cases - Walturn",
+        "url": "https://www.walturn.com/insights/what-is-ollama-features-pricing-and-use-cases",
+        "content": "Our services..."
+    },
+    {
+        "title": "Complete Ollama Guide: Installation, Usage & Code Examples",
+        "url": "https://collabnix.com/complete-ollama-guide-installation-usage-code-examples",
+        "content": "Join our Discord Server..."
+    }
+]
+
+```
+
+More Ollama [Python example](https://github.com/ollama/ollama-python/blob/main/examples/web-search.py)
+
+#### JavaScript Library
+
+```tsx
+import { Ollama } from "ollama";
+
+const client = new Ollama();
+const results = await client.webSearch({ query: "what is ollama?" });
+console.log(JSON.stringify(results, null, 2));
+```
+
+**Example output**
+
+```json
+{
+  "results": [
+    {
+      "title": "Ollama",
+      "url": "https://ollama.com/",
+      "content": "Cloud models are now available..."
+    },
+    {
+      "title": "What is Ollama? Introduction to the AI model management tool",
+      "url": "https://www.hostinger.com/tutorials/what-is-ollama",
+      "content": "Ollama is an open-source tool..."
+    },
+    {
+      "title": "Ollama Explained: Transforming AI Accessibility and Language Processing",
+      "url": "https://www.geeksforgeeks.org/artificial-intelligence/ollama-explained-transforming-ai-accessibility-and-language-processing/",
+      "content": "Ollama is a groundbreaking..."
+    }
+  ]
+}
+```
+
+More Ollama [JavaScript example](https://github.com/ollama/ollama-js/blob/main/examples/websearch/websearch-tools.ts)
+
+## Web fetch API
+
+Fetches a single web page by URL and returns its content.
+
+### Request
+
+`POST https://ollama.com/api/web_fetch`
+
+- `url` (string, required): the URL to fetch
+
+### Response
+
+Returns an object containing:
+
+- `title` (string): the title of the web page
+- `content` (string): the main content of the web page
+- `links` (array): array of links found on the page
+
+### Examples
+
+#### cURL Request
+
+```python
+curl --request POST \
+  --url https://ollama.com/api/web_fetch \
+  --header "Authorization: Bearer $OLLAMA_API_KEY" \
+  --header 'Content-Type: application/json' \
+  --data '{
+      "url": "ollama.com"
+  }'
+```
+
+**Response**
+
+```json
+{
+  "title": "Ollama",
+  "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...",
+  "links": [
+    "http://ollama.com/",
+    "http://ollama.com/models",
+    "https://github.com/ollama/ollama"
+  ]
+
+```
+
+#### Python SDK
+
+```python
+from ollama import web_fetch
+
+result = web_fetch('https://ollama.com')
+print(result)
+```
+
+**Result**
+
+```python
+WebFetchResponse(
+    title='Ollama',
+    content='[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama\n\n**Chat & build
+with open models**\n\n[Download](https://ollama.com/download) [Explore
+models](https://ollama.com/models)\n\nAvailable for macOS, Windows, and Linux',
+    links=['https://ollama.com/', 'https://ollama.com/models', 'https://github.com/ollama/ollama']
+)
+```
+
+#### JavaScript SDK
+
+```tsx
+import { Ollama } from "ollama";
+
+const client = new Ollama();
+const fetchResult = await client.webFetch({ url: "https://ollama.com" });
+console.log(JSON.stringify(fetchResult, null, 2));
+```
+
+**Result**
+
+```json
+{
+  "title": "Ollama",
+  "content": "[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama...",
+  "links": [
+    "https://ollama.com/",
+    "https://ollama.com/models",
+    "https://github.com/ollama/ollama"
+  ]
+}
+```
+
+## Building a search agent
+
+Use Ollama’s web search API as a tool to build a mini search agent.
+
+This example uses Alibaba’s Qwen 3 model with 4B parameters.
+
+```bash
+ollama pull qwen3:4b
+```
+
+```python
+from ollama import chat, web_fetch, web_search
+
+available_tools = {'web_search': web_search, 'web_fetch': web_fetch}
+
+messages = [{'role': 'user', 'content': "what is ollama's new engine"}]
+
+while True:
+  response = chat(
+    model='qwen3:4b',
+    messages=messages,
+    tools=[web_search, web_fetch],
+    think=True
+    )
+  if response.message.thinking:
+    print('Thinking: ', response.message.thinking)
+  if response.message.content:
+    print('Content: ', response.message.content)
+  messages.append(response.message)
+  if response.message.tool_calls:
+    print('Tool calls: ', response.message.tool_calls)
+    for tool_call in response.message.tool_calls:
+      function_to_call = available_tools.get(tool_call.function.name)
+      if function_to_call:
+        args = tool_call.function.arguments
+        result = function_to_call(**args)
+        print('Result: ', str(result)[:200]+'...')
+        # Result is truncated for limited context lengths
+        messages.append({'role': 'tool', 'content': str(result)[:2000 * 4], 'tool_name': tool_call.function.name})
+      else:
+        messages.append({'role': 'tool', 'content': f'Tool {tool_call.function.name} not found', 'tool_name': tool_call.function.name})
+  else:
+    break
+```
+
+**Result**
+
+```
+Thinking:  Okay, the user is asking about Ollama's new engine. I need to figure out what they're referring to. Ollama is a company that develops large language models, so maybe they've released a new model or an updated version of their existing engine....
+
+Tool calls:  [ToolCall(function=Function(name='web_search', arguments={'max_results': 3, 'query': 'Ollama new engine'}))]
+Result:  results=[WebSearchResult(content='# New model scheduling\n\n## September 23, 2025\n\nOllama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine
+
+Thinking:  Okay, the user asked about Ollama's new engine. Let me look at the search results.
+
+First result is from September 23, 2025, talking about new model scheduling. It mentions improved memory management, reduced crashes, better GPU utilization, and multi-GPU performance. Examples show speed improvements and accurate memory reporting. Supported models include gemma3, llama4, qwen3, etc...
+
+Content:  Ollama has introduced two key updates to its engine, both released in 2025:
+
+1. **Enhanced Model Scheduling (September 23, 2025)**
+   - **Precision Memory Management**: Exact memory allocation reduces out-of-memory crashes and optimizes GPU utilization.
+   - **Performance Gains**: Examples show significant speed improvements (e.g., 85.54 tokens/s vs 52.02 tokens/s) and full GPU layer utilization.
+   - **Multi-GPU Support**: Improved efficiency across multiple GPUs, with accurate memory reporting via tools like `nvidia-smi`.
+   - **Supported Models**: Includes `gemma3`, `llama4`, `qwen3`, `mistral-small3.2`, and more.
+
+2. **Multimodal Engine (May 15, 2025)**
+   - **Vision Support**: First-class support for vision models, including `llama4:scout` (109B parameters), `gemma3`, `qwen2.5vl`, and `mistral-small3.1`.
+   - **Multimodal Tasks**: Examples include identifying animals in multiple images, answering location-based questions from videos, and document scanning.
+
+These updates highlight Ollama's focus on efficiency, performance, and expanded capabilities for both text and vision tasks.
+```
+
+### Context length and agents
+
+Web search results can return thousands of tokens. It is recommended to increase the context length of the model to at least ~32000 tokens. Search agents work best with full context length. [Ollama's cloud models](https://docs.ollama.com/cloud) run at the full context length.
+
+## MCP Server
+
+You can enable web search in any MCP client through the [Python MCP server](https://github.com/ollama/ollama-python/blob/main/examples/web-search-mcp.py).
+
+### Cline
+
+Ollama's web search can be integrated with Cline easily using the MCP server configuration.
+
+`Manage MCP Servers` > `Configure MCP Servers` > Add the following configuration:
+
+```json
+{
+  "mcpServers": {
+    "web_search_and_fetch": {
+      "type": "stdio",
+      "command": "uv",
+      "args": ["run", "path/to/web-search-mcp.py"],
+      "env": { "OLLAMA_API_KEY": "your_api_key_here" }
+    }
+  }
+}
+```
+
+![Cline MCP Configuration](/images/cline-mcp.png)
+
+### Codex
+
+Ollama works well with OpenAI's Codex tool.
+
+Add the following configuration to `~/.codex/config.toml`
+
+```python
+[mcp_servers.web_search]
+command = "uv"
+args = ["run", "path/to/web-search-mcp.py"]
+env = { "OLLAMA_API_KEY" = "your_api_key_here" }
+```
+
+![Codex MCP Configuration](/images/codex-mcp.png)
+
+### Goose
+
+Ollama can integrate with Goose via its MCP feature.
+
+![Goose MCP Configuration 1](/images/goose-mcp-1.png)
+
+![Goose MCP Configuration 2](/images/goose-mcp-2.png)
+
+### Other integrations
+
+Ollama can be integrated into most of the tools available either through direct integration of Ollama's API, Python / JavaScript libraries, OpenAI compatible API, and MCP server integration.