Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support

This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-09 23:37:06 +00:00 · 2025-11-05 14:03:05 +08:00
parent fabe2c5cb7
commit ef14fb5b26
817 changed files with 241634 additions and 70888 deletions
--- a/docs/api.md
+++ b/docs/api.md
@@ -1,5 +1,7 @@
 # API

+> Note: Ollama's API docs are moving to https://docs.ollama.com/api
+
 ## Endpoints

 - [Generate a completion](#generate-a-completion)
@@ -104,7 +106,7 @@ The final response in the stream also includes additional data about the generat
 - `context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
 - `response`: empty if the response was streamed, if not streamed, this will contain the full response

-To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration` * `10^9`.
+To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration` \* `10^9`.

 ```json
 {
@@ -500,11 +502,11 @@ The `message` object has the following fields:
 - `thinking`: (for thinking models) the model's thinking process
 - `images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
 - `tool_calls` (optional): a list of tools in JSON that the model wants to use
- `tool_name` (optional): add the name of the tool that was executed to inform the model of the result 
+- `tool_name` (optional): add the name of the tool that was executed to inform the model of the result

 Advanced parameters (optional):

- `format`: the format to return a response in. Format can be `json` or a JSON schema. 
+- `format`: the format to return a response in. Format can be `json` or a JSON schema.
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
 - `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
@@ -617,25 +619,26 @@ curl http://localhost:11434/api/chat -d '{
 ##### Response

 A stream of JSON objects is returned:
+
 ```json
 {
-    "model": "llama3.2",
-    "created_at": "2025-07-07T20:22:19.184789Z",
-    "message": {
-        "role": "assistant",
-        "content": "",
-        "tool_calls": [
-            {
-                "function": {
-                    "name": "get_weather",
-                    "arguments": {
-                        "city": "Tokyo"
-                    }
-                },
-            }
-        ]
-    },
-    "done": false
+  "model": "llama3.2",
+  "created_at": "2025-07-07T20:22:19.184789Z",
+  "message": {
+    "role": "assistant",
+    "content": "",
+    "tool_calls": [
+      {
+        "function": {
+          "name": "get_weather",
+          "arguments": {
+            "city": "Tokyo"
+          }
+        }
+      }
+    ]
+  },
+  "done": false
 }
 ```

@@ -643,8 +646,8 @@ Final response:

 ```json
 {
-  "model":"llama3.2",
-  "created_at":"2025-07-07T20:22:19.19314Z",
+  "model": "llama3.2",
+  "created_at": "2025-07-07T20:22:19.19314Z",
  "message": {
    "role": "assistant",
    "content": ""
@@ -701,7 +704,6 @@ curl http://localhost:11434/api/chat -d '{

 ##### Request

-
 ```shell
 curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
@@ -730,7 +732,7 @@ curl http://localhost:11434/api/chat -d '{
      }
    }
  ],
-  "stream": false 
+  "stream": false
 }'
 ```

@@ -750,7 +752,7 @@ curl http://localhost:11434/api/chat -d '{
          "arguments": {
            "city": "Tokyo"
          }
-        },
+        }
      }
    ]
  },
@@ -801,7 +803,10 @@ curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json"
 {
  "model": "llama3.1",
  "created_at": "2024-12-06T00:46:58.265747Z",
-  "message": { "role": "assistant", "content": "{\"age\": 22, \"available\": false}" },
+  "message": {
+    "role": "assistant",
+    "content": "{\"age\": 22, \"available\": false}"
+  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 2254970291,
@@ -871,7 +876,6 @@ Final response:
 }
 ```

-
 #### Chat request (With history, with tools)

 ##### Request
@@ -948,10 +952,8 @@ curl http://localhost:11434/api/chat -d '{
  "eval_count": 11,
  "eval_duration": 90282125
 }
-
 ```

-
 #### Chat request (with images)

 ##### Request
@@ -1123,7 +1125,7 @@ curl http://localhost:11434/api/chat -d '{
 ```json
 {
  "model": "llama3.2",
-  "created_at":"2024-09-12T21:17:29.110811Z",
+  "created_at": "2024-09-12T21:17:29.110811Z",
  "message": {
    "role": "assistant",
    "content": ""
@@ -1154,7 +1156,7 @@ A single JSON object is returned:
 ```json
 {
  "model": "llama3.2",
-  "created_at":"2024-09-12T21:33:17.547535Z",
+  "created_at": "2024-09-12T21:33:17.547535Z",
  "message": {
    "role": "assistant",
    "content": ""
@@ -1171,9 +1173,10 @@ POST /api/create
 ```

 Create a model from:
- * another model;
- * a safetensors directory; or
- * a GGUF file.
+
+- another model;
+- a safetensors directory; or
+- a GGUF file.

 If you are creating a model from a safetensors directory or from a GGUF file, you must [create a blob](#create-a-blob) for each of the files and then use the file name and SHA256 digest associated with each blob in the `files` field.

@@ -1193,11 +1196,11 @@ If you are creating a model from a safetensors directory or from a GGUF file, yo

 #### Quantization types

-| Type | Recommended |
-| --- | :-: |
-| q4_K_M | * |
-| q4_K_S | |
-| q8_0 | * |
+| Type   | Recommended |
+| ------ | :---------: |
+| q4_K_M |     \*      |
+| q4_K_S |             |
+| q8_0   |     \*      |

 ### Examples

@@ -1268,7 +1271,6 @@ A stream of JSON objects is returned:

 Create a model from a GGUF file. The `files` parameter should be filled out with the file name and SHA256 digest of the GGUF file you wish to use. Use [/api/blobs/:digest](#push-a-blob) to push the GGUF file to the server before calling this API.

-
 ##### Request

 ```shell
@@ -1291,7 +1293,6 @@ A stream of JSON objects is returned:
 {"status":"success"}
 ```

-
 #### Create a model from a Safetensors directory

 The `files` parameter should include a dictionary of files for the safetensors model which includes the file names and SHA256 digest of each file. Use [/api/blobs/:digest](#push-a-blob) to first push each of the files to the server before calling this API. Files will remain in the cache until the Ollama server is restarted.
@@ -1406,9 +1407,7 @@ A single JSON object will be returned.
        "parent_model": "",
        "format": "gguf",
        "family": "qwen2",
-        "families": [
-          "qwen2"
-        ],
+        "families": ["qwen2"],
        "parameter_size": "7.6B",
        "quantization_level": "Q4_K_M"
      }
@@ -1423,9 +1422,7 @@ A single JSON object will be returned.
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
-        "families": [
-          "llama"
-        ],
+        "families": ["llama"],
        "parameter_size": "3.2B",
        "quantization_level": "Q4_K_M"
      }
@@ -1461,20 +1458,18 @@ curl http://localhost:11434/api/show -d '{

 ```json5
 {
-  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSISTANT:\"",
-  "parameters": "num_keep                       24\nstop                           \"<|start_header_id|>\"\nstop                           \"<|end_header_id|>\"\nstop                           \"<|eot_id|>\"",
-  "template": "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>",
-  "details": {
-    "parent_model": "",
-    "format": "gguf",
-    "family": "llama",
-    "families": [
-      "llama"
-    ],
-    "parameter_size": "8.0B",
-    "quantization_level": "Q4_0"
+  modelfile: '# Modelfile generated by "ollama show"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE """{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: """\nPARAMETER num_ctx 4096\nPARAMETER stop "\u003c/s\u003e"\nPARAMETER stop "USER:"\nPARAMETER stop "ASSISTANT:"',
+  parameters: 'num_keep                       24\nstop                           "<|start_header_id|>"\nstop                           "<|end_header_id|>"\nstop                           "<|eot_id|>"',
+  template: "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>",
+  details: {
+    parent_model: "",
+    format: "gguf",
+    family: "llama",
+    families: ["llama"],
+    parameter_size: "8.0B",
+    quantization_level: "Q4_0",
  },
-  "model_info": {
+  model_info: {
    "general.architecture": "llama",
    "general.file_type": 2,
    "general.parameter_count": 8030261248,
@@ -1491,16 +1486,13 @@ curl http://localhost:11434/api/show -d '{
    "llama.vocab_size": 128256,
    "tokenizer.ggml.bos_token_id": 128000,
    "tokenizer.ggml.eos_token_id": 128009,
-    "tokenizer.ggml.merges": [],            // populates if `verbose=true`
+    "tokenizer.ggml.merges": [], // populates if `verbose=true`
    "tokenizer.ggml.model": "gpt2",
    "tokenizer.ggml.pre": "llama-bpe",
-    "tokenizer.ggml.token_type": [],        // populates if `verbose=true`
-    "tokenizer.ggml.tokens": []             // populates if `verbose=true`
+    "tokenizer.ggml.token_type": [], // populates if `verbose=true`
+    "tokenizer.ggml.tokens": [], // populates if `verbose=true`
  },
-  "capabilities": [
-    "completion",
-    "vision"
-  ],
+  capabilities: ["completion", "vision"],
 }
 ```

@@ -1593,7 +1585,7 @@ Then there is a series of downloading responses. Until any of the download is co

 ```json
 {
-  "status": "downloading digestname",
+  "status": "pulling digestname",
  "digest": "digestname",
  "total": 2142590208,
  "completed": 241970
@@ -1708,6 +1700,7 @@ Advanced parameters:
 - `truncate`: truncates the end of each input to fit within context length. Returns error if `false` and context length is exceeded. Defaults to `true`
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
+- `dimensions`: number of dimensions for the embedding

 ### Examples

@@ -1725,10 +1718,12 @@ curl http://localhost:11434/api/embed -d '{
 ```json
 {
  "model": "all-minilm",
-  "embeddings": [[
-    0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
-    0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
-  ]],
+  "embeddings": [
+    [
+      0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
+      0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
+    ]
+  ],
  "total_duration": 14143917,
  "load_duration": 1019500,
  "prompt_eval_count": 8
@@ -1749,17 +1744,21 @@ curl http://localhost:11434/api/embed -d '{
 ```json
 {
  "model": "all-minilm",
-  "embeddings": [[
-    0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
-    0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
-  ],[
-    -0.0098027075, 0.06042469, 0.025257962, -0.006364387, 0.07272725,
-    0.017194884, 0.09032035, -0.051705178, 0.09951512, 0.09072481
-  ]]
+  "embeddings": [
+    [
+      0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
+      0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
+    ],
+    [
+      -0.0098027075, 0.06042469, 0.025257962, -0.006364387, 0.07272725,
+      0.017194884, 0.09032035, -0.051705178, 0.09951512, 0.09072481
+    ]
+  ]
 }
 ```

 ## List Running Models
+
 ```
 GET /api/ps
 ```
@@ -1790,9 +1789,7 @@ A single JSON object will be returned.
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
-        "families": [
-          "llama"
-        ],
+        "families": ["llama"],
        "parameter_size": "7.2B",
        "quantization_level": "Q4_0"
      },
@@ -1839,8 +1836,10 @@ curl http://localhost:11434/api/embeddings -d '{
 ```json
 {
  "embedding": [
-    0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
-    0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
+    0.5670403838157654, 0.009260174818336964, 0.23178744316101074,
+    -0.2916173040866852, -0.8924556970596313, 0.8785552978515625,
+    -0.34576427936553955, 0.5742510557174683, -0.04222835972905159,
+    -0.137906014919281
  ]
 }
 ```
@@ -1868,5 +1867,3 @@ curl http://localhost:11434/api/version
  "version": "0.5.1"
 }
 ```
-
-