mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-09 23:37:06 +00:00
Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support
This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
169
docs/api.md
169
docs/api.md
@@ -1,5 +1,7 @@
|
||||
# API
|
||||
|
||||
> Note: Ollama's API docs are moving to https://docs.ollama.com/api
|
||||
|
||||
## Endpoints
|
||||
|
||||
- [Generate a completion](#generate-a-completion)
|
||||
@@ -104,7 +106,7 @@ The final response in the stream also includes additional data about the generat
|
||||
- `context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
|
||||
- `response`: empty if the response was streamed, if not streamed, this will contain the full response
|
||||
|
||||
To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration` * `10^9`.
|
||||
To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration` \* `10^9`.
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -500,11 +502,11 @@ The `message` object has the following fields:
|
||||
- `thinking`: (for thinking models) the model's thinking process
|
||||
- `images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
|
||||
- `tool_calls` (optional): a list of tools in JSON that the model wants to use
|
||||
- `tool_name` (optional): add the name of the tool that was executed to inform the model of the result
|
||||
- `tool_name` (optional): add the name of the tool that was executed to inform the model of the result
|
||||
|
||||
Advanced parameters (optional):
|
||||
|
||||
- `format`: the format to return a response in. Format can be `json` or a JSON schema.
|
||||
- `format`: the format to return a response in. Format can be `json` or a JSON schema.
|
||||
- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
|
||||
- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
|
||||
- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
|
||||
@@ -617,25 +619,26 @@ curl http://localhost:11434/api/chat -d '{
|
||||
##### Response
|
||||
|
||||
A stream of JSON objects is returned:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "llama3.2",
|
||||
"created_at": "2025-07-07T20:22:19.184789Z",
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"tool_calls": [
|
||||
{
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"city": "Tokyo"
|
||||
}
|
||||
},
|
||||
}
|
||||
]
|
||||
},
|
||||
"done": false
|
||||
"model": "llama3.2",
|
||||
"created_at": "2025-07-07T20:22:19.184789Z",
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"tool_calls": [
|
||||
{
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"city": "Tokyo"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"done": false
|
||||
}
|
||||
```
|
||||
|
||||
@@ -643,8 +646,8 @@ Final response:
|
||||
|
||||
```json
|
||||
{
|
||||
"model":"llama3.2",
|
||||
"created_at":"2025-07-07T20:22:19.19314Z",
|
||||
"model": "llama3.2",
|
||||
"created_at": "2025-07-07T20:22:19.19314Z",
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": ""
|
||||
@@ -701,7 +704,6 @@ curl http://localhost:11434/api/chat -d '{
|
||||
|
||||
##### Request
|
||||
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/api/chat -d '{
|
||||
"model": "llama3.2",
|
||||
@@ -730,7 +732,7 @@ curl http://localhost:11434/api/chat -d '{
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -750,7 +752,7 @@ curl http://localhost:11434/api/chat -d '{
|
||||
"arguments": {
|
||||
"city": "Tokyo"
|
||||
}
|
||||
},
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
@@ -801,7 +803,10 @@ curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json"
|
||||
{
|
||||
"model": "llama3.1",
|
||||
"created_at": "2024-12-06T00:46:58.265747Z",
|
||||
"message": { "role": "assistant", "content": "{\"age\": 22, \"available\": false}" },
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "{\"age\": 22, \"available\": false}"
|
||||
},
|
||||
"done_reason": "stop",
|
||||
"done": true,
|
||||
"total_duration": 2254970291,
|
||||
@@ -871,7 +876,6 @@ Final response:
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
#### Chat request (With history, with tools)
|
||||
|
||||
##### Request
|
||||
@@ -948,10 +952,8 @@ curl http://localhost:11434/api/chat -d '{
|
||||
"eval_count": 11,
|
||||
"eval_duration": 90282125
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
|
||||
#### Chat request (with images)
|
||||
|
||||
##### Request
|
||||
@@ -1123,7 +1125,7 @@ curl http://localhost:11434/api/chat -d '{
|
||||
```json
|
||||
{
|
||||
"model": "llama3.2",
|
||||
"created_at":"2024-09-12T21:17:29.110811Z",
|
||||
"created_at": "2024-09-12T21:17:29.110811Z",
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": ""
|
||||
@@ -1154,7 +1156,7 @@ A single JSON object is returned:
|
||||
```json
|
||||
{
|
||||
"model": "llama3.2",
|
||||
"created_at":"2024-09-12T21:33:17.547535Z",
|
||||
"created_at": "2024-09-12T21:33:17.547535Z",
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": ""
|
||||
@@ -1171,9 +1173,10 @@ POST /api/create
|
||||
```
|
||||
|
||||
Create a model from:
|
||||
* another model;
|
||||
* a safetensors directory; or
|
||||
* a GGUF file.
|
||||
|
||||
- another model;
|
||||
- a safetensors directory; or
|
||||
- a GGUF file.
|
||||
|
||||
If you are creating a model from a safetensors directory or from a GGUF file, you must [create a blob](#create-a-blob) for each of the files and then use the file name and SHA256 digest associated with each blob in the `files` field.
|
||||
|
||||
@@ -1193,11 +1196,11 @@ If you are creating a model from a safetensors directory or from a GGUF file, yo
|
||||
|
||||
#### Quantization types
|
||||
|
||||
| Type | Recommended |
|
||||
| --- | :-: |
|
||||
| q4_K_M | * |
|
||||
| q4_K_S | |
|
||||
| q8_0 | * |
|
||||
| Type | Recommended |
|
||||
| ------ | :---------: |
|
||||
| q4_K_M | \* |
|
||||
| q4_K_S | |
|
||||
| q8_0 | \* |
|
||||
|
||||
### Examples
|
||||
|
||||
@@ -1268,7 +1271,6 @@ A stream of JSON objects is returned:
|
||||
|
||||
Create a model from a GGUF file. The `files` parameter should be filled out with the file name and SHA256 digest of the GGUF file you wish to use. Use [/api/blobs/:digest](#push-a-blob) to push the GGUF file to the server before calling this API.
|
||||
|
||||
|
||||
##### Request
|
||||
|
||||
```shell
|
||||
@@ -1291,7 +1293,6 @@ A stream of JSON objects is returned:
|
||||
{"status":"success"}
|
||||
```
|
||||
|
||||
|
||||
#### Create a model from a Safetensors directory
|
||||
|
||||
The `files` parameter should include a dictionary of files for the safetensors model which includes the file names and SHA256 digest of each file. Use [/api/blobs/:digest](#push-a-blob) to first push each of the files to the server before calling this API. Files will remain in the cache until the Ollama server is restarted.
|
||||
@@ -1406,9 +1407,7 @@ A single JSON object will be returned.
|
||||
"parent_model": "",
|
||||
"format": "gguf",
|
||||
"family": "qwen2",
|
||||
"families": [
|
||||
"qwen2"
|
||||
],
|
||||
"families": ["qwen2"],
|
||||
"parameter_size": "7.6B",
|
||||
"quantization_level": "Q4_K_M"
|
||||
}
|
||||
@@ -1423,9 +1422,7 @@ A single JSON object will be returned.
|
||||
"parent_model": "",
|
||||
"format": "gguf",
|
||||
"family": "llama",
|
||||
"families": [
|
||||
"llama"
|
||||
],
|
||||
"families": ["llama"],
|
||||
"parameter_size": "3.2B",
|
||||
"quantization_level": "Q4_K_M"
|
||||
}
|
||||
@@ -1461,20 +1458,18 @@ curl http://localhost:11434/api/show -d '{
|
||||
|
||||
```json5
|
||||
{
|
||||
"modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSISTANT:\"",
|
||||
"parameters": "num_keep 24\nstop \"<|start_header_id|>\"\nstop \"<|end_header_id|>\"\nstop \"<|eot_id|>\"",
|
||||
"template": "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>",
|
||||
"details": {
|
||||
"parent_model": "",
|
||||
"format": "gguf",
|
||||
"family": "llama",
|
||||
"families": [
|
||||
"llama"
|
||||
],
|
||||
"parameter_size": "8.0B",
|
||||
"quantization_level": "Q4_0"
|
||||
modelfile: '# Modelfile generated by "ollama show"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE """{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: """\nPARAMETER num_ctx 4096\nPARAMETER stop "\u003c/s\u003e"\nPARAMETER stop "USER:"\nPARAMETER stop "ASSISTANT:"',
|
||||
parameters: 'num_keep 24\nstop "<|start_header_id|>"\nstop "<|end_header_id|>"\nstop "<|eot_id|>"',
|
||||
template: "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>",
|
||||
details: {
|
||||
parent_model: "",
|
||||
format: "gguf",
|
||||
family: "llama",
|
||||
families: ["llama"],
|
||||
parameter_size: "8.0B",
|
||||
quantization_level: "Q4_0",
|
||||
},
|
||||
"model_info": {
|
||||
model_info: {
|
||||
"general.architecture": "llama",
|
||||
"general.file_type": 2,
|
||||
"general.parameter_count": 8030261248,
|
||||
@@ -1491,16 +1486,13 @@ curl http://localhost:11434/api/show -d '{
|
||||
"llama.vocab_size": 128256,
|
||||
"tokenizer.ggml.bos_token_id": 128000,
|
||||
"tokenizer.ggml.eos_token_id": 128009,
|
||||
"tokenizer.ggml.merges": [], // populates if `verbose=true`
|
||||
"tokenizer.ggml.merges": [], // populates if `verbose=true`
|
||||
"tokenizer.ggml.model": "gpt2",
|
||||
"tokenizer.ggml.pre": "llama-bpe",
|
||||
"tokenizer.ggml.token_type": [], // populates if `verbose=true`
|
||||
"tokenizer.ggml.tokens": [] // populates if `verbose=true`
|
||||
"tokenizer.ggml.token_type": [], // populates if `verbose=true`
|
||||
"tokenizer.ggml.tokens": [], // populates if `verbose=true`
|
||||
},
|
||||
"capabilities": [
|
||||
"completion",
|
||||
"vision"
|
||||
],
|
||||
capabilities: ["completion", "vision"],
|
||||
}
|
||||
```
|
||||
|
||||
@@ -1593,7 +1585,7 @@ Then there is a series of downloading responses. Until any of the download is co
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "downloading digestname",
|
||||
"status": "pulling digestname",
|
||||
"digest": "digestname",
|
||||
"total": 2142590208,
|
||||
"completed": 241970
|
||||
@@ -1708,6 +1700,7 @@ Advanced parameters:
|
||||
- `truncate`: truncates the end of each input to fit within context length. Returns error if `false` and context length is exceeded. Defaults to `true`
|
||||
- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
|
||||
- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
|
||||
- `dimensions`: number of dimensions for the embedding
|
||||
|
||||
### Examples
|
||||
|
||||
@@ -1725,10 +1718,12 @@ curl http://localhost:11434/api/embed -d '{
|
||||
```json
|
||||
{
|
||||
"model": "all-minilm",
|
||||
"embeddings": [[
|
||||
0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
|
||||
0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
|
||||
]],
|
||||
"embeddings": [
|
||||
[
|
||||
0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
|
||||
0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
|
||||
]
|
||||
],
|
||||
"total_duration": 14143917,
|
||||
"load_duration": 1019500,
|
||||
"prompt_eval_count": 8
|
||||
@@ -1749,17 +1744,21 @@ curl http://localhost:11434/api/embed -d '{
|
||||
```json
|
||||
{
|
||||
"model": "all-minilm",
|
||||
"embeddings": [[
|
||||
0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
|
||||
0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
|
||||
],[
|
||||
-0.0098027075, 0.06042469, 0.025257962, -0.006364387, 0.07272725,
|
||||
0.017194884, 0.09032035, -0.051705178, 0.09951512, 0.09072481
|
||||
]]
|
||||
"embeddings": [
|
||||
[
|
||||
0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
|
||||
0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
|
||||
],
|
||||
[
|
||||
-0.0098027075, 0.06042469, 0.025257962, -0.006364387, 0.07272725,
|
||||
0.017194884, 0.09032035, -0.051705178, 0.09951512, 0.09072481
|
||||
]
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## List Running Models
|
||||
|
||||
```
|
||||
GET /api/ps
|
||||
```
|
||||
@@ -1790,9 +1789,7 @@ A single JSON object will be returned.
|
||||
"parent_model": "",
|
||||
"format": "gguf",
|
||||
"family": "llama",
|
||||
"families": [
|
||||
"llama"
|
||||
],
|
||||
"families": ["llama"],
|
||||
"parameter_size": "7.2B",
|
||||
"quantization_level": "Q4_0"
|
||||
},
|
||||
@@ -1839,8 +1836,10 @@ curl http://localhost:11434/api/embeddings -d '{
|
||||
```json
|
||||
{
|
||||
"embedding": [
|
||||
0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
|
||||
0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
|
||||
0.5670403838157654, 0.009260174818336964, 0.23178744316101074,
|
||||
-0.2916173040866852, -0.8924556970596313, 0.8785552978515625,
|
||||
-0.34576427936553955, 0.5742510557174683, -0.04222835972905159,
|
||||
-0.137906014919281
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -1868,5 +1867,3 @@ curl http://localhost:11434/api/version
|
||||
"version": "0.5.1"
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user