mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-18 03:37:09 +00:00
Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support
This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
63
docs/api/authentication.mdx
Normal file
63
docs/api/authentication.mdx
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: Authentication
|
||||
---
|
||||
|
||||
No authentication is required when accessing Ollama's API locally via `http://localhost:11434`.
|
||||
|
||||
Authentication is required for the following:
|
||||
|
||||
* Running cloud models via ollama.com
|
||||
* Publishing models
|
||||
* Downloading private models
|
||||
|
||||
Ollama supports two authentication methods:
|
||||
|
||||
* **Signing in**: sign in from your local installation, and Ollama will automatically take care of authenticating requests to ollama.com when running commands
|
||||
* **API keys**: API keys for programmatic access to ollama.com's API
|
||||
|
||||
## Signing in
|
||||
|
||||
To sign in to ollama.com from your local installation of Ollama, run:
|
||||
|
||||
```
|
||||
ollama signin
|
||||
```
|
||||
|
||||
Once signed in, Ollama will automatically authenticate commands as required:
|
||||
|
||||
```
|
||||
ollama run gpt-oss:120b-cloud
|
||||
```
|
||||
|
||||
Similarly, when accessing a local API endpoint that requires cloud access, Ollama will automatically authenticate the request:
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/api/generate -d '{
|
||||
"model": "gpt-oss:120b-cloud",
|
||||
"prompt": "Why is the sky blue?"
|
||||
}'
|
||||
```
|
||||
|
||||
## API keys
|
||||
|
||||
For direct access to ollama.com's API served at `https://ollama.com/api`, authentication via API keys is required.
|
||||
|
||||
First, create an [API key](https://ollama.com/settings/keys), then set the `OLLAMA_API_KEY` environment variable:
|
||||
|
||||
```shell
|
||||
export OLLAMA_API_KEY=your_api_key
|
||||
```
|
||||
|
||||
Then use the API key in the Authorization header:
|
||||
|
||||
```shell
|
||||
curl https://ollama.com/api/generate \
|
||||
-H "Authorization: Bearer $OLLAMA_API_KEY" \
|
||||
-d '{
|
||||
"model": "gpt-oss:120b",
|
||||
"prompt": "Why is the sky blue?",
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
API keys don't currently expire, however you can revoke them at any time in your [API keys settings](https://ollama.com/settings/keys).
|
||||
36
docs/api/errors.mdx
Normal file
36
docs/api/errors.mdx
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
title: Errors
|
||||
---
|
||||
|
||||
## Status codes
|
||||
|
||||
Endpoints return appropriate HTTP status codes based on the success or failure of the request in the HTTP status line (e.g. `HTTP/1.1 200 OK` or `HTTP/1.1 400 Bad Request`). Common status codes are:
|
||||
|
||||
- `200`: Success
|
||||
- `400`: Bad Request (missing parameters, invalid JSON, etc.)
|
||||
- `404`: Not Found (model doesn't exist, etc.)
|
||||
- `429`: Too Many Requests (e.g. when a rate limit is exceeded)
|
||||
- `500`: Internal Server Error
|
||||
- `502`: Bad Gateway (e.g. when a cloud model cannot be reached)
|
||||
|
||||
## Error messages
|
||||
|
||||
Errors are returned in the `application/json` format with the following structure, with the error message in the `error` property:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": "the model failed to generate a response"
|
||||
}
|
||||
```
|
||||
|
||||
## Errors that occur while streaming
|
||||
|
||||
If an error occurs mid-stream, the error will be returned as an object in the `application/x-ndjson` format with an `error` property. Since the response has already started, the status code of the response will not be changed.
|
||||
|
||||
```json
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:21:21.196249Z","response":" Yes","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:21:21.207235Z","response":".","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:21:21.219166Z","response":"I","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:21:21.231094Z","response":"can","done":false}
|
||||
{"error":"an error was encountered while running the model"}
|
||||
```
|
||||
47
docs/api/index.mdx
Normal file
47
docs/api/index.mdx
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: Introduction
|
||||
---
|
||||
|
||||
Ollama's API allows you to run and interact with models programatically.
|
||||
|
||||
## Get started
|
||||
|
||||
If you're just getting started, follow the [quickstart](/quickstart) documentation to get up and running with Ollama's API.
|
||||
|
||||
## Base URL
|
||||
|
||||
After installation, Ollama's API is served by default at:
|
||||
|
||||
```
|
||||
http://localhost:11434/api
|
||||
```
|
||||
|
||||
For running cloud models on **ollama.com**, the same API is available with the following base URL:
|
||||
|
||||
```
|
||||
https://ollama.com/api
|
||||
```
|
||||
|
||||
## Example request
|
||||
|
||||
Once Ollama is running, its API is automatically available and can be accessed via `curl`:
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/api/generate -d '{
|
||||
"model": "gemma3",
|
||||
"prompt": "Why is the sky blue?"
|
||||
}'
|
||||
```
|
||||
|
||||
## Libraries
|
||||
|
||||
Ollama has official libraries for Python and JavaScript:
|
||||
|
||||
- [Python](https://github.com/ollama/ollama-python)
|
||||
- [JavaScript](https://github.com/ollama/ollama-js)
|
||||
|
||||
Several community-maintained libraries are available for Ollama. For a full list, see the [Ollama GitHub repository](https://github.com/ollama/ollama?tab=readme-ov-file#libraries-1).
|
||||
|
||||
## Versioning
|
||||
|
||||
Ollama's API isn't strictly versioned, but the API is expected to be stable and backwards compatible. Deprecations are rare and will be announced in the [release notes](https://github.com/ollama/ollama/releases).
|
||||
368
docs/api/openai-compatibility.mdx
Normal file
368
docs/api/openai-compatibility.mdx
Normal file
@@ -0,0 +1,368 @@
|
||||
---
|
||||
title: OpenAI compatibility
|
||||
---
|
||||
|
||||
Ollama provides compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama.
|
||||
|
||||
## Usage
|
||||
|
||||
### OpenAI Python library
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url='http://localhost:11434/v1/',
|
||||
|
||||
# required but ignored
|
||||
api_key='ollama',
|
||||
)
|
||||
|
||||
chat_completion = client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
'role': 'user',
|
||||
'content': 'Say this is a test',
|
||||
}
|
||||
],
|
||||
model='llama3.2',
|
||||
)
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llava",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What's in this image?"},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": "",
|
||||
},
|
||||
],
|
||||
}
|
||||
],
|
||||
max_tokens=300,
|
||||
)
|
||||
|
||||
completion = client.completions.create(
|
||||
model="llama3.2",
|
||||
prompt="Say this is a test",
|
||||
)
|
||||
|
||||
list_completion = client.models.list()
|
||||
|
||||
model = client.models.retrieve("llama3.2")
|
||||
|
||||
embeddings = client.embeddings.create(
|
||||
model="all-minilm",
|
||||
input=["why is the sky blue?", "why is the grass green?"],
|
||||
)
|
||||
```
|
||||
|
||||
#### Structured outputs
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
|
||||
|
||||
# Define the schema for the response
|
||||
class FriendInfo(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
is_available: bool
|
||||
|
||||
class FriendList(BaseModel):
|
||||
friends: list[FriendInfo]
|
||||
|
||||
try:
|
||||
completion = client.beta.chat.completions.parse(
|
||||
temperature=0,
|
||||
model="llama3.1:8b",
|
||||
messages=[
|
||||
{"role": "user", "content": "I have two friends. The first is Ollama 22 years old busy saving the world, and the second is Alonso 23 years old and wants to hang out. Return a list of friends in JSON format"}
|
||||
],
|
||||
response_format=FriendList,
|
||||
)
|
||||
|
||||
friends_response = completion.choices[0].message
|
||||
if friends_response.parsed:
|
||||
print(friends_response.parsed)
|
||||
elif friends_response.refusal:
|
||||
print(friends_response.refusal)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
### OpenAI JavaScript library
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
|
||||
const openai = new OpenAI({
|
||||
baseURL: "http://localhost:11434/v1/",
|
||||
|
||||
// required but ignored
|
||||
apiKey: "ollama",
|
||||
});
|
||||
|
||||
const chatCompletion = await openai.chat.completions.create({
|
||||
messages: [{ role: "user", content: "Say this is a test" }],
|
||||
model: "llama3.2",
|
||||
});
|
||||
|
||||
const response = await openai.chat.completions.create({
|
||||
model: "llava",
|
||||
messages: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{ type: "text", text: "What's in this image?" },
|
||||
{
|
||||
type: "image_url",
|
||||
image_url:
|
||||
"",
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
const completion = await openai.completions.create({
|
||||
model: "llama3.2",
|
||||
prompt: "Say this is a test.",
|
||||
});
|
||||
|
||||
const listCompletion = await openai.models.list();
|
||||
|
||||
const model = await openai.models.retrieve("llama3.2");
|
||||
|
||||
const embedding = await openai.embeddings.create({
|
||||
model: "all-minilm",
|
||||
input: ["why is the sky blue?", "why is the grass green?"],
|
||||
});
|
||||
```
|
||||
|
||||
### `curl`
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.2",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello!"
|
||||
}
|
||||
]
|
||||
}'
|
||||
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llava",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What'\''s in this image?"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": ""
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"max_tokens": 300
|
||||
}'
|
||||
|
||||
curl http://localhost:11434/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.2",
|
||||
"prompt": "Say this is a test"
|
||||
}'
|
||||
|
||||
curl http://localhost:11434/v1/models
|
||||
|
||||
curl http://localhost:11434/v1/models/llama3.2
|
||||
|
||||
curl http://localhost:11434/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "all-minilm",
|
||||
"input": ["why is the sky blue?", "why is the grass green?"]
|
||||
}'
|
||||
```
|
||||
|
||||
## Endpoints
|
||||
|
||||
### `/v1/chat/completions`
|
||||
|
||||
#### Supported features
|
||||
|
||||
- [x] Chat completions
|
||||
- [x] Streaming
|
||||
- [x] JSON mode
|
||||
- [x] Reproducible outputs
|
||||
- [x] Vision
|
||||
- [x] Tools
|
||||
- [ ] Logprobs
|
||||
|
||||
#### Supported request fields
|
||||
|
||||
- [x] `model`
|
||||
- [x] `messages`
|
||||
- [x] Text `content`
|
||||
- [x] Image `content`
|
||||
- [x] Base64 encoded image
|
||||
- [ ] Image URL
|
||||
- [x] Array of `content` parts
|
||||
- [x] `frequency_penalty`
|
||||
- [x] `presence_penalty`
|
||||
- [x] `response_format`
|
||||
- [x] `seed`
|
||||
- [x] `stop`
|
||||
- [x] `stream`
|
||||
- [x] `stream_options`
|
||||
- [x] `include_usage`
|
||||
- [x] `temperature`
|
||||
- [x] `top_p`
|
||||
- [x] `max_tokens`
|
||||
- [x] `tools`
|
||||
- [ ] `tool_choice`
|
||||
- [ ] `logit_bias`
|
||||
- [ ] `user`
|
||||
- [ ] `n`
|
||||
|
||||
### `/v1/completions`
|
||||
|
||||
#### Supported features
|
||||
|
||||
- [x] Completions
|
||||
- [x] Streaming
|
||||
- [x] JSON mode
|
||||
- [x] Reproducible outputs
|
||||
- [ ] Logprobs
|
||||
|
||||
#### Supported request fields
|
||||
|
||||
- [x] `model`
|
||||
- [x] `prompt`
|
||||
- [x] `frequency_penalty`
|
||||
- [x] `presence_penalty`
|
||||
- [x] `seed`
|
||||
- [x] `stop`
|
||||
- [x] `stream`
|
||||
- [x] `stream_options`
|
||||
- [x] `include_usage`
|
||||
- [x] `temperature`
|
||||
- [x] `top_p`
|
||||
- [x] `max_tokens`
|
||||
- [x] `suffix`
|
||||
- [ ] `best_of`
|
||||
- [ ] `echo`
|
||||
- [ ] `logit_bias`
|
||||
- [ ] `user`
|
||||
- [ ] `n`
|
||||
|
||||
#### Notes
|
||||
|
||||
- `prompt` currently only accepts a string
|
||||
|
||||
### `/v1/models`
|
||||
|
||||
#### Notes
|
||||
|
||||
- `created` corresponds to when the model was last modified
|
||||
- `owned_by` corresponds to the ollama username, defaulting to `"library"`
|
||||
|
||||
### `/v1/models/{model}`
|
||||
|
||||
#### Notes
|
||||
|
||||
- `created` corresponds to when the model was last modified
|
||||
- `owned_by` corresponds to the ollama username, defaulting to `"library"`
|
||||
|
||||
### `/v1/embeddings`
|
||||
|
||||
#### Supported request fields
|
||||
|
||||
- [x] `model`
|
||||
- [x] `input`
|
||||
- [x] string
|
||||
- [x] array of strings
|
||||
- [ ] array of tokens
|
||||
- [ ] array of token arrays
|
||||
- [x] `encoding format`
|
||||
- [x] `dimensions`
|
||||
- [ ] `user`
|
||||
|
||||
## Models
|
||||
|
||||
Before using a model, pull it locally `ollama pull`:
|
||||
|
||||
```shell
|
||||
ollama pull llama3.2
|
||||
```
|
||||
|
||||
### Default model names
|
||||
|
||||
For tooling that relies on default OpenAI model names such as `gpt-3.5-turbo`, use `ollama cp` to copy an existing model name to a temporary name:
|
||||
|
||||
```shell
|
||||
ollama cp llama3.2 gpt-3.5-turbo
|
||||
```
|
||||
|
||||
Afterwards, this new model name can be specified the `model` field:
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-3.5-turbo",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello!"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Setting the context size
|
||||
|
||||
The OpenAI API does not have a way of setting the context size for a model. If you need to change the context size, create a `Modelfile` which looks like:
|
||||
|
||||
```
|
||||
FROM <some model>
|
||||
PARAMETER num_ctx <context size>
|
||||
```
|
||||
|
||||
Use the `ollama create mymodel` command to create a new model with the updated context size. Call the API with the updated model name:
|
||||
|
||||
```shell
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "mymodel",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello!"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
35
docs/api/streaming.mdx
Normal file
35
docs/api/streaming.mdx
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
title: Streaming
|
||||
---
|
||||
|
||||
Certain API endpoints stream responses by default, such as `/api/generate`. These responses are provided in the newline-delimited JSON format (i.e. the `application/x-ndjson` content type). For example:
|
||||
|
||||
```json
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.097767Z","response":"That","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.109172Z","response":"'","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.121485Z","response":"s","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.132802Z","response":" a","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.143931Z","response":" fantastic","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.155176Z","response":" question","done":false}
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"!","done":true, "done_reason": "stop"}
|
||||
```
|
||||
|
||||
## Disabling streaming
|
||||
|
||||
Streaming can be disabled by providing `{"stream": false}` in the request body for any endpoint that support streaming. This will cause responses to be returned in the `application/json` format instead:
|
||||
|
||||
```json
|
||||
{"model":"gemma3","created_at":"2025-10-26T17:15:24.166576Z","response":"That's a fantastic question!","done":true}
|
||||
```
|
||||
|
||||
## When to use streaming vs non-streaming
|
||||
|
||||
**Streaming (default)**:
|
||||
- Real-time response generation
|
||||
- Lower perceived latency
|
||||
- Better for long generations
|
||||
|
||||
**Non-streaming**:
|
||||
- Simpler to process
|
||||
- Better for short responses, or structured outputs
|
||||
- Easier to handle in some applications
|
||||
36
docs/api/usage.mdx
Normal file
36
docs/api/usage.mdx
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
title: Usage
|
||||
---
|
||||
|
||||
Ollama's API responses include metrics that can be used for measuring performance and model usage:
|
||||
|
||||
* `total_duration`: How long the response took to generate
|
||||
* `load_duration`: How long the model took to load
|
||||
* `prompt_eval_count`: How many input tokens were processed
|
||||
* `prompt_eval_duration`: How long it took to evaluate the prompt
|
||||
* `eval_count`: How many output tokens were processes
|
||||
* `eval_duration`: How long it took to generate the output tokens
|
||||
|
||||
All timing values are measured in nanoseconds.
|
||||
|
||||
## Example response
|
||||
|
||||
For endpoints that return usage metrics, the response body will include the usage fields. For example, a non-streaming call to `/api/generate` may return the following response:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "gemma3",
|
||||
"created_at": "2025-10-17T23:14:07.414671Z",
|
||||
"response": "Hello! How can I help you today?",
|
||||
"done": true,
|
||||
"done_reason": "stop",
|
||||
"total_duration": 174560334,
|
||||
"load_duration": 101397084,
|
||||
"prompt_eval_count": 11,
|
||||
"prompt_eval_duration": 13074791,
|
||||
"eval_count": 18,
|
||||
"eval_duration": 52479709
|
||||
}
|
||||
```
|
||||
|
||||
For endpoints that return **streaming responses**, usage fields are included as part of the final chunk, where `done` is `true`.
|
||||
Reference in New Issue
Block a user