mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-09 23:37:06 +00:00
This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
237 lines
4.5 KiB
Plaintext
237 lines
4.5 KiB
Plaintext
---
|
|
title: Cloud
|
|
sidebarTitle: Cloud
|
|
---
|
|
|
|
<Info>Ollama's cloud is currently in preview.</Info>
|
|
|
|
## Cloud Models
|
|
|
|
Ollama's cloud models are a new kind of model in Ollama that can run without a powerful GPU. Instead, cloud models are automatically offloaded to Ollama's cloud service while offering the same capabilities as local models, making it possible to keep using your local tools while running larger models that wouldn't fit on a personal computer.
|
|
|
|
Ollama currently supports the following cloud models, with more coming soon:
|
|
|
|
- `deepseek-v3.1:671b-cloud`
|
|
- `gpt-oss:20b-cloud`
|
|
- `gpt-oss:120b-cloud`
|
|
- `kimi-k2:1t-cloud`
|
|
- `qwen3-coder:480b-cloud`
|
|
- `glm-4.6:cloud`
|
|
- `minimax-m2:cloud`
|
|
|
|
### Running Cloud models
|
|
|
|
Ollama's cloud models require an account on [ollama.com](https://ollama.com). To sign in or create an account, run:
|
|
|
|
```
|
|
ollama signin
|
|
```
|
|
|
|
<Tabs>
|
|
<Tab title="CLI">
|
|
|
|
To run a cloud model, open the terminal and run:
|
|
|
|
```
|
|
ollama run gpt-oss:120b-cloud
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="Python">
|
|
|
|
First, pull a cloud model so it can be accessed:
|
|
|
|
```
|
|
ollama pull gpt-oss:120b-cloud
|
|
```
|
|
|
|
Next, install [Ollama's Python library](https://github.com/ollama/ollama-python):
|
|
|
|
```
|
|
pip install ollama
|
|
```
|
|
|
|
Next, create and run a simple Python script:
|
|
|
|
```python
|
|
from ollama import Client
|
|
|
|
client = Client()
|
|
|
|
messages = [
|
|
{
|
|
'role': 'user',
|
|
'content': 'Why is the sky blue?',
|
|
},
|
|
]
|
|
|
|
for part in client.chat('gpt-oss:120b-cloud', messages=messages, stream=True):
|
|
print(part['message']['content'], end='', flush=True)
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="JavaScript">
|
|
|
|
First, pull a cloud model so it can be accessed:
|
|
|
|
```
|
|
ollama pull gpt-oss:120b-cloud
|
|
```
|
|
|
|
Next, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js):
|
|
|
|
```
|
|
npm i ollama
|
|
```
|
|
|
|
Then use the library to run a cloud model:
|
|
|
|
```typescript
|
|
import { Ollama } from "ollama";
|
|
|
|
const ollama = new Ollama();
|
|
|
|
const response = await ollama.chat({
|
|
model: "gpt-oss:120b-cloud",
|
|
messages: [{ role: "user", content: "Explain quantum computing" }],
|
|
stream: true,
|
|
});
|
|
|
|
for await (const part of response) {
|
|
process.stdout.write(part.message.content);
|
|
}
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
|
|
First, pull a cloud model so it can be accessed:
|
|
|
|
```
|
|
ollama pull gpt-oss:120b-cloud
|
|
```
|
|
|
|
Run the following cURL command to run the command via Ollama's API:
|
|
|
|
```
|
|
curl http://localhost:11434/api/chat -d '{
|
|
"model": "gpt-oss:120b-cloud",
|
|
"messages": [{
|
|
"role": "user",
|
|
"content": "Why is the sky blue?"
|
|
}],
|
|
"stream": false
|
|
}'
|
|
```
|
|
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Cloud API access
|
|
|
|
Cloud models can also be accessed directly on ollama.com's API. In this mode, ollama.com acts as a remote Ollama host.
|
|
|
|
### Authentication
|
|
|
|
For direct access to ollama.com's API, first create an [API key](https://ollama.com/settings/keys).
|
|
|
|
Then, set the `OLLAMA_API_KEY` environment variable to your API key.
|
|
|
|
```
|
|
export OLLAMA_API_KEY=your_api_key
|
|
```
|
|
|
|
### Listing models
|
|
|
|
For models available directly via Ollama's API, models can be listed via:
|
|
|
|
```
|
|
curl https://ollama.com/api/tags
|
|
```
|
|
|
|
### Generating a response
|
|
|
|
<Tabs>
|
|
<Tab title="Python">
|
|
|
|
First, install [Ollama's Python library](https://github.com/ollama/ollama-python)
|
|
|
|
```
|
|
pip install ollama
|
|
```
|
|
|
|
Then make a request
|
|
|
|
```python
|
|
import os
|
|
from ollama import Client
|
|
|
|
client = Client(
|
|
host="https://ollama.com",
|
|
headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')}
|
|
)
|
|
|
|
messages = [
|
|
{
|
|
'role': 'user',
|
|
'content': 'Why is the sky blue?',
|
|
},
|
|
]
|
|
|
|
for part in client.chat('gpt-oss:120b', messages=messages, stream=True):
|
|
print(part['message']['content'], end='', flush=True)
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="JavaScript">
|
|
|
|
First, install [Ollama's JavaScript library](https://github.com/ollama/ollama-js):
|
|
|
|
```
|
|
npm i ollama
|
|
```
|
|
|
|
Next, make a request to the model:
|
|
|
|
```typescript
|
|
import { Ollama } from "ollama";
|
|
|
|
const ollama = new Ollama({
|
|
host: "https://ollama.com",
|
|
headers: {
|
|
Authorization: "Bearer " + process.env.OLLAMA_API_KEY,
|
|
},
|
|
});
|
|
|
|
const response = await ollama.chat({
|
|
model: "gpt-oss:120b",
|
|
messages: [{ role: "user", content: "Explain quantum computing" }],
|
|
stream: true,
|
|
});
|
|
|
|
for await (const part of response) {
|
|
process.stdout.write(part.message.content);
|
|
}
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
|
|
Generate a response via Ollama's chat API:
|
|
|
|
```
|
|
curl https://ollama.com/api/chat \
|
|
-H "Authorization: Bearer $OLLAMA_API_KEY" \
|
|
-d '{
|
|
"model": "gpt-oss:120b",
|
|
"messages": [{
|
|
"role": "user",
|
|
"content": "Why is the sky blue?"
|
|
}],
|
|
"stream": false
|
|
}'
|
|
```
|
|
|
|
</Tab>
|
|
</Tabs>
|