mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-17 19:27:00 +00:00
This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80) - **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7) - **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
99 lines
3.3 KiB
Plaintext
99 lines
3.3 KiB
Plaintext
---
|
||
title: Streaming
|
||
---
|
||
|
||
Streaming allows you to render text as it is produced by the model.
|
||
|
||
Streaming is enabled by default through the REST API, but disabled by default in the SDKs.
|
||
|
||
To enable streaming in the SDKs, set the `stream` parameter to `True`.
|
||
|
||
## Key streaming concepts
|
||
1. Chatting: Stream partial assistant messages. Each chunk includes the `content` so you can render messages as they arrive.
|
||
1. Thinking: Thinking-capable models emit a `thinking` field alongside regular content in each chunk. Detect this field in streaming chunks to show or hide reasoning traces before the final answer arrives.
|
||
1. Tool calling: Watch for streamed `tool_calls` in each chunk, execute the requested tool, and append tool outputs back into the conversation.
|
||
|
||
## Handling streamed chunks
|
||
|
||
|
||
<Note> It is necessary to accumulate the partial fields in order to maintain the history of the conversation. This is particularly important for tool calling where the thinking, tool call from the model, and the executed tool result must be passed back to the model in the next request. </Note>
|
||
|
||
<Tabs>
|
||
<Tab title="Python">
|
||
|
||
```python
|
||
from ollama import chat
|
||
|
||
stream = chat(
|
||
model='qwen3',
|
||
messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
|
||
stream=True,
|
||
)
|
||
|
||
in_thinking = False
|
||
content = ''
|
||
thinking = ''
|
||
for chunk in stream:
|
||
if chunk.message.thinking:
|
||
if not in_thinking:
|
||
in_thinking = True
|
||
print('Thinking:\n', end='', flush=True)
|
||
print(chunk.message.thinking, end='', flush=True)
|
||
# accumulate the partial thinking
|
||
thinking += chunk.message.thinking
|
||
elif chunk.message.content:
|
||
if in_thinking:
|
||
in_thinking = False
|
||
print('\n\nAnswer:\n', end='', flush=True)
|
||
print(chunk.message.content, end='', flush=True)
|
||
# accumulate the partial content
|
||
content += chunk.message.content
|
||
|
||
# append the accumulated fields to the messages for the next request
|
||
new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
|
||
```
|
||
</Tab>
|
||
<Tab title="JavaScript">
|
||
|
||
```javascript
|
||
import ollama from 'ollama'
|
||
|
||
async function main() {
|
||
const stream = await ollama.chat({
|
||
model: 'qwen3',
|
||
messages: [{ role: 'user', content: 'What is 17 × 23?' }],
|
||
stream: true,
|
||
})
|
||
|
||
let inThinking = false
|
||
let content = ''
|
||
let thinking = ''
|
||
|
||
for await (const chunk of stream) {
|
||
if (chunk.message.thinking) {
|
||
if (!inThinking) {
|
||
inThinking = true
|
||
process.stdout.write('Thinking:\n')
|
||
}
|
||
process.stdout.write(chunk.message.thinking)
|
||
// accumulate the partial thinking
|
||
thinking += chunk.message.thinking
|
||
} else if (chunk.message.content) {
|
||
if (inThinking) {
|
||
inThinking = false
|
||
process.stdout.write('\n\nAnswer:\n')
|
||
}
|
||
process.stdout.write(chunk.message.content)
|
||
// accumulate the partial content
|
||
content += chunk.message.content
|
||
}
|
||
}
|
||
|
||
// append the accumulated fields to the messages for the next request
|
||
new_messages = [{ role: 'assistant', thinking: thinking, content: content }]
|
||
}
|
||
|
||
main().catch(console.error)
|
||
```
|
||
</Tab>
|
||
</Tabs> |