mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Go to file

Shang Chieh Tseng 5077ab3fb4 Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
✅ CUDA backend loads without undefined symbol errors
✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
✅ Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-29 17:44:36 +08:00

.github/workflows

Update command.

2025-10-28 18:42:49 +08:00

api

tools: support anyOf types

2025-08-05 16:46:24 -07:00

app

feat: add trace log level (#10650 )

2025-05-12 11:43:00 -07:00

auth

lint

2024-08-01 17:06:06 -07:00

cmd

gpt-oss (#11672 )

2025-08-05 12:21:16 -07:00

convert

Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support

2025-08-08 10:43:29 +08:00

discover

Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support

2025-08-08 10:43:29 +08:00

docker

Reorganize Docker build infrastructure for better maintainability

2025-10-28 14:47:39 +08:00

docs

Update command.

2025-10-29 14:21:03 +08:00

envconfig

Add support for new models and fix GitHub issues

2025-07-20 00:12:36 +08:00

format

chore(all): replace instances of interface with any (#10067 )

2025-04-02 09:44:27 -07:00

gptoss: fix memory calc (#11700 )

2025-08-05 15:56:12 -07:00

integration

tests: add integration coverage for oss-gpt (#11696 )

2025-08-07 15:06:57 -07:00

kvcache

kvcache: Log contents of cache when unable to find a slot

2025-08-04 16:59:29 -07:00

llama

gpt-oss (#11672 )

2025-08-05 12:21:16 -07:00

llm

Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

2025-10-29 17:44:36 +08:00

logutil

feat: add trace log level (#10650 )

2025-05-12 11:43:00 -07:00

macapp

docs: improve syntax highlighting in code blocks (#8854 )

2025-02-07 09:55:07 -08:00

Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

2025-10-29 17:44:36 +08:00

model

Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support

2025-08-08 10:43:29 +08:00

openai

openai: always provide reasoning

2025-08-06 18:54:20 -07:00

parser

skip tokenizer.model if possible (#11050 )

2025-06-11 12:10:35 -07:00

progress

create blobs in parallel (#10135 )

2025-05-05 11:59:26 -07:00

readline

add thinking support to the api and cli (#10584 )

2025-05-28 19:38:52 -07:00

runner

ml: Panic rather than return error on tensor allocation failure

2025-05-22 14:38:09 -07:00

sample

model: handle multiple eos tokens (#10577 )

2025-05-16 13:40:23 -07:00

scripts

ci: rocm parallel builds on windows (#11187 )

2025-06-24 15:27:09 -07:00

server

Fix Tesla K80 multi-GPU model switching deadlocks and silent failures

2025-08-10 01:30:10 +08:00

template

tools: support anyOf types

2025-08-05 16:46:24 -07:00

thinking

move thinking logic into its own package (#10990 )

2025-06-06 12:02:20 -07:00

tools

tools: support anyOf types

2025-08-05 16:46:24 -07:00

types

add thinking support to the api and cli (#10584 )

2025-05-28 19:38:52 -07:00

version

add version

2023-08-22 09:40:58 -07:00

.dockerignore

next build (#8539 )

2025-01-29 15:03:38 -08:00

.gitattributes

chore: update gitattributes (#8860 )

2025-02-05 16:37:18 -08:00

.gitignore

server/internal: copy bmizerany/ollama-go to internal package (#9294 )

2025-02-24 22:39:44 -08:00

.golangci.yaml

lint: enable usetesting, disable tenv (#10594 )

2025-05-08 11:42:14 -07:00

CLAUDE.md

Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

2025-10-29 17:44:36 +08:00

CMakeLists.txt

Re-remove cuda v11 (#10694 )

2025-06-23 14:07:00 -07:00

CMakePresets.json

Complete CC 3.7-only CUDA optimization for Tesla K80 support

2025-10-29 15:21:08 +08:00

CONTRIBUTING.md

CONTRIBUTING: fix typo in commit message example (#11528 )

2025-07-25 14:24:06 -07:00

Dockerfile

Update base image to Ubuntu 24.04 LTS (#9681 )

2025-07-05 16:02:33 -07:00

go.mod

s#x/exp/maps#maps# (#11506 )

2025-07-23 13:23:32 -07:00

go.sum

Reapply "feat: incremental gguf parser (#10822 )" (#11114 ) (#11119 )

2025-06-20 11:11:40 -07:00

LICENSE

proto -> ollama

2023-06-26 15:57:13 -04:00

main.go

lint

2024-08-01 17:06:06 -07:00

Makefile.sync

chore: update mllama to use ollama engine (#10637 )

2025-05-13 17:36:02 -07:00

README.md

Update README.md for v1.4.0: GPT-OSS support and Tesla K80 memory improvements

2025-08-10 01:42:38 +08:00

SECURITY.md

Create SECURITY.md

2024-07-30 21:01:12 -07:00

README.md

Ollama37 🚀

Tesla K80 Compatible Ollama Fork

Run modern LLMs on NVIDIA Tesla K80 and other CUDA Compute Capability 3.7 GPUs. While official Ollama dropped legacy GPU support, Ollama37 keeps your Tesla K80 hardware functional with the latest models and features.

Key Features

⚡ Tesla K80 Support - Full compatibility with CUDA Compute Capability 3.7
🔄 Always Current - Synced with upstream Ollama for latest models and fixes
🛠️ Optimized Build - CUDA 11 toolchain for maximum legacy GPU compatibility
💰 Cost Effective - Leverage existing hardware without expensive upgrades

Quick Start

Docker (Recommended)

# Pull and run
docker pull dogkeeper886/ollama37
docker run --runtime=nvidia --gpus all -p 11434:11434 dogkeeper886/ollama37

Docker Compose

services:
  ollama:
    image: dogkeeper886/ollama37
    ports: ["11434:11434"]
    volumes: ["./.ollama:/root/.ollama"]
    runtime: nvidia
    restart: unless-stopped

docker-compose up -d

Usage

Run Your First Model

# Download and run a model
ollama pull gemma3
ollama run gemma3 "Why is the sky blue?"

# Interactive chat
ollama run gemma3

Tesla K80 Multi-GPU Example

# GPT-OSS utilizes both GPUs automatically
ollama pull gpt-oss
ollama run gpt-oss "Explain the advantages of dual GPU inference"

# Monitor GPU usage
nvidia-smi -l 1  # Shows ~94%/74% utilization on dual K80s

Supported Models

All models from ollama.com/library including Llama 3.2, Gemma3n, Qwen 2.5, Phi-4, Code Llama, and GPT-OSS (multi-GPU optimized for Tesla K80).

REST API

# Generate response
curl http://localhost:11434/api/generate -d '{"model": "gemma3, "prompt": "Hello Tesla K80!"}'

# Chat
curl http://localhost:11434/api/chat -d '{"model": "gemma3, "messages": [{"role": "user", "content": "Hello!"}]}'

Technical Details

Tesla K80 Support

CUDA 3.7 Support: Maintained via CMAKE_CUDA_ARCHITECTURES "37;50;61;70;75;80"
CUDA 11 Toolchain: Compatible with legacy GPUs (CUDA 12 dropped 3.7 support)
Multi-GPU Optimization: GPT-OSS runs efficiently across dual K80 GPUs with 13,12 tensor-split
Memory Management: Enhanced VMM pool with granularity alignment and progressive fallback

Tesla K80 Memory Improvements (v1.4.0)

This release includes major stability improvements for Tesla K80 dual-GPU systems:

VMM Pool Crash Fixes

Issue: cuMemAddressReserve failures causing CUDA_ERROR_INVALID_VALUE crashes
Solution: Memory granularity alignment and progressive fallback (4GB → 2GB → 1GB → 512MB)
Result: Stable memory allocation with 93.8%/74.0% GPU utilization on dual K80s

Multi-GPU Model Switching

Issue: Scheduler deadlocks when switching between multi-GPU (GPT-OSS) and single-GPU (Llama 3.2) models
Solution: Enhanced conflict detection and proper unload sequencing in scheduler
Result: Seamless gpt-oss ↔ llama3.2 switching with 4-17s load times

Silent Inference Failures

Issue: Models loaded successfully but failed to generate output after model switching
Solution: Critical cudaSetDevice() validation - fail fast instead of silent failures
Result: Self-healing system with automatic recovery, no system reboots required

These improvements enable robust production use of Tesla K80 hardware for LLM inference with model switching capabilities that rival modern GPU setups.

Recent Updates

v1.4.0 (2025-08-10): GPT-OSS multi-GPU support, critical Tesla K80 memory fixes, robust model switching
v1.3.0 (2025-07-19): Added Gemma3n, Qwen2.5VL, latest upstream sync
v1.2.0 (2025-05-06): Qwen3, Gemma 3 12B, Phi-4 14B support

Building from Source

Docker Build

docker build -f ollama37.Dockerfile -t ollama37 .

Manual Build

For detailed manual compilation instructions including CUDA 11.4, GCC 10, and CMake setup, see our Manual Build Guide.

Contributing

Found an issue or want to contribute? Check our GitHub issues or submit Tesla K80-specific bug reports and compatibility fixes.

License

Same license as upstream Ollama. See LICENSE file for details.

Advanced Usage

Custom Models

# Import GGUF model
ollama create custom-model -f Modelfile

# Customize existing model
echo 'FROM llama3.2
PARAMETER temperature 0.8
SYSTEM "You are a helpful Tesla K80 expert."' > Modelfile
ollama create tesla-expert -f Modelfile

CLI Commands

ollama list              # List models
ollama show llama3.2     # Model info  
ollama ps               # Running models
ollama stop llama3.2    # Stop model
ollama serve            # Start server

Libraries & Community

ollama-python | ollama-js
Discord | Reddit

See API documentation for complete REST API reference.

Languages

Go 89.9%

GLSL 6.9%

Shell 0.7%

TypeScript 0.6%

PowerShell 0.5%

Other 1.3%