mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-10 07:46:59 +00:00
Document Phase 9 completion: Fix CUDA backend loading for CC 3.7
Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
54
CLAUDE.md
54
CLAUDE.md
@@ -26,10 +26,23 @@ The project uses CUDA 11 toolchain to maintain compatibility with Tesla K80 and
|
||||
|
||||
## CC 3.7-Only Optimization Strategy
|
||||
|
||||
**Status**: ✅ **COMPLETED** - All 8 phases finished, compilation successful
|
||||
**Status**: ✅ **COMPLETED** - All 9 phases complete and tested successfully
|
||||
|
||||
**Completion Summary**: Successfully simplified CUDA backend to support only CC 3.7 (Kepler/Tesla K80). After the initial optimization removed modern GPU architecture constants from `common.cuh`, additional fixes were required to handle undefined constant references throughout the codebase. All MMA (tensor core) functions have been properly disabled while preserving DP4A functions for CC 3.7 compatibility.
|
||||
|
||||
**Critical Runtime Fix - Phase 9 (2025-10-29)**: After Phase 8, CUDA backend failed to load due to undefined Flash Attention symbols. Solution implemented:
|
||||
1. Disabled all flash attention helper functions with `#if 0` (lines 126-274 in fattn.cu)
|
||||
2. Simplified main `ggml_cuda_flash_attn_ext()` function to abort immediately for CC 3.7
|
||||
3. Added `GGML_UNUSED` macros to prevent compiler warnings
|
||||
4. **Build successful** ✅
|
||||
5. **Runtime testing successful** ✅ - CUDA backend loads, GPU offloading works correctly
|
||||
|
||||
**Verified Working**:
|
||||
- ✅ CUDA backend loads without undefined symbol errors
|
||||
- ✅ Log shows: `load_backend: loaded CUDA backend from libggml-cuda.so`
|
||||
- ✅ Layers offload to GPU correctly (e.g., 35/35 layers for gemma3:4b)
|
||||
- ✅ Fast GPU inference confirmed
|
||||
|
||||
**Goal**: Simplify the codebase by removing support for all CUDA Compute Capabilities except 3.7, since newer GPUs (CC 5.0+) are already supported by upstream Ollama.
|
||||
|
||||
### Rationale
|
||||
@@ -89,9 +102,48 @@ Detailed cleanup instructions are maintained in folder-specific `CLAUDE.md` file
|
||||
|
||||
- `ml/backend/ggml/ggml/src/ggml-cuda/CLAUDE.md` - CUDA kernel cleanup instructions
|
||||
- `ml/CLAUDE.md` - Go-level GPU detection simplification
|
||||
- `llm/CLAUDE.md` - Memory estimation optimization for single-GPU preference
|
||||
|
||||
These files contain specific line numbers, code blocks, and commands to execute the cleanup incrementally across sessions.
|
||||
|
||||
## 🎯 Tesla K80 Performance Optimizations
|
||||
|
||||
### Memory Estimation Optimization for Single-GPU Preference
|
||||
|
||||
**Status**: ⚠️ **IN PROGRESS** - Design complete, implementation pending
|
||||
|
||||
**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
|
||||
|
||||
**Problem Identified** (2025-10-29):
|
||||
|
||||
Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestimation** causing unnecessary multi-GPU splits:
|
||||
|
||||
| Component | Estimated | Actual | Issue |
|
||||
|-----------|-----------|--------|-------|
|
||||
| GPU 0 | 7.7 GiB | 4.1 GiB | 47% overestimate |
|
||||
| GPU 1 | 5.3 GiB | 6.3 GiB | Accurate |
|
||||
| **Total** | **13.0 GiB** | **10.4 GiB** | **Fits in single GPU!** |
|
||||
|
||||
**Root Cause**: `llm/memory.go:289-298` allocates full graph memory (1.3 GiB) to **EACH GPU**, but actual usage shows only the primary GPU needs full graph. Secondary GPUs only need ~15% of graph size (~186 MiB).
|
||||
|
||||
**Impact**:
|
||||
- Models that fit in single GPU (11.2 GiB) are unnecessarily split across 2 GPUs
|
||||
- Cross-GPU communication overhead reduces inference speed
|
||||
- Wasted VRAM reserves space that's never used
|
||||
|
||||
**Solution**: Modify graph allocation logic to use empirically-measured ratios:
|
||||
- Primary GPU (last GPU with most layers): 100% of graph size (1.3 GiB)
|
||||
- Secondary GPUs: 15% of graph size (~186 MiB)
|
||||
- Expected reduction: 13.0 GiB → 10.8 GiB (fits in single K80)
|
||||
|
||||
**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
|
||||
|
||||
**Benefits**:
|
||||
- More models run on single GPU = faster inference
|
||||
- Better VRAM utilization
|
||||
- Simpler deployment for single-model workloads
|
||||
- Empirically validated with real Tesla K80 measurements
|
||||
|
||||
## Documentation Structure
|
||||
|
||||
The project documentation is organized as follows:
|
||||
|
||||
Reference in New Issue
Block a user