Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 07:46:59 +00:00 · 2025-10-29 17:44:36 +08:00
parent 66fca1b685
commit 5077ab3fb4
5 changed files with 663 additions and 72 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -26,10 +26,23 @@ The project uses CUDA 11 toolchain to maintain compatibility with Tesla K80 and

 ## CC 3.7-Only Optimization Strategy

-**Status**: ✅ **COMPLETED** - All 8 phases finished, compilation successful
+**Status**: ✅ **COMPLETED** - All 9 phases complete and tested successfully

 **Completion Summary**: Successfully simplified CUDA backend to support only CC 3.7 (Kepler/Tesla K80). After the initial optimization removed modern GPU architecture constants from `common.cuh`, additional fixes were required to handle undefined constant references throughout the codebase. All MMA (tensor core) functions have been properly disabled while preserving DP4A functions for CC 3.7 compatibility.

+**Critical Runtime Fix - Phase 9 (2025-10-29)**: After Phase 8, CUDA backend failed to load due to undefined Flash Attention symbols. Solution implemented:
+1. Disabled all flash attention helper functions with `#if 0` (lines 126-274 in fattn.cu)
+2. Simplified main `ggml_cuda_flash_attn_ext()` function to abort immediately for CC 3.7
+3. Added `GGML_UNUSED` macros to prevent compiler warnings
+4. **Build successful** ✅
+5. **Runtime testing successful** ✅ - CUDA backend loads, GPU offloading works correctly
+
+**Verified Working**:
+- ✅ CUDA backend loads without undefined symbol errors
+- ✅ Log shows: `load_backend: loaded CUDA backend from libggml-cuda.so`
+- ✅ Layers offload to GPU correctly (e.g., 35/35 layers for gemma3:4b)
+- ✅ Fast GPU inference confirmed
+
 **Goal**: Simplify the codebase by removing support for all CUDA Compute Capabilities except 3.7, since newer GPUs (CC 5.0+) are already supported by upstream Ollama.

 ### Rationale
@@ -89,9 +102,48 @@ Detailed cleanup instructions are maintained in folder-specific `CLAUDE.md` file

 - `ml/backend/ggml/ggml/src/ggml-cuda/CLAUDE.md` - CUDA kernel cleanup instructions
 - `ml/CLAUDE.md` - Go-level GPU detection simplification
+- `llm/CLAUDE.md` - Memory estimation optimization for single-GPU preference

 These files contain specific line numbers, code blocks, and commands to execute the cleanup incrementally across sessions.

+## 🎯 Tesla K80 Performance Optimizations
+
+### Memory Estimation Optimization for Single-GPU Preference
+
+**Status**: ⚠️ **IN PROGRESS** - Design complete, implementation pending
+
+**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
+
+**Problem Identified** (2025-10-29):
+
+Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestimation** causing unnecessary multi-GPU splits:
+
+| Component | Estimated | Actual | Issue |
+|-----------|-----------|--------|-------|
+| GPU 0 | 7.7 GiB | 4.1 GiB | 47% overestimate |
+| GPU 1 | 5.3 GiB | 6.3 GiB | Accurate |
+| **Total** | **13.0 GiB** | **10.4 GiB** | **Fits in single GPU!** |
+
+**Root Cause**: `llm/memory.go:289-298` allocates full graph memory (1.3 GiB) to **EACH GPU**, but actual usage shows only the primary GPU needs full graph. Secondary GPUs only need ~15% of graph size (~186 MiB).
+
+**Impact**:
+- Models that fit in single GPU (11.2 GiB) are unnecessarily split across 2 GPUs
+- Cross-GPU communication overhead reduces inference speed
+- Wasted VRAM reserves space that's never used
+
+**Solution**: Modify graph allocation logic to use empirically-measured ratios:
+- Primary GPU (last GPU with most layers): 100% of graph size (1.3 GiB)
+- Secondary GPUs: 15% of graph size (~186 MiB)
+- Expected reduction: 13.0 GiB → 10.8 GiB (fits in single K80)
+
+**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
+
+**Benefits**:
+- More models run on single GPU = faster inference
+- Better VRAM utilization
+- Simpler deployment for single-model workloads
+- Empirically validated with real Tesla K80 measurements
+
 ## Documentation Structure

 The project documentation is organized as follows: