Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
 CUDA backend loads without undefined symbol errors
 GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
 Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Shang Chieh Tseng
2025-10-29 17:44:36 +08:00
parent 66fca1b685
commit 5077ab3fb4
5 changed files with 663 additions and 72 deletions

View File

@@ -26,10 +26,23 @@ The project uses CUDA 11 toolchain to maintain compatibility with Tesla K80 and
## CC 3.7-Only Optimization Strategy
**Status**: ✅ **COMPLETED** - All 8 phases finished, compilation successful
**Status**: ✅ **COMPLETED** - All 9 phases complete and tested successfully
**Completion Summary**: Successfully simplified CUDA backend to support only CC 3.7 (Kepler/Tesla K80). After the initial optimization removed modern GPU architecture constants from `common.cuh`, additional fixes were required to handle undefined constant references throughout the codebase. All MMA (tensor core) functions have been properly disabled while preserving DP4A functions for CC 3.7 compatibility.
**Critical Runtime Fix - Phase 9 (2025-10-29)**: After Phase 8, CUDA backend failed to load due to undefined Flash Attention symbols. Solution implemented:
1. Disabled all flash attention helper functions with `#if 0` (lines 126-274 in fattn.cu)
2. Simplified main `ggml_cuda_flash_attn_ext()` function to abort immediately for CC 3.7
3. Added `GGML_UNUSED` macros to prevent compiler warnings
4. **Build successful**
5. **Runtime testing successful** ✅ - CUDA backend loads, GPU offloading works correctly
**Verified Working**:
- ✅ CUDA backend loads without undefined symbol errors
- ✅ Log shows: `load_backend: loaded CUDA backend from libggml-cuda.so`
- ✅ Layers offload to GPU correctly (e.g., 35/35 layers for gemma3:4b)
- ✅ Fast GPU inference confirmed
**Goal**: Simplify the codebase by removing support for all CUDA Compute Capabilities except 3.7, since newer GPUs (CC 5.0+) are already supported by upstream Ollama.
### Rationale
@@ -89,9 +102,48 @@ Detailed cleanup instructions are maintained in folder-specific `CLAUDE.md` file
- `ml/backend/ggml/ggml/src/ggml-cuda/CLAUDE.md` - CUDA kernel cleanup instructions
- `ml/CLAUDE.md` - Go-level GPU detection simplification
- `llm/CLAUDE.md` - Memory estimation optimization for single-GPU preference
These files contain specific line numbers, code blocks, and commands to execute the cleanup incrementally across sessions.
## 🎯 Tesla K80 Performance Optimizations
### Memory Estimation Optimization for Single-GPU Preference
**Status**: ⚠️ **IN PROGRESS** - Design complete, implementation pending
**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
**Problem Identified** (2025-10-29):
Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestimation** causing unnecessary multi-GPU splits:
| Component | Estimated | Actual | Issue |
|-----------|-----------|--------|-------|
| GPU 0 | 7.7 GiB | 4.1 GiB | 47% overestimate |
| GPU 1 | 5.3 GiB | 6.3 GiB | Accurate |
| **Total** | **13.0 GiB** | **10.4 GiB** | **Fits in single GPU!** |
**Root Cause**: `llm/memory.go:289-298` allocates full graph memory (1.3 GiB) to **EACH GPU**, but actual usage shows only the primary GPU needs full graph. Secondary GPUs only need ~15% of graph size (~186 MiB).
**Impact**:
- Models that fit in single GPU (11.2 GiB) are unnecessarily split across 2 GPUs
- Cross-GPU communication overhead reduces inference speed
- Wasted VRAM reserves space that's never used
**Solution**: Modify graph allocation logic to use empirically-measured ratios:
- Primary GPU (last GPU with most layers): 100% of graph size (1.3 GiB)
- Secondary GPUs: 15% of graph size (~186 MiB)
- Expected reduction: 13.0 GiB → 10.8 GiB (fits in single K80)
**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
**Benefits**:
- More models run on single GPU = faster inference
- Better VRAM utilization
- Simpler deployment for single-model workloads
- Empirically validated with real Tesla K80 measurements
## Documentation Structure
The project documentation is organized as follows: