Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction

Added Phase 2 documentation for single-GPU optimization: - CC 3.7 graph correction factor (85% of estimate) - gemma3:12b now loads on single GPU - Improved from 11.9 GiB → 11.0 GiB estimation - Validated with 10.0 GiB actual usage, 94% GPU utilization 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 07:46:59 +00:00 · 2025-10-30 00:16:38 +08:00
parent 6d87524e22
commit 296d537a2c
1 changed files with 38 additions and 26 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -110,44 +110,56 @@ These files contain specific line numbers, code blocks, and commands to execute

 ### Memory Estimation Optimization for Single-GPU Preference

-**Status**: ✅ **COMPLETED** - Implemented and tested successfully
+**Status**: ✅ **COMPLETED** - Fully implemented and tested (2025-10-30)

-**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
+**Goal**: Eliminate unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80.

-**Problem Identified** (2025-10-29):
+### Phase 1: Per-GPU Graph Allocation (2025-10-29)

-Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestimation** causing unnecessary multi-GPU splits:
+**Problem**: Multi-GPU systems allocated full graph memory (1.3 GiB) to EACH GPU, causing 2.6 GiB total overestimation.

-| Component | Estimated | Actual | Issue |
-|-----------|-----------|--------|-------|
-| GPU 0 | 7.7 GiB | 4.1 GiB | 47% overestimate |
-| GPU 1 | 5.3 GiB | 6.3 GiB | Accurate |
-| **Total** | **13.0 GiB** | **10.4 GiB** | **Fits in single GPU!** |
+**Solution**: Secondary GPUs use 190 MiB, primary GPU uses full 1.3 GiB (based on empirical measurements).

-**Root Cause**: `llm/memory.go:289-298` allocates full graph memory (1.3 GiB) to **EACH GPU**, but actual usage shows only the primary GPU needs full graph. Secondary GPUs only need ~15% of graph size (~186 MiB).
+**Results**: gemma3:12b split improved from 25,24 → 1,48 layers, but still not single-GPU.

-**Impact**:
- Models that fit in single GPU (11.2 GiB) are unnecessarily split across 2 GPUs
- Cross-GPU communication overhead reduces inference speed
- Wasted VRAM reserves space that's never used
+### Phase 2: CC 3.7 Graph Correction Factor (2025-10-30)

-**Solution Implemented**:
-1. Per-GPU graph allocations (190 MiB for secondary GPUs vs 1.3 GiB for primary)
-2. Reverse-order layer distribution (prefer loading on last GPU first)
+**Problem**: Graph estimates were 15-20% higher than actual usage for CC 3.7 GPUs:
+- Estimated: 1.3 GiB
+- Actual: 1.1 GiB
+- This caused gemma3:12b single-GPU check to fail by ~200 MiB margin
+
+**Root Cause**: Output layer (2.6 GiB) couldn't fit after 48 layers (8.5 GiB) due to overestimated graph overhead.
+
+**Solution** (`llm/memory.go:173-182`):
+```go
+// Apply empirical 85% correction factor for Tesla K80 (CC 3.7)
+if gpus[0].Library == "cuda" && gpus[0].Compute == "3.7" {
+    graphPartialOffload = (graphPartialOffload * 85) / 100
+    graphFullOffload = (graphFullOffload * 85) / 100
+}
+```

 **Results Achieved**:
- **gemma3:4b**: Single GPU (no split) ✅
- **gemma3:12b**: 1,48 layer split (down from 25,24) - 98% on primary GPU ✅
- **Memory estimate**: Reduced from 13.0 GiB → 11.9 GiB
- **Actual usage**: 10.4-10.5 GiB total (fits on single K80)
+- **gemma3:4b**: Single GPU ✅
+- **gemma3:12b**: Single GPU ✅ (was 1,48 split)
+- **Memory estimate**: 11.9 GiB → 11.0 GiB (-900 MiB)
+- **Actual usage**: 10.0 GiB on single GPU
+- **GPU utilization**: 94% during inference
+- **nvidia-smi**: GPU 0: 10,015 MiB, GPU 1: 7 MiB (idle)

-**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
+**Technical Details**:
+- Only affects CUDA CC 3.7 GPUs (Tesla K80, K40, M40)
+- No impact on newer GPUs (CC 5.0+)
+- Maintains 10% safety margin between estimate and actual
+- Preserves multi-GPU functionality for models >11 GiB

 **Benefits**:
- More models run on single GPU = faster inference
- Better VRAM utilization
- Simpler deployment for single-model workloads
- Empirically validated with real Tesla K80 measurements
+- ✅ gemma3:12b runs on single GPU (no cross-GPU communication)
+- ✅ Faster inference (no tensor split overhead)
+- ✅ Better VRAM utilization
+- ✅ Empirically validated with real measurements
+- ✅ Conservative correction maintains stability

 ## Model Architecture Compatibility