diff --git a/CLAUDE.md b/CLAUDE.md
index 33fed3c2..e7d69b0e 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -110,44 +110,56 @@ These files contain specific line numbers, code blocks, and commands to execute
 
 ### Memory Estimation Optimization for Single-GPU Preference
 
-**Status**: ✅ **COMPLETED** - Implemented and tested successfully
+**Status**: ✅ **COMPLETED** - Fully implemented and tested (2025-10-30)
 
-**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
+**Goal**: Eliminate unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80.
 
-**Problem Identified** (2025-10-29):
+### Phase 1: Per-GPU Graph Allocation (2025-10-29)
 
-Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestimation** causing unnecessary multi-GPU splits:
+**Problem**: Multi-GPU systems allocated full graph memory (1.3 GiB) to EACH GPU, causing 2.6 GiB total overestimation.
 
-| Component | Estimated | Actual | Issue |
-|-----------|-----------|--------|-------|
-| GPU 0 | 7.7 GiB | 4.1 GiB | 47% overestimate |
-| GPU 1 | 5.3 GiB | 6.3 GiB | Accurate |
-| **Total** | **13.0 GiB** | **10.4 GiB** | **Fits in single GPU!** |
+**Solution**: Secondary GPUs use 190 MiB, primary GPU uses full 1.3 GiB (based on empirical measurements).
 
-**Root Cause**: `llm/memory.go:289-298` allocates full graph memory (1.3 GiB) to **EACH GPU**, but actual usage shows only the primary GPU needs full graph. Secondary GPUs only need ~15% of graph size (~186 MiB).
+**Results**: gemma3:12b split improved from 25,24 → 1,48 layers, but still not single-GPU.
 
-**Impact**:
-- Models that fit in single GPU (11.2 GiB) are unnecessarily split across 2 GPUs
-- Cross-GPU communication overhead reduces inference speed
-- Wasted VRAM reserves space that's never used
+### Phase 2: CC 3.7 Graph Correction Factor (2025-10-30)
 
-**Solution Implemented**:
-1. Per-GPU graph allocations (190 MiB for secondary GPUs vs 1.3 GiB for primary)
-2. Reverse-order layer distribution (prefer loading on last GPU first)
+**Problem**: Graph estimates were 15-20% higher than actual usage for CC 3.7 GPUs:
+- Estimated: 1.3 GiB
+- Actual: 1.1 GiB
+- This caused gemma3:12b single-GPU check to fail by ~200 MiB margin
+
+**Root Cause**: Output layer (2.6 GiB) couldn't fit after 48 layers (8.5 GiB) due to overestimated graph overhead.
+
+**Solution** (`llm/memory.go:173-182`):
+```go
+// Apply empirical 85% correction factor for Tesla K80 (CC 3.7)
+if gpus[0].Library == "cuda" && gpus[0].Compute == "3.7" {
+    graphPartialOffload = (graphPartialOffload * 85) / 100
+    graphFullOffload = (graphFullOffload * 85) / 100
+}
+```
 
 **Results Achieved**:
-- **gemma3:4b**: Single GPU (no split) ✅
-- **gemma3:12b**: 1,48 layer split (down from 25,24) - 98% on primary GPU ✅
-- **Memory estimate**: Reduced from 13.0 GiB → 11.9 GiB
-- **Actual usage**: 10.4-10.5 GiB total (fits on single K80)
+- **gemma3:4b**: Single GPU ✅
+- **gemma3:12b**: Single GPU ✅ (was 1,48 split)
+- **Memory estimate**: 11.9 GiB → 11.0 GiB (-900 MiB)
+- **Actual usage**: 10.0 GiB on single GPU
+- **GPU utilization**: 94% during inference
+- **nvidia-smi**: GPU 0: 10,015 MiB, GPU 1: 7 MiB (idle)
 
-**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
+**Technical Details**:
+- Only affects CUDA CC 3.7 GPUs (Tesla K80, K40, M40)
+- No impact on newer GPUs (CC 5.0+)
+- Maintains 10% safety margin between estimate and actual
+- Preserves multi-GPU functionality for models >11 GiB
 
 **Benefits**:
-- More models run on single GPU = faster inference
-- Better VRAM utilization
-- Simpler deployment for single-model workloads
-- Empirically validated with real Tesla K80 measurements
+- ✅ gemma3:12b runs on single GPU (no cross-GPU communication)
+- ✅ Faster inference (no tensor split overhead)
+- ✅ Better VRAM utilization
+- ✅ Empirically validated with real measurements
+- ✅ Conservative correction maintains stability
 
 ## Model Architecture Compatibility