Fix gemma3:12b to load on single Tesla K80 GPU ## Problem gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting in a single Tesla K80 (11.2 GiB available). This caused: - Unnecessary multi-GPU splits (1,48 layer distribution) - Cross-GPU communication overhead - Slower inference performance - Wasted VRAM on secondary GPU ## Root Cause Graph memory estimates for CUDA CC 3.7 were consistently 15-20% higher than actual usage: - Estimated: 1.3 GiB per GPU - Actual: 1.1 GiB primary GPU, ~86 MiB secondary GPU - This caused single-GPU placement to fail by ~200 MiB margin ## Solution Applied empirical 85% correction factor to graph memory estimates for Tesla K80 (CC 3.7) GPUs, based on measured actual usage. ## Changes - llm/memory.go: Add CC 3.7 graph correction (lines 173-182) - Reduces graphPartialOffload and graphFullOffload by 15% - Only applies to CUDA library with compute capability 3.7 - Based on empirical measurements from gemma3:12b testing ## Results ### Before: - Memory estimate: 11.9 GiB - GPU split: 1,48 layers across 2 GPUs - GPU 0: 617 MiB, GPU 1: 9,866 MiB - Command: --tensor-split 1,48 ### After: - Memory estimate: 11.0 GiB (-900 MiB) - GPU split: None (single GPU) - GPU 0: 10,015 MiB, GPU 1: 7 MiB - Command: --parallel 1 (no tensor-split) - GPU utilization: 94% during inference ## Testing - ✅ gemma3:12b loads on single GPU - ✅ All 49 layers offloaded to GPU 0 - ✅ Inference works correctly with 94% GPU utilization - ✅ No cross-GPU communication overhead - ✅ Memory usage: 10.0 GiB vs 11.0 GiB estimated (10% safety margin) ## Compatibility - Only affects Tesla K80 and other CC 3.7 GPUs - No impact on newer GPUs (CC 5.0+) - Maintains existing multi-GPU functionality for models >11 GiB - Preserves safety margins for stable operation