Files
ollama37/COMMIT_MESSAGE.txt
Shang Chieh Tseng 6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00

54 lines
1.8 KiB
Plaintext

Fix gemma3:12b to load on single Tesla K80 GPU
## Problem
gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting
in a single Tesla K80 (11.2 GiB available). This caused:
- Unnecessary multi-GPU splits (1,48 layer distribution)
- Cross-GPU communication overhead
- Slower inference performance
- Wasted VRAM on secondary GPU
## Root Cause
Graph memory estimates for CUDA CC 3.7 were consistently 15-20% higher
than actual usage:
- Estimated: 1.3 GiB per GPU
- Actual: 1.1 GiB primary GPU, ~86 MiB secondary GPU
- This caused single-GPU placement to fail by ~200 MiB margin
## Solution
Applied empirical 85% correction factor to graph memory estimates for
Tesla K80 (CC 3.7) GPUs, based on measured actual usage.
## Changes
- llm/memory.go: Add CC 3.7 graph correction (lines 173-182)
- Reduces graphPartialOffload and graphFullOffload by 15%
- Only applies to CUDA library with compute capability 3.7
- Based on empirical measurements from gemma3:12b testing
## Results
### Before:
- Memory estimate: 11.9 GiB
- GPU split: 1,48 layers across 2 GPUs
- GPU 0: 617 MiB, GPU 1: 9,866 MiB
- Command: --tensor-split 1,48
### After:
- Memory estimate: 11.0 GiB (-900 MiB)
- GPU split: None (single GPU)
- GPU 0: 10,015 MiB, GPU 1: 7 MiB
- Command: --parallel 1 (no tensor-split)
- GPU utilization: 94% during inference
## Testing
- ✅ gemma3:12b loads on single GPU
- ✅ All 49 layers offloaded to GPU 0
- ✅ Inference works correctly with 94% GPU utilization
- ✅ No cross-GPU communication overhead
- ✅ Memory usage: 10.0 GiB vs 11.0 GiB estimated (10% safety margin)
## Compatibility
- Only affects Tesla K80 and other CC 3.7 GPUs
- No impact on newer GPUs (CC 5.0+)
- Maintains existing multi-GPU functionality for models >11 GiB
- Preserves safety margins for stable operation