mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-09 23:37:06 +00:00
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB available). Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high (estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check to fail by ~200 MiB margin. Solution: Apply empirical 85% correction factor to graph estimates for Tesla K80 (CC 3.7) based on measured actual usage. Results: - Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB) - GPU split: 1,48 layers → single GPU (no split) - GPU 0: 10,015 MiB (was 617 MiB) - GPU 1: 7 MiB (was 9,866 MiB) - Inference: 94% GPU utilization, no cross-GPU overhead Testing: ✅ gemma3:12b loads on single GPU with correct inference 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
54 lines
1.8 KiB
Plaintext
54 lines
1.8 KiB
Plaintext
Fix gemma3:12b to load on single Tesla K80 GPU
|
|
|
|
## Problem
|
|
gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting
|
|
in a single Tesla K80 (11.2 GiB available). This caused:
|
|
- Unnecessary multi-GPU splits (1,48 layer distribution)
|
|
- Cross-GPU communication overhead
|
|
- Slower inference performance
|
|
- Wasted VRAM on secondary GPU
|
|
|
|
## Root Cause
|
|
Graph memory estimates for CUDA CC 3.7 were consistently 15-20% higher
|
|
than actual usage:
|
|
- Estimated: 1.3 GiB per GPU
|
|
- Actual: 1.1 GiB primary GPU, ~86 MiB secondary GPU
|
|
- This caused single-GPU placement to fail by ~200 MiB margin
|
|
|
|
## Solution
|
|
Applied empirical 85% correction factor to graph memory estimates for
|
|
Tesla K80 (CC 3.7) GPUs, based on measured actual usage.
|
|
|
|
## Changes
|
|
- llm/memory.go: Add CC 3.7 graph correction (lines 173-182)
|
|
- Reduces graphPartialOffload and graphFullOffload by 15%
|
|
- Only applies to CUDA library with compute capability 3.7
|
|
- Based on empirical measurements from gemma3:12b testing
|
|
|
|
## Results
|
|
### Before:
|
|
- Memory estimate: 11.9 GiB
|
|
- GPU split: 1,48 layers across 2 GPUs
|
|
- GPU 0: 617 MiB, GPU 1: 9,866 MiB
|
|
- Command: --tensor-split 1,48
|
|
|
|
### After:
|
|
- Memory estimate: 11.0 GiB (-900 MiB)
|
|
- GPU split: None (single GPU)
|
|
- GPU 0: 10,015 MiB, GPU 1: 7 MiB
|
|
- Command: --parallel 1 (no tensor-split)
|
|
- GPU utilization: 94% during inference
|
|
|
|
## Testing
|
|
- ✅ gemma3:12b loads on single GPU
|
|
- ✅ All 49 layers offloaded to GPU 0
|
|
- ✅ Inference works correctly with 94% GPU utilization
|
|
- ✅ No cross-GPU communication overhead
|
|
- ✅ Memory usage: 10.0 GiB vs 11.0 GiB estimated (10% safety margin)
|
|
|
|
## Compatibility
|
|
- Only affects Tesla K80 and other CC 3.7 GPUs
|
|
- No impact on newer GPUs (CC 5.0+)
|
|
- Maintains existing multi-GPU functionality for models >11 GiB
|
|
- Preserves safety margins for stable operation
|