ollama37/COMMIT_MESSAGE.txt

Fix gemma3:12b to load on single Tesla K80 GPU

## Problem
gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting
in a single Tesla K80 (11.2 GiB available). This caused:
- Unnecessary multi-GPU splits (1,48 layer distribution)
- Cross-GPU communication overhead
- Slower inference performance
- Wasted VRAM on secondary GPU

## Root Cause
Graph memory estimates for CUDA CC 3.7 were consistently 15-20% higher
than actual usage:
- Estimated: 1.3 GiB per GPU
- Actual: 1.1 GiB primary GPU, ~86 MiB secondary GPU
- This caused single-GPU placement to fail by ~200 MiB margin

## Solution
Applied empirical 85% correction factor to graph memory estimates for
Tesla K80 (CC 3.7) GPUs, based on measured actual usage.

## Changes
- llm/memory.go: Add CC 3.7 graph correction (lines 173-182)
  - Reduces graphPartialOffload and graphFullOffload by 15%
  - Only applies to CUDA library with compute capability 3.7
  - Based on empirical measurements from gemma3:12b testing

## Results
### Before:
- Memory estimate: 11.9 GiB
- GPU split: 1,48 layers across 2 GPUs
- GPU 0: 617 MiB, GPU 1: 9,866 MiB
- Command: --tensor-split 1,48

### After:
- Memory estimate: 11.0 GiB (-900 MiB)
- GPU split: None (single GPU)
- GPU 0: 10,015 MiB, GPU 1: 7 MiB
- Command: --parallel 1 (no tensor-split)
- GPU utilization: 94% during inference

## Testing
- ✅ gemma3:12b loads on single GPU
- ✅ All 49 layers offloaded to GPU 0
- ✅ Inference works correctly with 94% GPU utilization
- ✅ No cross-GPU communication overhead
- ✅ Memory usage: 10.0 GiB vs 11.0 GiB estimated (10% safety margin)

## Compatibility
- Only affects Tesla K80 and other CC 3.7 GPUs
- No impact on newer GPUs (CC 5.0+)
- Maintains existing multi-GPU functionality for models >11 GiB
- Preserves safety margins for stable operation