# Solution: Fix gemma3:12b Single-GPU Loading on Tesla K80 **Date**: 2025-10-29 **Branch**: `fix-memory-estimation-gemma12b` **Status**: Root cause identified, solution designed --- ## Problem Summary **Issue**: gemma3:12b (10.2 GiB actual usage) splits across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB). **Symptoms**: - Estimated memory: 11.9 GiB (split 1,48 layers) - Actual memory: 10.2 GiB (fits in single GPU!) - Overestimation: 1.7 GiB --- ## Root Cause Analysis ### Discovery from Debug Logs The memory estimation function runs **4 times** with different GPU configurations: 1. **Estimation 1 & 2**: Single GPU (GPU 0) - Result: `used="8.5 GiB" required="8.6 GiB" fits=true` - **All 48 layers fit!** ✅ 2. **Estimation 3 & 4**: Multi-GPU (GPU 0 + GPU 1) - Result: Split 1,48 layers - `memory.required.allocations="[3.3 GiB 8.6 GiB]"` = 11.9 GiB total ### The Real Problem **Location**: `server/sched.go` lines 865-891 **Logic Flow**: ```go // Line 865-877: Try single GPU first for _, g := range sgl { if ok, estimatedVRAM = llm.PredictServerFit([]discover.GpuInfo{g}, ...) { return []discover.GpuInfo{g} // ← Should succeed here! } } // Line 883-891: Fall back to multi-GPU if ok, estimatedVRAM = llm.PredictServerFit(sgl, ...) { return sgl // ← But returns multi-GPU instead! } ``` **Why Single-GPU Check Fails**: The single-GPU check at line 870 calls `PredictServerFit([GPU 0], ...)` which: 1. Calls `EstimateGPULayers([GPU 0], ...)` 2. Gets estimate with `is_multi_gpu=false`, `graph_alloc="1.3 GiB"` 3. Used: 8.5 GiB + overhead 4. Checks: `8.6 GiB < 11.1 GiB` ✅ **Fits!** 5. But `PredictServerFit` **still returns false**! ### The Bug Looking at `llm/memory.go:18-36` (`PredictServerFit`): ```go func PredictServerFit(...) (bool, uint64) { for _, gpus := range allGpus.ByLibrary() { estimate := EstimateGPULayers(gpus, f, projectors, opts, numParallel) layerCount, estimatedVRAM = estimate.Layers, estimate.VRAMSize if opts.NumGPU < 0 { if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) { return true, estimatedVRAM // ← Needs 49 layers } } } return false, estimatedVRAM } ``` **The issue**: `f.KV().BlockCount()` returns **48** (repeating layers), so it checks for **49 layers** (48 + 1 output). But from the debug logs: ``` total_layers=48 ``` The estimate only counts **48 layers**, NOT 49! So the check `layerCount >= 49` **fails**, even though all layers actually fit! --- ## Solution Options ### Option A: Fix Layer Count (Safest) **File**: `llm/memory.go` **Lines**: Around 282-303 (output layer handling) **Issue**: The output layer is being handled separately but may not be counted in `layerCount`. **Fix**: Ensure output layer is included in the layer count. ### Option B: Adjust Comparison Logic **File**: `llm/memory.go` line 26 **Change**: ```go // Before: if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) { // After (if output layer not in BlockCount): if layerCount > 0 && layerCount >= int(f.KV().BlockCount()) { ``` ### Option C: Fix EstimateGPULayers to Always Count Output **Most robust**: Ensure the layer count explicitly includes the output layer when it's successfully placed. --- ## Recommended Solution **Approach**: Option A + C (Fix both the counting and verification) ### Step 1: Verify Output Layer Counting Check if output layer placement increments `layerCount`: ```go // Around line 282-303 in memory.go if memoryLastLayer > 0 { // ... placement logic ... gpuAllocations[g.i] += memoryLastLayer layerCounts[g.i]++ // ← Does this happen? layerCount++ // ← Does this happen? } ``` ### Step 2: Adjust Comparison if Needed If output layer is NOT in `BlockCount()`, adjust the comparison at line 26: ```go // Check against BlockCount() only (48 layers) if layerCount > 0 && layerCount >= int(f.KV().BlockCount()) { return true, estimatedVRAM } ``` --- ## Testing Plan 1. **Verify current behavior**: - Add logging to show `f.KV().BlockCount()` value - Add logging to show `layerCount` from estimate - Add logging in output layer placement to see if it increments count 2. **Apply fix** 3. **Test gemma3:12b**: - Should load on single GPU - Should show `layers.split=""` (no split) - Should use ~10.2 GiB on single GPU 4. **Regression test**: - Test gemma3:4b (should still work) - Test larger models that NEED multi-GPU --- ## Expected Results **After fix**: ``` Single-GPU check succeeds: PredictServerFit([GPU 0], ...) returns true Scheduler selects single GPU Model loads on GPU 1 only (preferred by reverse-order logic) nvidia-smi shows: GPU 0: ~3 MiB (minimal Xorg) GPU 1: ~10.2 GiB (full model) ``` **Performance improvement**: - No cross-GPU communication overhead - Faster inference - Simpler memory management --- ## Next Steps 1. Add more detailed logging to confirm output layer counting 2. Implement the fix 3. Test and verify 4. Clean up debug logging before merging 5. Update documentation