mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Files

Shang Chieh Tseng 6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU

Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing: ✅ gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-30 00:15:59 +08:00

5.0 KiB

Raw Blame History

Solution: Fix gemma3:12b Single-GPU Loading on Tesla K80

Date: 2025-10-29 Branch: fix-memory-estimation-gemma12b Status: Root cause identified, solution designed

Problem Summary

Issue: gemma3:12b (10.2 GiB actual usage) splits across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB).

Symptoms:

Estimated memory: 11.9 GiB (split 1,48 layers)
Actual memory: 10.2 GiB (fits in single GPU!)
Overestimation: 1.7 GiB

Root Cause Analysis

Discovery from Debug Logs

The memory estimation function runs 4 times with different GPU configurations:

Estimation 1 & 2: Single GPU (GPU 0)
- Result: used="8.5 GiB" required="8.6 GiB" fits=true
- All 48 layers fit! ✅
Estimation 3 & 4: Multi-GPU (GPU 0 + GPU 1)
- Result: Split 1,48 layers
- memory.required.allocations="[3.3 GiB 8.6 GiB]" = 11.9 GiB total

The Real Problem

Location: server/sched.go lines 865-891

Logic Flow:

// Line 865-877: Try single GPU first
for _, g := range sgl {
    if ok, estimatedVRAM = llm.PredictServerFit([]discover.GpuInfo{g}, ...) {
        return []discover.GpuInfo{g}  // ← Should succeed here!
    }
}

// Line 883-891: Fall back to multi-GPU
if ok, estimatedVRAM = llm.PredictServerFit(sgl, ...) {
    return sgl  // ← But returns multi-GPU instead!
}

Why Single-GPU Check Fails:

The single-GPU check at line 870 calls PredictServerFit([GPU 0], ...) which:

Calls EstimateGPULayers([GPU 0], ...)
Gets estimate with is_multi_gpu=false, graph_alloc="1.3 GiB"
Used: 8.5 GiB + overhead
Checks: 8.6 GiB < 11.1 GiB ✅ Fits!
But PredictServerFit still returns false!

The Bug

Looking at llm/memory.go:18-36 (PredictServerFit):

func PredictServerFit(...) (bool, uint64) {
    for _, gpus := range allGpus.ByLibrary() {
        estimate := EstimateGPULayers(gpus, f, projectors, opts, numParallel)
        layerCount, estimatedVRAM = estimate.Layers, estimate.VRAMSize
        if opts.NumGPU < 0 {
            if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) {
                return true, estimatedVRAM  // ← Needs 49 layers
            }
        }
    }
    return false, estimatedVRAM
}

The issue: f.KV().BlockCount() returns 48 (repeating layers), so it checks for 49 layers (48 + 1 output).

But from the debug logs:

total_layers=48

The estimate only counts 48 layers, NOT 49! So the check layerCount >= 49 fails, even though all layers actually fit!

Solution Options

Option A: Fix Layer Count (Safest)

File: llm/memory.go Lines: Around 282-303 (output layer handling)

Issue: The output layer is being handled separately but may not be counted in layerCount.

Fix: Ensure output layer is included in the layer count.

Option B: Adjust Comparison Logic

File: llm/memory.go line 26

Change:

// Before:
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) {

// After (if output layer not in BlockCount):
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()) {

Option C: Fix EstimateGPULayers to Always Count Output

Most robust: Ensure the layer count explicitly includes the output layer when it's successfully placed.

Testing Plan

Verify current behavior:
- Add logging to show f.KV().BlockCount() value
- Add logging to show layerCount from estimate
- Add logging in output layer placement to see if it increments count
Apply fix
Test gemma3:12b:
- Should load on single GPU
- Should show layers.split="" (no split)
- Should use ~10.2 GiB on single GPU
Regression test:
- Test gemma3:4b (should still work)
- Test larger models that NEED multi-GPU

Expected Results

After fix:

Single-GPU check succeeds:
  PredictServerFit([GPU 0], ...) returns true
  Scheduler selects single GPU
  Model loads on GPU 1 only (preferred by reverse-order logic)

nvidia-smi shows:
  GPU 0: ~3 MiB (minimal Xorg)
  GPU 1: ~10.2 GiB (full model)

Performance improvement:

No cross-GPU communication overhead
Faster inference
Simpler memory management

Next Steps

Add more detailed logging to confirm output layer counting
Implement the fix
Test and verify
Clean up debug logging before merging
Update documentation

5.0 KiB

Raw Blame History

Solution: Fix gemma3:12b Single-GPU Loading on Tesla K80

Problem Summary

Root Cause Analysis

Discovery from Debug Logs

The Real Problem

The Bug

Solution Options

Option A: Fix Layer Count (Safest)

Option B: Adjust Comparison Logic

Option C: Fix EstimateGPULayers to Always Count Output

Recommended Solution

Step 1: Verify Output Layer Counting

Step 2: Adjust Comparison if Needed

Testing Plan

Expected Results

Next Steps

5.0 KiB Raw Blame History

Solution: Fix gemma3:12b Single-GPU Loading on Tesla K80

Problem Summary

Root Cause Analysis

Discovery from Debug Logs

The Real Problem

The Bug

Solution Options

Option A: Fix Layer Count (Safest)

Option B: Adjust Comparison Logic

Option C: Fix EstimateGPULayers to Always Count Output

Recommended Solution

Step 1: Verify Output Layer Counting

Step 2: Adjust Comparison if Needed

Testing Plan

Expected Results

Next Steps

5.0 KiB

Raw Blame History