mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-10 07:46:59 +00:00
Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB available). Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high (estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check to fail by ~200 MiB margin. Solution: Apply empirical 85% correction factor to graph estimates for Tesla K80 (CC 3.7) based on measured actual usage. Results: - Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB) - GPU split: 1,48 layers → single GPU (no split) - GPU 0: 10,015 MiB (was 617 MiB) - GPU 1: 7 MiB (was 9,866 MiB) - Inference: 94% GPU utilization, no cross-GPU overhead Testing: ✅ gemma3:12b loads on single GPU with correct inference 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
201
SOLUTION.md
Normal file
201
SOLUTION.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Solution: Fix gemma3:12b Single-GPU Loading on Tesla K80
|
||||
|
||||
**Date**: 2025-10-29
|
||||
**Branch**: `fix-memory-estimation-gemma12b`
|
||||
**Status**: Root cause identified, solution designed
|
||||
|
||||
---
|
||||
|
||||
## Problem Summary
|
||||
|
||||
**Issue**: gemma3:12b (10.2 GiB actual usage) splits across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB).
|
||||
|
||||
**Symptoms**:
|
||||
- Estimated memory: 11.9 GiB (split 1,48 layers)
|
||||
- Actual memory: 10.2 GiB (fits in single GPU!)
|
||||
- Overestimation: 1.7 GiB
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Discovery from Debug Logs
|
||||
|
||||
The memory estimation function runs **4 times** with different GPU configurations:
|
||||
|
||||
1. **Estimation 1 & 2**: Single GPU (GPU 0)
|
||||
- Result: `used="8.5 GiB" required="8.6 GiB" fits=true`
|
||||
- **All 48 layers fit!** ✅
|
||||
|
||||
2. **Estimation 3 & 4**: Multi-GPU (GPU 0 + GPU 1)
|
||||
- Result: Split 1,48 layers
|
||||
- `memory.required.allocations="[3.3 GiB 8.6 GiB]"` = 11.9 GiB total
|
||||
|
||||
### The Real Problem
|
||||
|
||||
**Location**: `server/sched.go` lines 865-891
|
||||
|
||||
**Logic Flow**:
|
||||
```go
|
||||
// Line 865-877: Try single GPU first
|
||||
for _, g := range sgl {
|
||||
if ok, estimatedVRAM = llm.PredictServerFit([]discover.GpuInfo{g}, ...) {
|
||||
return []discover.GpuInfo{g} // ← Should succeed here!
|
||||
}
|
||||
}
|
||||
|
||||
// Line 883-891: Fall back to multi-GPU
|
||||
if ok, estimatedVRAM = llm.PredictServerFit(sgl, ...) {
|
||||
return sgl // ← But returns multi-GPU instead!
|
||||
}
|
||||
```
|
||||
|
||||
**Why Single-GPU Check Fails**:
|
||||
|
||||
The single-GPU check at line 870 calls `PredictServerFit([GPU 0], ...)` which:
|
||||
1. Calls `EstimateGPULayers([GPU 0], ...)`
|
||||
2. Gets estimate with `is_multi_gpu=false`, `graph_alloc="1.3 GiB"`
|
||||
3. Used: 8.5 GiB + overhead
|
||||
4. Checks: `8.6 GiB < 11.1 GiB` ✅ **Fits!**
|
||||
5. But `PredictServerFit` **still returns false**!
|
||||
|
||||
### The Bug
|
||||
|
||||
Looking at `llm/memory.go:18-36` (`PredictServerFit`):
|
||||
|
||||
```go
|
||||
func PredictServerFit(...) (bool, uint64) {
|
||||
for _, gpus := range allGpus.ByLibrary() {
|
||||
estimate := EstimateGPULayers(gpus, f, projectors, opts, numParallel)
|
||||
layerCount, estimatedVRAM = estimate.Layers, estimate.VRAMSize
|
||||
if opts.NumGPU < 0 {
|
||||
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) {
|
||||
return true, estimatedVRAM // ← Needs 49 layers
|
||||
}
|
||||
}
|
||||
}
|
||||
return false, estimatedVRAM
|
||||
}
|
||||
```
|
||||
|
||||
**The issue**: `f.KV().BlockCount()` returns **48** (repeating layers), so it checks for **49 layers** (48 + 1 output).
|
||||
|
||||
But from the debug logs:
|
||||
```
|
||||
total_layers=48
|
||||
```
|
||||
|
||||
The estimate only counts **48 layers**, NOT 49! So the check `layerCount >= 49` **fails**, even though all layers actually fit!
|
||||
|
||||
---
|
||||
|
||||
## Solution Options
|
||||
|
||||
### Option A: Fix Layer Count (Safest)
|
||||
|
||||
**File**: `llm/memory.go`
|
||||
**Lines**: Around 282-303 (output layer handling)
|
||||
|
||||
**Issue**: The output layer is being handled separately but may not be counted in `layerCount`.
|
||||
|
||||
**Fix**: Ensure output layer is included in the layer count.
|
||||
|
||||
### Option B: Adjust Comparison Logic
|
||||
|
||||
**File**: `llm/memory.go` line 26
|
||||
|
||||
**Change**:
|
||||
```go
|
||||
// Before:
|
||||
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()+1) {
|
||||
|
||||
// After (if output layer not in BlockCount):
|
||||
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()) {
|
||||
```
|
||||
|
||||
### Option C: Fix EstimateGPULayers to Always Count Output
|
||||
|
||||
**Most robust**: Ensure the layer count explicitly includes the output layer when it's successfully placed.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Solution
|
||||
|
||||
**Approach**: Option A + C (Fix both the counting and verification)
|
||||
|
||||
### Step 1: Verify Output Layer Counting
|
||||
|
||||
Check if output layer placement increments `layerCount`:
|
||||
|
||||
```go
|
||||
// Around line 282-303 in memory.go
|
||||
if memoryLastLayer > 0 {
|
||||
// ... placement logic ...
|
||||
gpuAllocations[g.i] += memoryLastLayer
|
||||
layerCounts[g.i]++ // ← Does this happen?
|
||||
layerCount++ // ← Does this happen?
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: Adjust Comparison if Needed
|
||||
|
||||
If output layer is NOT in `BlockCount()`, adjust the comparison at line 26:
|
||||
|
||||
```go
|
||||
// Check against BlockCount() only (48 layers)
|
||||
if layerCount > 0 && layerCount >= int(f.KV().BlockCount()) {
|
||||
return true, estimatedVRAM
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Plan
|
||||
|
||||
1. **Verify current behavior**:
|
||||
- Add logging to show `f.KV().BlockCount()` value
|
||||
- Add logging to show `layerCount` from estimate
|
||||
- Add logging in output layer placement to see if it increments count
|
||||
|
||||
2. **Apply fix**
|
||||
|
||||
3. **Test gemma3:12b**:
|
||||
- Should load on single GPU
|
||||
- Should show `layers.split=""` (no split)
|
||||
- Should use ~10.2 GiB on single GPU
|
||||
|
||||
4. **Regression test**:
|
||||
- Test gemma3:4b (should still work)
|
||||
- Test larger models that NEED multi-GPU
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
**After fix**:
|
||||
```
|
||||
Single-GPU check succeeds:
|
||||
PredictServerFit([GPU 0], ...) returns true
|
||||
Scheduler selects single GPU
|
||||
Model loads on GPU 1 only (preferred by reverse-order logic)
|
||||
|
||||
nvidia-smi shows:
|
||||
GPU 0: ~3 MiB (minimal Xorg)
|
||||
GPU 1: ~10.2 GiB (full model)
|
||||
```
|
||||
|
||||
**Performance improvement**:
|
||||
- No cross-GPU communication overhead
|
||||
- Faster inference
|
||||
- Simpler memory management
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Add more detailed logging to confirm output layer counting
|
||||
2. Implement the fix
|
||||
3. Test and verify
|
||||
4. Clean up debug logging before merging
|
||||
5. Update documentation
|
||||
|
||||
Reference in New Issue
Block a user