mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-10 07:46:59 +00:00
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB available). Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high (estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check to fail by ~200 MiB margin. Solution: Apply empirical 85% correction factor to graph estimates for Tesla K80 (CC 3.7) based on measured actual usage. Results: - Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB) - GPU split: 1,48 layers → single GPU (no split) - GPU 0: 10,015 MiB (was 617 MiB) - GPU 1: 7 MiB (was 9,866 MiB) - Inference: 94% GPU utilization, no cross-GPU overhead Testing: ✅ gemma3:12b loads on single GPU with correct inference 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
169 lines
5.0 KiB
Markdown
169 lines
5.0 KiB
Markdown
# Memory Estimation Trace Analysis for gemma3:12b
|
|
|
|
**Date**: 2025-10-29
|
|
**Goal**: Understand why estimated memory (11.9 GiB) exceeds actual usage (10.48 GiB) by 1.42 GiB
|
|
|
|
## Input Data from Logs
|
|
|
|
### System Configuration
|
|
- GPUs: 2x Tesla K80 (11.2 GiB each)
|
|
- Model: gemma3:12b
|
|
- Layers: 49 total (48 repeating + 1 output)
|
|
- Context: 4096 tokens
|
|
- Batch: 512 tokens
|
|
- Parallel: 1
|
|
|
|
### Log Output - Estimated Memory
|
|
```
|
|
memory.available="[11.1 GiB 11.1 GiB]"
|
|
memory.required.full="11.9 GiB"
|
|
memory.required.partial="11.9 GiB"
|
|
memory.required.kv="736.0 MiB"
|
|
memory.required.allocations="[3.3 GiB 8.6 GiB]"
|
|
memory.weights.total="6.8 GiB"
|
|
memory.weights.repeating="6.0 GiB"
|
|
memory.weights.nonrepeating="787.5 MiB"
|
|
memory.graph.full="1.3 GiB"
|
|
memory.graph.partial="1.3 GiB"
|
|
projector.weights="795.9 MiB"
|
|
projector.graph="1.0 GiB"
|
|
layers.split="1,48"
|
|
```
|
|
|
|
### Log Output - Actual Memory Usage
|
|
```
|
|
Model weights loaded:
|
|
CPU buffer: 787.5 MiB
|
|
CUDA0 buffer: 136.7 MiB
|
|
CUDA1 buffer: 7.4 GiB
|
|
Total: 8.324 GiB
|
|
|
|
Compute graphs allocated:
|
|
CUDA0: 85.8 MiB
|
|
CUDA1: 1.1 GiB
|
|
CPU: 7.5 MiB
|
|
Total: 1.193 GiB
|
|
|
|
nvidia-smi readings:
|
|
GPU0: 617 MiB (0.602 GiB)
|
|
GPU1: 9866 MiB (9.635 GiB)
|
|
Total: 10.237 GiB
|
|
```
|
|
|
|
## Component-by-Component Analysis
|
|
|
|
### 1. Model Weights
|
|
- **Estimated**: 6.8 GiB (memory.weights.total)
|
|
- **Actual**: 8.324 GiB (787.5 MiB CPU + 136.7 MiB GPU0 + 7.4 GiB GPU1)
|
|
- **Delta**: +1.524 GiB (actual > estimate)
|
|
- **Status**: ⚠️ UNDERESTIMATED
|
|
|
|
**Note**: This is odd - weights are UNDERESTIMATED, not overestimated!
|
|
|
|
### 2. KV Cache
|
|
- **Estimated**: 736 MiB
|
|
- **Actual**: Included in nvidia-smi totals, hard to isolate
|
|
- **Status**: ❓ UNKNOWN
|
|
|
|
### 3. Compute Graphs
|
|
- **Estimated**: 1.3 GiB (per log: memory.graph.full)
|
|
- **Actual**: 1.193 GiB (85.8 MiB GPU0 + 1.1 GiB GPU1)
|
|
- **Delta**: -0.107 GiB (slight overestimate)
|
|
- **Status**: ✅ CLOSE
|
|
|
|
### 4. Projector Components
|
|
- **Estimated**: 795.9 MiB weights + 1.0 GiB graph = 1.796 GiB
|
|
- **Actual**: Unclear from logs (likely included in weights/graph totals)
|
|
- **Status**: ❓ POSSIBLY DOUBLE-COUNTED
|
|
|
|
### 5. GPU Allocations
|
|
```
|
|
Estimated per GPU:
|
|
GPU0: 3.3 GiB
|
|
GPU1: 8.6 GiB
|
|
Total: 11.9 GiB
|
|
|
|
Actual per GPU (nvidia-smi):
|
|
GPU0: 0.602 GiB
|
|
GPU1: 9.635 GiB
|
|
Total: 10.237 GiB
|
|
|
|
Delta:
|
|
GPU0: -2.698 GiB (MASSIVE overestimate)
|
|
GPU1: +1.035 GiB (underestimate)
|
|
Total: -1.663 GiB (net overestimate)
|
|
```
|
|
|
|
## Key Findings
|
|
|
|
### Finding 1: GPU0 Massive Overestimation
|
|
GPU0 estimated at **3.3 GiB** but actually uses only **0.602 GiB**.
|
|
|
|
**Possible causes:**
|
|
1. Full graph allocation assigned to GPU0 during estimation
|
|
2. Layer weights estimated for GPU0 but actually loaded elsewhere
|
|
3. Conservative buffers that aren't actually needed
|
|
|
|
### Finding 2: Weights Accounting Mismatch
|
|
- Log says `memory.weights.total="6.8 GiB"`
|
|
- But actual weight buffers sum to **8.324 GiB**
|
|
- **Gap: 1.524 GiB underestimate**
|
|
|
|
This suggests the `memory.weights.total` in logs **excludes something** (KV cache? buffers?).
|
|
|
|
### Finding 3: Layer Split Decision
|
|
With split `1,48`:
|
|
- GPU0: 1 layer only (why?)
|
|
- GPU1: 48 layers
|
|
|
|
If GPU0 can only hold 1 layer, why estimate 3.3 GiB for it?
|
|
|
|
## Hypothesis: The Root Cause
|
|
|
|
**Theory**: The layer placement algorithm is placing 1 layer on GPU0 unnecessarily due to:
|
|
|
|
1. GPU0 gets allocated **full graph overhead** (1.3 GiB) during estimation
|
|
2. This leaves ~9.8 GiB "available" on GPU0
|
|
3. Algorithm tries to place layers, but only 1 fits after accounting for real overheads
|
|
4. This triggers multi-GPU mode
|
|
5. But if we **didn't place ANY layers on GPU0**, all 49 layers could fit on GPU1
|
|
|
|
**Test hypothesis**: What if we disable GPU0 entirely?
|
|
|
|
## Next Steps
|
|
|
|
1. **Add debug logging** to track exact layer-by-layer placement decisions
|
|
2. **Calculate theoretical single-GPU memory**:
|
|
- All weights on GPU1: 8.324 GiB
|
|
- Full graph on GPU1: 1.3 GiB
|
|
- KV cache: 0.736 GiB
|
|
- Total: ~10.36 GiB
|
|
- **Result**: Fits in 11.2 GiB! ✅
|
|
|
|
3. **Find why algorithm splits**:
|
|
- Is it the `overhead` value?
|
|
- Is it the layer placement logic at lines 243-277?
|
|
- Is it the graph allocation at lines 230-241?
|
|
|
|
4. **Possible fixes**:
|
|
- Option A: Be more conservative about GPU0 free space
|
|
- Option B: Prefer single-GPU until proven necessary
|
|
- Option C: Adjust overhead calculations
|
|
- Option D: Fix the layer placement algorithm to try single-GPU first
|
|
|
|
## Code Sections to Investigate
|
|
|
|
1. **Line 106**: `overhead := envconfig.GpuOverhead()` - What is this value?
|
|
2. **Lines 193-213**: GPU filtering logic - Which GPUs are deemed "viable"?
|
|
3. **Lines 230-241**: Graph allocation per GPU - Is GPU0 getting full 1.3 GiB?
|
|
4. **Lines 243-277**: Layer placement loop - Why does it place layers on GPU0?
|
|
5. **Lines 282-303**: Output layer placement - Does this trigger GPU0 usage?
|
|
|
|
## Questions to Answer
|
|
|
|
1. What is `envconfig.GpuOverhead()` returning?
|
|
2. What is `gpus[i].MinimumMemory` for each GPU?
|
|
3. During layer placement, what are the `used` values for each GPU?
|
|
4. What is `gpusWithSpace` after filtering?
|
|
5. Is the 190 MiB optimization actually being applied?
|