Files
ollama37/memory_trace_analysis.md
Shang Chieh Tseng 6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00

169 lines
5.0 KiB
Markdown

# Memory Estimation Trace Analysis for gemma3:12b
**Date**: 2025-10-29
**Goal**: Understand why estimated memory (11.9 GiB) exceeds actual usage (10.48 GiB) by 1.42 GiB
## Input Data from Logs
### System Configuration
- GPUs: 2x Tesla K80 (11.2 GiB each)
- Model: gemma3:12b
- Layers: 49 total (48 repeating + 1 output)
- Context: 4096 tokens
- Batch: 512 tokens
- Parallel: 1
### Log Output - Estimated Memory
```
memory.available="[11.1 GiB 11.1 GiB]"
memory.required.full="11.9 GiB"
memory.required.partial="11.9 GiB"
memory.required.kv="736.0 MiB"
memory.required.allocations="[3.3 GiB 8.6 GiB]"
memory.weights.total="6.8 GiB"
memory.weights.repeating="6.0 GiB"
memory.weights.nonrepeating="787.5 MiB"
memory.graph.full="1.3 GiB"
memory.graph.partial="1.3 GiB"
projector.weights="795.9 MiB"
projector.graph="1.0 GiB"
layers.split="1,48"
```
### Log Output - Actual Memory Usage
```
Model weights loaded:
CPU buffer: 787.5 MiB
CUDA0 buffer: 136.7 MiB
CUDA1 buffer: 7.4 GiB
Total: 8.324 GiB
Compute graphs allocated:
CUDA0: 85.8 MiB
CUDA1: 1.1 GiB
CPU: 7.5 MiB
Total: 1.193 GiB
nvidia-smi readings:
GPU0: 617 MiB (0.602 GiB)
GPU1: 9866 MiB (9.635 GiB)
Total: 10.237 GiB
```
## Component-by-Component Analysis
### 1. Model Weights
- **Estimated**: 6.8 GiB (memory.weights.total)
- **Actual**: 8.324 GiB (787.5 MiB CPU + 136.7 MiB GPU0 + 7.4 GiB GPU1)
- **Delta**: +1.524 GiB (actual > estimate)
- **Status**: ⚠️ UNDERESTIMATED
**Note**: This is odd - weights are UNDERESTIMATED, not overestimated!
### 2. KV Cache
- **Estimated**: 736 MiB
- **Actual**: Included in nvidia-smi totals, hard to isolate
- **Status**: ❓ UNKNOWN
### 3. Compute Graphs
- **Estimated**: 1.3 GiB (per log: memory.graph.full)
- **Actual**: 1.193 GiB (85.8 MiB GPU0 + 1.1 GiB GPU1)
- **Delta**: -0.107 GiB (slight overestimate)
- **Status**: ✅ CLOSE
### 4. Projector Components
- **Estimated**: 795.9 MiB weights + 1.0 GiB graph = 1.796 GiB
- **Actual**: Unclear from logs (likely included in weights/graph totals)
- **Status**: ❓ POSSIBLY DOUBLE-COUNTED
### 5. GPU Allocations
```
Estimated per GPU:
GPU0: 3.3 GiB
GPU1: 8.6 GiB
Total: 11.9 GiB
Actual per GPU (nvidia-smi):
GPU0: 0.602 GiB
GPU1: 9.635 GiB
Total: 10.237 GiB
Delta:
GPU0: -2.698 GiB (MASSIVE overestimate)
GPU1: +1.035 GiB (underestimate)
Total: -1.663 GiB (net overestimate)
```
## Key Findings
### Finding 1: GPU0 Massive Overestimation
GPU0 estimated at **3.3 GiB** but actually uses only **0.602 GiB**.
**Possible causes:**
1. Full graph allocation assigned to GPU0 during estimation
2. Layer weights estimated for GPU0 but actually loaded elsewhere
3. Conservative buffers that aren't actually needed
### Finding 2: Weights Accounting Mismatch
- Log says `memory.weights.total="6.8 GiB"`
- But actual weight buffers sum to **8.324 GiB**
- **Gap: 1.524 GiB underestimate**
This suggests the `memory.weights.total` in logs **excludes something** (KV cache? buffers?).
### Finding 3: Layer Split Decision
With split `1,48`:
- GPU0: 1 layer only (why?)
- GPU1: 48 layers
If GPU0 can only hold 1 layer, why estimate 3.3 GiB for it?
## Hypothesis: The Root Cause
**Theory**: The layer placement algorithm is placing 1 layer on GPU0 unnecessarily due to:
1. GPU0 gets allocated **full graph overhead** (1.3 GiB) during estimation
2. This leaves ~9.8 GiB "available" on GPU0
3. Algorithm tries to place layers, but only 1 fits after accounting for real overheads
4. This triggers multi-GPU mode
5. But if we **didn't place ANY layers on GPU0**, all 49 layers could fit on GPU1
**Test hypothesis**: What if we disable GPU0 entirely?
## Next Steps
1. **Add debug logging** to track exact layer-by-layer placement decisions
2. **Calculate theoretical single-GPU memory**:
- All weights on GPU1: 8.324 GiB
- Full graph on GPU1: 1.3 GiB
- KV cache: 0.736 GiB
- Total: ~10.36 GiB
- **Result**: Fits in 11.2 GiB! ✅
3. **Find why algorithm splits**:
- Is it the `overhead` value?
- Is it the layer placement logic at lines 243-277?
- Is it the graph allocation at lines 230-241?
4. **Possible fixes**:
- Option A: Be more conservative about GPU0 free space
- Option B: Prefer single-GPU until proven necessary
- Option C: Adjust overhead calculations
- Option D: Fix the layer placement algorithm to try single-GPU first
## Code Sections to Investigate
1. **Line 106**: `overhead := envconfig.GpuOverhead()` - What is this value?
2. **Lines 193-213**: GPU filtering logic - Which GPUs are deemed "viable"?
3. **Lines 230-241**: Graph allocation per GPU - Is GPU0 getting full 1.3 GiB?
4. **Lines 243-277**: Layer placement loop - Why does it place layers on GPU0?
5. **Lines 282-303**: Output layer placement - Does this trigger GPU0 usage?
## Questions to Answer
1. What is `envconfig.GpuOverhead()` returning?
2. What is `gpus[i].MinimumMemory` for each GPU?
3. During layer placement, what are the `used` values for each GPU?
4. What is `gpusWithSpace` after filtering?
5. Is the 190 MiB optimization actually being applied?