ollama37/memory_trace_analysis.md

# Memory Estimation Trace Analysis for gemma3:12b

**Date**: 2025-10-29
**Goal**: Understand why estimated memory (11.9 GiB) exceeds actual usage (10.48 GiB) by 1.42 GiB

## Input Data from Logs

### System Configuration
- GPUs: 2x Tesla K80 (11.2 GiB each)
- Model: gemma3:12b
- Layers: 49 total (48 repeating + 1 output)
- Context: 4096 tokens
- Batch: 512 tokens
- Parallel: 1

### Log Output - Estimated Memory
```
memory.available="[11.1 GiB 11.1 GiB]"
memory.required.full="11.9 GiB"
memory.required.partial="11.9 GiB"
memory.required.kv="736.0 MiB"
memory.required.allocations="[3.3 GiB 8.6 GiB]"
memory.weights.total="6.8 GiB"
memory.weights.repeating="6.0 GiB"
memory.weights.nonrepeating="787.5 MiB"
memory.graph.full="1.3 GiB"
memory.graph.partial="1.3 GiB"
projector.weights="795.9 MiB"
projector.graph="1.0 GiB"
layers.split="1,48"
```

### Log Output - Actual Memory Usage
```
Model weights loaded:
  CPU buffer: 787.5 MiB
  CUDA0 buffer: 136.7 MiB
  CUDA1 buffer: 7.4 GiB
  Total: 8.324 GiB

Compute graphs allocated:
  CUDA0: 85.8 MiB
  CUDA1: 1.1 GiB
  CPU: 7.5 MiB
  Total: 1.193 GiB

nvidia-smi readings:
  GPU0: 617 MiB (0.602 GiB)
  GPU1: 9866 MiB (9.635 GiB)
  Total: 10.237 GiB
```

## Component-by-Component Analysis

### 1. Model Weights
- **Estimated**: 6.8 GiB (memory.weights.total)
- **Actual**: 8.324 GiB (787.5 MiB CPU + 136.7 MiB GPU0 + 7.4 GiB GPU1)
- **Delta**: +1.524 GiB (actual > estimate)
- **Status**: ⚠️ UNDERESTIMATED

**Note**: This is odd - weights are UNDERESTIMATED, not overestimated!

### 2. KV Cache
- **Estimated**: 736 MiB
- **Actual**: Included in nvidia-smi totals, hard to isolate
- **Status**: ❓ UNKNOWN

### 3. Compute Graphs
- **Estimated**: 1.3 GiB (per log: memory.graph.full)
- **Actual**: 1.193 GiB (85.8 MiB GPU0 + 1.1 GiB GPU1)
- **Delta**: -0.107 GiB (slight overestimate)
- **Status**: ✅ CLOSE

### 4. Projector Components
- **Estimated**: 795.9 MiB weights + 1.0 GiB graph = 1.796 GiB
- **Actual**: Unclear from logs (likely included in weights/graph totals)
- **Status**: ❓ POSSIBLY DOUBLE-COUNTED

### 5. GPU Allocations
```
Estimated per GPU:
  GPU0: 3.3 GiB
  GPU1: 8.6 GiB
  Total: 11.9 GiB

Actual per GPU (nvidia-smi):
  GPU0: 0.602 GiB
  GPU1: 9.635 GiB
  Total: 10.237 GiB

Delta:
  GPU0: -2.698 GiB (MASSIVE overestimate)
  GPU1: +1.035 GiB (underestimate)
  Total: -1.663 GiB (net overestimate)
```

## Key Findings

### Finding 1: GPU0 Massive Overestimation
GPU0 estimated at **3.3 GiB** but actually uses only **0.602 GiB**.

**Possible causes:**
1. Full graph allocation assigned to GPU0 during estimation
2. Layer weights estimated for GPU0 but actually loaded elsewhere
3. Conservative buffers that aren't actually needed

### Finding 2: Weights Accounting Mismatch
- Log says `memory.weights.total="6.8 GiB"`
- But actual weight buffers sum to **8.324 GiB**
- **Gap: 1.524 GiB underestimate**

This suggests the `memory.weights.total` in logs **excludes something** (KV cache? buffers?).

### Finding 3: Layer Split Decision
With split `1,48`:
- GPU0: 1 layer only (why?)
- GPU1: 48 layers

If GPU0 can only hold 1 layer, why estimate 3.3 GiB for it?

## Hypothesis: The Root Cause

**Theory**: The layer placement algorithm is placing 1 layer on GPU0 unnecessarily due to:

1. GPU0 gets allocated **full graph overhead** (1.3 GiB) during estimation
2. This leaves ~9.8 GiB "available" on GPU0
3. Algorithm tries to place layers, but only 1 fits after accounting for real overheads
4. This triggers multi-GPU mode
5. But if we **didn't place ANY layers on GPU0**, all 49 layers could fit on GPU1

**Test hypothesis**: What if we disable GPU0 entirely?

## Next Steps

1. **Add debug logging** to track exact layer-by-layer placement decisions
2. **Calculate theoretical single-GPU memory**:
   - All weights on GPU1: 8.324 GiB
   - Full graph on GPU1: 1.3 GiB
   - KV cache: 0.736 GiB
   - Total: ~10.36 GiB
   - **Result**: Fits in 11.2 GiB! ✅

3. **Find why algorithm splits**:
   - Is it the `overhead` value?
   - Is it the layer placement logic at lines 243-277?
   - Is it the graph allocation at lines 230-241?

4. **Possible fixes**:
   - Option A: Be more conservative about GPU0 free space
   - Option B: Prefer single-GPU until proven necessary
   - Option C: Adjust overhead calculations
   - Option D: Fix the layer placement algorithm to try single-GPU first

## Code Sections to Investigate

1. **Line 106**: `overhead := envconfig.GpuOverhead()` - What is this value?
2. **Lines 193-213**: GPU filtering logic - Which GPUs are deemed "viable"?
3. **Lines 230-241**: Graph allocation per GPU - Is GPU0 getting full 1.3 GiB?
4. **Lines 243-277**: Layer placement loop - Why does it place layers on GPU0?
5. **Lines 282-303**: Output layer placement - Does this trigger GPU0 usage?

## Questions to Answer

1. What is `envconfig.GpuOverhead()` returning?
2. What is `gpus[i].MinimumMemory` for each GPU?
3. During layer placement, what are the `used` values for each GPU?
4. What is `gpusWithSpace` after filtering?
5. Is the 190 MiB optimization actually being applied?