Files
ollama37/memory_trace_analysis.md
Shang Chieh Tseng 6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00

5.0 KiB

Memory Estimation Trace Analysis for gemma3:12b

Date: 2025-10-29 Goal: Understand why estimated memory (11.9 GiB) exceeds actual usage (10.48 GiB) by 1.42 GiB

Input Data from Logs

System Configuration

  • GPUs: 2x Tesla K80 (11.2 GiB each)
  • Model: gemma3:12b
  • Layers: 49 total (48 repeating + 1 output)
  • Context: 4096 tokens
  • Batch: 512 tokens
  • Parallel: 1

Log Output - Estimated Memory

memory.available="[11.1 GiB 11.1 GiB]"
memory.required.full="11.9 GiB"
memory.required.partial="11.9 GiB"
memory.required.kv="736.0 MiB"
memory.required.allocations="[3.3 GiB 8.6 GiB]"
memory.weights.total="6.8 GiB"
memory.weights.repeating="6.0 GiB"
memory.weights.nonrepeating="787.5 MiB"
memory.graph.full="1.3 GiB"
memory.graph.partial="1.3 GiB"
projector.weights="795.9 MiB"
projector.graph="1.0 GiB"
layers.split="1,48"

Log Output - Actual Memory Usage

Model weights loaded:
  CPU buffer: 787.5 MiB
  CUDA0 buffer: 136.7 MiB
  CUDA1 buffer: 7.4 GiB
  Total: 8.324 GiB

Compute graphs allocated:
  CUDA0: 85.8 MiB
  CUDA1: 1.1 GiB
  CPU: 7.5 MiB
  Total: 1.193 GiB

nvidia-smi readings:
  GPU0: 617 MiB (0.602 GiB)
  GPU1: 9866 MiB (9.635 GiB)
  Total: 10.237 GiB

Component-by-Component Analysis

1. Model Weights

  • Estimated: 6.8 GiB (memory.weights.total)
  • Actual: 8.324 GiB (787.5 MiB CPU + 136.7 MiB GPU0 + 7.4 GiB GPU1)
  • Delta: +1.524 GiB (actual > estimate)
  • Status: ⚠️ UNDERESTIMATED

Note: This is odd - weights are UNDERESTIMATED, not overestimated!

2. KV Cache

  • Estimated: 736 MiB
  • Actual: Included in nvidia-smi totals, hard to isolate
  • Status: UNKNOWN

3. Compute Graphs

  • Estimated: 1.3 GiB (per log: memory.graph.full)
  • Actual: 1.193 GiB (85.8 MiB GPU0 + 1.1 GiB GPU1)
  • Delta: -0.107 GiB (slight overestimate)
  • Status: CLOSE

4. Projector Components

  • Estimated: 795.9 MiB weights + 1.0 GiB graph = 1.796 GiB
  • Actual: Unclear from logs (likely included in weights/graph totals)
  • Status: POSSIBLY DOUBLE-COUNTED

5. GPU Allocations

Estimated per GPU:
  GPU0: 3.3 GiB
  GPU1: 8.6 GiB
  Total: 11.9 GiB

Actual per GPU (nvidia-smi):
  GPU0: 0.602 GiB
  GPU1: 9.635 GiB
  Total: 10.237 GiB

Delta:
  GPU0: -2.698 GiB (MASSIVE overestimate)
  GPU1: +1.035 GiB (underestimate)
  Total: -1.663 GiB (net overestimate)

Key Findings

Finding 1: GPU0 Massive Overestimation

GPU0 estimated at 3.3 GiB but actually uses only 0.602 GiB.

Possible causes:

  1. Full graph allocation assigned to GPU0 during estimation
  2. Layer weights estimated for GPU0 but actually loaded elsewhere
  3. Conservative buffers that aren't actually needed

Finding 2: Weights Accounting Mismatch

  • Log says memory.weights.total="6.8 GiB"
  • But actual weight buffers sum to 8.324 GiB
  • Gap: 1.524 GiB underestimate

This suggests the memory.weights.total in logs excludes something (KV cache? buffers?).

Finding 3: Layer Split Decision

With split 1,48:

  • GPU0: 1 layer only (why?)
  • GPU1: 48 layers

If GPU0 can only hold 1 layer, why estimate 3.3 GiB for it?

Hypothesis: The Root Cause

Theory: The layer placement algorithm is placing 1 layer on GPU0 unnecessarily due to:

  1. GPU0 gets allocated full graph overhead (1.3 GiB) during estimation
  2. This leaves ~9.8 GiB "available" on GPU0
  3. Algorithm tries to place layers, but only 1 fits after accounting for real overheads
  4. This triggers multi-GPU mode
  5. But if we didn't place ANY layers on GPU0, all 49 layers could fit on GPU1

Test hypothesis: What if we disable GPU0 entirely?

Next Steps

  1. Add debug logging to track exact layer-by-layer placement decisions

  2. Calculate theoretical single-GPU memory:

    • All weights on GPU1: 8.324 GiB
    • Full graph on GPU1: 1.3 GiB
    • KV cache: 0.736 GiB
    • Total: ~10.36 GiB
    • Result: Fits in 11.2 GiB!
  3. Find why algorithm splits:

    • Is it the overhead value?
    • Is it the layer placement logic at lines 243-277?
    • Is it the graph allocation at lines 230-241?
  4. Possible fixes:

    • Option A: Be more conservative about GPU0 free space
    • Option B: Prefer single-GPU until proven necessary
    • Option C: Adjust overhead calculations
    • Option D: Fix the layer placement algorithm to try single-GPU first

Code Sections to Investigate

  1. Line 106: overhead := envconfig.GpuOverhead() - What is this value?
  2. Lines 193-213: GPU filtering logic - Which GPUs are deemed "viable"?
  3. Lines 230-241: Graph allocation per GPU - Is GPU0 getting full 1.3 GiB?
  4. Lines 243-277: Layer placement loop - Why does it place layers on GPU0?
  5. Lines 282-303: Output layer placement - Does this trigger GPU0 usage?

Questions to Answer

  1. What is envconfig.GpuOverhead() returning?
  2. What is gpus[i].MinimumMemory for each GPU?
  3. During layer placement, what are the used values for each GPU?
  4. What is gpusWithSpace after filtering?
  5. Is the 190 MiB optimization actually being applied?