mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Files

Shang Chieh Tseng 6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU

Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing: ✅ gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-30 00:15:59 +08:00

5.0 KiB

Raw Blame History

Memory Estimation Trace Analysis for gemma3:12b

Date: 2025-10-29 Goal: Understand why estimated memory (11.9 GiB) exceeds actual usage (10.48 GiB) by 1.42 GiB

Input Data from Logs

System Configuration

GPUs: 2x Tesla K80 (11.2 GiB each)
Model: gemma3:12b
Layers: 49 total (48 repeating + 1 output)
Context: 4096 tokens
Batch: 512 tokens
Parallel: 1

Log Output - Estimated Memory

memory.available="[11.1 GiB 11.1 GiB]"
memory.required.full="11.9 GiB"
memory.required.partial="11.9 GiB"
memory.required.kv="736.0 MiB"
memory.required.allocations="[3.3 GiB 8.6 GiB]"
memory.weights.total="6.8 GiB"
memory.weights.repeating="6.0 GiB"
memory.weights.nonrepeating="787.5 MiB"
memory.graph.full="1.3 GiB"
memory.graph.partial="1.3 GiB"
projector.weights="795.9 MiB"
projector.graph="1.0 GiB"
layers.split="1,48"

Log Output - Actual Memory Usage

Model weights loaded:
  CPU buffer: 787.5 MiB
  CUDA0 buffer: 136.7 MiB
  CUDA1 buffer: 7.4 GiB
  Total: 8.324 GiB

Compute graphs allocated:
  CUDA0: 85.8 MiB
  CUDA1: 1.1 GiB
  CPU: 7.5 MiB
  Total: 1.193 GiB

nvidia-smi readings:
  GPU0: 617 MiB (0.602 GiB)
  GPU1: 9866 MiB (9.635 GiB)
  Total: 10.237 GiB

Component-by-Component Analysis

1. Model Weights

Estimated: 6.8 GiB (memory.weights.total)
Actual: 8.324 GiB (787.5 MiB CPU + 136.7 MiB GPU0 + 7.4 GiB GPU1)
Delta: +1.524 GiB (actual > estimate)
Status: ⚠️ UNDERESTIMATED

Note: This is odd - weights are UNDERESTIMATED, not overestimated!

2. KV Cache

Estimated: 736 MiB
Actual: Included in nvidia-smi totals, hard to isolate
Status: ❓ UNKNOWN

3. Compute Graphs

Estimated: 1.3 GiB (per log: memory.graph.full)
Actual: 1.193 GiB (85.8 MiB GPU0 + 1.1 GiB GPU1)
Delta: -0.107 GiB (slight overestimate)
Status: ✅ CLOSE

4. Projector Components

Estimated: 795.9 MiB weights + 1.0 GiB graph = 1.796 GiB
Actual: Unclear from logs (likely included in weights/graph totals)
Status: ❓ POSSIBLY DOUBLE-COUNTED

5. GPU Allocations

Estimated per GPU:
  GPU0: 3.3 GiB
  GPU1: 8.6 GiB
  Total: 11.9 GiB

Actual per GPU (nvidia-smi):
  GPU0: 0.602 GiB
  GPU1: 9.635 GiB
  Total: 10.237 GiB

Delta:
  GPU0: -2.698 GiB (MASSIVE overestimate)
  GPU1: +1.035 GiB (underestimate)
  Total: -1.663 GiB (net overestimate)

Key Findings

Finding 1: GPU0 Massive Overestimation

GPU0 estimated at 3.3 GiB but actually uses only 0.602 GiB.

Possible causes:

Full graph allocation assigned to GPU0 during estimation
Layer weights estimated for GPU0 but actually loaded elsewhere
Conservative buffers that aren't actually needed

Finding 2: Weights Accounting Mismatch

Log says memory.weights.total="6.8 GiB"
But actual weight buffers sum to 8.324 GiB
Gap: 1.524 GiB underestimate

This suggests the memory.weights.total in logs excludes something (KV cache? buffers?).

Finding 3: Layer Split Decision

With split 1,48:

GPU0: 1 layer only (why?)
GPU1: 48 layers

If GPU0 can only hold 1 layer, why estimate 3.3 GiB for it?

Hypothesis: The Root Cause

Theory: The layer placement algorithm is placing 1 layer on GPU0 unnecessarily due to:

GPU0 gets allocated full graph overhead (1.3 GiB) during estimation
This leaves ~9.8 GiB "available" on GPU0
Algorithm tries to place layers, but only 1 fits after accounting for real overheads
This triggers multi-GPU mode
But if we didn't place ANY layers on GPU0, all 49 layers could fit on GPU1

Test hypothesis: What if we disable GPU0 entirely?

Next Steps

Add debug logging to track exact layer-by-layer placement decisions
Calculate theoretical single-GPU memory:
- All weights on GPU1: 8.324 GiB
- Full graph on GPU1: 1.3 GiB
- KV cache: 0.736 GiB
- Total: ~10.36 GiB
- Result: Fits in 11.2 GiB! ✅
Find why algorithm splits:
- Is it the overhead value?
- Is it the layer placement logic at lines 243-277?
- Is it the graph allocation at lines 230-241?
Possible fixes:
- Option A: Be more conservative about GPU0 free space
- Option B: Prefer single-GPU until proven necessary
- Option C: Adjust overhead calculations
- Option D: Fix the layer placement algorithm to try single-GPU first

Code Sections to Investigate

Line 106: overhead := envconfig.GpuOverhead() - What is this value?
Lines 193-213: GPU filtering logic - Which GPUs are deemed "viable"?
Lines 230-241: Graph allocation per GPU - Is GPU0 getting full 1.3 GiB?
Lines 243-277: Layer placement loop - Why does it place layers on GPU0?
Lines 282-303: Output layer placement - Does this trigger GPU0 usage?

Questions to Answer

What is envconfig.GpuOverhead() returning?
What is gpus[i].MinimumMemory for each GPU?
During layer placement, what are the used values for each GPU?
What is gpusWithSpace after filtering?
Is the 190 MiB optimization actually being applied?

5.0 KiB Raw Blame History