Files
ollama37/llm/CLAUDE.md
Shang Chieh Tseng 5077ab3fb4 Document Phase 9 completion: Fix CUDA backend loading for CC 3.7
Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
 CUDA backend loads without undefined symbol errors
 GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
 Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 17:44:36 +08:00

11 KiB

LLM Package - Memory Estimation Optimization Guide

Status: ⚠️ IN PROGRESS - Implementation pending

This file contains instructions for optimizing GPU memory estimation to reduce unnecessary multi-GPU splits on Tesla K80 dual-GPU systems.


🎯 Goal

Fix the graph memory overestimation that causes models to split across multiple GPUs when they could fit on a single GPU.

Objective: Reduce total estimated memory from 13.0 GiB → 10.8 GiB for gemma3:12b, allowing it to run on a single Tesla K80 (11.2 GiB).


📊 Problem Analysis (2025-10-29)

Real-World Measurements

Test Case: gemma3:12b model on dual Tesla K80 GPUs

Current Behavior:

Estimated Memory:
  GPU 0: 7.7 GiB  (graph: 1.3 GiB allocated)
  GPU 1: 5.3 GiB  (graph: 1.3 GiB allocated)
  Total: 13.0 GiB (graph: 2.6 GiB total)

Actual Memory (nvidia-smi):
  GPU 0: 4.1 GiB  (graph: 181.3 MiB actual)
  GPU 1: 6.3 GiB  (graph: 1.1 GiB actual)
  Total: 10.4 GiB (graph: 1.28 GiB total)

Result: Split across 2 GPUs (slower inference)

Root Cause

File: llm/memory.go Lines: 289-298 Issue: Graph memory is allocated to ALL GPUs at 100%, but only the primary GPU needs full graph size.

// Current problematic code:
for i := range gpus {
    if layerCounts[i] <= 0 {
        continue
    }
    if fullyLoaded {
        gpuAllocations[i] += graphFullOffload  // ← 1.3 GiB added to EACH GPU
    } else {
        gpuAllocations[i] += graphPartialOffload  // ← 1.3 GiB added to EACH GPU
    }
}

Empirical Findings

From actual Tesla K80 measurements:

  • Primary GPU (GPU 1 - most layers): Needs full graph (1.1 GiB ≈ 100% of estimate)
  • Secondary GPU (GPU 0 - fewer layers): Needs minimal graph (181.3 MiB ≈ 14% of estimate)
  • Ratio: Secondary GPU uses ~1/7 (14%) of estimated graph size

🔧 Implementation Instructions

Step 1: Locate the Target Code

File: /home/jack/Documents/ollama37/llm/memory.go Target Lines: 289-298

Original code block:

// Add the applicable (full or partial) graph allocations
for i := range gpus {
    if layerCounts[i] <= 0 {
        continue
    }
    if fullyLoaded {
        gpuAllocations[i] += graphFullOffload
    } else {
        gpuAllocations[i] += graphPartialOffload
    }
}

Step 2: Replace with Optimized Code

Action: Replace lines 289-298 with the following:

// Add the applicable (full or partial) graph allocations
// ollama37: Multi-GPU optimization for Tesla K80
// Primary GPU (last GPU with most layers) needs full graph memory
// Secondary GPUs only need ~15% of graph size based on empirical measurements
for i := range gpus {
    if layerCounts[i] <= 0 {
        continue
    }

    var graphAlloc uint64

    // Determine which GPU gets full graph vs reduced graph
    if len(gpus) > 1 && i < len(gpus)-1 {
        // Secondary GPU: Use 15% of graph size
        // Empirical data: GPU 0 used 181.3 MiB vs 1.3 GiB estimate = ~14% ratio
        // Using 1/7 ratio (14.3%) provides conservative buffer
        if fullyLoaded {
            graphAlloc = graphFullOffload / 7
        } else {
            graphAlloc = graphPartialOffload / 7
        }
    } else {
        // Primary GPU (or single GPU): Full graph allocation
        if fullyLoaded {
            graphAlloc = graphFullOffload
        } else {
            graphAlloc = graphPartialOffload
        }
    }

    gpuAllocations[i] += graphAlloc
}

Step 3: Verification

After making the change, verify the code compiles:

# Navigate to project root
cd /home/jack/Documents/ollama37

# Build Go binary
go build -o ollama .

# Should complete without errors
echo $?  # Should output: 0

🧪 Testing Procedure

Test 1: Memory Estimation Check

Objective: Verify the estimated memory now fits in single GPU

# Start ollama server
./ollama serve &

# Wait for server to start
sleep 2

# Load gemma3:12b and watch logs
./ollama run gemma3:12b

# Expected log output should show:
# memory.required.allocations="[X.X GiB Y.Y GiB]"
# Where total (X.X + Y.Y) ≈ 10.8 GiB (down from 13.0 GiB)
#
# With the fix:
# - GPU 0: ~5.5 GiB (down from 7.7 GiB) = 7.7 - 1.3 + (1.3/7) = 5.5 GiB
# - GPU 1: ~5.3 GiB (unchanged)
# - Total: ~10.8 GiB (down from 13.0 GiB)

Test 2: Single GPU Loading

Objective: Verify model now loads on single GPU instead of splitting

# Monitor GPU memory during load
watch -n 1 nvidia-smi

# Expected behavior:
# BEFORE FIX: Model splits across GPU 0 (4.1 GiB) + GPU 1 (6.3 GiB)
# AFTER FIX:  Model loads on single GPU (likely GPU 1: ~10.4 GiB)

Test 3: Inference Performance

Objective: Verify inference still works correctly

# Run inference test
./ollama run gemma3:12b "Explain quantum computing in one sentence."

# Expected:
# - Response should generate successfully
# - Check nvidia-smi during inference
# - Verify GPU utilization is normal (>80%)

Test 4: Multi-GPU Models Still Work

Objective: Ensure models that TRULY need multi-GPU still split correctly

# Test with a larger model that requires >11 GiB
# (If you have one available)
./ollama run [larger-model]

# Expected:
# - Should still split across both GPUs
# - Primary GPU should still get full graph
# - Secondary GPUs should still get reduced graph

📈 Expected Results

Memory Allocation Improvements

Metric Before Fix After Fix Improvement
GPU 0 allocation 7.7 GiB 5.5 GiB -2.2 GiB (29%)
GPU 1 allocation 5.3 GiB 5.3 GiB unchanged
Total estimate 13.0 GiB 10.8 GiB -2.2 GiB (17%)
Fits single K80? No Yes
Graph overhead 2.6 GiB 1.48 GiB -1.12 GiB (43%)

Performance Expectations

Single-GPU Mode (NEW):

  • Faster inference (no cross-GPU communication)
  • Simpler memory management
  • Better VRAM utilization
  • Models up to ~10.5 GiB can run on single GPU

Multi-GPU Mode (for larger models):

  • Still works correctly for models >11 GiB
  • More accurate memory estimation
  • Reduced wasted VRAM on secondary GPUs

🔍 How It Works

Memory Allocation Logic

Key Insight: In multi-GPU splits, the last GPU (highest index) typically has the most layers and handles the output layer, requiring full graph memory. Earlier GPUs handle fewer intermediate layers and need minimal graph memory.

Layer Distribution Example (gemma3:12b):

layers.split="25,24"
  GPU 0: 25 layers (intermediate layers only)
  GPU 1: 24 layers (includes output layer)

GPU 1 needs full graph for final computations
GPU 0 only needs small graph for intermediate passes

Graph Memory Breakdown

Full Graph Memory: 1.3 GiB (from model.graphFullOffload)

Multi-GPU Allocation:
  GPU 0 (secondary): 1.3 GiB / 7 = 186 MiB  (~14% - matches empirical 181.3 MiB)
  GPU 1 (primary):   1.3 GiB / 1 = 1.3 GiB  (100% - matches empirical 1.1 GiB)

Total Graph: 1.49 GiB (vs 2.6 GiB before) = 43% reduction

Ratio Selection Rationale

Why 1/7 (14.3%)?

  1. Empirical measurement: GPU 0 used 181.3 MiB / 1.3 GiB = 14.0%
  2. Conservative buffer: 1/7 = 14.3% provides slight headroom
  3. Simple integer division: Easy to compute, no floating point
  4. Validated: Matches real-world usage within 3%

Alternative ratios to consider (if needed):

  • / 8 = 12.5% (more aggressive)
  • / 6 = 16.7% (more conservative)
  • / 5 = 20.0% (very conservative)

Current choice of /7 provides best balance of accuracy and safety margin.


🐛 Troubleshooting

Issue: Model still splits across GPUs after fix

Diagnosis:

# Check the log output for memory.required.allocations
grep "memory.required.allocations" /path/to/ollama.log

Possible causes:

  1. Code change not applied correctly - verify lines 289-298
  2. Binary not rebuilt - run go build -o ollama . again
  3. Old process still running - pkill ollama and restart

Issue: Compilation errors after change

Error: undefined: graphAlloc

Solution: Ensure the entire for-loop block (lines 289-298) is replaced, not just part of it.

Issue: Out of memory errors during inference

Symptoms: Model loads but fails during generation

Solution: The 1/7 ratio may be too aggressive. Edit memory.go and change:

graphAlloc = graphFullOffload / 7  // Change to /6 for more headroom

Issue: Model loads but inference is slow

Diagnosis: Check if model actually loaded on single GPU:

nvidia-smi  # During inference

Expected: One GPU should show ~10-11 GiB usage, other GPU minimal If both GPUs active: Model may still be splitting (check logs)


📝 Additional Notes

Preserving Multi-GPU Functionality

This optimization ONLY affects multi-GPU systems. Single-GPU systems are unaffected because:

if len(gpus) > 1 && i < len(gpus)-1 {
    // Only executes when multiple GPUs present
}

Future Enhancements (Optional)

If more fine-tuning is needed, consider:

  1. Model-specific ratios: Larger models might need different ratios
  2. Layer-count based calculation: Scale ratio based on layer distribution
  3. Environment variable: OLLAMA_SECONDARY_GPU_GRAPH_RATIO for user control

For now, the hardcoded 1/7 ratio provides best results for Tesla K80 based on empirical data.


Completion Checklist

  • Code modified: Lines 289-298 in llm/memory.go replaced with optimized version
  • Build successful: go build -o ollama . completes without errors
  • Test 1 passed: Memory estimation reduced to ~10.8 GiB
  • Test 2 passed: Model loads on single GPU
  • Test 3 passed: Inference works correctly
  • Test 4 passed: Large models still split when needed
  • Documentation updated: Update root CLAUDE.md status from "IN PROGRESS" to "COMPLETED"
  • Performance verified: Single-GPU inference faster than multi-GPU split

Once all items checked, update status in /home/jack/Documents/ollama37/CLAUDE.md:

**Status**: ✅ **COMPLETED** - Single-GPU preference optimization deployed

📚 Reference

Related Files:

  • llm/memory.go - Memory estimation logic (THIS FILE)
  • llm/server.go - LLM server process management
  • server/sched.go - GPU scheduler
  • discover/gpu.go - GPU detection and capabilities

Key Functions:

  • EstimateGPULayers() - Main memory estimation function (line 74)
  • PredictServerFit() - Determines if model fits in available VRAM (line 18)

Empirical Data Source:

  • User logs from 2025-10-29 showing gemma3:12b memory usage
  • nvidia-smi measurements during actual inference
  • Ollama server logs with detailed memory allocations