Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
LLM Package - Memory Estimation Optimization Guide
Status: ⚠️ IN PROGRESS - Implementation pending
This file contains instructions for optimizing GPU memory estimation to reduce unnecessary multi-GPU splits on Tesla K80 dual-GPU systems.
🎯 Goal
Fix the graph memory overestimation that causes models to split across multiple GPUs when they could fit on a single GPU.
Objective: Reduce total estimated memory from 13.0 GiB → 10.8 GiB for gemma3:12b, allowing it to run on a single Tesla K80 (11.2 GiB).
📊 Problem Analysis (2025-10-29)
Real-World Measurements
Test Case: gemma3:12b model on dual Tesla K80 GPUs
Current Behavior:
Estimated Memory:
GPU 0: 7.7 GiB (graph: 1.3 GiB allocated)
GPU 1: 5.3 GiB (graph: 1.3 GiB allocated)
Total: 13.0 GiB (graph: 2.6 GiB total)
Actual Memory (nvidia-smi):
GPU 0: 4.1 GiB (graph: 181.3 MiB actual)
GPU 1: 6.3 GiB (graph: 1.1 GiB actual)
Total: 10.4 GiB (graph: 1.28 GiB total)
Result: Split across 2 GPUs (slower inference)
Root Cause
File: llm/memory.go
Lines: 289-298
Issue: Graph memory is allocated to ALL GPUs at 100%, but only the primary GPU needs full graph size.
// Current problematic code:
for i := range gpus {
if layerCounts[i] <= 0 {
continue
}
if fullyLoaded {
gpuAllocations[i] += graphFullOffload // ← 1.3 GiB added to EACH GPU
} else {
gpuAllocations[i] += graphPartialOffload // ← 1.3 GiB added to EACH GPU
}
}
Empirical Findings
From actual Tesla K80 measurements:
- Primary GPU (GPU 1 - most layers): Needs full graph (1.1 GiB ≈ 100% of estimate)
- Secondary GPU (GPU 0 - fewer layers): Needs minimal graph (181.3 MiB ≈ 14% of estimate)
- Ratio: Secondary GPU uses ~1/7 (14%) of estimated graph size
🔧 Implementation Instructions
Step 1: Locate the Target Code
File: /home/jack/Documents/ollama37/llm/memory.go
Target Lines: 289-298
Original code block:
// Add the applicable (full or partial) graph allocations
for i := range gpus {
if layerCounts[i] <= 0 {
continue
}
if fullyLoaded {
gpuAllocations[i] += graphFullOffload
} else {
gpuAllocations[i] += graphPartialOffload
}
}
Step 2: Replace with Optimized Code
Action: Replace lines 289-298 with the following:
// Add the applicable (full or partial) graph allocations
// ollama37: Multi-GPU optimization for Tesla K80
// Primary GPU (last GPU with most layers) needs full graph memory
// Secondary GPUs only need ~15% of graph size based on empirical measurements
for i := range gpus {
if layerCounts[i] <= 0 {
continue
}
var graphAlloc uint64
// Determine which GPU gets full graph vs reduced graph
if len(gpus) > 1 && i < len(gpus)-1 {
// Secondary GPU: Use 15% of graph size
// Empirical data: GPU 0 used 181.3 MiB vs 1.3 GiB estimate = ~14% ratio
// Using 1/7 ratio (14.3%) provides conservative buffer
if fullyLoaded {
graphAlloc = graphFullOffload / 7
} else {
graphAlloc = graphPartialOffload / 7
}
} else {
// Primary GPU (or single GPU): Full graph allocation
if fullyLoaded {
graphAlloc = graphFullOffload
} else {
graphAlloc = graphPartialOffload
}
}
gpuAllocations[i] += graphAlloc
}
Step 3: Verification
After making the change, verify the code compiles:
# Navigate to project root
cd /home/jack/Documents/ollama37
# Build Go binary
go build -o ollama .
# Should complete without errors
echo $? # Should output: 0
🧪 Testing Procedure
Test 1: Memory Estimation Check
Objective: Verify the estimated memory now fits in single GPU
# Start ollama server
./ollama serve &
# Wait for server to start
sleep 2
# Load gemma3:12b and watch logs
./ollama run gemma3:12b
# Expected log output should show:
# memory.required.allocations="[X.X GiB Y.Y GiB]"
# Where total (X.X + Y.Y) ≈ 10.8 GiB (down from 13.0 GiB)
#
# With the fix:
# - GPU 0: ~5.5 GiB (down from 7.7 GiB) = 7.7 - 1.3 + (1.3/7) = 5.5 GiB
# - GPU 1: ~5.3 GiB (unchanged)
# - Total: ~10.8 GiB (down from 13.0 GiB)
Test 2: Single GPU Loading
Objective: Verify model now loads on single GPU instead of splitting
# Monitor GPU memory during load
watch -n 1 nvidia-smi
# Expected behavior:
# BEFORE FIX: Model splits across GPU 0 (4.1 GiB) + GPU 1 (6.3 GiB)
# AFTER FIX: Model loads on single GPU (likely GPU 1: ~10.4 GiB)
Test 3: Inference Performance
Objective: Verify inference still works correctly
# Run inference test
./ollama run gemma3:12b "Explain quantum computing in one sentence."
# Expected:
# - Response should generate successfully
# - Check nvidia-smi during inference
# - Verify GPU utilization is normal (>80%)
Test 4: Multi-GPU Models Still Work
Objective: Ensure models that TRULY need multi-GPU still split correctly
# Test with a larger model that requires >11 GiB
# (If you have one available)
./ollama run [larger-model]
# Expected:
# - Should still split across both GPUs
# - Primary GPU should still get full graph
# - Secondary GPUs should still get reduced graph
📈 Expected Results
Memory Allocation Improvements
| Metric | Before Fix | After Fix | Improvement |
|---|---|---|---|
| GPU 0 allocation | 7.7 GiB | 5.5 GiB | -2.2 GiB (29%) |
| GPU 1 allocation | 5.3 GiB | 5.3 GiB | unchanged |
| Total estimate | 13.0 GiB | 10.8 GiB | -2.2 GiB (17%) |
| Fits single K80? | ❌ No | ✅ Yes | ✓ |
| Graph overhead | 2.6 GiB | 1.48 GiB | -1.12 GiB (43%) |
Performance Expectations
Single-GPU Mode (NEW):
- ✅ Faster inference (no cross-GPU communication)
- ✅ Simpler memory management
- ✅ Better VRAM utilization
- ✅ Models up to ~10.5 GiB can run on single GPU
Multi-GPU Mode (for larger models):
- ✅ Still works correctly for models >11 GiB
- ✅ More accurate memory estimation
- ✅ Reduced wasted VRAM on secondary GPUs
🔍 How It Works
Memory Allocation Logic
Key Insight: In multi-GPU splits, the last GPU (highest index) typically has the most layers and handles the output layer, requiring full graph memory. Earlier GPUs handle fewer intermediate layers and need minimal graph memory.
Layer Distribution Example (gemma3:12b):
layers.split="25,24"
GPU 0: 25 layers (intermediate layers only)
GPU 1: 24 layers (includes output layer)
GPU 1 needs full graph for final computations
GPU 0 only needs small graph for intermediate passes
Graph Memory Breakdown
Full Graph Memory: 1.3 GiB (from model.graphFullOffload)
Multi-GPU Allocation:
GPU 0 (secondary): 1.3 GiB / 7 = 186 MiB (~14% - matches empirical 181.3 MiB)
GPU 1 (primary): 1.3 GiB / 1 = 1.3 GiB (100% - matches empirical 1.1 GiB)
Total Graph: 1.49 GiB (vs 2.6 GiB before) = 43% reduction
Ratio Selection Rationale
Why 1/7 (14.3%)?
- Empirical measurement: GPU 0 used 181.3 MiB / 1.3 GiB = 14.0%
- Conservative buffer: 1/7 = 14.3% provides slight headroom
- Simple integer division: Easy to compute, no floating point
- Validated: Matches real-world usage within 3%
Alternative ratios to consider (if needed):
/ 8= 12.5% (more aggressive)/ 6= 16.7% (more conservative)/ 5= 20.0% (very conservative)
Current choice of /7 provides best balance of accuracy and safety margin.
🐛 Troubleshooting
Issue: Model still splits across GPUs after fix
Diagnosis:
# Check the log output for memory.required.allocations
grep "memory.required.allocations" /path/to/ollama.log
Possible causes:
- Code change not applied correctly - verify lines 289-298
- Binary not rebuilt - run
go build -o ollama .again - Old process still running -
pkill ollamaand restart
Issue: Compilation errors after change
Error: undefined: graphAlloc
Solution: Ensure the entire for-loop block (lines 289-298) is replaced, not just part of it.
Issue: Out of memory errors during inference
Symptoms: Model loads but fails during generation
Solution: The 1/7 ratio may be too aggressive. Edit memory.go and change:
graphAlloc = graphFullOffload / 7 // Change to /6 for more headroom
Issue: Model loads but inference is slow
Diagnosis: Check if model actually loaded on single GPU:
nvidia-smi # During inference
Expected: One GPU should show ~10-11 GiB usage, other GPU minimal If both GPUs active: Model may still be splitting (check logs)
📝 Additional Notes
Preserving Multi-GPU Functionality
This optimization ONLY affects multi-GPU systems. Single-GPU systems are unaffected because:
if len(gpus) > 1 && i < len(gpus)-1 {
// Only executes when multiple GPUs present
}
Future Enhancements (Optional)
If more fine-tuning is needed, consider:
- Model-specific ratios: Larger models might need different ratios
- Layer-count based calculation: Scale ratio based on layer distribution
- Environment variable:
OLLAMA_SECONDARY_GPU_GRAPH_RATIOfor user control
For now, the hardcoded 1/7 ratio provides best results for Tesla K80 based on empirical data.
✅ Completion Checklist
- Code modified: Lines 289-298 in
llm/memory.goreplaced with optimized version - Build successful:
go build -o ollama .completes without errors - Test 1 passed: Memory estimation reduced to ~10.8 GiB
- Test 2 passed: Model loads on single GPU
- Test 3 passed: Inference works correctly
- Test 4 passed: Large models still split when needed
- Documentation updated: Update root CLAUDE.md status from "IN PROGRESS" to "COMPLETED"
- Performance verified: Single-GPU inference faster than multi-GPU split
Once all items checked, update status in /home/jack/Documents/ollama37/CLAUDE.md:
**Status**: ✅ **COMPLETED** - Single-GPU preference optimization deployed
📚 Reference
Related Files:
llm/memory.go- Memory estimation logic (THIS FILE)llm/server.go- LLM server process managementserver/sched.go- GPU schedulerdiscover/gpu.go- GPU detection and capabilities
Key Functions:
EstimateGPULayers()- Main memory estimation function (line 74)PredictServerFit()- Determines if model fits in available VRAM (line 18)
Empirical Data Source:
- User logs from 2025-10-29 showing gemma3:12b memory usage
- nvidia-smi measurements during actual inference
- Ollama server logs with detailed memory allocations