mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-10 15:57:04 +00:00
Optimize GPU memory estimation for single-GPU preference on Tesla K80
Implemented multi-GPU memory optimization to reduce unnecessary model splits across dual Tesla K80 GPUs by fixing graph memory overestimation. Changes: 1. Per-GPU graph allocation strategy - Secondary GPUs: 190 MiB (empirically measured) - Primary GPU: Full 1.3 GiB graph allocation - Applied during layer distribution, not just final allocation 2. Reverse-order layer distribution - Prefer loading all layers on last GPU (GPU 1) first - Only use secondary GPUs when primary is full - Changed from round-robin to reverse-order (j-1 instead of i%j) Results: ✅ gemma3:4b: Single GPU (no split, was already working) ✅ gemma3:12b: 1,48 layer split (improved from 25,24 split) - GPU 0: 1 layer, 610 MiB (down from 4156 MiB) - GPU 1: 48 layers, 9857 MiB (primary) - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB) Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models to run on single GPU with better performance (no cross-GPU overhead). Files modified: - llm/memory.go: Core allocation logic (lines 230-288) - llm/CLAUDE.md: Detailed implementation guide - CLAUDE.md: Project status and results summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
15
CLAUDE.md
15
CLAUDE.md
@@ -110,7 +110,7 @@ These files contain specific line numbers, code blocks, and commands to execute
|
||||
|
||||
### Memory Estimation Optimization for Single-GPU Preference
|
||||
|
||||
**Status**: ⚠️ **IN PROGRESS** - Design complete, implementation pending
|
||||
**Status**: ✅ **COMPLETED** - Implemented and tested successfully
|
||||
|
||||
**Goal**: Reduce unnecessary multi-GPU splits by fixing graph memory overestimation for Tesla K80 dual-GPU systems.
|
||||
|
||||
@@ -131,10 +131,15 @@ Analysis of real-world usage (gemma3:12b) revealed a **2.6 GiB memory overestima
|
||||
- Cross-GPU communication overhead reduces inference speed
|
||||
- Wasted VRAM reserves space that's never used
|
||||
|
||||
**Solution**: Modify graph allocation logic to use empirically-measured ratios:
|
||||
- Primary GPU (last GPU with most layers): 100% of graph size (1.3 GiB)
|
||||
- Secondary GPUs: 15% of graph size (~186 MiB)
|
||||
- Expected reduction: 13.0 GiB → 10.8 GiB (fits in single K80)
|
||||
**Solution Implemented**:
|
||||
1. Per-GPU graph allocations (190 MiB for secondary GPUs vs 1.3 GiB for primary)
|
||||
2. Reverse-order layer distribution (prefer loading on last GPU first)
|
||||
|
||||
**Results Achieved**:
|
||||
- **gemma3:4b**: Single GPU (no split) ✅
|
||||
- **gemma3:12b**: 1,48 layer split (down from 25,24) - 98% on primary GPU ✅
|
||||
- **Memory estimate**: Reduced from 13.0 GiB → 11.9 GiB
|
||||
- **Actual usage**: 10.4-10.5 GiB total (fits on single K80)
|
||||
|
||||
**Implementation Details**: See `llm/CLAUDE.md` for specific code changes and testing procedures.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user