ollama37

matt/ollama37

Fork 0

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Commit Graph

Author	SHA1	Message	Date
Shang Chieh Tseng	241a03402e	Optimize GPU memory estimation for single-GPU preference on Tesla K80 Implemented multi-GPU memory optimization to reduce unnecessary model splits across dual Tesla K80 GPUs by fixing graph memory overestimation. Changes: 1. Per-GPU graph allocation strategy - Secondary GPUs: 190 MiB (empirically measured) - Primary GPU: Full 1.3 GiB graph allocation - Applied during layer distribution, not just final allocation 2. Reverse-order layer distribution - Prefer loading all layers on last GPU (GPU 1) first - Only use secondary GPUs when primary is full - Changed from round-robin to reverse-order (j-1 instead of i%j) Results: ✅ gemma3:4b: Single GPU (no split, was already working) ✅ gemma3:12b: 1,48 layer split (improved from 25,24 split) - GPU 0: 1 layer, 610 MiB (down from 4156 MiB) - GPU 1: 48 layers, 9857 MiB (primary) - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB) Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models to run on single GPU with better performance (no cross-GPU overhead). Files modified: - llm/memory.go: Core allocation logic (lines 230-288) - llm/CLAUDE.md: Detailed implementation guide - CLAUDE.md: Project status and results summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 19:58:20 +08:00
Shang Chieh Tseng	5077ab3fb4	Document Phase 9 completion: Fix CUDA backend loading for CC 3.7 Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 17:44:36 +08:00

Author

SHA1

Message

Date

Shang Chieh Tseng

241a03402e

Optimize GPU memory estimation for single-GPU preference on Tesla K80

Implemented multi-GPU memory optimization to reduce unnecessary model splits
across dual Tesla K80 GPUs by fixing graph memory overestimation.

Changes:
1. Per-GPU graph allocation strategy
   - Secondary GPUs: 190 MiB (empirically measured)
   - Primary GPU: Full 1.3 GiB graph allocation
   - Applied during layer distribution, not just final allocation

2. Reverse-order layer distribution
   - Prefer loading all layers on last GPU (GPU 1) first
   - Only use secondary GPUs when primary is full
   - Changed from round-robin to reverse-order (j-1 instead of i%j)

Results:
✅ gemma3:4b: Single GPU (no split, was already working)
✅ gemma3:12b: 1,48 layer split (improved from 25,24 split)
   - GPU 0: 1 layer, 610 MiB (down from 4156 MiB)
   - GPU 1: 48 layers, 9857 MiB (primary)
   - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB)

Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models
to run on single GPU with better performance (no cross-GPU overhead).

Files modified:
- llm/memory.go: Core allocation logic (lines 230-288)
- llm/CLAUDE.md: Detailed implementation guide
- CLAUDE.md: Project status and results summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-29 19:58:20 +08:00

Shang Chieh Tseng

5077ab3fb4

Document Phase 9 completion: Fix CUDA backend loading for CC 3.7

Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
✅ CUDA backend loads without undefined symbol errors
✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
✅ Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-29 17:44:36 +08:00

2 Commits