ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	94bbfbb2e7	Add Docker-based build system with GPU-enabled builder and runtime containers	2025-11-07 12:48:05 +08:00
Shang Chieh Tseng	d948926581	Fix Tesla K80 CUBLAS compatibility with two-tier fallback strategy This commit implements comprehensive Tesla K80 (Kepler, compute 3.7) compatibility for batched matrix multiplication operations. Problem: Modern CUBLAS functions fail on Tesla K80 with CUBLAS_STATUS_ARCH_MISMATCH: 1. CUBLAS_GEMM_DEFAULT_TENSOR_OP requires Tensor Cores (Volta+ only) 2. cublasGemmStridedBatchedEx/cublasGemmBatchedEx have architectural requirements beyond algorithm selection Solution - Two-Tier Fallback: Tier 1: Algorithm Selection - Volta+ (cc >= 7.0): CUBLAS_GEMM_DEFAULT_TENSOR_OP - Pre-Volta (cc < 7.0): CUBLAS_GEMM_DEFAULT Tier 2: Function Selection - Volta+ or non-FP32: Use Ex variants (flexible precision) - Kepler/Maxwell/Pascal with FP32: Use legacy type-specific functions (cublasSgemmStridedBatched, cublasSgemmBatched) Changes:* CUDA Implementation: - ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu * ggml_cuda_op_mul_mat_cublas: Algorithm selection for non-batched ops * ggml_cuda_mul_mat_batched_cublas_impl: Two-tier fallback for batched ops * Added GGML_CUDA_DEBUG environment variable for conditional debug logging * Comprehensive function documentation explaining fallback strategy Documentation: - CLAUDE.md * Added Tesla K80 CUBLAS Compatibility section * Documented GGML_CUDA_DEBUG environment variable * Enhanced "Running Ollama" section with log capture examples * Updated Files Modified list Code Comments: - Added detailed comments throughout CUDA code explaining: * Why TENSOR_OP fails on pre-Volta GPUs * Why Ex functions require architectural support Compute capability checks and fallback logic * Debug logging usage Testing: All models verified working on Tesla K80: - ✅ gemma3:4b - ✅ gpt-oss - ✅ deepseek-r1 Debug flag tested in both enabled and disabled states. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 23:52:45 +08:00
Shang Chieh Tseng	ef14fb5b26	Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - NVIDIA Driver: 470.256.02 (last version supporting Kepler/K80) - CUDA Version: 11.4.4 (last CUDA 11.x supporting compute 3.7) - GCC Version: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 14:03:05 +08:00
Shang Chieh Tseng	fabe2c5cb7	Revert Phase 1 memory optimization to fix multi-GPU stability Problem: Phase 1 optimization (190 MiB for secondary GPUs) caused OOM errors on large multi-GPU models due to insufficient runtime buffer: - gemma3:27b: Estimated 10.9 GiB, used 10.8 GiB → only 400 MiB free - Failed when allocating 6 MiB for KV cache during graph reservation - Root cause: 190 MiB didn't account for runtime allocations Investigation: Studied upstream Ollama code (upstream/main:llm/memory.go) and confirmed official behavior allocates FULL graph to ALL GPUs with layers, not reduced allocation for secondary GPUs. Solution: Reverted llm/memory.go to upstream behavior: - Removed gpuGraphAllocations map and per-GPU logic - Restored original round-robin layer distribution (layerCount%j) - All GPUs with layers now get full graph allocation - Matches official Ollama for maximum stability Results with revert: - gemma3:27b: ✅ Works correctly with 31/31 layer split - Memory allocation: [10.0 GiB, 9.8 GiB] with proper headroom - nvidia-smi: GPU0 8.7 GiB, GPU1 8.7 GiB (even distribution) - Graph allocation: Both GPUs get 300 MiB (actual, not estimate) Trade-offs: - ❌ gemma3:12b will use 2 GPUs instead of trying single-GPU (stable) - ✅ Large models (27b+) work reliably with proper buffer - ✅ Matches upstream behavior (easier to maintain) - ✅ Conservative estimates prevent OOM errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 19:10:23 +08:00
Shang Chieh Tseng	d002de9af4	Fix multi-GPU OOM errors by disabling Phase 2 graph correction Problem: The Phase 2 CC 3.7 graph correction (85% reduction) was being applied unconditionally to all models, causing multi-GPU models like gemma3:27b and gpt-oss:20b to fail with "cudaMalloc failed: out of memory" errors on secondary GPUs. Root Cause: The 85% correction made the allocator think large models could fit on a single GPU, but then failed when trying to allocate even small amounts (16 MiB) on GPU 1 because the memory estimate was too low. Solution: Disabled Phase 2 correction factor in llm/memory.go:173-182. Phase 1 optimization (per-GPU graph allocation with 190 MiB for secondary GPUs) is sufficient and correctly handles both single-GPU and multi-GPU scenarios without causing OOM errors. Impact: - gemma3:4b: Still runs on single GPU ✅ - gemma3:12b: May split across GPUs (acceptable trade-off) ✅ - gemma3:27b: Now works with multi-GPU split ✅ - gpt-oss:20b: Now works with multi-GPU split ✅ Files Modified: - llm/memory.go: Commented out Phase 2 correction factor - CLAUDE.md: Updated Phase 2 section with new status and lessons learned 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 18:15:46 +08:00
Shang Chieh Tseng	296d537a2c	Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction Added Phase 2 documentation for single-GPU optimization: - CC 3.7 graph correction factor (85% of estimate) - gemma3:12b now loads on single GPU - Improved from 11.9 GiB → 11.0 GiB estimation - Validated with 10.0 GiB actual usage, 94% GPU utilization 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 00:16:38 +08:00
Shang Chieh Tseng	d04ea50ced	Fix gpt-oss model architecture to match GGUF tensor format The gpt-oss model architecture code expected fused tensors (attn_qkv, ffn_gate_up_exps) but the actual GGUF files contain separate tensors (attn_q/k/v, ffn_gate_exps/up_exps), causing nil pointer panics during model loading. Changes: - model/models/gptoss/model.go: Updated AttentionBlock to use separate Query/Key/Value fields instead of fused QKV, modified Forward() to compute projections separately - model/models/gptoss/model.go: Updated MLPBlock to use separate Gate/Up fields instead of fused GateUp, simplified Forward() logic - fs/ggml/type.go: Reorganized MXFP4 tensor type constant ordering - ml/backend/ggml/ggml/include/ggml.h: Moved GGML_TYPE_MXFP4 to end of enum to match GGUF file format specification - ml/backend/ggml/ggml/src/ggml.c: Updated type name array to match reordered enum - CLAUDE.md: Documented gpt-oss model compatibility fix Result: gpt-oss:20b model now loads and runs successfully on Tesla K80, all 25 layers offload to GPU correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 23:34:03 +08:00
Shang Chieh Tseng	241a03402e	Optimize GPU memory estimation for single-GPU preference on Tesla K80 Implemented multi-GPU memory optimization to reduce unnecessary model splits across dual Tesla K80 GPUs by fixing graph memory overestimation. Changes: 1. Per-GPU graph allocation strategy - Secondary GPUs: 190 MiB (empirically measured) - Primary GPU: Full 1.3 GiB graph allocation - Applied during layer distribution, not just final allocation 2. Reverse-order layer distribution - Prefer loading all layers on last GPU (GPU 1) first - Only use secondary GPUs when primary is full - Changed from round-robin to reverse-order (j-1 instead of i%j) Results: ✅ gemma3:4b: Single GPU (no split, was already working) ✅ gemma3:12b: 1,48 layer split (improved from 25,24 split) - GPU 0: 1 layer, 610 MiB (down from 4156 MiB) - GPU 1: 48 layers, 9857 MiB (primary) - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB) Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models to run on single GPU with better performance (no cross-GPU overhead). Files modified: - llm/memory.go: Core allocation logic (lines 230-288) - llm/CLAUDE.md: Detailed implementation guide - CLAUDE.md: Project status and results summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 19:58:20 +08:00
Shang Chieh Tseng	5077ab3fb4	Document Phase 9 completion: Fix CUDA backend loading for CC 3.7 Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 17:44:36 +08:00
Shang Chieh Tseng	771044bead	Complete CC 3.7-only CUDA optimization for Tesla K80 support Simplify CUDA backend to exclusively support Compute Capability 3.7 (Kepler/Tesla K80). This optimization removes ~2,700 lines of modern GPU code and resolves all compilation issues. Changes: - Remove tensor core files (mma.cuh, fattn-wmma-f16.*, fattn-mma-f16.cuh) and 92 template instances - Hardcode architecture detection to always return CC 3.7 (370) in common.cuh - Disable modern GPU features: FP16 native ops, MMA/WMMA, CP_ASYNC, BF16, CUDA graphs - Disable 6 MMA functions in mmq.cuh while preserving DP4A functions for CC 3.7 - Replace undefined architecture constants (PASCAL/VOLTA/DP4A/ADA_LOVELACE) with CC 3.7 equivalents - Set CMAKE_CUDA_ARCHITECTURES to "37" only in CMakeLists.txt and CMakePresets.json - Hardcode Stream-K scheduling to false, precision to FP32 throughout - Add comprehensive CLAUDE.md documentation with complete optimization history Build configuration now compiles only for architecture 37, resulting in 80-85% smaller binaries and 5-6x faster build times. All removed code paths were unreachable on CC 3.7 hardware, ensuring no performance degradation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 15:21:08 +08:00
Shang Chieh Tseng	135b799b13	Update command.	2025-10-29 14:21:03 +08:00
Shang Chieh Tseng	f337f53408	docs: update documentation to reflect Gemma3n support in v1.3.0 Update README.md and CLAUDE.md to correctly reference Gemma3n model support that was added in version 1.3.0, replacing generic "Gemma 3" references with the specific "Gemma3n" model name. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 09:47:05 +08:00
Shang Chieh Tseng	7c029749bc	docs: restructure README and create comprehensive manual build guide - Restructure README.md for better readability and organization - Reduce README word count by 75% while maintaining key information - Move detailed installation guides to docs/manual-build.md - Add Tesla K80-specific build instructions and optimizations - Update CLAUDE.md with new documentation structure and references - Improve title formatting with emoji and clear tagline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 09:11:43 +08:00
Shang Chieh Tseng	cbcbc9ae07	Add support for new models and fix GitHub issues - Add Gemma3n model support with text generation capabilities - Add new CUDA mean operations for improved performance - Add macOS documentation and performance tests - Update LLAMA patches for ROCm/CUDA compatibility - Fix various model conversion and processing issues - Update CI workflows and build configurations - Add library model tests and Shakespeare test data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 00:12:36 +08:00

14 Commits