ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-18 19:56:59 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	92ba15bcb1	Fix multi-GPU memory allocation for large models (deepseek-r1:14b) This commit fixes the issue where large models (>10B parameters) fail to load due to underestimated compute buffer memory requirements, causing allocation failures when the model should use multiple GPUs. Problem: - deepseek-r1:14b (14B, qwen2 architecture) failed with "failed to allocate compute buffers" error - System has 2×Tesla K80 GPUs (24GB total) but tried to fit 12GB model in 1×11GB GPU - Root cause: Memory estimation underestimated compute buffers by 3-4× (estimated 916 MB, actual requirement ~3-4 GB) Solution: 1. Added model-family-specific batch size defaults (llm/memory.go) - Different architectures have different optimal batch sizes - deepseek2: 2048/256, qwen2: 512/512, llama: 512/512, etc. - Ensures accurate memory estimation based on architecture 2. Updated server to use architecture-specific batch sizes (llm/server.go) - Detects model architecture from GGUF metadata - Uses family defaults when user doesn't specify - Ensures consistency between estimation and allocation 3. Applied 3.5× safety margin to compute buffer estimates (llm/memory.go) - Accounts for temporary tensors not captured in GraphSize formulas - Conservative approach prevents allocation failures - Documented with detailed analysis of underestimation causes 4. Implemented measurement API for future use (llama-context.cpp, llama.go) - C++ function to measure actual memory requirements - Go wrapper for integration into GPU selection - Foundation for future measurement-based approach - Currently unused but documented for future improvement Results: - deepseek-r1:14b now loads successfully using both GPUs - Proper distribution: 25 layers on GPU0, 24 layers on GPU1 - Total memory: 16.2 GB across 2×11 GB GPUs (8.4 + 7.8 GB) - Compute buffers: 3.1 GB per GPU (with safety margin applied) - All other models continue to work correctly Comprehensive documentation added to all modified code explaining: - Problem analysis with real examples - Solution rationale and trade-offs - Future improvement paths Tested with: deepseek-r1:14b, deepseek-r1:8b, gemma3:4b, gpt-oss 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-06 14:13:29 +08:00
Shang Chieh Tseng	ef14fb5b26	Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - NVIDIA Driver: 470.256.02 (last version supporting Kepler/K80) - CUDA Version: 11.4.4 (last CUDA 11.x supporting compute 3.7) - GCC Version: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 14:03:05 +08:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Jeffrey Morgan	f46df4e5d2	llama: fix defrag patch to defragment when no slots are available (#10695 )	2025-05-13 14:02:08 -07:00
Jeffrey Morgan	4b903f088a	llama: fix crash on snowflake embedding model (#10690 )	2025-05-13 13:11:11 -07:00
Jeffrey Morgan	0cefd46f23	llama: update to commit de4c07f93 (#10655 )	2025-05-12 12:17:26 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jeffrey Morgan	8dd12c873d	llama: update to commit e1e8e099 (#10513 )	2025-05-01 18:24:09 -07:00
Jeffrey Morgan	e9e5f61c45	llama: update to commit 2016f07b (#10352 )	2025-04-24 17:26:02 -07:00
Parth Sareen	a53d744b01	llama: remove model loading for grammar (#10096 )	2025-04-24 11:51:19 -07:00
Jeffrey Morgan	943464ccb8	llama: update to commit 71e90e88 (#10192 )	2025-04-16 15:14:01 -07:00
Bruce MacDonald	6bd0a983cd	model: support for mistral-small in the ollama runner Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.	2025-04-03 16:57:36 -07:00
Patrick Devine	ef378ad673	gemma3 quantization (#9776 )	2025-03-14 17:41:07 -07:00
Jeffrey Morgan	4289c74359	llama: fix kv loading on snowflake-arctic-embed models (#9536 )	2025-03-07 09:25:34 -08:00
Jeffrey Morgan	98d44fa39d	llama: add phi4 mini support (#9403 )	2025-02-27 19:30:32 -08:00
Jeffrey Morgan	d7d7e99662	llama: update llama.cpp vendor code to commit d7cfe1ff (#9356 )	2025-02-26 20:34:44 -08:00
Michael Yang	548a9f56a6	Revert "cgo: use O3" This reverts commit `bea1f1fac6`.	2025-01-31 10:25:39 -08:00
Michael Yang	bea1f1fac6	cgo: use O3	2025-01-30 12:21:50 -08:00
Michael Yang	dcfb7a105c	next build (#8539 ) * add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-01-29 15:03:38 -08:00

19 Commits