Commit Graph

47 Commits

Author SHA1 Message Date
Shang Chieh Tseng
92ba15bcb1 Fix multi-GPU memory allocation for large models (deepseek-r1:14b)
This commit fixes the issue where large models (>10B parameters) fail to
load due to underestimated compute buffer memory requirements, causing
allocation failures when the model should use multiple GPUs.

Problem:
- deepseek-r1:14b (14B, qwen2 architecture) failed with "failed to allocate
  compute buffers" error
- System has 2×Tesla K80 GPUs (24GB total) but tried to fit 12GB model in
  1×11GB GPU
- Root cause: Memory estimation underestimated compute buffers by 3-4×
  (estimated 916 MB, actual requirement ~3-4 GB)

Solution:
1. Added model-family-specific batch size defaults (llm/memory.go)
   - Different architectures have different optimal batch sizes
   - deepseek2: 2048/256, qwen2: 512/512, llama: 512/512, etc.
   - Ensures accurate memory estimation based on architecture

2. Updated server to use architecture-specific batch sizes (llm/server.go)
   - Detects model architecture from GGUF metadata
   - Uses family defaults when user doesn't specify
   - Ensures consistency between estimation and allocation

3. Applied 3.5× safety margin to compute buffer estimates (llm/memory.go)
   - Accounts for temporary tensors not captured in GraphSize formulas
   - Conservative approach prevents allocation failures
   - Documented with detailed analysis of underestimation causes

4. Implemented measurement API for future use (llama-context.cpp, llama.go)
   - C++ function to measure actual memory requirements
   - Go wrapper for integration into GPU selection
   - Foundation for future measurement-based approach
   - Currently unused but documented for future improvement

Results:
- deepseek-r1:14b now loads successfully using both GPUs
- Proper distribution: 25 layers on GPU0, 24 layers on GPU1
- Total memory: 16.2 GB across 2×11 GB GPUs (8.4 + 7.8 GB)
- Compute buffers: 3.1 GB per GPU (with safety margin applied)
- All other models continue to work correctly

Comprehensive documentation added to all modified code explaining:
- Problem analysis with real examples
- Solution rationale and trade-offs
- Future improvement paths

Tested with: deepseek-r1:14b, deepseek-r1:8b, gemma3:4b, gpt-oss

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 14:13:29 +08:00
Shang Chieh Tseng
ef14fb5b26 Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support
This commit represents a complete rework after pulling the latest changes from
official ollama/ollama repository and re-applying Tesla K80 compatibility patches.

## Key Changes

### CUDA Compute Capability 3.7 Support (Tesla K80)
- Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt
- Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset
- Using 37-virtual (PTX with JIT compilation) for maximum compatibility

### Legacy Toolchain Compatibility
- **NVIDIA Driver**: 470.256.02 (last version supporting Kepler/K80)
- **CUDA Version**: 11.4.4 (last CUDA 11.x supporting compute 3.7)
- **GCC Version**: 10.5.0 (required by CUDA 11.4 host_config.h)

### CPU Architecture Trade-offs
Due to GCC 10.5 limitation, sacrificed newer CPU optimizations:
- Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+)
- Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA
- Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility)

### Build System Updates
- Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7
- Added -Wno-deprecated-gpu-targets flag to suppress warnings
- Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI

### Upstream Sync
Merged latest llama.cpp changes including:
- Enhanced KV cache management with ISWA and hybrid memory support
- Improved multi-modal support (mtmd framework)
- New model architectures (Gemma3, Llama4, Qwen3, etc.)
- GPU backend improvements for CUDA, Metal, and ROCm
- Updated quantization support and GGUF format handling

### Documentation
- Updated CLAUDE.md with comprehensive build instructions
- Documented toolchain constraints and CPU architecture trade-offs
- Removed outdated CI/CD workflows (tesla-k80-*.yml)
- Cleaned up temporary development artifacts

## Rationale

This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in
official Ollama due to legacy driver/CUDA requirements. The toolchain constraint
creates a deadlock:
- K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI

We accept the loss of cutting-edge CPU optimizations to enable running modern
LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 14:03:05 +08:00
Shang Chieh Tseng
fabe2c5cb7 Revert Phase 1 memory optimization to fix multi-GPU stability
Problem: Phase 1 optimization (190 MiB for secondary GPUs) caused OOM
errors on large multi-GPU models due to insufficient runtime buffer:
- gemma3:27b: Estimated 10.9 GiB, used 10.8 GiB → only 400 MiB free
- Failed when allocating 6 MiB for KV cache during graph reservation
- Root cause: 190 MiB didn't account for runtime allocations

Investigation: Studied upstream Ollama code (upstream/main:llm/memory.go)
and confirmed official behavior allocates FULL graph to ALL GPUs with
layers, not reduced allocation for secondary GPUs.

Solution: Reverted llm/memory.go to upstream behavior:
- Removed gpuGraphAllocations map and per-GPU logic
- Restored original round-robin layer distribution (layerCount%j)
- All GPUs with layers now get full graph allocation
- Matches official Ollama for maximum stability

Results with revert:
- gemma3:27b:  Works correctly with 31/31 layer split
- Memory allocation: [10.0 GiB, 9.8 GiB] with proper headroom
- nvidia-smi: GPU0 8.7 GiB, GPU1 8.7 GiB (even distribution)
- Graph allocation: Both GPUs get 300 MiB (actual, not estimate)

Trade-offs:
-  gemma3:12b will use 2 GPUs instead of trying single-GPU (stable)
-  Large models (27b+) work reliably with proper buffer
-  Matches upstream behavior (easier to maintain)
-  Conservative estimates prevent OOM errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 19:10:23 +08:00
Shang Chieh Tseng
d002de9af4 Fix multi-GPU OOM errors by disabling Phase 2 graph correction
Problem: The Phase 2 CC 3.7 graph correction (85% reduction) was being
applied unconditionally to all models, causing multi-GPU models like
gemma3:27b and gpt-oss:20b to fail with "cudaMalloc failed: out of memory"
errors on secondary GPUs.

Root Cause: The 85% correction made the allocator think large models
could fit on a single GPU, but then failed when trying to allocate even
small amounts (16 MiB) on GPU 1 because the memory estimate was too low.

Solution: Disabled Phase 2 correction factor in llm/memory.go:173-182.
Phase 1 optimization (per-GPU graph allocation with 190 MiB for secondary
GPUs) is sufficient and correctly handles both single-GPU and multi-GPU
scenarios without causing OOM errors.

Impact:
- gemma3:4b: Still runs on single GPU 
- gemma3:12b: May split across GPUs (acceptable trade-off) 
- gemma3:27b: Now works with multi-GPU split 
- gpt-oss:20b: Now works with multi-GPU split 

Files Modified:
- llm/memory.go: Commented out Phase 2 correction factor
- CLAUDE.md: Updated Phase 2 section with new status and lessons learned

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 18:15:46 +08:00
Shang Chieh Tseng
6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00
Shang Chieh Tseng
241a03402e Optimize GPU memory estimation for single-GPU preference on Tesla K80
Implemented multi-GPU memory optimization to reduce unnecessary model splits
across dual Tesla K80 GPUs by fixing graph memory overestimation.

Changes:
1. Per-GPU graph allocation strategy
   - Secondary GPUs: 190 MiB (empirically measured)
   - Primary GPU: Full 1.3 GiB graph allocation
   - Applied during layer distribution, not just final allocation

2. Reverse-order layer distribution
   - Prefer loading all layers on last GPU (GPU 1) first
   - Only use secondary GPUs when primary is full
   - Changed from round-robin to reverse-order (j-1 instead of i%j)

Results:
 gemma3:4b: Single GPU (no split, was already working)
 gemma3:12b: 1,48 layer split (improved from 25,24 split)
   - GPU 0: 1 layer, 610 MiB (down from 4156 MiB)
   - GPU 1: 48 layers, 9857 MiB (primary)
   - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB)

Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models
to run on single GPU with better performance (no cross-GPU overhead).

Files modified:
- llm/memory.go: Core allocation logic (lines 230-288)
- llm/CLAUDE.md: Detailed implementation guide
- CLAUDE.md: Project status and results summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 19:58:20 +08:00
Shang Chieh Tseng
cbcbc9ae07 Add support for new models and fix GitHub issues
- Add Gemma3n model support with text generation capabilities
- Add new CUDA mean operations for improved performance
- Add macOS documentation and performance tests
- Update LLAMA patches for ROCm/CUDA compatibility
- Fix various model conversion and processing issues
- Update CI workflows and build configurations
- Add library model tests and Shakespeare test data

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-20 00:12:36 +08:00
Jesse Gross
3fe74fba42 llm: Use first layer as memory buffer in estimation
This is a partial revert of 0478d44 "Fixed over vram allcation dure to
small initial layer sizes."

Previously we used the size of the first layer as an extra reserved
amount of space to buffer our memory estimates. The above commit
changed this to use the largest layer. However, this had performance
impacts on more models than the original commit was trying to fix.

There is just a heuristic without an ideal solution so this goes back
to the historic behavior.

Fixes: #10765, #10756, #10752, #10726
2025-05-19 14:03:34 -07:00
Jesse Gross
94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Jesse Gross
d755577473 llm: Estimate projector memory correctly for Ollama engine
The Llama engine always places vision projectors on the first GPU
if one exists. However, the Ollama engine groups it with the output
layer, which means the projector is only offloaded if all other layers
are offloaded. The memory estimation code always assumes the former
layout - this changes it to use the correct layout based on the engine.

This addresses two impacts of the current behavior:
 - In multi-GPU setups, we can crash with OOM errors when we try to
   allocate memory on a full GPU while another still has space.
 - If the vision projector is large, it may prevent us from offloading
   anything when we could have fit some of the text layers.
2025-05-19 09:52:48 -07:00
Jesse Gross
a2cc8571c5 llm: Consistently track unassigned model data
In some cases, if we fail to assign a piece of the model to a GPU then
we lose track of this data. Although it doesn't change the memory
allocation, it does affect the total size of the model reported by
tools such as ollama ps (and also the percent offloaded).

This makes it look like setting num_gpu isn't reflected in ollama ps,
which isn't true but the offloading percent may appear to not change.

Spreading the model across more GPUs will continue to impact the
reported total size of the model.
2025-05-19 09:52:48 -07:00
Michael Yang
23125648b8 chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
tej
0478d440f0 Fixed over vram allcation dure to small initial layer sizes.
Co-authored-by: Tej Kiran <kiran.tej@amd.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Tej Kiran <itej89@gmailcom>
2025-05-13 16:42:39 -07:00
Michael Yang
340448d2d1 explicitly decode maxarraysize 1024 2025-04-25 16:59:01 -07:00
Jesse Gross
f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Jesse Gross
f4f0992b6e llm: Fix debug logging for memory estimates 2025-03-26 13:16:03 -07:00
Michael Yang
033cec232a count gemma3 vision tensors 2025-03-13 16:34:42 -07:00
Daniel Hiltgen
1fdb351c37 New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00
Michael Yang
58245413f4 next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00
frob
63269668c0 Prevent underflow when FreeMemory < overhead (#8014)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2024-12-10 09:10:40 -08:00
Sam
539be43640 llm: normalise kvct parameter handling (#7926) 2024-12-03 16:30:40 -08:00
Sam
1bdab9fdb1 llm: introduce k/v context quantization (vRAM improvements) (#6279) 2024-12-03 15:57:19 -08:00
Michael Yang
d07cf41a97 refactor kv estimation 2024-11-01 16:23:55 -07:00
Patrick Devine
c7cb0f0602 image processing for llama3.2 (#6963)
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>
2024-10-18 16:12:35 -07:00
Daniel Hiltgen
05cd82ef94 Rename gpu package discover (#7143)
Cleaning up go package naming
2024-10-16 17:45:00 -07:00
Daniel Hiltgen
56318fb365 Improve logging on GPU too small (#6666)
When we determine a GPU is too small for any layers, it's not always clear why.
This will help troubleshoot those scenarios.
2024-09-06 08:29:36 -07:00
Daniel Hiltgen
b05c9e83d9 Introduce GPU Overhead env var (#5922)
Provide a mechanism for users to set aside an amount of VRAM on each GPU
to make room for other applications they want to start after Ollama, or workaround
memory prediction bugs
2024-09-05 13:46:35 -07:00
Michael Yang
8e0641a9bf handle asymmetric embedding KVs 2024-06-20 09:57:27 -07:00
Daniel Hiltgen
359b15a597 Handle models with divergent layer sizes
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
2024-06-18 11:05:34 -07:00
Daniel Hiltgen
7784ca33ce Tighten up memory prediction logging
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
2024-06-18 09:15:35 -07:00
Daniel Hiltgen
17df6520c8 Remove mmap related output calc logic 2024-06-14 14:55:50 -07:00
Daniel Hiltgen
6f351bf586 review comments and coverage 2024-06-14 14:55:50 -07:00
Daniel Hiltgen
6fd04ca922 Improve multi-gpu handling at the limit
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
Michael Yang
6297f85606 gofmt, goimports 2024-06-04 13:20:24 -07:00
Michael Yang
e40145a39d lint 2024-06-04 11:13:30 -07:00
Patrick Devine
4cc3be3035 Move envconfig and consolidate env vars (#4608) 2024-05-24 14:57:15 -07:00
Michael Yang
1d359e737e typo 2024-05-13 14:18:34 -07:00
Michael Yang
50b9056e09 count memory up to NumGPU 2024-05-13 14:13:10 -07:00
Jeffrey Morgan
bb6fd02298 Don't clamp ctx size in PredictServerFit (#4317)
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
2024-05-10 10:17:12 -07:00
Daniel Hiltgen
bee2f4a3b0 Record GPU usage information
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Michael Yang
4736391bfb llm: add minimum based on layer size 2024-05-06 17:04:19 -07:00
Daniel Hiltgen
f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Jeffrey Morgan
f0c454ab57 gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) 2024-05-01 11:46:03 -04:00
Michael Yang
f81f308118 fix gemma, command-r layer weights 2024-04-26 15:00:55 -07:00
Michael Yang
7bb7cb8a60 only count output tensors 2024-04-25 15:24:08 -07:00
Daniel Hiltgen
5445aaa94e Add back memory escape valve
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
Daniel Hiltgen
34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00