Files
ollama37/llama
Shang Chieh Tseng 92ba15bcb1 Fix multi-GPU memory allocation for large models (deepseek-r1:14b)
This commit fixes the issue where large models (>10B parameters) fail to
load due to underestimated compute buffer memory requirements, causing
allocation failures when the model should use multiple GPUs.

Problem:
- deepseek-r1:14b (14B, qwen2 architecture) failed with "failed to allocate
  compute buffers" error
- System has 2×Tesla K80 GPUs (24GB total) but tried to fit 12GB model in
  1×11GB GPU
- Root cause: Memory estimation underestimated compute buffers by 3-4×
  (estimated 916 MB, actual requirement ~3-4 GB)

Solution:
1. Added model-family-specific batch size defaults (llm/memory.go)
   - Different architectures have different optimal batch sizes
   - deepseek2: 2048/256, qwen2: 512/512, llama: 512/512, etc.
   - Ensures accurate memory estimation based on architecture

2. Updated server to use architecture-specific batch sizes (llm/server.go)
   - Detects model architecture from GGUF metadata
   - Uses family defaults when user doesn't specify
   - Ensures consistency between estimation and allocation

3. Applied 3.5× safety margin to compute buffer estimates (llm/memory.go)
   - Accounts for temporary tensors not captured in GraphSize formulas
   - Conservative approach prevents allocation failures
   - Documented with detailed analysis of underestimation causes

4. Implemented measurement API for future use (llama-context.cpp, llama.go)
   - C++ function to measure actual memory requirements
   - Go wrapper for integration into GPU selection
   - Foundation for future measurement-based approach
   - Currently unused but documented for future improvement

Results:
- deepseek-r1:14b now loads successfully using both GPUs
- Proper distribution: 25 layers on GPU0, 24 layers on GPU1
- Total memory: 16.2 GB across 2×11 GB GPUs (8.4 + 7.8 GB)
- Compute buffers: 3.1 GB per GPU (with safety margin applied)
- All other models continue to work correctly

Comprehensive documentation added to all modified code explaining:
- Problem analysis with real examples
- Solution rationale and trade-offs
- Future improvement paths

Tested with: deepseek-r1:14b, deepseek-r1:8b, gemma3:4b, gpt-oss

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 14:13:29 +08:00
..

llama

This package provides Go bindings to llama.cpp.

Vendoring

Ollama vendors llama.cpp and ggml. While we generally strive to contribute changes back upstream to avoid drift, we carry a small set of patches which are applied to the tracking commit.

If you update the vendoring code, start by running the following command to establish the tracking llama.cpp repo in the ./vendor/ directory.

make -f Makefile.sync apply-patches

Updating Base Commit

Pin to new base commit

To change the base commit, update FETCH_HEAD in Makefile.sync.

When updating to a newer base commit, the existing patches may not apply cleanly and require manual merge resolution.

Start by applying the patches. If any of the patches have conflicts, the git am will stop at the first failure.

make -f Makefile.sync apply-patches

If there are conflicts, you will see an error message. Resolve the conflicts in ./vendor/, and continue the patch series with git am --continue and rerun make -f Makefile.sync apply-patches. Repeat until all patches are successfully applied.

Once all patches are applied, commit the changes to the tracking repository.

make -f Makefile.sync format-patches sync

Generating Patches

When working on new fixes or features that impact vendored code, use the following model. First get a clean tracking repo with all current patches applied:

make -f Makefile.sync clean apply-patches

Iterate until you're ready to submit PRs. Once your code is ready, commit a change in the ./vendor/ directory, then generate the patches for ollama with

make -f Makefile.sync format-patches

In your ./vendor/ directory, create a branch, and cherry-pick the new commit to that branch, then submit a PR upstream to llama.cpp.

Commit the changes in the ollama repo and submit a PR to Ollama, which will include the vendored code update with your change, along with the patches.

After your PR upstream is merged, follow the Updating Base Commit instructions above, however first remove your patch before running apply-patches since the new base commit contains your change already.