Fix multi-GPU memory allocation for large models (deepseek-r1:14b)

This commit fixes the issue where large models (>10B parameters) fail to load due to underestimated compute buffer memory requirements, causing allocation failures when the model should use multiple GPUs. Problem: - deepseek-r1:14b (14B, qwen2 architecture) failed with "failed to allocate compute buffers" error - System has 2×Tesla K80 GPUs (24GB total) but tried to fit 12GB model in 1×11GB GPU - Root cause: Memory estimation underestimated compute buffers by 3-4× (estimated 916 MB, actual requirement ~3-4 GB) Solution: 1. Added model-family-specific batch size defaults (llm/memory.go) - Different architectures have different optimal batch sizes - deepseek2: 2048/256, qwen2: 512/512, llama: 512/512, etc. - Ensures accurate memory estimation based on architecture 2. Updated server to use architecture-specific batch sizes (llm/server.go) - Detects model architecture from GGUF metadata - Uses family defaults when user doesn't specify - Ensures consistency between estimation and allocation 3. Applied 3.5× safety margin to compute buffer estimates (llm/memory.go) - Accounts for temporary tensors not captured in GraphSize formulas - Conservative approach prevents allocation failures - Documented with detailed analysis of underestimation causes 4. Implemented measurement API for future use (llama-context.cpp, llama.go) - C++ function to measure actual memory requirements - Go wrapper for integration into GPU selection - Foundation for future measurement-based approach - Currently unused but documented for future improvement Results: - deepseek-r1:14b now loads successfully using both GPUs - Proper distribution: 25 layers on GPU0, 24 layers on GPU1 - Total memory: 16.2 GB across 2×11 GB GPUs (8.4 + 7.8 GB) - Compute buffers: 3.1 GB per GPU (with safety margin applied) - All other models continue to work correctly Comprehensive documentation added to all modified code explaining: - Problem analysis with real examples - Solution rationale and trade-offs - Future improvement paths Tested with: deepseek-r1:14b, deepseek-r1:8b, gemma3:4b, gpt-oss 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-18 19:56:59 +00:00 · 2025-11-06 14:13:29 +08:00
parent d948926581
commit 92ba15bcb1
5 changed files with 415 additions and 1 deletions
--- a/llama/llama.cpp/include/llama.h
+++ b/llama/llama.cpp/include/llama.h
@@ -1370,6 +1370,28 @@ extern "C" {
    // print a breakdown of per-device memory use via LLAMA_LOG:
    LLAMA_API void llama_memory_breakdown_print(const struct llama_context * ctx);

+    // Memory measurement for GPU selection:
+    // This struct holds measured memory requirements per backend device.
+    // Used by Go layer to select appropriate GPU configuration before actual model loading.
+    struct llama_memory_measurement {
+        char   backend_name[128]; // Backend device name (e.g., "CUDA0", "CUDA1", "CPU")
+        size_t model_bytes;       // Model weights memory
+        size_t context_bytes;     // KV cache memory
+        size_t compute_bytes;     // Compute buffer memory (temp tensors during inference)
+        size_t total_bytes;       // Total memory requirement
+        bool   is_host;           // True if this is a host (CPU) backend
+    };
+
+    // Measure memory requirements without fully initializing context.
+    // This allows Go layer to make informed GPU selection decisions.
+    // Returns number of backends, fills measurements array (caller must allocate).
+    // If measurement fails, returns -1.
+    LLAMA_API int32_t llama_measure_memory_requirements(
+                     struct llama_model * model,
+            struct llama_context_params   params,
+          struct llama_memory_measurement * measurements,
+                              int32_t   max_measurements);
+
    //
    // training
    //