Improve GPU detection and add detailed model loading logs

1. Fix binary path resolution using symlink (docker/runtime/Dockerfile) - Build binary to source directory (./ollama) - Create symlink from /usr/local/bin/ollama to /usr/local/src/ollama37/ollama - Allows ml/path.go to resolve libraries via filepath.EvalSymlinks() - Fixes "total vram=0 B" issue without requiring -w flag 2. Add comprehensive logging for model loading phases (llm/server.go) - Log runner subprocess startup and readiness - Log each memory allocation phase (FIT, ALLOC, COMMIT) - Log layer allocation adjustments during convergence - Log when model weights are being loaded (slowest phase) - Log progress during waitUntilRunnerLaunched (every 1s) - Improves visibility during 1-2 minute first-time model loads 3. Fix flash attention compute capability check (ml/device.go) - Changed DriverMajor to ComputeMajor for correct capability detection - Flash attention requires compute capability >= 7.0, not driver version These changes improve user experience during model loading by providing clear feedback at each stage, especially during the slow COMMIT phase where GGUF weights are loaded and CUDA kernels compile. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 07:46:59 +00:00 · 2025-11-11 23:28:00 +08:00
parent db00f2d5f4
commit 7d9b59c520
3 changed files with 22 additions and 2 deletions
--- a/ml/device.go
+++ b/ml/device.go
@@ -431,7 +431,7 @@ func FlashAttentionSupported(l []DeviceInfo) bool {
 	for _, gpu := range l {
 		supportsFA := gpu.Library == "cpu" ||
 			gpu.Name == "Metal" || gpu.Library == "Metal" ||
-			(gpu.Library == "CUDA" && gpu.DriverMajor >= 7 && !(gpu.ComputeMajor == 7 && gpu.ComputeMinor == 2)) ||
+			(gpu.Library == "CUDA" && gpu.ComputeMajor >= 7 && !(gpu.ComputeMajor == 7 && gpu.ComputeMinor == 2)) ||
 			gpu.Library == "ROCm"

 		if !supportsFA {