Add timing instrumentation and user progress messages for model loading

Problem: Model loading takes 2-3 minutes on first load with no user feedback, causing confusion about whether the system is frozen or working. Root Cause: GPU initialization (reserveWorstCaseGraph) takes ~164 seconds on Tesla K80 GPUs due to CUDA kernel compilation (PTX JIT for compute 3.7). This is by design - it validates GPU compatibility before committing to full load. Solution: 1. Add comprehensive timing instrumentation to identify bottlenecks 2. Add user-facing progress messages explaining the delay Changes: - cmd/cmd.go: Update spinner with informative message for users - llama/llama.go: Add timing logs for CGO model loading - runner/llamarunner/runner.go: Add detailed timing for llama runner - runner/ollamarunner/runner.go: Add timing + stderr messages for new engine - server/sched.go: Add timing for scheduler load operation User Experience: Before: Silent wait with blinking cursor for 2-3 minutes After: Rotating spinner with message "loading model (may take 1-3 min on first load)" Performance Metrics Captured: - GGUF file reading: ~0.4s - GPU kernel compilation: ~164s (bottleneck identified) - Model weight loading: ~0.002s - Total end-to-end: ~165s 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 07:46:59 +00:00 · 2025-11-12 19:09:37 +08:00
parent 84210db18a
commit 68f9b1580e
5 changed files with 63 additions and 6 deletions
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -272,7 +272,9 @@ func loadOrUnloadModel(cmd *cobra.Command, opts *runOptions) error {
 	p := progress.NewProgress(os.Stderr)
 	defer p.StopAndClear()

-	spinner := progress.NewSpinner("")
+	// Show a message explaining potential delays on first load
+	// For older GPUs (Tesla K80, etc.), GPU initialization can take 1-3 minutes
+	spinner := progress.NewSpinner("loading model (may take 1-3 min on first load)")
 	p.Add("", spinner)

 	client, err := api.ClientFromEnvironment()