Add Claude AI-powered response validation and update test model

Changes: 1. Update quick test to use gemma3:4b (was gemma2:2b) - Increased timeout to 60s for larger model 2. Implement Claude headless validation (validate.go) - Hybrid approach: simple checks first, then Claude validation ALWAYS runs - Claude validates response quality, coherence, relevance - Detects gibberish, errors, and malformed responses - Falls back to simple validation if Claude CLI unavailable - Verbose logging shows Claude validation results 3. Validation flow: - Step 1: Fast checks (empty response, token count) - Step 2: Claude AI analysis (runs regardless of simple check) - Claude result overrides simple checks - If Claude unavailable, uses simple validation only 4. Workflow improvements: - Remove useless GPU memory check step (server already stopped) - Cleaner workflow output Benefits: - Intelligent response quality validation - Catches subtle issues (gibberish, off-topic responses) - Better than hardcoded pattern matching - Graceful degradation when Claude unavailable
2025-12-12 08:47:01 +00:00 · 2025-10-30 11:42:10 +08:00
parent d59284d30a
commit 4de7dd453b
4 changed files with 148 additions and 27 deletions
--- a/cmd/test-runner/main.go
+++ b/cmd/test-runner/main.go
@@ -148,7 +148,7 @@ func runTests(configPath, profileName, ollamaBin, outputPath string, verbose, ke
 	// Run tests
 	startTime := time.Now()
 	tester := NewModelTester(server.BaseURL())
-	validator := NewValidator(config.Validation, monitor)
+	validator := NewValidator(config.Validation, monitor, verbose)

 	results := make([]TestResult, 0, len(profile.Models))