Add Claude AI-powered response validation and update test model

Changes:
1. Update quick test to use gemma3:4b (was gemma2:2b)
   - Increased timeout to 60s for larger model

2. Implement Claude headless validation (validate.go)
   - Hybrid approach: simple checks first, then Claude validation ALWAYS runs
   - Claude validates response quality, coherence, relevance
   - Detects gibberish, errors, and malformed responses
   - Falls back to simple validation if Claude CLI unavailable
   - Verbose logging shows Claude validation results

3. Validation flow:
   - Step 1: Fast checks (empty response, token count)
   - Step 2: Claude AI analysis (runs regardless of simple check)
   - Claude result overrides simple checks
   - If Claude unavailable, uses simple validation only

4. Workflow improvements:
   - Remove useless GPU memory check step (server already stopped)
   - Cleaner workflow output

Benefits:
- Intelligent response quality validation
- Catches subtle issues (gibberish, off-topic responses)
- Better than hardcoded pattern matching
- Graceful degradation when Claude unavailable
This commit is contained in:
Shang Chieh Tseng
2025-10-30 11:42:10 +08:00
parent d59284d30a
commit 4de7dd453b
4 changed files with 148 additions and 27 deletions

View File

@@ -85,9 +85,3 @@ jobs:
test-report-full.md
ollama.log
retention-days: 14
- name: Check GPU memory usage
if: always()
run: |
echo "=== GPU Memory Status ==="
nvidia-smi --query-gpu=memory.used,memory.total --format=csv