The log monitor was calling Reset() before each model test, which cleared
all GPU detection events that occurred during server startup. This caused
the validation to fail with 'GPU acceleration not detected' even though
GPU was being used successfully.
Root cause: GPU detection logs are written during server startup
(lines like 'offloaded 35/35 layers to GPU'), but monitor.Reset() was
clearing these events before validation could check them.
Solution: Comment out the monitor.Reset() call to preserve GPU detection
events from server startup. These events are still relevant for validating
that the model is using GPU acceleration.
The test-runner was starting the ollama server subprocess without inheriting
environment variables, causing the GGML CUDA backend to fail loading even
though LD_LIBRARY_PATH was set in the GitHub Actions workflow.
Changes:
- Added s.cmd.Env = os.Environ() to inherit all environment variables
- This ensures LD_LIBRARY_PATH is passed to the ollama server subprocess
- Fixes GPU offloading failure where layers were not being loaded to GPU
Root cause analysis from logs:
- GPUs were detected: Tesla K80 with 11.1 GiB available
- Server scheduled 35 layers for GPU offload
- But actual offload was 0/35 layers (all stayed on CPU)
- Runner subprocess couldn't find CUDA libraries without LD_LIBRARY_PATH
This fix ensures the runner subprocess can dynamically load libggml-cuda.so
by inheriting the CUDA library paths from the parent process.
The Claude AI validator was receiving detailed explanations with markdown
formatting (e.g., '**PASS**') instead of the expected simple format.
Updated the validation prompt to explicitly require responses to start
with either 'PASS' or 'FAIL: <reason>' without any additional formatting,
explanations, or markdown before the verdict.
This fixes the 'Warning: Unexpected Claude response format' error that
was causing valid test results to be incorrectly marked as unclear.
- Change temp directory from /tmp/test-runner-claude to .test-runner-temp
- Keeps temporary files within project bounds for Claude Code access
- Add .test-runner-temp to .gitignore to exclude from version control
- Fixes Claude AI validation permission issue
- Function in main.go renamed from validateConfig to validateConfigFile
- Resolves redeclaration error with validateConfig in config.go
- config.go has validateConfig(*Config) for internal validation
- main.go has validateConfigFile(string) for CLI command
- Rename validateConfig flag variable to validateConfigPath
- Resolves compilation error: validateConfig was both a *string variable and function name
- Function call now uses correct variable name
Add comprehensive test orchestration framework:
Test Runner (cmd/test-runner/):
- config.go: YAML configuration loading and validation
- server.go: Ollama server lifecycle management (start/stop/health checks)
- monitor.go: Real-time log monitoring with pattern matching
- test.go: Model testing via Ollama API (pull, chat, validation)
- validate.go: Test result validation (GPU usage, response quality, log analysis)
- report.go: Structured reporting (JSON and Markdown formats)
- main.go: CLI interface with run/validate/list commands
Test Configurations (test/config/):
- models.yaml: Full test suite with quick/full/stress profiles
- quick.yaml: Fast smoke test with gemma2:2b
Updated Workflow:
- tesla-k80-tests.yml: Use test-runner instead of shell scripts
- Run quick tests first, then full tests if passing
- Generate structured JSON reports for pass/fail checking
- Upload test results as artifacts
Features:
- Multi-model testing with configurable profiles
- API-based testing (not CLI commands)
- Real-time log monitoring for GPU events and errors
- Automatic validation of GPU loading and response quality
- Structured JSON and Markdown reports
- Graceful server lifecycle management
- Interrupt handling (Ctrl+C cleanup)
Addresses limitations of shell-based testing by providing:
- Better error handling and reporting
- Programmatic test orchestration
- Reusable test framework
- Clear pass/fail criteria
- Detailed test metrics and timing