Commit Graph

2 Commits

Author SHA1 Message Date
Shang Chieh Tseng
d8ea75a3e2 Fix test-runner to inherit LD_LIBRARY_PATH for CUDA backend loading
The test-runner was starting the ollama server subprocess without inheriting
environment variables, causing the GGML CUDA backend to fail loading even
though LD_LIBRARY_PATH was set in the GitHub Actions workflow.

Changes:
- Added s.cmd.Env = os.Environ() to inherit all environment variables
- This ensures LD_LIBRARY_PATH is passed to the ollama server subprocess
- Fixes GPU offloading failure where layers were not being loaded to GPU

Root cause analysis from logs:
- GPUs were detected: Tesla K80 with 11.1 GiB available
- Server scheduled 35 layers for GPU offload
- But actual offload was 0/35 layers (all stayed on CPU)
- Runner subprocess couldn't find CUDA libraries without LD_LIBRARY_PATH

This fix ensures the runner subprocess can dynamically load libggml-cuda.so
by inheriting the CUDA library paths from the parent process.
2025-10-30 14:08:24 +08:00
Shang Chieh Tseng
d59284d30a Implement Go-based test runner framework for Tesla K80 testing
Add comprehensive test orchestration framework:

Test Runner (cmd/test-runner/):
- config.go: YAML configuration loading and validation
- server.go: Ollama server lifecycle management (start/stop/health checks)
- monitor.go: Real-time log monitoring with pattern matching
- test.go: Model testing via Ollama API (pull, chat, validation)
- validate.go: Test result validation (GPU usage, response quality, log analysis)
- report.go: Structured reporting (JSON and Markdown formats)
- main.go: CLI interface with run/validate/list commands

Test Configurations (test/config/):
- models.yaml: Full test suite with quick/full/stress profiles
- quick.yaml: Fast smoke test with gemma2:2b

Updated Workflow:
- tesla-k80-tests.yml: Use test-runner instead of shell scripts
- Run quick tests first, then full tests if passing
- Generate structured JSON reports for pass/fail checking
- Upload test results as artifacts

Features:
- Multi-model testing with configurable profiles
- API-based testing (not CLI commands)
- Real-time log monitoring for GPU events and errors
- Automatic validation of GPU loading and response quality
- Structured JSON and Markdown reports
- Graceful server lifecycle management
- Interrupt handling (Ctrl+C cleanup)

Addresses limitations of shell-based testing by providing:
- Better error handling and reporting
- Programmatic test orchestration
- Reusable test framework
- Clear pass/fail criteria
- Detailed test metrics and timing
2025-10-30 11:04:48 +08:00