Rewrite CICD.md to focus on design and philosophy

Replace command-heavy documentation with conceptual explanation: - Project goal and infrastructure rationale - Test framework philosophy (exit codes lie, logs tell truth) - Dual-judge architecture design - Log collection problem and solution - Test execution flow - Model unload strategy - Design decisions with reasoning - Known limitations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 05:37:01 +00:00 · 2025-12-17 17:58:41 +08:00
parent bf2c321626
commit e80f226507
1 changed files with 67 additions and 350 deletions
--- a/docs/CICD.md
+++ b/docs/CICD.md
@@ -1,399 +1,116 @@
 # CI/CD Pipeline for Ollama37
-This document describes the CI/CD pipeline for building and testing Ollama37 with Tesla K80 (CUDA compute capability 3.7) support.
+## Project Goal
-## Infrastructure Overview
+Enable Ollama to run on Tesla K80 GPUs (CUDA compute capability 3.7), which are no longer supported by mainstream Ollama builds. This requires custom compilation with CUDA 11.4 and legacy CUBLAS function fallbacks.
-```
+## Infrastructure
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                              GITHUB                                      │
 │                     dogkeeper886/ollama37                                │
 │                                                                         │
 │  Push to main ──────────────────────────────────────────────────────┐   │
 └─────────────────────────────────────────────────────────────────────│───┘
                                                                      │
                                                                      ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                         CI/CD NODE                                       │
 │                                                                         │
 │  Hardware:                                                              │
 │    - Tesla K80 GPU (compute capability 3.7)                            │
 │    - NVIDIA Driver 470.x                                               │
 │                                                                         │
 │  Software:                                                              │
 │    - Rocky Linux 9.7                                                   │
 │    - Docker 29.1.3 + Docker Compose 5.0.0                              │
 │    - NVIDIA Container Toolkit                                          │
 │    - GitHub Actions Runner (self-hosted)                               │
 │                                                                         │
 └─────────────────────────────────────────────────────────────────────────┘
 ```
-## Test Framework
+The CI/CD pipeline runs on a self-hosted GitHub Actions runner with Tesla K80 hardware. This is necessary because:
-### Test Runner CLI
+1. K80 requires specific NVIDIA driver (470.x) and CUDA version (11.4)
 2. Cloud runners don't provide legacy GPU hardware
 3. Real GPU testing is essential - emulation cannot catch CUBLAS compatibility issues
-The test runner is located in `tests/src/` and provides a CLI tool:
+## Build Strategy
-```bash
+The build uses a two-stage Docker approach:
 cd tests
 npm run dev -- run [options]
 ```
-**Commands:**
+**Stage 1 (Builder)**: A cached base image containing the complete toolchain - Rocky Linux 8, CUDA 11.4, GCC 10 (compiled from source due to RHEL limitations), CMake, and Go. This image takes ~90 minutes to build but is reused across all builds.
 - `run` - Execute test cases
 - `list` - List all available test cases
-**Options:**
+**Stage 2 (Runtime)**: Built on each commit, this stage clones the source, compiles with K80-specific CMake presets, and packages the binary. Build time is ~10 minutes.
 | Option | Default | Description |
 |--------|---------|-------------|
 | `-s, --suite <suite>` | all | Filter by suite (build, runtime, inference) |
 | `-i, --id <id>` | - | Run specific test by ID |
 | `-w, --workers <n>` | 1 | Parallel worker count |
 | `-d, --dry-run` | false | Preview without executing |
 | `-o, --output <format>` | console | Output format: console, json, junit |
 | `--no-llm` | false | Skip LLM, use simple exit code check only |
 | `--judge-model <model>` | gemma3:12b | Model for LLM judging |
 | `--dual-judge` | true | Run both simple and LLM judge |
 | `--ollama-url <url>` | localhost:11434 | Test subject server |
 | `--judge-url <url>` | localhost:11435 | Separate judge instance |
-### Judge Modes
+This separation means routine builds are fast while toolchain changes are rare and cached.
-The test framework supports three judge modes:
+## Test Framework Design
-| Mode | Flag | Description |
+### Philosophy
 |------|------|-------------|
 | **Simple** | `--no-llm` | Exit code checking only (exit 0 = pass) |
 | **LLM** | `--judge-model` | Semantic analysis of test logs using LLM |
 | **Dual** | `--dual-judge` | Both must pass (default) |
-**LLM Judge:**
+The test framework is designed around two key insights:
 - Analyzes test execution logs semantically
 - Detects hidden issues (e.g., CUDA errors with exit 0)
 - Uses configurable model (default: gemma3:12b)
 - Batches tests for efficient judging
-**Simple Judge:**
+1. **Exit codes lie**: A CUDA operation can fail silently, returning success while producing garbage output. Traditional pass/fail based on exit codes misses these failures.
 - Fast, deterministic
 - Checks exit codes only
 - Fallback when LLM unavailable
-### Log Collector
+2. **Logs tell truth**: The real story is in the execution logs - CUBLAS errors, memory allocation failures, and GPU fallback warnings appear there even when commands "succeed".
-The test framework includes a log collector that solves log overlap issues:
+### Judge System
-**Problem:** `docker compose logs --since=5m` can include logs from previous tests or miss logs if a test exceeds 5 minutes.
+The framework implements a dual-judge architecture:
-**Solution:** LogCollector class that:
+**Simple Judge**: Fast, deterministic verification based on exit codes. Catches obvious failures like command not found, timeout, or explicit error exits.
 1. Runs `docker compose logs --follow` as background process
 2. Marks test start/end boundaries
 3. Writes test-specific logs to `/tmp/test-{testId}-logs.txt`
 4. Provides precise logs for each test
-Test steps access logs via:
+**LLM Judge**: Semantic analysis of test execution logs using a language model. The judge receives test criteria and logs, then evaluates whether the actual behavior matches expected behavior. This catches:
-```bash
+- CUDA errors that don't cause exit failures
-LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+- Subtle GPU fallback to CPU mode
-```
+- Memory allocation warnings
 - Incorrect but non-crashing output
-## GitHub Workflows
+**Dual Mode** (default): Both judges must pass. This combines the speed of simple checking with the depth of semantic analysis.
-Located in `.github/workflows/`:
+### Log Collection
-| Workflow | Purpose |
+A critical problem with container log analysis is temporal precision. Using `docker compose logs --since=5m` creates issues:
 |----------|---------|
 | `build.yml` | Docker image build verification |
 | `runtime.yml` | Container startup and GPU detection |
 | `inference.yml` | Model inference tests (4b, 12b, 27b) |
 | `full-pipeline.yml` | Orchestrates all stages sequentially |
-### Workflow Inputs
+- Logs from previous tests contaminate current test analysis
 - Long-running tests may exceed the time window
 - Fast tests include irrelevant historical logs
-| Parameter | Default | Options | Description |
+The LogCollector solves this by running `docker compose logs --follow` as a background process throughout test execution. It maintains markers for each test's start and end, then extracts precisely the logs generated during that specific test. Each test step receives only its own logs for analysis.
 |-----------|---------|---------|-------------|
 | `judge_mode` | dual | simple, llm, dual | Judge strategy |
 | `judge_model` | gemma3:12b | Any model | LLM for evaluation |
 | `use_existing_container` | false | true, false | Reuse running container |
 | `keep_container` | false | true, false | Leave container running |
-### Example: Run Inference Tests
+### Test Execution Flow
-```bash
+1. **Load Phase**: YAML test definitions are parsed and sorted by dependency order
-# Manual trigger via GitHub Actions UI
+2. **Collection Start**: LogCollector begins streaming container logs
-# Or via gh CLI:
+3. **Execution Phase**: Tests run sequentially, each step receiving current test ID via environment
-gh workflow run inference.yml \
+4. **Log Capture**: Before each step, accumulated logs are written to a test-specific file
-  -f judge_mode=dual \
+5. **Judgment Phase**: Both judges evaluate results - simple checks exit codes, LLM analyzes logs
-  -f judge_model=gemma3:12b
+6. **Cleanup**: Models are unloaded from VRAM, log collector stops
 ```
-## Test Suites
+### Test Architecture
-### Build Suite (3 tests)
+Tests are organized into three suites that must run in order:
-| ID | Name | Timeout | Description |
+**Build Suite**: Verifies Docker images exist and are correctly configured. No GPU required.
 |----|------|---------|-------------|
 | TC-BUILD-001 | Builder Image Verification | 2m | Verify builder image exists |
 | TC-BUILD-002 | Runtime Image Build | 30m | Build runtime image |
 | TC-BUILD-003 | Image Size Validation | 30s | Check image sizes |
-### Runtime Suite (3 tests)
+**Runtime Suite**: Starts the container and verifies GPU detection. Checks that Ollama recognizes K80 hardware and loads CUDA libraries. Critical validation that the driver/toolkit/container integration works.
-| ID | Name | Timeout | Description |
+**Inference Suite**: Actually runs models of increasing size. The 4B model tests basic functionality, 12B tests single-GPU capacity, and 27B tests multi-GPU layer splitting. Each model size unloads after testing to free VRAM for the next.
 |----|------|---------|-------------|
 | TC-RUNTIME-001 | Container Startup | 2m | Start container with GPU |
 | TC-RUNTIME-002 | GPU Detection | 2m | Verify K80 detected |
 | TC-RUNTIME-003 | Health Check | 3m | API health verification |
 ### Inference Suite (5 tests)
 | ID | Name | Model | Timeout | Description |
 |----|------|-------|---------|-------------|
 | TC-INFERENCE-001 | Model Pull | gemma3:4b | 10m | Pull and warmup 4b model |
 | TC-INFERENCE-002 | Basic Inference | gemma3:4b | 3m | Simple prompt test |
 | TC-INFERENCE-003 | API Endpoint Test | gemma3:4b | 2m | REST API verification |
 | TC-INFERENCE-004 | Medium Model | gemma3:12b | 10m | 12b inference (single GPU) |
 | TC-INFERENCE-005 | Large Model Dual-GPU | gemma3:27b | 15m | 27b inference (dual GPU) |
 ### Model Unload Strategy
-Each model size test unloads its model after completion:
+K80 has limited VRAM (12GB per GPU). The framework explicitly unloads each model after its tests complete, rather than relying on automatic eviction. This ensures:
-```
+- Predictable VRAM state between tests
-4b tests (001-003) → unload 4b
+- No interference from cached models
-12b test (004) → unload 12b
+- Clean baseline for each model size test
 27b test (005) → unload 27b
 ```
-Workflow-level cleanup (`if: always()`) provides safety fallback.
+Workflow-level cleanup provides a safety net if individual test unloads fail.
-## Test Case Structure
+## Error Detection
-Test cases are YAML files in `tests/testcases/{suite}/`:
+The framework specifically watches for K80-related failure patterns:
-```yaml
+- `CUBLAS_STATUS_*` errors indicate the legacy CUBLAS fallback isn't working
-id: TC-INFERENCE-002
+- `CUDA error` messages suggest driver/toolkit mismatch
-name: Basic Inference
+- `cudaMalloc failed` indicates VRAM exhaustion
-suite: inference
+- `id=cpu library=cpu` means GPU detection failed entirely
 priority: 2
 timeout: 180000
-dependencies:
+These patterns are checked by both the simple judge (via grep in test steps) and the LLM judge (via semantic log analysis).
  - TC-INFERENCE-001
-steps:
+## Design Decisions
  - name: Run simple math question
    command: docker exec ollama37 ollama run gemma3:4b "What is 2+2?"
    timeout: 120000
-  - name: Check for errors in logs
+**Why YAML test cases?** Declarative test definitions separate test logic from execution machinery. Adding a new test requires no code changes.
    command: |
      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
      else
        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
      fi
      # Check for CUDA errors...
-criteria: |
+**Why LLM judging?** Traditional test assertions require anticipating every failure mode. LLM evaluation can recognize novel failures and evaluate fuzzy criteria like "response should mention Paris".
  Expected:
  - Model responds with "4" or equivalent
  - NO CUBLAS_STATUS_ errors
  - NO CUDA errors
 ```
-## Build System
+**Why sequential execution?** Log collection with precise boundaries requires knowing which test is running. Parallel execution would interleave logs unpredictably.
-### Docker Images
+**Why Docker-based builds?** Reproducibility. The exact toolchain that works is captured in the builder image. No "works on my machine" issues.
-**Builder Image:** `ollama37-builder:latest` (~15GB)
+**Why self-hosted runners?** K80 hardware. No cloud provider offers compute capability 3.7 GPUs for CI/CD.
 - Rocky Linux 8
 - CUDA 11.4 toolkit
 - GCC 10, CMake 4.0, Go 1.25.3
 - Build time: ~90 minutes (cached)
-**Runtime Image:** `ollama37:latest` (~18GB)
+## Limitations
 - Built from GitHub source
 - Build time: ~10 minutes
-### Build Commands
+- Tests must run sequentially for accurate log collection
-
+- LLM judge requires a working Ollama instance (chicken-and-egg for broken builds)
-```bash
+- K80 VRAM limits restrict maximum model size to ~27B parameters
-cd docker
+- Build times are significant due to CUDA compilation
 # Build base image (first time only)
 make build-builder
 # Build runtime from GitHub
 make build-runtime
 # Build without cache
 make build-runtime-no-cache
 # Build from local source
 make build-runtime-local
 ```
 ## Running Tests Locally
 ### Prerequisites
 1. Docker with NVIDIA runtime
 2. Node.js 20+
 3. Tesla K80 GPU (or compatible)
 ### Quick Start
 ```bash
 # Start the container
 cd docker && docker compose up -d
 # Install test runner
 cd tests && npm ci
 # Run all tests with dual judge
 npm run dev -- run --dual-judge
 # Run specific suite
 npm run dev -- run --suite inference
 # Run single test
 npm run dev -- run --id TC-INFERENCE-002
 # Simple mode (no LLM)
 npm run dev -- run --no-llm
 # JSON output
 npm run dev -- run -o json > results.json
 ```
 ### Test Output
 Results are saved to `/tmp/`:
 - `/tmp/build-results.json`
 - `/tmp/runtime-results.json`
 - `/tmp/inference-results.json`
 JSON structure:
 ```json
 {
  "summary": {
    "total": 5,
    "passed": 5,
    "failed": 0,
    "timestamp": "2025-12-17T...",
    "simple": { "passed": 5, "failed": 0 },
    "llm": { "passed": 5, "failed": 0 }
  },
  "results": [...]
 }
 ```
 ## Environment Variables
 ### Build Environment
 | Variable | Value | Description |
 |----------|-------|-------------|
 | `BUILDER_IMAGE` | ollama37-builder | Builder image name |
 | `RUNTIME_IMAGE` | ollama37 | Runtime image name |
 ### Runtime Environment
 | Variable | Value | Description |
 |----------|-------|-------------|
 | `OLLAMA_HOST` | 0.0.0.0:11434 | Server listen address |
 | `NVIDIA_VISIBLE_DEVICES` | all | GPU visibility |
 | `OLLAMA_DEBUG` | 1 (optional) | Enable debug logging |
 | `GGML_CUDA_DEBUG` | 1 (optional) | Enable CUDA debug |
 ### Test Environment
 | Variable | Description |
 |----------|-------------|
 | `TEST_ID` | Current test ID (set by executor) |
 | `OLLAMA_HOST` | Test subject URL |
 ## Troubleshooting
 ### GPU Not Detected in Container
 ```bash
 # Check UVM device files
 ls -l /dev/nvidia-uvm*
 # Create if missing
 nvidia-modprobe -u -c=0
 # Restart container
 docker compose restart
 ```
 ### LLM Judge Timeout
 ```bash
 # Use simple mode
 npm run dev -- run --no-llm
 # Or increase judge model size
 npm run dev -- run --judge-model gemma3:4b
 ```
 ### Log Collector Issues
 If test step can't find logs:
 ```bash
 # Check log file exists
 ls -l /tmp/test-*-logs.txt
 # Fallback to direct logs
 docker compose logs --since=5m
 ```
 ### Build Failures
 ```bash
 # Clean build
 cd docker && make build-runtime-no-cache
 # Check builder image
 docker images | grep ollama37-builder
 ```
 ## Error Patterns
 The test framework checks for these critical errors:
 | Pattern | Severity | Description |
 |---------|----------|-------------|
 | `CUBLAS_STATUS_*` | Critical | CUDA/cuBLAS error (K80-specific) |
 | `CUDA error` | Critical | General CUDA failure |
 | `cudaMalloc failed` | Critical | GPU memory allocation failure |
 | `out of memory` | Critical | VRAM exhausted |
 | `level=ERROR` | Warning | Ollama application error |
 | `panic`, `fatal` | Critical | Runtime crash |
 | `id=cpu library=cpu` | Critical | CPU-only fallback (GPU not detected) |
 ## File Structure
 ```
 tests/
 ├── src/
 │   ├── cli.ts           # CLI entry point
 │   ├── executor.ts      # Test execution engine
 │   ├── judge.ts         # LLM/simple judging
 │   ├── loader.ts        # YAML test case parser
 │   ├── log-collector.ts # Docker log collector
 │   ├── reporter.ts      # Output formatters
 │   └── types.ts         # Type definitions
 ├── testcases/
 │   ├── build/           # Build test cases
 │   ├── runtime/         # Runtime test cases
 │   └── inference/       # Inference test cases
 └── package.json
 .github/workflows/
 ├── build.yml            # Build verification
 ├── runtime.yml          # Container/GPU tests
 ├── inference.yml        # Model inference tests
 └── full-pipeline.yml    # Complete pipeline
 ```