Rewrite CICD.md to focus on design and philosophy

Replace command-heavy documentation with conceptual explanation: - Project goal and infrastructure rationale - Test framework philosophy (exit codes lie, logs tell truth) - Dual-judge architecture design - Log collection problem and solution - Test execution flow - Model unload strategy - Design decisions with reasoning - Known limitations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 12:47:00 +00:00 · 2025-12-17 17:58:41 +08:00
parent bf2c321626
commit e80f226507
1 changed files with 67 additions and 350 deletions
--- a/docs/CICD.md
+++ b/docs/CICD.md
@@ -1,399 +1,116 @@
 # CI/CD Pipeline for Ollama37

-This document describes the CI/CD pipeline for building and testing Ollama37 with Tesla K80 (CUDA compute capability 3.7) support.
+## Project Goal

-## Infrastructure Overview
+Enable Ollama to run on Tesla K80 GPUs (CUDA compute capability 3.7), which are no longer supported by mainstream Ollama builds. This requires custom compilation with CUDA 11.4 and legacy CUBLAS function fallbacks.

-```
-┌─────────────────────────────────────────────────────────────────────────┐
-│                              GITHUB                                      │
-│                     dogkeeper886/ollama37                                │
-│                                                                         │
-│  Push to main ──────────────────────────────────────────────────────┐   │
-└─────────────────────────────────────────────────────────────────────│───┘
-                                                                      │
-                                                                      ▼
-┌─────────────────────────────────────────────────────────────────────────┐
-│                         CI/CD NODE                                       │
-│                                                                         │
-│  Hardware:                                                              │
-│    - Tesla K80 GPU (compute capability 3.7)                            │
-│    - NVIDIA Driver 470.x                                               │
-│                                                                         │
-│  Software:                                                              │
-│    - Rocky Linux 9.7                                                   │
-│    - Docker 29.1.3 + Docker Compose 5.0.0                              │
-│    - NVIDIA Container Toolkit                                          │
-│    - GitHub Actions Runner (self-hosted)                               │
-│                                                                         │
-└─────────────────────────────────────────────────────────────────────────┘
-```
+## Infrastructure

-## Test Framework
+The CI/CD pipeline runs on a self-hosted GitHub Actions runner with Tesla K80 hardware. This is necessary because:

-### Test Runner CLI
+1. K80 requires specific NVIDIA driver (470.x) and CUDA version (11.4)
+2. Cloud runners don't provide legacy GPU hardware
+3. Real GPU testing is essential - emulation cannot catch CUBLAS compatibility issues

-The test runner is located in `tests/src/` and provides a CLI tool:
+## Build Strategy

-```bash
-cd tests
-npm run dev -- run [options]
-```
+The build uses a two-stage Docker approach:

-**Commands:**
- `run` - Execute test cases
- `list` - List all available test cases
+**Stage 1 (Builder)**: A cached base image containing the complete toolchain - Rocky Linux 8, CUDA 11.4, GCC 10 (compiled from source due to RHEL limitations), CMake, and Go. This image takes ~90 minutes to build but is reused across all builds.

-**Options:**
-| Option | Default | Description |
-|--------|---------|-------------|
-| `-s, --suite <suite>` | all | Filter by suite (build, runtime, inference) |
-| `-i, --id <id>` | - | Run specific test by ID |
-| `-w, --workers <n>` | 1 | Parallel worker count |
-| `-d, --dry-run` | false | Preview without executing |
-| `-o, --output <format>` | console | Output format: console, json, junit |
-| `--no-llm` | false | Skip LLM, use simple exit code check only |
-| `--judge-model <model>` | gemma3:12b | Model for LLM judging |
-| `--dual-judge` | true | Run both simple and LLM judge |
-| `--ollama-url <url>` | localhost:11434 | Test subject server |
-| `--judge-url <url>` | localhost:11435 | Separate judge instance |
+**Stage 2 (Runtime)**: Built on each commit, this stage clones the source, compiles with K80-specific CMake presets, and packages the binary. Build time is ~10 minutes.

-### Judge Modes
+This separation means routine builds are fast while toolchain changes are rare and cached.

-The test framework supports three judge modes:
+## Test Framework Design

-| Mode | Flag | Description |
-|------|------|-------------|
-| **Simple** | `--no-llm` | Exit code checking only (exit 0 = pass) |
-| **LLM** | `--judge-model` | Semantic analysis of test logs using LLM |
-| **Dual** | `--dual-judge` | Both must pass (default) |
+### Philosophy

-**LLM Judge:**
- Analyzes test execution logs semantically
- Detects hidden issues (e.g., CUDA errors with exit 0)
- Uses configurable model (default: gemma3:12b)
- Batches tests for efficient judging
+The test framework is designed around two key insights:

-**Simple Judge:**
- Fast, deterministic
- Checks exit codes only
- Fallback when LLM unavailable
+1. **Exit codes lie**: A CUDA operation can fail silently, returning success while producing garbage output. Traditional pass/fail based on exit codes misses these failures.

-### Log Collector
+2. **Logs tell truth**: The real story is in the execution logs - CUBLAS errors, memory allocation failures, and GPU fallback warnings appear there even when commands "succeed".

-The test framework includes a log collector that solves log overlap issues:
+### Judge System

-**Problem:** `docker compose logs --since=5m` can include logs from previous tests or miss logs if a test exceeds 5 minutes.
+The framework implements a dual-judge architecture:

-**Solution:** LogCollector class that:
-1. Runs `docker compose logs --follow` as background process
-2. Marks test start/end boundaries
-3. Writes test-specific logs to `/tmp/test-{testId}-logs.txt`
-4. Provides precise logs for each test
+**Simple Judge**: Fast, deterministic verification based on exit codes. Catches obvious failures like command not found, timeout, or explicit error exits.

-Test steps access logs via:
-```bash
-LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
-```
+**LLM Judge**: Semantic analysis of test execution logs using a language model. The judge receives test criteria and logs, then evaluates whether the actual behavior matches expected behavior. This catches:
+- CUDA errors that don't cause exit failures
+- Subtle GPU fallback to CPU mode
+- Memory allocation warnings
+- Incorrect but non-crashing output

-## GitHub Workflows
+**Dual Mode** (default): Both judges must pass. This combines the speed of simple checking with the depth of semantic analysis.

-Located in `.github/workflows/`:
+### Log Collection

-| Workflow | Purpose |
-|----------|---------|
-| `build.yml` | Docker image build verification |
-| `runtime.yml` | Container startup and GPU detection |
-| `inference.yml` | Model inference tests (4b, 12b, 27b) |
-| `full-pipeline.yml` | Orchestrates all stages sequentially |
+A critical problem with container log analysis is temporal precision. Using `docker compose logs --since=5m` creates issues:

-### Workflow Inputs
+- Logs from previous tests contaminate current test analysis
+- Long-running tests may exceed the time window
+- Fast tests include irrelevant historical logs

-| Parameter | Default | Options | Description |
-|-----------|---------|---------|-------------|
-| `judge_mode` | dual | simple, llm, dual | Judge strategy |
-| `judge_model` | gemma3:12b | Any model | LLM for evaluation |
-| `use_existing_container` | false | true, false | Reuse running container |
-| `keep_container` | false | true, false | Leave container running |
+The LogCollector solves this by running `docker compose logs --follow` as a background process throughout test execution. It maintains markers for each test's start and end, then extracts precisely the logs generated during that specific test. Each test step receives only its own logs for analysis.

-### Example: Run Inference Tests
+### Test Execution Flow

-```bash
-# Manual trigger via GitHub Actions UI
-# Or via gh CLI:
-gh workflow run inference.yml \
-  -f judge_mode=dual \
-  -f judge_model=gemma3:12b
-```
+1. **Load Phase**: YAML test definitions are parsed and sorted by dependency order
+2. **Collection Start**: LogCollector begins streaming container logs
+3. **Execution Phase**: Tests run sequentially, each step receiving current test ID via environment
+4. **Log Capture**: Before each step, accumulated logs are written to a test-specific file
+5. **Judgment Phase**: Both judges evaluate results - simple checks exit codes, LLM analyzes logs
+6. **Cleanup**: Models are unloaded from VRAM, log collector stops

-## Test Suites
+### Test Architecture

-### Build Suite (3 tests)
+Tests are organized into three suites that must run in order:

-| ID | Name | Timeout | Description |
-|----|------|---------|-------------|
-| TC-BUILD-001 | Builder Image Verification | 2m | Verify builder image exists |
-| TC-BUILD-002 | Runtime Image Build | 30m | Build runtime image |
-| TC-BUILD-003 | Image Size Validation | 30s | Check image sizes |
+**Build Suite**: Verifies Docker images exist and are correctly configured. No GPU required.

-### Runtime Suite (3 tests)
+**Runtime Suite**: Starts the container and verifies GPU detection. Checks that Ollama recognizes K80 hardware and loads CUDA libraries. Critical validation that the driver/toolkit/container integration works.

-| ID | Name | Timeout | Description |
-|----|------|---------|-------------|
-| TC-RUNTIME-001 | Container Startup | 2m | Start container with GPU |
-| TC-RUNTIME-002 | GPU Detection | 2m | Verify K80 detected |
-| TC-RUNTIME-003 | Health Check | 3m | API health verification |
-
-### Inference Suite (5 tests)
-
-| ID | Name | Model | Timeout | Description |
-|----|------|-------|---------|-------------|
-| TC-INFERENCE-001 | Model Pull | gemma3:4b | 10m | Pull and warmup 4b model |
-| TC-INFERENCE-002 | Basic Inference | gemma3:4b | 3m | Simple prompt test |
-| TC-INFERENCE-003 | API Endpoint Test | gemma3:4b | 2m | REST API verification |
-| TC-INFERENCE-004 | Medium Model | gemma3:12b | 10m | 12b inference (single GPU) |
-| TC-INFERENCE-005 | Large Model Dual-GPU | gemma3:27b | 15m | 27b inference (dual GPU) |
+**Inference Suite**: Actually runs models of increasing size. The 4B model tests basic functionality, 12B tests single-GPU capacity, and 27B tests multi-GPU layer splitting. Each model size unloads after testing to free VRAM for the next.

 ### Model Unload Strategy

-Each model size test unloads its model after completion:
+K80 has limited VRAM (12GB per GPU). The framework explicitly unloads each model after its tests complete, rather than relying on automatic eviction. This ensures:

-```
-4b tests (001-003) → unload 4b
-12b test (004) → unload 12b
-27b test (005) → unload 27b
-```
+- Predictable VRAM state between tests
+- No interference from cached models
+- Clean baseline for each model size test

-Workflow-level cleanup (`if: always()`) provides safety fallback.
+Workflow-level cleanup provides a safety net if individual test unloads fail.

-## Test Case Structure
+## Error Detection

-Test cases are YAML files in `tests/testcases/{suite}/`:
+The framework specifically watches for K80-related failure patterns:

-```yaml
-id: TC-INFERENCE-002
-name: Basic Inference
-suite: inference
-priority: 2
-timeout: 180000
+- `CUBLAS_STATUS_*` errors indicate the legacy CUBLAS fallback isn't working
+- `CUDA error` messages suggest driver/toolkit mismatch
+- `cudaMalloc failed` indicates VRAM exhaustion
+- `id=cpu library=cpu` means GPU detection failed entirely

-dependencies:
-  - TC-INFERENCE-001
+These patterns are checked by both the simple judge (via grep in test steps) and the LLM judge (via semantic log analysis).

-steps:
-  - name: Run simple math question
-    command: docker exec ollama37 ollama run gemma3:4b "What is 2+2?"
-    timeout: 120000
+## Design Decisions

-  - name: Check for errors in logs
-    command: |
-      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
-        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
-      else
-        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
-      fi
-      # Check for CUDA errors...
+**Why YAML test cases?** Declarative test definitions separate test logic from execution machinery. Adding a new test requires no code changes.

-criteria: |
-  Expected:
-  - Model responds with "4" or equivalent
-  - NO CUBLAS_STATUS_ errors
-  - NO CUDA errors
-```
+**Why LLM judging?** Traditional test assertions require anticipating every failure mode. LLM evaluation can recognize novel failures and evaluate fuzzy criteria like "response should mention Paris".

-## Build System
+**Why sequential execution?** Log collection with precise boundaries requires knowing which test is running. Parallel execution would interleave logs unpredictably.

-### Docker Images
+**Why Docker-based builds?** Reproducibility. The exact toolchain that works is captured in the builder image. No "works on my machine" issues.

-**Builder Image:** `ollama37-builder:latest` (~15GB)
- Rocky Linux 8
- CUDA 11.4 toolkit
- GCC 10, CMake 4.0, Go 1.25.3
- Build time: ~90 minutes (cached)
+**Why self-hosted runners?** K80 hardware. No cloud provider offers compute capability 3.7 GPUs for CI/CD.

-**Runtime Image:** `ollama37:latest` (~18GB)
- Built from GitHub source
- Build time: ~10 minutes
+## Limitations

-### Build Commands
-
-```bash
-cd docker
-
-# Build base image (first time only)
-make build-builder
-
-# Build runtime from GitHub
-make build-runtime
-
-# Build without cache
-make build-runtime-no-cache
-
-# Build from local source
-make build-runtime-local
-```
-
-## Running Tests Locally
-
-### Prerequisites
-
-1. Docker with NVIDIA runtime
-2. Node.js 20+
-3. Tesla K80 GPU (or compatible)
-
-### Quick Start
-
-```bash
-# Start the container
-cd docker && docker compose up -d
-
-# Install test runner
-cd tests && npm ci
-
-# Run all tests with dual judge
-npm run dev -- run --dual-judge
-
-# Run specific suite
-npm run dev -- run --suite inference
-
-# Run single test
-npm run dev -- run --id TC-INFERENCE-002
-
-# Simple mode (no LLM)
-npm run dev -- run --no-llm
-
-# JSON output
-npm run dev -- run -o json > results.json
-```
-
-### Test Output
-
-Results are saved to `/tmp/`:
- `/tmp/build-results.json`
- `/tmp/runtime-results.json`
- `/tmp/inference-results.json`
-
-JSON structure:
-```json
-{
-  "summary": {
-    "total": 5,
-    "passed": 5,
-    "failed": 0,
-    "timestamp": "2025-12-17T...",
-    "simple": { "passed": 5, "failed": 0 },
-    "llm": { "passed": 5, "failed": 0 }
-  },
-  "results": [...]
-}
-```
-
-## Environment Variables
-
-### Build Environment
-
-| Variable | Value | Description |
-|----------|-------|-------------|
-| `BUILDER_IMAGE` | ollama37-builder | Builder image name |
-| `RUNTIME_IMAGE` | ollama37 | Runtime image name |
-
-### Runtime Environment
-
-| Variable | Value | Description |
-|----------|-------|-------------|
-| `OLLAMA_HOST` | 0.0.0.0:11434 | Server listen address |
-| `NVIDIA_VISIBLE_DEVICES` | all | GPU visibility |
-| `OLLAMA_DEBUG` | 1 (optional) | Enable debug logging |
-| `GGML_CUDA_DEBUG` | 1 (optional) | Enable CUDA debug |
-
-### Test Environment
-
-| Variable | Description |
-|----------|-------------|
-| `TEST_ID` | Current test ID (set by executor) |
-| `OLLAMA_HOST` | Test subject URL |
-
-## Troubleshooting
-
-### GPU Not Detected in Container
-
-```bash
-# Check UVM device files
-ls -l /dev/nvidia-uvm*
-
-# Create if missing
-nvidia-modprobe -u -c=0
-
-# Restart container
-docker compose restart
-```
-
-### LLM Judge Timeout
-
-```bash
-# Use simple mode
-npm run dev -- run --no-llm
-
-# Or increase judge model size
-npm run dev -- run --judge-model gemma3:4b
-```
-
-### Log Collector Issues
-
-If test step can't find logs:
-```bash
-# Check log file exists
-ls -l /tmp/test-*-logs.txt
-
-# Fallback to direct logs
-docker compose logs --since=5m
-```
-
-### Build Failures
-
-```bash
-# Clean build
-cd docker && make build-runtime-no-cache
-
-# Check builder image
-docker images | grep ollama37-builder
-```
-
-## Error Patterns
-
-The test framework checks for these critical errors:
-
-| Pattern | Severity | Description |
-|---------|----------|-------------|
-| `CUBLAS_STATUS_*` | Critical | CUDA/cuBLAS error (K80-specific) |
-| `CUDA error` | Critical | General CUDA failure |
-| `cudaMalloc failed` | Critical | GPU memory allocation failure |
-| `out of memory` | Critical | VRAM exhausted |
-| `level=ERROR` | Warning | Ollama application error |
-| `panic`, `fatal` | Critical | Runtime crash |
-| `id=cpu library=cpu` | Critical | CPU-only fallback (GPU not detected) |
-
-## File Structure
-
-```
-tests/
-├── src/
-│   ├── cli.ts           # CLI entry point
-│   ├── executor.ts      # Test execution engine
-│   ├── judge.ts         # LLM/simple judging
-│   ├── loader.ts        # YAML test case parser
-│   ├── log-collector.ts # Docker log collector
-│   ├── reporter.ts      # Output formatters
-│   └── types.ts         # Type definitions
-├── testcases/
-│   ├── build/           # Build test cases
-│   ├── runtime/         # Runtime test cases
-│   └── inference/       # Inference test cases
-└── package.json
-
-.github/workflows/
-├── build.yml            # Build verification
-├── runtime.yml          # Container/GPU tests
-├── inference.yml        # Model inference tests
-└── full-pipeline.yml    # Complete pipeline
-```
+- Tests must run sequentially for accurate log collection
+- LLM judge requires a working Ollama instance (chicken-and-egg for broken builds)
+- K80 VRAM limits restrict maximum model size to ~27B parameters
+- Build times are significant due to CUDA compilation