From 2c5094db924f2f73d3832fd2ba80043526129440 Mon Sep 17 00:00:00 2001
From: Shang Chieh Tseng <shangchieh.tseng@tsengsyu.com>
Date: Wed, 17 Dec 2025 17:46:49 +0800
Subject: [PATCH] Add LogCollector for precise test log boundaries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Problem: Tests used `docker compose logs --since=5m` which caused:
- Log overlap between tests
- Logs from previous tests included
- Missing logs if test exceeded 5 minutes

Solution:
- New LogCollector class runs `docker compose logs --follow`
- Marks test start/end boundaries
- Writes test-specific logs to /tmp/test-{testId}-logs.txt
- Test steps access via TEST_ID environment variable

Changes:
- tests/src/log-collector.ts: New LogCollector class
- tests/src/executor.ts: Integrate LogCollector, set TEST_ID env
- tests/src/cli.ts: Start/stop LogCollector for runtime/inference
- All test cases: Use log collector with fallback to docker compose

Also updated docs/CICD.md with:
- Test runner CLI documentation
- Judge modes (simple, llm, dual)
- Log collector integration
- Updated test case list (12b, 27b models)
- Model unload strategy
- Troubleshooting guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 docs/CICD.md                                  | 533 ++++++++++--------
 tests/src/cli.ts                              |  34 +-
 tests/src/executor.ts                         |  32 +-
 tests/src/log-collector.ts                    | 200 +++++++
 .../testcases/inference/TC-INFERENCE-001.yml  |  32 +-
 .../testcases/inference/TC-INFERENCE-002.yml  |  24 +-
 .../testcases/inference/TC-INFERENCE-003.yml  |  24 +-
 .../testcases/inference/TC-INFERENCE-004.yml  |  16 +-
 .../testcases/inference/TC-INFERENCE-005.yml  |  16 +-
 tests/testcases/runtime/TC-RUNTIME-001.yml    |  15 +-
 tests/testcases/runtime/TC-RUNTIME-002.yml    |  24 +-
 tests/testcases/runtime/TC-RUNTIME-003.yml    |  24 +-
 12 files changed, 702 insertions(+), 272 deletions(-)
 create mode 100644 tests/src/log-collector.ts

diff --git a/docs/CICD.md b/docs/CICD.md
index 0c1068c7..07244686 100644
--- a/docs/CICD.md
+++ b/docs/CICD.md
@@ -1,4 +1,4 @@
-# CI/CD Plan for Ollama37
+# CI/CD Pipeline for Ollama37
 
 This document describes the CI/CD pipeline for building and testing Ollama37 with Tesla K80 (CUDA compute capability 3.7) support.
 
@@ -24,185 +24,269 @@ This document describes the CI/CD pipeline for building and testing Ollama37 wit
 │    - Rocky Linux 9.7                                                   │
 │    - Docker 29.1.3 + Docker Compose 5.0.0                              │
 │    - NVIDIA Container Toolkit                                          │
-│    - GitHub Actions Runner (self-hosted, labels: k80, cuda11)          │
-│                                                                         │
-│  Services:                                                              │
-│    - TestLink (http://localhost:8090) - Test management                │
-│    - TestLink MCP - Claude Code integration                            │
-│                                                                         │
-└─────────────────────────────────────────────────────────────────────────┘
-                                                                      │
-                                                                      ▼
-┌─────────────────────────────────────────────────────────────────────────┐
-│                         SERVE NODE                                       │
-│                                                                         │
-│  Services:                                                              │
-│    - Ollama (production)                                               │
-│    - Dify (LLM application platform)                                   │
+│    - GitHub Actions Runner (self-hosted)                               │
 │                                                                         │
 └─────────────────────────────────────────────────────────────────────────┘
 ```
 
-## Build Strategy: Docker-Based
+## Test Framework
 
-We use the two-stage Docker build system located in `/docker/`:
+### Test Runner CLI
 
-### Stage 1: Builder Image (Cached)
+The test runner is located in `tests/src/` and provides a CLI tool:
 
-**Image:** `ollama37-builder:latest` (~15GB)
+```bash
+cd tests
+npm run dev -- run [options]
+```
 
-**Contents:**
+**Commands:**
+- `run` - Execute test cases
+- `list` - List all available test cases
+
+**Options:**
+| Option | Default | Description |
+|--------|---------|-------------|
+| `-s, --suite <suite>` | all | Filter by suite (build, runtime, inference) |
+| `-i, --id <id>` | - | Run specific test by ID |
+| `-w, --workers <n>` | 1 | Parallel worker count |
+| `-d, --dry-run` | false | Preview without executing |
+| `-o, --output <format>` | console | Output format: console, json, junit |
+| `--no-llm` | false | Skip LLM, use simple exit code check only |
+| `--judge-model <model>` | gemma3:12b | Model for LLM judging |
+| `--dual-judge` | true | Run both simple and LLM judge |
+| `--ollama-url <url>` | localhost:11434 | Test subject server |
+| `--judge-url <url>` | localhost:11435 | Separate judge instance |
+
+### Judge Modes
+
+The test framework supports three judge modes:
+
+| Mode | Flag | Description |
+|------|------|-------------|
+| **Simple** | `--no-llm` | Exit code checking only (exit 0 = pass) |
+| **LLM** | `--judge-model` | Semantic analysis of test logs using LLM |
+| **Dual** | `--dual-judge` | Both must pass (default) |
+
+**LLM Judge:**
+- Analyzes test execution logs semantically
+- Detects hidden issues (e.g., CUDA errors with exit 0)
+- Uses configurable model (default: gemma3:12b)
+- Batches tests for efficient judging
+
+**Simple Judge:**
+- Fast, deterministic
+- Checks exit codes only
+- Fallback when LLM unavailable
+
+### Log Collector
+
+The test framework includes a log collector that solves log overlap issues:
+
+**Problem:** `docker compose logs --since=5m` can include logs from previous tests or miss logs if a test exceeds 5 minutes.
+
+**Solution:** LogCollector class that:
+1. Runs `docker compose logs --follow` as background process
+2. Marks test start/end boundaries
+3. Writes test-specific logs to `/tmp/test-{testId}-logs.txt`
+4. Provides precise logs for each test
+
+Test steps access logs via:
+```bash
+LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+```
+
+## GitHub Workflows
+
+Located in `.github/workflows/`:
+
+| Workflow | Purpose |
+|----------|---------|
+| `build.yml` | Docker image build verification |
+| `runtime.yml` | Container startup and GPU detection |
+| `inference.yml` | Model inference tests (4b, 12b, 27b) |
+| `full-pipeline.yml` | Orchestrates all stages sequentially |
+
+### Workflow Inputs
+
+| Parameter | Default | Options | Description |
+|-----------|---------|---------|-------------|
+| `judge_mode` | dual | simple, llm, dual | Judge strategy |
+| `judge_model` | gemma3:12b | Any model | LLM for evaluation |
+| `use_existing_container` | false | true, false | Reuse running container |
+| `keep_container` | false | true, false | Leave container running |
+
+### Example: Run Inference Tests
+
+```bash
+# Manual trigger via GitHub Actions UI
+# Or via gh CLI:
+gh workflow run inference.yml \
+  -f judge_mode=dual \
+  -f judge_model=gemma3:12b
+```
+
+## Test Suites
+
+### Build Suite (3 tests)
+
+| ID | Name | Timeout | Description |
+|----|------|---------|-------------|
+| TC-BUILD-001 | Builder Image Verification | 2m | Verify builder image exists |
+| TC-BUILD-002 | Runtime Image Build | 30m | Build runtime image |
+| TC-BUILD-003 | Image Size Validation | 30s | Check image sizes |
+
+### Runtime Suite (3 tests)
+
+| ID | Name | Timeout | Description |
+|----|------|---------|-------------|
+| TC-RUNTIME-001 | Container Startup | 2m | Start container with GPU |
+| TC-RUNTIME-002 | GPU Detection | 2m | Verify K80 detected |
+| TC-RUNTIME-003 | Health Check | 3m | API health verification |
+
+### Inference Suite (5 tests)
+
+| ID | Name | Model | Timeout | Description |
+|----|------|-------|---------|-------------|
+| TC-INFERENCE-001 | Model Pull | gemma3:4b | 10m | Pull and warmup 4b model |
+| TC-INFERENCE-002 | Basic Inference | gemma3:4b | 3m | Simple prompt test |
+| TC-INFERENCE-003 | API Endpoint Test | gemma3:4b | 2m | REST API verification |
+| TC-INFERENCE-004 | Medium Model | gemma3:12b | 10m | 12b inference (single GPU) |
+| TC-INFERENCE-005 | Large Model Dual-GPU | gemma3:27b | 15m | 27b inference (dual GPU) |
+
+### Model Unload Strategy
+
+Each model size test unloads its model after completion:
+
+```
+4b tests (001-003) → unload 4b
+12b test (004) → unload 12b
+27b test (005) → unload 27b
+```
+
+Workflow-level cleanup (`if: always()`) provides safety fallback.
+
+## Test Case Structure
+
+Test cases are YAML files in `tests/testcases/{suite}/`:
+
+```yaml
+id: TC-INFERENCE-002
+name: Basic Inference
+suite: inference
+priority: 2
+timeout: 180000
+
+dependencies:
+  - TC-INFERENCE-001
+
+steps:
+  - name: Run simple math question
+    command: docker exec ollama37 ollama run gemma3:4b "What is 2+2?"
+    timeout: 120000
+
+  - name: Check for errors in logs
+    command: |
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
+      # Check for CUDA errors...
+
+criteria: |
+  Expected:
+  - Model responds with "4" or equivalent
+  - NO CUBLAS_STATUS_ errors
+  - NO CUDA errors
+```
+
+## Build System
+
+### Docker Images
+
+**Builder Image:** `ollama37-builder:latest` (~15GB)
 - Rocky Linux 8
 - CUDA 11.4 toolkit
-- GCC 10 (built from source)
-- CMake 4.0 (built from source)
-- Go 1.25.3
+- GCC 10, CMake 4.0, Go 1.25.3
+- Build time: ~90 minutes (cached)
 
-**Build time:** ~90 minutes (first time only, then cached)
+**Runtime Image:** `ollama37:latest` (~18GB)
+- Built from GitHub source
+- Build time: ~10 minutes
+
+### Build Commands
 
-**Build command:**
 ```bash
-cd docker && make build-builder
+cd docker
+
+# Build base image (first time only)
+make build-builder
+
+# Build runtime from GitHub
+make build-runtime
+
+# Build without cache
+make build-runtime-no-cache
+
+# Build from local source
+make build-runtime-local
 ```
 
-### Stage 2: Runtime Image (Per Build)
+## Running Tests Locally
 
-**Image:** `ollama37:latest` (~18GB)
+### Prerequisites
 
-**Process:**
-1. Clone source from GitHub
-2. Configure with CMake ("CUDA 11" preset)
-3. Build C/C++/CUDA libraries
-4. Build Go binary
-5. Package runtime environment
+1. Docker with NVIDIA runtime
+2. Node.js 20+
+3. Tesla K80 GPU (or compatible)
 
-**Build time:** ~10 minutes
+### Quick Start
 
-**Build command:**
 ```bash
-cd docker && make build-runtime
+# Start the container
+cd docker && docker compose up -d
+
+# Install test runner
+cd tests && npm ci
+
+# Run all tests with dual judge
+npm run dev -- run --dual-judge
+
+# Run specific suite
+npm run dev -- run --suite inference
+
+# Run single test
+npm run dev -- run --id TC-INFERENCE-002
+
+# Simple mode (no LLM)
+npm run dev -- run --no-llm
+
+# JSON output
+npm run dev -- run -o json > results.json
 ```
 
-## Pipeline Stages
+### Test Output
 
-### Stage 1: Docker Build
+Results are saved to `/tmp/`:
+- `/tmp/build-results.json`
+- `/tmp/runtime-results.json`
+- `/tmp/inference-results.json`
 
-**Trigger:** Push to `main` branch
-
-**Steps:**
-1. Checkout repository
-2. Ensure builder image exists (build if not)
-3. Build runtime image: `make build-runtime`
-4. Verify image created successfully
-
-**Test Cases:**
-- TC-BUILD-001: Builder Image Verification
-- TC-BUILD-002: Runtime Image Build
-- TC-BUILD-003: Image Size Validation
-
-### Stage 2: Container Startup
-
-**Steps:**
-1. Start container with GPU: `docker compose up -d`
-2. Wait for health check to pass
-3. Verify Ollama server is responding
-
-**Test Cases:**
-- TC-RUNTIME-001: Container Startup
-- TC-RUNTIME-002: GPU Detection
-- TC-RUNTIME-003: Health Check
-
-### Stage 3: Inference Tests
-
-**Steps:**
-1. Pull test model (gemma3:4b)
-2. Run inference tests
-3. Verify CUBLAS legacy fallback
-
-**Test Cases:**
-- TC-INFERENCE-001: Model Pull
-- TC-INFERENCE-002: Basic Inference
-- TC-INFERENCE-003: API Endpoint Test
-- TC-INFERENCE-004: CUBLAS Fallback Verification
-
-### Stage 4: Cleanup & Report
-
-**Steps:**
-1. Stop container: `docker compose down`
-2. Report results to TestLink
-3. Clean up resources
-
-## Test Case Design
-
-### Build Tests (Suite: Build Tests)
-
-| ID | Name | Type | Description |
-|----|------|------|-------------|
-| TC-BUILD-001 | Builder Image Verification | Automated | Verify builder image exists with correct tools |
-| TC-BUILD-002 | Runtime Image Build | Automated | Build runtime image from GitHub source |
-| TC-BUILD-003 | Image Size Validation | Automated | Verify image sizes are within expected range |
-
-### Runtime Tests (Suite: Runtime Tests)
-
-| ID | Name | Type | Description |
-|----|------|------|-------------|
-| TC-RUNTIME-001 | Container Startup | Automated | Start container with GPU passthrough |
-| TC-RUNTIME-002 | GPU Detection | Automated | Verify Tesla K80 detected inside container |
-| TC-RUNTIME-003 | Health Check | Automated | Verify Ollama health check passes |
-
-### Inference Tests (Suite: Inference Tests)
-
-| ID | Name | Type | Description |
-|----|------|------|-------------|
-| TC-INFERENCE-001 | Model Pull | Automated | Pull gemma3:4b model |
-| TC-INFERENCE-002 | Basic Inference | Automated | Run simple prompt and verify response |
-| TC-INFERENCE-003 | API Endpoint Test | Automated | Test /api/generate endpoint |
-| TC-INFERENCE-004 | CUBLAS Fallback Verification | Automated | Verify legacy CUBLAS functions used |
-
-## GitHub Actions Workflow
-
-**File:** `.github/workflows/build-test.yml`
-
-**Triggers:**
-- Push to `main` branch
-- Pull request to `main` branch
-- Manual trigger (workflow_dispatch)
-
-**Runner:** Self-hosted with labels `[self-hosted, k80, cuda11]`
-
-**Jobs:**
-1. `build` - Build Docker runtime image
-2. `test` - Run inference tests in container
-3. `report` - Report results to TestLink
-
-## TestLink Integration
-
-**URL:** http://localhost:8090
-
-**Project:** ollama37
-
-**Test Suites:**
-- Build Tests
-- Runtime Tests
-- Inference Tests
-
-**Test Plan:** Created per release/sprint
-
-**Builds:** Created per CI run (commit SHA)
-
-**Execution Recording:**
-- Each test case result recorded via TestLink API
-- Pass/Fail status with notes
-- Linked to specific build/commit
-
-## Makefile Targets for CI
-
-| Target | Description | When to Use |
-|--------|-------------|-------------|
-| `make build-builder` | Build base image | First time setup |
-| `make build-runtime` | Build from GitHub | Normal CI builds |
-| `make build-runtime-no-cache` | Fresh GitHub clone | When cache is stale |
-| `make build-runtime-local` | Build from local | Local testing |
+JSON structure:
+```json
+{
+  "summary": {
+    "total": 5,
+    "passed": 5,
+    "failed": 0,
+    "timestamp": "2025-12-17T...",
+    "simple": { "passed": 5, "failed": 0 },
+    "llm": { "passed": 5, "failed": 0 }
+  },
+  "results": [...]
+}
+```
 
 ## Environment Variables
 
@@ -222,76 +306,19 @@ cd docker && make build-runtime
 | `OLLAMA_DEBUG` | 1 (optional) | Enable debug logging |
 | `GGML_CUDA_DEBUG` | 1 (optional) | Enable CUDA debug |
 
-### TestLink Environment
+### Test Environment
 
-| Variable | Value | Description |
-|----------|-------|-------------|
-| `TESTLINK_URL` | http://localhost:8090 | TestLink server URL |
-| `TESTLINK_API_KEY` | (configured) | API key for automation |
-
-## Prerequisites
-
-### One-Time Setup on CI/CD Node
-
-1. **Install GitHub Actions Runner:**
-   ```bash
-   mkdir -p ~/actions-runner && cd ~/actions-runner
-   curl -o actions-runner-linux-x64-2.321.0.tar.gz -L \
-     https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz
-   tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz
-   ./config.sh --url https://github.com/dogkeeper886/ollama37 --token YOUR_TOKEN --labels k80,cuda11
-   sudo ./svc.sh install && sudo ./svc.sh start
-   ```
-
-2. **Build Builder Image (one-time, ~90 min):**
-   ```bash
-   cd /home/jack/src/ollama37/docker
-   make build-builder
-   ```
-
-3. **Verify GPU Access in Docker:**
-   ```bash
-   docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
-   ```
-
-4. **Start TestLink:**
-   ```bash
-   cd /home/jack/src/testlink-code
-   docker compose up -d
-   ```
-
-## Monitoring & Logs
-
-### View CI/CD Logs
-
-```bash
-# GitHub Actions Runner logs
-journalctl -u actions.runner.* -f
-
-# Docker build logs
-docker compose logs -f
-
-# TestLink logs
-cd /home/jack/src/testlink-code && docker compose logs -f
-```
-
-### Test Results
-
-- **TestLink Dashboard:** http://localhost:8090
-- **GitHub Actions:** https://github.com/dogkeeper886/ollama37/actions
+| Variable | Description |
+|----------|-------------|
+| `TEST_ID` | Current test ID (set by executor) |
+| `OLLAMA_HOST` | Test subject URL |
 
 ## Troubleshooting
 
-### Builder Image Missing
-
-```bash
-cd docker && make build-builder
-```
-
 ### GPU Not Detected in Container
 
 ```bash
-# Check UVM device files on host
+# Check UVM device files
 ls -l /dev/nvidia-uvm*
 
 # Create if missing
@@ -301,18 +328,72 @@ nvidia-modprobe -u -c=0
 docker compose restart
 ```
 
-### Build Cache Stale
+### LLM Judge Timeout
 
 ```bash
+# Use simple mode
+npm run dev -- run --no-llm
+
+# Or increase judge model size
+npm run dev -- run --judge-model gemma3:4b
+```
+
+### Log Collector Issues
+
+If test step can't find logs:
+```bash
+# Check log file exists
+ls -l /tmp/test-*-logs.txt
+
+# Fallback to direct logs
+docker compose logs --since=5m
+```
+
+### Build Failures
+
+```bash
+# Clean build
 cd docker && make build-runtime-no-cache
+
+# Check builder image
+docker images | grep ollama37-builder
 ```
 
-### TestLink Connection Failed
+## Error Patterns
 
-```bash
-# Check TestLink is running
-curl http://localhost:8090
+The test framework checks for these critical errors:
+
+| Pattern | Severity | Description |
+|---------|----------|-------------|
+| `CUBLAS_STATUS_*` | Critical | CUDA/cuBLAS error (K80-specific) |
+| `CUDA error` | Critical | General CUDA failure |
+| `cudaMalloc failed` | Critical | GPU memory allocation failure |
+| `out of memory` | Critical | VRAM exhausted |
+| `level=ERROR` | Warning | Ollama application error |
+| `panic`, `fatal` | Critical | Runtime crash |
+| `id=cpu library=cpu` | Critical | CPU-only fallback (GPU not detected) |
+
+## File Structure
 
-# Restart if needed
-cd /home/jack/src/testlink-code && docker compose restart
+```
+tests/
+├── src/
+│   ├── cli.ts           # CLI entry point
+│   ├── executor.ts      # Test execution engine
+│   ├── judge.ts         # LLM/simple judging
+│   ├── loader.ts        # YAML test case parser
+│   ├── log-collector.ts # Docker log collector
+│   ├── reporter.ts      # Output formatters
+│   └── types.ts         # Type definitions
+├── testcases/
+│   ├── build/           # Build test cases
+│   ├── runtime/         # Runtime test cases
+│   └── inference/       # Inference test cases
+└── package.json
+
+.github/workflows/
+├── build.yml            # Build verification
+├── runtime.yml          # Container/GPU tests
+├── inference.yml        # Model inference tests
+└── full-pipeline.yml    # Complete pipeline
 ```
diff --git a/tests/src/cli.ts b/tests/src/cli.ts
index beb79a40..b7c2b370 100644
--- a/tests/src/cli.ts
+++ b/tests/src/cli.ts
@@ -9,6 +9,7 @@ import { TestExecutor } from "./executor.js";
 import { LLMJudge } from "./judge.js";
 import { Reporter, TestLinkReporter } from "./reporter.js";
 import { RunnerOptions, Judgment } from "./types.js";
+import { LogCollector } from "./log-collector.js";
 
 const __dirname = path.dirname(fileURLToPath(import.meta.url));
 const defaultTestcasesDir = path.join(__dirname, "..", "testcases");
@@ -68,9 +69,12 @@ program
     log("=".repeat(60));
 
     const loader = new TestLoader(options.testcasesDir);
-    const executor = new TestExecutor(path.join(__dirname, "..", ".."));
+    const projectRoot = path.join(__dirname, "..", "..");
     const judge = new LLMJudge(options.judgeUrl, options.judgeModel);
 
+    // LogCollector will be created if running runtime or inference tests
+    let logCollector: LogCollector | undefined;
+
     // Load test cases
     log("\nLoading test cases...");
     let testCases = await loader.loadAll();
@@ -107,10 +111,38 @@ program
       process.exit(0);
     }
 
+    // Check if we need log collector (runtime or inference tests)
+    const needsLogCollector = testCases.some(
+      (tc) => tc.suite === "runtime" || tc.suite === "inference"
+    );
+
+    if (needsLogCollector) {
+      log("\nStarting log collector...");
+      logCollector = new LogCollector(projectRoot);
+      try {
+        await logCollector.start();
+        log("  Log collector started");
+      } catch (error) {
+        log(`  Warning: Log collector failed to start: ${error}`);
+        log("  Tests will run without precise log collection");
+        logCollector = undefined;
+      }
+    }
+
+    // Create executor with optional log collector
+    const executor = new TestExecutor(projectRoot, logCollector);
+
     // Execute tests (progress goes to stderr via executor)
     const workers = parseInt(options.workers);
     const results = await executor.executeAll(testCases, workers);
 
+    // Stop log collector if running
+    if (logCollector) {
+      log("\nStopping log collector...");
+      await logCollector.stop();
+      log("  Log collector stopped");
+    }
+
     // Judge results
     log("\nJudging results...");
     let judgments: Judgment[];
diff --git a/tests/src/executor.ts b/tests/src/executor.ts
index d25e6429..73597c28 100644
--- a/tests/src/executor.ts
+++ b/tests/src/executor.ts
@@ -1,6 +1,7 @@
 import { exec } from 'child_process'
 import { promisify } from 'util'
 import { TestCase, TestResult, StepResult } from './types.js'
+import { LogCollector } from './log-collector.js'
 
 const execAsync = promisify(exec)
 
@@ -17,9 +18,12 @@ export class TestExecutor {
   private workingDir: string
   private totalTests: number = 0
   private currentTest: number = 0
+  private logCollector: LogCollector | null = null
+  private currentTestId: string | null = null
 
-  constructor(workingDir: string = process.cwd()) {
+  constructor(workingDir: string = process.cwd(), logCollector?: LogCollector) {
     this.workingDir = workingDir
+    this.logCollector = logCollector || null
   }
 
   // Progress output goes to stderr (visible in console)
@@ -34,12 +38,24 @@ export class TestExecutor {
     let exitCode = 0
     let timedOut = false
 
+    // Build environment with TEST_ID for log access
+    const env = { ...process.env }
+    if (this.currentTestId) {
+      env.TEST_ID = this.currentTestId
+    }
+
+    // Update log file before each step so test can access current logs
+    if (this.logCollector && this.currentTestId) {
+      this.logCollector.writeCurrentLogs(this.currentTestId)
+    }
+
     try {
       const result = await execAsync(command, {
         cwd: this.workingDir,
         timeout,
         maxBuffer: 10 * 1024 * 1024, // 10MB buffer
-        shell: '/bin/bash'
+        shell: '/bin/bash',
+        env
       })
       stdout = result.stdout
       stderr = result.stderr
@@ -77,8 +93,14 @@ export class TestExecutor {
     const timestamp = new Date().toISOString().substring(11, 19)
 
     this.currentTest++
+    this.currentTestId = testCase.id
     this.progress(`[${timestamp}] [${this.currentTest}/${this.totalTests}] ${testCase.id}: ${testCase.name}`)
 
+    // Mark test start for log collection
+    if (this.logCollector) {
+      this.logCollector.markTestStart(testCase.id)
+    }
+
     for (let i = 0; i < testCase.steps.length; i++) {
       const step = testCase.steps[i]
       const stepTimestamp = new Date().toISOString().substring(11, 19)
@@ -106,6 +128,12 @@ export class TestExecutor {
 
     const totalDuration = Date.now() - startTime
 
+    // Mark test end for log collection
+    if (this.logCollector) {
+      this.logCollector.markTestEnd(testCase.id)
+    }
+    this.currentTestId = null
+
     // Combine all logs
     const logs = stepResults.map(r => {
       return `=== Step: ${r.name} ===
diff --git a/tests/src/log-collector.ts b/tests/src/log-collector.ts
new file mode 100644
index 00000000..b7eb8ca9
--- /dev/null
+++ b/tests/src/log-collector.ts
@@ -0,0 +1,200 @@
+import { spawn, ChildProcess } from "child_process";
+import { writeFileSync } from "fs";
+import path from "path";
+
+interface LogEntry {
+  timestamp: Date;
+  line: string;
+}
+
+interface TestMarker {
+  start: number;
+  end?: number;
+}
+
+/**
+ * LogCollector runs `docker compose logs --follow` as a background process
+ * and captures logs with precise test boundaries.
+ *
+ * This solves the log overlap problem where `docker compose logs --since=5m`
+ * could include logs from previous tests or miss logs if a test exceeds 5 minutes.
+ */
+export class LogCollector {
+  private process: ChildProcess | null = null;
+  private logs: LogEntry[] = [];
+  private testMarkers: Map<string, TestMarker> = new Map();
+  private workingDir: string;
+  private isRunning: boolean = false;
+
+  constructor(workingDir: string) {
+    // workingDir should be the project root, docker-compose.yml is in docker/
+    this.workingDir = path.join(workingDir, "docker");
+  }
+
+  /**
+   * Start the log collector background process
+   */
+  async start(): Promise<void> {
+    if (this.isRunning) {
+      return;
+    }
+
+    return new Promise((resolve, reject) => {
+      // Spawn docker compose logs --follow
+      this.process = spawn("docker", ["compose", "logs", "--follow", "--timestamps"], {
+        cwd: this.workingDir,
+        stdio: ["ignore", "pipe", "pipe"],
+      });
+
+      this.isRunning = true;
+
+      // Handle stdout (main log output)
+      this.process.stdout?.on("data", (data: Buffer) => {
+        const lines = data.toString().split("\n").filter((l) => l.trim());
+        for (const line of lines) {
+          this.logs.push({
+            timestamp: new Date(),
+            line: line,
+          });
+        }
+      });
+
+      // Handle stderr (docker compose messages)
+      this.process.stderr?.on("data", (data: Buffer) => {
+        const lines = data.toString().split("\n").filter((l) => l.trim());
+        for (const line of lines) {
+          this.logs.push({
+            timestamp: new Date(),
+            line: `[stderr] ${line}`,
+          });
+        }
+      });
+
+      this.process.on("error", (err) => {
+        this.isRunning = false;
+        reject(err);
+      });
+
+      this.process.on("close", (code) => {
+        this.isRunning = false;
+      });
+
+      // Give it a moment to start
+      setTimeout(() => {
+        if (this.isRunning) {
+          resolve();
+        } else {
+          reject(new Error("Log collector failed to start"));
+        }
+      }, 500);
+    });
+  }
+
+  /**
+   * Stop the log collector background process
+   */
+  async stop(): Promise<void> {
+    if (!this.process || !this.isRunning) {
+      return;
+    }
+
+    return new Promise((resolve) => {
+      this.process!.on("close", () => {
+        this.isRunning = false;
+        resolve();
+      });
+
+      this.process!.kill("SIGTERM");
+
+      // Force kill after 5 seconds
+      setTimeout(() => {
+        if (this.isRunning && this.process) {
+          this.process.kill("SIGKILL");
+        }
+        resolve();
+      }, 5000);
+    });
+  }
+
+  /**
+   * Mark the start of a test - records current log index
+   */
+  markTestStart(testId: string): void {
+    this.testMarkers.set(testId, {
+      start: this.logs.length,
+    });
+  }
+
+  /**
+   * Mark the end of a test - records current log index and writes logs to file
+   */
+  markTestEnd(testId: string): void {
+    const marker = this.testMarkers.get(testId);
+    if (marker) {
+      marker.end = this.logs.length;
+
+      // Write logs to temp file for test steps to access
+      const logs = this.getLogsForTest(testId);
+      const filePath = `/tmp/test-${testId}-logs.txt`;
+      writeFileSync(filePath, logs);
+    }
+  }
+
+  /**
+   * Get logs for a specific test (between start and end markers)
+   */
+  getLogsForTest(testId: string): string {
+    const marker = this.testMarkers.get(testId);
+    if (!marker) {
+      return "";
+    }
+
+    const endIndex = marker.end ?? this.logs.length;
+    const testLogs = this.logs.slice(marker.start, endIndex);
+
+    return testLogs.map((entry) => entry.line).join("\n");
+  }
+
+  /**
+   * Get all logs collected so far
+   */
+  getAllLogs(): string {
+    return this.logs.map((entry) => entry.line).join("\n");
+  }
+
+  /**
+   * Get logs since a specific index
+   */
+  getLogsSince(startIndex: number): string {
+    return this.logs.slice(startIndex).map((entry) => entry.line).join("\n");
+  }
+
+  /**
+   * Get current log count (useful for tracking)
+   */
+  getLogCount(): number {
+    return this.logs.length;
+  }
+
+  /**
+   * Check if the collector is running
+   */
+  isActive(): boolean {
+    return this.isRunning;
+  }
+
+  /**
+   * Write current test logs to file (call during test execution)
+   * This allows test steps to access logs accumulated so far
+   */
+  writeCurrentLogs(testId: string): void {
+    const marker = this.testMarkers.get(testId);
+    if (!marker) {
+      return;
+    }
+
+    const testLogs = this.logs.slice(marker.start);
+    const filePath = `/tmp/test-${testId}-logs.txt`;
+    writeFileSync(filePath, testLogs.map((entry) => entry.line).join("\n"));
+  }
+}
diff --git a/tests/testcases/inference/TC-INFERENCE-001.yml b/tests/testcases/inference/TC-INFERENCE-001.yml
index 51f212c3..e23efbe6 100644
--- a/tests/testcases/inference/TC-INFERENCE-001.yml
+++ b/tests/testcases/inference/TC-INFERENCE-001.yml
@@ -27,8 +27,12 @@ steps:
 
   - name: Verify model loading in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Model Loading Check ==="
 
@@ -58,8 +62,12 @@ steps:
 
   - name: Verify llama runner started
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Llama Runner Check ==="
 
@@ -74,8 +82,12 @@ steps:
 
   - name: Check for model loading errors
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Model Loading Error Check ==="
 
@@ -97,8 +109,12 @@ steps:
 
   - name: Display model memory allocation from logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Model Memory Allocation ==="
       echo "$LOGS" | grep -E '(model weights|kv cache|compute graph|total memory).*device=' | tail -8
diff --git a/tests/testcases/inference/TC-INFERENCE-002.yml b/tests/testcases/inference/TC-INFERENCE-002.yml
index 52d587a7..efad80cf 100644
--- a/tests/testcases/inference/TC-INFERENCE-002.yml
+++ b/tests/testcases/inference/TC-INFERENCE-002.yml
@@ -17,8 +17,12 @@ steps:
 
   - name: Check for inference errors in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== Inference Error Check ==="
 
@@ -47,8 +51,12 @@ steps:
 
   - name: Verify inference request in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== Inference Request Verification ==="
 
@@ -69,8 +77,12 @@ steps:
 
   - name: Display recent CUDA activity from logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== Recent CUDA Activity ==="
       echo "$LOGS" | grep -iE "(CUDA|cuda|device=CUDA)" | tail -5 || echo "No recent CUDA activity logged"
diff --git a/tests/testcases/inference/TC-INFERENCE-003.yml b/tests/testcases/inference/TC-INFERENCE-003.yml
index 81616e01..9fe5533d 100644
--- a/tests/testcases/inference/TC-INFERENCE-003.yml
+++ b/tests/testcases/inference/TC-INFERENCE-003.yml
@@ -22,8 +22,12 @@ steps:
 
   - name: Verify API requests logged successfully
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== API Request Log Verification ==="
 
@@ -40,8 +44,12 @@ steps:
 
   - name: Check for API errors in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== API Error Check ==="
 
@@ -64,8 +72,12 @@ steps:
 
   - name: Display API response times from logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== API Response Times ==="
 
diff --git a/tests/testcases/inference/TC-INFERENCE-004.yml b/tests/testcases/inference/TC-INFERENCE-004.yml
index 5898895d..347c9225 100644
--- a/tests/testcases/inference/TC-INFERENCE-004.yml
+++ b/tests/testcases/inference/TC-INFERENCE-004.yml
@@ -27,8 +27,12 @@ steps:
 
   - name: Verify model loaded to GPU
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== Model Loading Check for gemma3:12b ==="
 
@@ -63,8 +67,12 @@ steps:
 
   - name: Check for inference errors
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=5m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=5m 2>&1)
+      fi
 
       echo "=== Inference Error Check ==="
 
diff --git a/tests/testcases/inference/TC-INFERENCE-005.yml b/tests/testcases/inference/TC-INFERENCE-005.yml
index 43ddbb07..e0233cec 100644
--- a/tests/testcases/inference/TC-INFERENCE-005.yml
+++ b/tests/testcases/inference/TC-INFERENCE-005.yml
@@ -39,8 +39,12 @@ steps:
 
   - name: Verify model loaded across GPUs
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=10m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=10m 2>&1)
+      fi
 
       echo "=== Model Loading Check for gemma3:27b ==="
 
@@ -96,8 +100,12 @@ steps:
 
   - name: Check for inference errors
     command: |
-      cd docker
-      LOGS=$(docker compose logs --since=10m 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs --since=10m 2>&1)
+      fi
 
       echo "=== Inference Error Check ==="
 
diff --git a/tests/testcases/runtime/TC-RUNTIME-001.yml b/tests/testcases/runtime/TC-RUNTIME-001.yml
index 3d0b389e..30a08e92 100644
--- a/tests/testcases/runtime/TC-RUNTIME-001.yml
+++ b/tests/testcases/runtime/TC-RUNTIME-001.yml
@@ -22,12 +22,21 @@ steps:
 
   - name: Capture startup logs
     command: |
-      cd docker && docker compose logs 2>&1 | head -100
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        head -100 /tmp/test-${TEST_ID}-logs.txt
+      else
+        cd docker && docker compose logs 2>&1 | head -100
+      fi
 
   - name: Check for startup errors in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       # Check for critical errors
       if echo "$LOGS" | grep -qE "(level=ERROR|CUBLAS_STATUS_|CUDA error|cudaMalloc failed)"; then
diff --git a/tests/testcases/runtime/TC-RUNTIME-002.yml b/tests/testcases/runtime/TC-RUNTIME-002.yml
index 46e16eff..cfe232a3 100644
--- a/tests/testcases/runtime/TC-RUNTIME-002.yml
+++ b/tests/testcases/runtime/TC-RUNTIME-002.yml
@@ -30,8 +30,12 @@ steps:
 
   - name: Verify GPU detection in Ollama logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== GPU Detection Check ==="
 
@@ -60,8 +64,12 @@ steps:
 
   - name: Check for GPU-related errors in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== GPU Error Check ==="
 
@@ -82,8 +90,12 @@ steps:
 
   - name: Display GPU memory status from logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== GPU Memory Status ==="
       echo "$LOGS" | grep -E "gpu memory.*library=CUDA" | tail -4
diff --git a/tests/testcases/runtime/TC-RUNTIME-003.yml b/tests/testcases/runtime/TC-RUNTIME-003.yml
index 9b03a968..b462bee3 100644
--- a/tests/testcases/runtime/TC-RUNTIME-003.yml
+++ b/tests/testcases/runtime/TC-RUNTIME-003.yml
@@ -30,8 +30,12 @@ steps:
 
   - name: Verify server listening in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Server Status Check ==="
 
@@ -46,8 +50,12 @@ steps:
 
   - name: Check for runtime errors in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== Runtime Error Check ==="
 
@@ -71,8 +79,12 @@ steps:
 
   - name: Verify API request handling in logs
     command: |
-      cd docker
-      LOGS=$(docker compose logs 2>&1)
+      # Use log collector file if available, fallback to docker compose logs
+      if [ -f "/tmp/test-${TEST_ID}-logs.txt" ]; then
+        LOGS=$(cat /tmp/test-${TEST_ID}-logs.txt)
+      else
+        LOGS=$(cd docker && docker compose logs 2>&1)
+      fi
 
       echo "=== API Request Logs ==="