Commit Graph

4573 Commits

Author SHA1 Message Date
Shang Chieh Tseng
6383a2e036 Fix template literal syntax error in sed command
The dollar-brace in sed end-of-file pattern was interpreted as
TypeScript template literal interpolation. Use escaped syntax
to insert a literal dollar sign.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 20:07:15 +08:00
Shang Chieh Tseng
6c97e5cd61 Rewrite LogCollector with file-based markers for crash resilience
Replace in-memory array storage with persistent file-based logging:
- Logs written to /tmp/ollama37-session-{timestamp}.log
- Text markers: ===MARKER:START:{ID}:{TIMESTAMP}=== for test boundaries
- Extract test-specific logs via sed to /tmp/test-{ID}-logs.txt
- Write queue prevents race conditions between log data and markers
- Line buffering prevents marker injection mid-line
- Auto-cleanup of session files older than 24 hours

Benefits:
- Crash resilient: logs persist even if test process dies
- Bounded memory: no array growth, only file I/O
- Precise boundaries: text markers unaffected by buffering delays

API unchanged - all existing integrations (executor, cli, test cases) work without modification.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 20:02:14 +08:00
Shang Chieh Tseng
e80f226507 Rewrite CICD.md to focus on design and philosophy
Replace command-heavy documentation with conceptual explanation:
- Project goal and infrastructure rationale
- Test framework philosophy (exit codes lie, logs tell truth)
- Dual-judge architecture design
- Log collection problem and solution
- Test execution flow
- Model unload strategy
- Design decisions with reasoning
- Known limitations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 17:58:41 +08:00
Shang Chieh Tseng
bf2c321626 Fix double-resolve bug in LogCollector.stop()
- stop() could resolve promise twice (from close event AND timeout)
- Add resolved flag to ensure single resolution
- Add comment about parallel execution limitation in executor

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 17:48:20 +08:00
Shang Chieh Tseng
2c5094db92 Add LogCollector for precise test log boundaries
Problem: Tests used `docker compose logs --since=5m` which caused:
- Log overlap between tests
- Logs from previous tests included
- Missing logs if test exceeded 5 minutes

Solution:
- New LogCollector class runs `docker compose logs --follow`
- Marks test start/end boundaries
- Writes test-specific logs to /tmp/test-{testId}-logs.txt
- Test steps access via TEST_ID environment variable

Changes:
- tests/src/log-collector.ts: New LogCollector class
- tests/src/executor.ts: Integrate LogCollector, set TEST_ID env
- tests/src/cli.ts: Start/stop LogCollector for runtime/inference
- All test cases: Use log collector with fallback to docker compose

Also updated docs/CICD.md with:
- Test runner CLI documentation
- Judge modes (simple, llm, dual)
- Log collector integration
- Updated test case list (12b, 27b models)
- Model unload strategy
- Troubleshooting guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 17:46:49 +08:00
Shang Chieh Tseng
82ab6cc96e Refactor model unload: each test cleans up its own model
- TC-INFERENCE-003: Add unload step for gemma3:4b at end
- TC-INFERENCE-004: Remove redundant 4b unload at start
- TC-INFERENCE-005: Remove redundant 12b unload at start

Each model size test now handles its own VRAM cleanup.
Workflow-level unload remains as safety fallback for failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 17:20:44 +08:00
Shang Chieh Tseng
806232d95f Add multi-model inference tests for gemma3 12b and 27b
- TC-INFERENCE-004: gemma3:12b single GPU test
- TC-INFERENCE-005: gemma3:27b dual-GPU test (K80 layer split)
- Each test unloads previous model before loading next
- Workflows unload all 3 model sizes after inference suite
- 27b test verifies both GPUs have memory allocated
2025-12-17 17:01:25 +08:00
Shang Chieh Tseng
22e77e0dde Unload models from VRAM after use to free GPU memory
- Add unloadModel() method to LLMJudge class
- CLI calls unloadModel() after judging completes
- Workflows unload gemma3:4b after inference tests
- Uses Ollama API with keep_alive:0 to trigger unload
2025-12-17 16:51:12 +08:00
Shang Chieh Tseng
7bb050f146 Change workflow defaults: judge_mode=dual, judge_model=gemma3:12b 2025-12-17 16:43:38 +08:00
Shang Chieh Tseng
b0c2a07190 Fix JSON output contamination in test runner
Change judge.ts progress message from console.log (stdout) to
process.stderr.write (stderr) to prevent 'Judging batch...' message
from contaminating JSON output when using --output json flag.
2025-12-17 16:08:01 +08:00
Shang Chieh Tseng
e06deff40f Enhance LLM judge prompt and add separate verdict display
- Add step results, timing context, and build notes to LLM prompt
- LLM now sees exit codes, durations, and simple judge result
- Add guidance that long build times within timeout are acceptable

- Add separate simple/LLM verdict tracking in dual-judge mode
- Console output shows both Simple and LLM pass/fail status
- JSON summary includes separate simple/llm breakdown
- Each test report includes simplePass/llmPass fields

This helps distinguish between simple judge failures (exit code != 0)
and LLM judge failures (semantic analysis), making debugging easier.
2025-12-17 15:04:05 +08:00
Shang Chieh Tseng
1e99c1bb50 Fix version injection for docker builds
- Add OLLAMA_VERSION build arg to Dockerfiles
- Update Makefile to pass version via --build-arg
- Add .env.example as local development reference
- Update build.yml to use cicd-1 environment for vars.OLLAMA_VERSION

Fixes #8

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 00:42:04 +08:00
Shang Chieh Tseng
f25016b7d5 Set version to 2.0.1 in build instructions
Update CLAUDE.md build command to include -ldflags that sets
the version properly instead of defaulting to 0.0.0.

Fixes #8

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 00:17:32 +08:00
Shang Chieh Tseng
ce2882b757 Fix runtime test log checks that require model loading
- Remove CUDA initialization checks from TC-RUNTIME-002 (ggml_cuda_init,
  load_backend only appear when a model is loaded, not at startup)
- Fix bash integer comparison error in TC-RUNTIME-003

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 00:00:24 +08:00
Shang Chieh Tseng
11329c5ee8 Replace README with Docker build documentation for Tesla K80 support 2025-12-16 23:31:43 +08:00
Shang Chieh Tseng
1a185f7926 Add comprehensive Ollama log checking and configurable LLM judge mode
Test case enhancements:
- TC-RUNTIME-001: Add startup log error checking (CUDA, CUBLAS, CPU fallback)
- TC-RUNTIME-002: Add GPU detection verification, CUDA init checks, error detection
- TC-RUNTIME-003: Add server listening verification, runtime error checks
- TC-INFERENCE-001: Add model loading logs, layer offload verification
- TC-INFERENCE-002: Add inference error checking (CUBLAS/CUDA errors)
- TC-INFERENCE-003: Add API request log verification, response time display

Workflow enhancements:
- Add judge_mode input (simple/llm/dual) to all workflows
- Add judge_model input to specify LLM model for judging
- Configurable via GitHub Actions UI without code changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 23:27:57 +08:00
Shang Chieh Tseng
143e6fa8e4 Improve UVM device check messaging in TC-RUNTIME-002
- Rename step to "Verify UVM device files" for clarity
- Add "WARNING:" prefix when UVM device is missing
- Add "SUCCESS:" prefix when device is present
- Add confirmation message after UVM fix is applied
- Separate ls command for cleaner output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 22:57:37 +08:00
Shang Chieh Tseng
c2f4f378cc Add dual-judge mode to test runner
New options:
- --dual-judge: Run both simple and LLM judge, fail if either fails
- --judge-url: Separate LLM Judge server URL (default: localhost:11435)
- --judge-model: Model for LLM judging (default: gemma3:4b)

Dual judge logic:
- Simple judge checks exit codes
- LLM judge analyzes logs semantically
- Final result: FAIL if either judge says FAIL
- Combines reasons from both judges on failure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 22:58:28 +08:00
Shang Chieh Tseng
6b84acd7d7 Add LLM Judge container infrastructure
- Add cicd/docker-compose.judge.yml for stable reference Ollama
- Runs on port 11435 (separate from test subject on 11434)
- Uses dogkeeper886/ollama37:latest from DockerHub
- Add cicd/README.md documenting CI infrastructure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 22:25:13 +08:00
Shang Chieh Tseng
0e66cc6f93 Fix workflows to fail on test failures
The '|| true' was swallowing test runner exit codes, causing workflows
to pass even when tests failed. Added separate 'Check test results'
step that reads JSON summary and fails workflow if any tests failed.

Affected workflows:
- build.yml
- runtime.yml
- inference.yml
- full-pipeline.yml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 21:48:40 +08:00
Shang Chieh Tseng
f59834c531 Improve test runner logging
- Strip ANSI escape codes from stdout/stderr to reduce log size
  (spinner animations were ~95% of inference log size)
- Add [TIMEOUT] indicator when commands are killed due to timeout
  for clearer failure diagnosis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 21:45:33 +08:00
Shang Chieh Tseng
ebcca9f483 Add model warmup step to TC-INFERENCE-001
Tesla K80 needs ~60-180s to load model into VRAM on first inference.
Add warmup step with 5-minute timeout to preload model before
subsequent inference tests run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 21:38:09 +08:00
Shang Chieh Tseng
3f3f68f08d Remove TC-INFERENCE-004: CUBLAS Fallback Verification
Redundant test - if TC-INFERENCE-002 (Basic Inference) passes,
CUBLAS fallback is already working. Any errors would cause
inference to fail, making a separate error-check test unnecessary.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 20:38:35 +08:00
Shang Chieh Tseng
8d65fd4211 Update TC-RUNTIME-002 to handle UVM device workaround
- Add step to check/create /dev/nvidia-uvm device files
- Use nvidia-modprobe -u -c=0 if UVM devices missing
- Restart container after creating UVM devices
- Update criteria to clarify GPU detection requirements
- Increase timeout to 120s for container restart

Fixes issue where nvidia-smi works but Ollama only detects CPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 19:58:16 +08:00
Shang Chieh Tseng
23c92954d7 Fix Unicode encoding for CI compatibility
Replace Unicode characters with ASCII equivalents:
- Line separators: '─' -> '-'
- Pass indicator: '✓' -> '[PASS]'
- Fail indicator: '✗' -> '[FAIL]'

GitHub Actions terminal has encoding issues with UTF-8 chars.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 19:06:07 +08:00
Shang Chieh Tseng
52ccb96a01 Reduce TC-BUILD-002 timeout to 30 minutes
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 19:02:27 +08:00
Shang Chieh Tseng
03da57629e Increase TC-BUILD-002 timeout to 60 minutes and improve logging
- Timeout: 900s -> 3600s (60 min) for runtime image build
- Add tee to capture full build log to /tmp/build-runtime.log
- Add step to show last 200 lines of build log for debugging
- Helps diagnose build failures with proper log capture

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 18:31:28 +08:00
Shang Chieh Tseng
fb01b8b1ca Split monolithic workflow into modular components
Separate workflows for flexibility:
- build.yml: Build verification (standalone + reusable)
- runtime.yml: Container & runtime tests with container lifecycle
- inference.yml: Inference tests with optional container management
- full-pipeline.yml: Orchestrates all stages with LLM judge

Each workflow can be triggered independently for targeted testing,
or run the full pipeline for complete validation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 17:57:11 +08:00
Shang Chieh Tseng
54248f42b0 Improve CI test transparency with dual-stream output
- Separate progress output (stderr) from JSON results (stdout)
- Add timestamps, test counters, and step progress to executor
- Update CLI to use stderr for progress messages
- Update workflow to capture JSON to file while showing progress
- Add --silent flag to suppress npm banner noise

This allows real-time visibility into test execution during CI runs
while preserving clean JSON output for artifact collection.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 17:50:32 +08:00
Shang Chieh Tseng
45e1e6c8b7 Change workflow to manual trigger only
Remove automatic triggers on push/PR to avoid slow builds running
unexpectedly. Use workflow_dispatch for manual control.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 14:38:09 +08:00
Shang Chieh Tseng
313642d6f1 Fix runner labels and improve job names
- Change runs-on from [self-hosted, k80, cuda11] to self-hosted
- Rename job names for clarity:
  - Build Docker Images -> Build Verification
  - Runtime Tests -> Container & Runtime Tests
  - Cleanup -> Cleanup & Summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 14:23:00 +08:00
Shang Chieh Tseng
d11140c016 Add GitHub Actions CI/CD pipeline and test framework
- Add .github/workflows/build-test.yml for automated testing
- Add tests/ directory with TypeScript test runner
- Add docs/CICD.md documentation
- Remove .gitlab-ci.yml (migrated to GitHub Actions)
- Update .gitignore for test artifacts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 14:06:44 +08:00
Shang Chieh Tseng
2b5aeaf86b Update .gitlab-ci.yml file 2025-12-14 09:06:23 +00:00
Shang Chieh Tseng
b5dac79d2c Update README.md 2025-11-13 18:14:16 +08:00
Shang Chieh Tseng
68f9b1580e Add timing instrumentation and user progress messages for model loading
Problem: Model loading takes 2-3 minutes on first load with no user feedback,
causing confusion about whether the system is frozen or working.

Root Cause: GPU initialization (reserveWorstCaseGraph) takes ~164 seconds on
Tesla K80 GPUs due to CUDA kernel compilation (PTX JIT for compute 3.7). This
is by design - it validates GPU compatibility before committing to full load.

Solution:
1. Add comprehensive timing instrumentation to identify bottlenecks
2. Add user-facing progress messages explaining the delay

Changes:
- cmd/cmd.go: Update spinner with informative message for users
- llama/llama.go: Add timing logs for CGO model loading
- runner/llamarunner/runner.go: Add detailed timing for llama runner
- runner/ollamarunner/runner.go: Add timing + stderr messages for new engine
- server/sched.go: Add timing for scheduler load operation

User Experience:
Before: Silent wait with blinking cursor for 2-3 minutes
After: Rotating spinner with message "loading model (may take 1-3 min on first load)"

Performance Metrics Captured:
- GGUF file reading: ~0.4s
- GPU kernel compilation: ~164s (bottleneck identified)
- Model weight loading: ~0.002s
- Total end-to-end: ~165s

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-12 19:09:37 +08:00
Shang Chieh Tseng
84210db18a Delete useles files 2025-11-12 12:52:36 +08:00
Shang Chieh Tseng
4cf745b40a Update README.md 2025-11-12 12:50:13 +08:00
Shang Chieh Tseng
8d376e0f9b Add local development build support to Docker build system
Extends the Docker Makefile with targets for building from local source code without pushing to GitHub, enabling faster iteration during development.

New build targets:
- build-runtime-local: Build from local source with cache
- build-runtime-local-no-cache: Full rebuild from local source
- build-runtime-no-cache: Force fresh GitHub clone without cache

Added docker/runtime/Dockerfile.local for local source builds, mirroring the GitHub-based Dockerfile structure but using COPY instead of git clone.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-12 06:51:05 +08:00
Shang Chieh Tseng
7d9b59c520 Improve GPU detection and add detailed model loading logs
1. Fix binary path resolution using symlink (docker/runtime/Dockerfile)
   - Build binary to source directory (./ollama)
   - Create symlink from /usr/local/bin/ollama to /usr/local/src/ollama37/ollama
   - Allows ml/path.go to resolve libraries via filepath.EvalSymlinks()
   - Fixes "total vram=0 B" issue without requiring -w flag

2. Add comprehensive logging for model loading phases (llm/server.go)
   - Log runner subprocess startup and readiness
   - Log each memory allocation phase (FIT, ALLOC, COMMIT)
   - Log layer allocation adjustments during convergence
   - Log when model weights are being loaded (slowest phase)
   - Log progress during waitUntilRunnerLaunched (every 1s)
   - Improves visibility during 1-2 minute first-time model loads

3. Fix flash attention compute capability check (ml/device.go)
   - Changed DriverMajor to ComputeMajor for correct capability detection
   - Flash attention requires compute capability >= 7.0, not driver version

These changes improve user experience during model loading by providing
clear feedback at each stage, especially during the slow COMMIT phase
where GGUF weights are loaded and CUDA kernels compile.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 23:28:00 +08:00
Shang Chieh Tseng
db00f2d5f4 Create dockerhub-readme.md 2025-11-10 20:35:43 +08:00
Shang Chieh Tseng
738a8ba2da Improve Docker runtime Dockerfile documentation and accuracy
Corrects misleading architecture description and enhances code comments:
- Fix header: change "two-stage build" to accurate "single-stage build"
- Remove obsolete multi-stage build artifacts (builder/runtime aliases)
- Clarify LD_LIBRARY_PATH purpose during CMake configuration
- Document parallel compilation benefit (-j flag)
- Explain health check validation scope (API + model registry)
- Add specific library path location to header comments

This aligns with the CLAUDE.md documentation policy of adding helpful
comments to improve code maintainability and debugging experience.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 14:18:08 +08:00
Shang Chieh Tseng
4810471b33 Redesign Docker build system to two-stage architecture with builder/runtime separation
Redesigned the Docker build system from a single-stage monolithic design to a clean
two-stage architecture that separates build environment from compilation process while
maintaining library path compatibility.

## Architecture Changes

### Builder Image (docker/builder/Dockerfile)
- Provides base environment: CUDA 11.4, GCC 10, CMake 4, Go 1.25.3
- Built once, cached for subsequent builds (~90 min first time)
- Removed config file copying (cuda-11.4.sh, gcc-10.conf, go.sh)
- Added comprehensive comments explaining each build step
- Added git installation for runtime stage source cloning

### Runtime Image (docker/runtime/Dockerfile)
- Two-stage build using ollama37-builder as base for BOTH stages
- Stage 1 (compile): Clone source from GitHub → CMake configure → Build C/C++/CUDA → Build Go
- Stage 2 (runtime): Copy artifacts from stage 1 → Setup environment → Configure server
- Both stages use identical base image to ensure library path compatibility
- Removed -buildvcs=false flag (VCS info embedded from git clone)
- Comprehensive comments documenting library paths and design rationale

### Makefile (docker/Makefile)
- Simplified from 289 to 145 lines (-50% complexity)
- Removed: run, stop, logs, shell, test targets (use docker-compose instead)
- Removed: build orchestration targets (start-builder, copy-source, run-cmake, etc.)
- Removed: artifact copying (handled internally by multi-stage build)
- Focus: Build images only (build, build-builder, build-runtime, clean, help)
- All runtime operations delegated to docker-compose.yml

### Documentation (docker/README.md)
- Completely rewritten for new two-stage architecture
- Added "Build System Components" section with file structure
- Documented why both runtime stages use builder base (library path compatibility)
- Updated build commands to use Makefile
- Updated runtime commands to use docker-compose
- Added comprehensive troubleshooting section
- Added build time and image size tables
- Reference to archived single-stage design

## Key Design Decision

**Problem**: Compiled binaries have hardcoded library paths
**Solution**: Use ollama37-builder as base for BOTH compile and runtime stages
**Trade-off**: Larger image (~18GB) vs guaranteed library compatibility

## Benefits

-  Cleaner separation of concerns (builder env vs compilation vs runtime)
-  Builder image cached after first build (90 min → <1 min rebuilds)
-  Runtime rebuilds only take ~10 min (pulls latest code from GitHub)
-  No library path mismatches (identical base images)
-  No complex artifact extraction (multi-stage COPY)
-  Simpler Makefile focused on image building
-  Runtime management via docker-compose (industry standard)

## Files Changed

Modified:
- docker/builder/Dockerfile - Added comments, removed COPY config files
- docker/runtime/Dockerfile - Converted to two-stage build
- docker/Makefile - Simplified to focus on image building only
- docker/README.md - Comprehensive rewrite for new architecture

Deleted:
- docker/builder/README.md - No longer needed
- docker/builder/cuda-11.4.sh - Generated in Dockerfile
- docker/builder/gcc-10.conf - Generated in Dockerfile
- docker/builder/go.sh - Generated in Dockerfile

Archived:
- docker/Dockerfile → docker/Dockerfile.single-stage.archived

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 13:14:49 +08:00
Shang Chieh Tseng
6dbd8ed44e Redesign Docker build system to single-stage architecture for reliable model loading
Replaced complex two-stage build (builder → runtime) with single-stage
Dockerfile that builds and runs Ollama in one image. This fixes model
loading issues caused by missing CUDA libraries and LD_LIBRARY_PATH
mismatches in the previous multi-stage design.

Changes:
- Add docker/Dockerfile: Single-stage build with GCC 10, CMake 4, Go 1.25.3, CUDA 11.4
- Clone source from https://github.com/dogkeeper886/ollama37
- Compile Ollama with "CUDA 11" preset for Tesla K80 (compute capability 3.7)
- Keep complete CUDA toolkit and all libraries in final image (~20GB)
- Update docker-compose.yml: Simplified config, use ollama37:latest image
- Update docker/README.md: New build instructions and architecture docs

Trade-off: Larger image size (~20GB vs ~3GB) for guaranteed compatibility
and reliable GPU backend operation. All libraries remain accessible with
correct paths, ensuring models load properly on Tesla K80.

Tested: Successfully runs gemma3:1b on Tesla K80

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 09:19:22 +08:00
Shang Chieh Tseng
0293c53746 Fix Docker container to run as host user and use host .ollama directory
This change prevents permission issues when using Ollama both locally and
in Docker by:
- Running container as host user (UID/GID) instead of root
- Mounting host's $HOME/.ollama directory using environment variables
- Setting HOME environment variable in container

This allows both the local binary and Docker container to share the same
model data without permission conflicts or duplication.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-09 18:00:42 +08:00
Shang Chieh Tseng
8380ca93f8 Fix Docker build system: add library paths, GCC 10 runtime libs, and Go build flags
- Add LD_LIBRARY_PATH to CMake and build steps for GCC 10 libraries
- Copy GCC 10 runtime libraries (libstdc++.so.6, libgcc_s.so.1) to output
- Update runtime Dockerfile to use minimal CUDA runtime packages
- Add -buildvcs=false flag to Go build to avoid Git VCS errors
- Simplify runtime container to only include necessary CUDA libraries
- Fix library path configuration for proper runtime library loading
2025-11-09 00:05:12 +08:00
Shang Chieh Tseng
6237498297 Fix Makefile to use custom-built GCC 10 instead of non-existent gcc-toolset-10
- Replace 'scl enable gcc-toolset-10' with 'bash -l' (login shell)
- Login shell sources /etc/profile.d/cuda-11.4.sh and go.sh for PATH
- Explicitly set CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ (custom-built GCC 10)
- Fix run-cmake, run-build, run-go-build, and shell targets
- Enables CMake to find nvcc and use correct compiler toolchain
2025-11-08 21:20:26 +08:00
Shang Chieh Tseng
f2c94bb9af Add Docker builder image with CUDA 11.4, GCC 10, CMake 4, and Go 1.25.3
- Build CUDA 11.4 toolkit from NVIDIA repository (for K80 compute 3.7 support)
- Build GCC 10 from source (required for CUDA 11.4 compatibility)
- Build CMake 4.0.0 from source (latest version)
- Install Go 1.25.3 from official tarball
- Configure library paths via /etc/ld.so.conf.d/gcc-10.conf and ldconfig
- Add /etc/profile.d scripts for interactive shell PATH setup
- Use ENV statements for Docker build-time and runtime PATH configuration
- Switch from nvidia/cuda base image to rockylinux:8 for full control
2025-11-08 21:03:38 +08:00
Shang Chieh Tseng
71fc994a63 Fix Docker build: clean host artifacts after copy to prevent conflicts
- Add cleanup step in copy-source target to remove build/, ollama, and dist/
- Prevents host build artifacts from interfering with container builds
- Ensures clean build environment when switching between host and Docker workflows
- docker cp doesn't respect .dockerignore, so explicit cleanup is needed
2025-11-08 17:16:46 +08:00
Shang Chieh Tseng
94bbfbb2e7 Add Docker-based build system with GPU-enabled builder and runtime containers 2025-11-07 12:48:05 +08:00
Shang Chieh Tseng
5744fb792a Remove hardcoded compiler paths from CMakePresets.json for portability
- Remove CMAKE_C_COMPILER and CMAKE_CXX_COMPILER from CUDA 11 presets
- Allows CMake to auto-detect system GCC instead of hardcoding /usr/local/bin/gcc
- Improves portability across different systems (host, Docker containers, etc.)
- Users can still override compiler via CC/CXX environment variables if needed
2025-11-06 23:38:46 +08:00