Add GitHub Actions workflow for Tesla K80 CI/CD

- Tesla K80 build and test workflow with self-hosted runner - Build using GCC 10 and CUDA 11.4 for Compute Capability 3.7 - Run unit tests, integration tests, and model inference tests - Test gemma2:2b model loading and GPU acceleration - Use Claude headless mode to analyze server logs and verify proper GPU initialization - Upload logs, analysis results, and binary artifacts - Comprehensive documentation in workflows README
2025-12-10 15:57:04 +00:00 · 2025-10-28 18:09:49 +08:00
parent fe0fd5b494
commit 92acf0f91e
2 changed files with 335 additions and 0 deletions
--- a/.github/workflows/README.md
+++ b/.github/workflows/README.md
@@ -0,0 +1,150 @@
+# GitHub Actions Workflows
+
+## Tesla K80 CI Workflow
+
+The `tesla-k80-ci.yml` workflow builds and tests ollama with CUDA Compute Capability 3.7 support using a self-hosted runner.
+
+### Prerequisites
+
+#### Self-Hosted Runner Setup
+
+1. **Install GitHub Actions Runner on your Tesla K80 machine**:
+   ```bash
+   # Navigate to your repository on GitHub:
+   # Settings > Actions > Runners > New self-hosted runner
+   
+   # Follow the provided instructions to download and configure the runner
+   mkdir -p ~/actions-runner && cd ~/actions-runner
+   curl -o actions-runner-linux-x64-2.XXX.X.tar.gz -L \
+     https://github.com/actions/runner/releases/download/vX.XXX.X/actions-runner-linux-x64-2.XXX.X.tar.gz
+   tar xzf ./actions-runner-linux-x64-2.XXX.X.tar.gz
+   
+   # Configure (use token from GitHub)
+   ./config.sh --url https://github.com/YOUR_USERNAME/ollama37 --token YOUR_TOKEN
+   
+   # Install and start as a service
+   sudo ./svc.sh install
+   sudo ./svc.sh start
+   ```
+
+2. **Verify runner environment has**:
+   - CUDA 11.4+ toolkit installed
+   - GCC 10 at `/usr/local/bin/gcc` and `/usr/local/bin/g++`
+   - CMake 3.24+
+   - Go 1.24+ (or let the workflow install it)
+   - NVIDIA driver with Tesla K80 support
+   - Network access to download Go dependencies and models
+   - **Claude CLI** installed and configured (`claude -p` must be available)
+     - Install: Follow instructions at https://docs.claude.com/en/docs/claude-code/installation
+     - The runner needs API access to use Claude for log analysis
+
+3. **Optional: Add runner labels**:
+   - You can add custom labels like `tesla-k80`, `cuda`, `gpu` during runner configuration
+   - Then target specific runners by uncommenting the labeled `runs-on` line in the workflow
+
+#### Environment Variables (Optional)
+
+You can set repository secrets or environment variables for:
+- `OLLAMA_DEBUG=1` - Enable debug logging
+- `OLLAMA_MODELS` - Custom model storage path
+- Any other ollama configuration
+
+### Workflow Triggers
+
+The workflow runs on:
+- **Push** to `main` or `develop` branches
+- **Pull requests** to `main` branch
+- **Manual dispatch** via GitHub Actions UI
+
+### Workflow Steps
+
+1. **Environment Setup**: Checkout code, install Go, display system info
+2. **Build**: Clean previous builds, configure CMake with GCC 10, build C++/CUDA components and Go binary
+3. **Unit Tests**: Run Go unit tests with race detector
+4. **Integration Tests**: Start ollama server, wait for ready, run integration tests
+5. **Model Tests**: Pull gemma2:2b, run inference, verify GPU acceleration
+6. **Log Analysis**: Use Claude headless mode to validate model loaded properly with Tesla K80
+7. **Cleanup**: Stop server, upload logs/artifacts
+
+### Artifacts
+
+- **ollama-logs-and-analysis** (always): Server logs, Claude analysis prompt, and analysis result
+- **ollama-binary-{sha}** (on success): Compiled ollama binary for the commit
+
+### Log Analysis with Claude
+
+The workflow uses Claude in headless mode (`claude -p`) to intelligently analyze ollama server logs and verify proper Tesla K80 GPU initialization. This provides automated validation that:
+
+1. **Model Loading**: Gemma2:2b loaded without errors
+2. **GPU Acceleration**: CUDA properly detected and initialized for Compute 3.7
+3. **No CPU Fallback**: Model is running on GPU, not falling back to CPU
+4. **No Compatibility Issues**: No CUDA version warnings or errors
+5. **Memory Allocation**: Successful GPU memory allocation
+6. **Inference Success**: Model inference completed without errors
+
+**Analysis Results**:
+- `PASS`: All checks passed, model working correctly with GPU
+- `WARN: <reason>`: Model works but has warnings worth reviewing
+- `FAIL: <reason>`: Critical issues detected, workflow fails
+
+This approach is superior to simple grep/pattern matching because Claude can:
+- Understand context and correlate multiple log entries
+- Distinguish between critical errors and benign warnings
+- Identify subtle issues like silent CPU fallback
+- Provide human-readable explanations of problems
+
+**Example**: If logs show "CUDA initialization successful" but later "using CPU backend", Claude will catch this inconsistency and fail the test, while simple pattern matching might miss it.
+
+### Customization
+
+#### Testing different models
+
+Uncomment and expand the "Test model operations" step:
+
+```yaml
+- name: Test model operations
+  run: |
+    ./ollama pull llama3.2:1b
+    ./ollama run llama3.2:1b "test prompt" --verbose
+    nvidia-smi  # Verify GPU was used
+```
+
+#### Running on specific branches
+
+Modify the `on` section:
+
+```yaml
+on:
+  push:
+    branches: [ main, develop, feature/* ]
+```
+
+#### Scheduled runs
+
+Add cron schedule for nightly builds:
+
+```yaml
+on:
+  schedule:
+    - cron: '0 2 * * *'  # 2 AM daily
+```
+
+### Troubleshooting
+
+**Runner offline**: Check runner service status
+```bash
+sudo systemctl status actions.runner.*
+```
+
+**Build failures**: Check uploaded logs in Actions > workflow run > Artifacts
+
+**GPU not detected**: Verify `nvidia-smi` works on the runner machine
+
+**Permissions**: Ensure runner user has access to CUDA libraries and can bind to port 11434
+
+### Security Considerations
+
+- Self-hosted runners should be on a secure, isolated machine
+- Consider using runner groups to restrict which repositories can use the runner
+- Do not use self-hosted runners for public repositories (untrusted PRs)
+- Keep the runner software updated
--- a/.github/workflows/tesla-k80-ci.yml
+++ b/.github/workflows/tesla-k80-ci.yml
@@ -0,0 +1,185 @@
+name: Tesla K80 Build and Test
+
+on:
+  push:
+    branches: [main, develop]
+  pull_request:
+    branches: [main]
+  workflow_dispatch: # Allow manual trigger
+
+jobs:
+  build-and-test:
+    runs-on: self-hosted
+
+    # Use specific labels if you want to target a particular self-hosted runner
+    # runs-on: [self-hosted, linux, cuda, tesla-k80]
+
+    timeout-minutes: 60 # Prevent hung jobs
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0 # Full history for accurate versioning
+
+      - name: Clean previous build
+        run: |
+          rm -rf build
+          rm -f ollama
+
+      - name: Configure CMake
+        run: |
+          CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake -B build
+        env:
+          CMAKE_BUILD_TYPE: Release
+
+      - name: Build C++/CUDA components
+        run: |
+          CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --build build --config Release
+        timeout-minutes: 30
+
+      - name: Build Go binary
+        run: |
+          go build -v -o ollama .
+
+      - name: Verify binary
+        run: |
+          ls -lh ollama
+          file ollama
+          ./ollama --version
+
+      - name: Run Go unit tests
+        run: |
+          go test -v -race -timeout 10m ./...
+        continue-on-error: false
+
+      - name: Start ollama server (background)
+        run: |
+          ./ollama serve > ollama.log 2>&1 &
+          echo $! > ollama.pid
+          echo "Ollama server started with PID $(cat ollama.pid)"
+
+      - name: Wait for server to be ready
+        run: |
+          for i in {1..30}; do
+            if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
+              echo "Server is ready!"
+              exit 0
+            fi
+            echo "Waiting for server... attempt $i/30"
+            sleep 2
+          done
+          echo "Server failed to start"
+          cat ollama.log
+          exit 1
+
+      - name: Run integration tests
+        run: |
+          go test -v -timeout 20m ./integration/...
+        continue-on-error: false
+
+      - name: Clear server logs for model test
+        run: |
+          # Truncate log file to start fresh for model loading test
+          > ollama.log
+
+      - name: Pull gemma3:4b model
+        run: |
+          echo "Pulling gemma3:4b model..."
+          ./ollama pull gemma2:2b
+          echo "Model pull completed"
+        timeout-minutes: 15
+
+      - name: Run inference with gemma3:4b
+        run: |
+          echo "Running inference test..."
+          ./ollama run gemma2:2b "Hello, this is a test. Please respond with a short greeting." --verbose
+          echo "Inference completed"
+        timeout-minutes: 5
+
+      - name: Wait for logs to flush
+        run: sleep 3
+
+      - name: Analyze server logs with Claude
+        run: |
+          echo "Analyzing ollama server logs for proper model loading..."
+
+          # Create analysis prompt
+          cat > log_analysis_prompt.txt << 'EOF'
+          Analyze the following Ollama server logs from a Tesla K80 (CUDA Compute Capability 3.7) system.
+
+          Verify that:
+          1. The model loaded successfully without errors
+          2. CUDA/GPU acceleration was properly detected and initialized
+          3. The model is using the Tesla K80 GPU (not CPU fallback)
+          4. There are no CUDA compatibility warnings or errors
+          5. Memory allocation was successful
+          6. Inference completed without errors
+
+          Respond with:
+          - "PASS" if all checks pass and model loaded properly with GPU acceleration
+          - "FAIL: <reason>" if there are critical issues
+          - "WARN: <reason>" if there are warnings but model works
+
+          Be specific about what succeeded or failed. Look for CUDA errors, memory issues, or CPU fallback warnings.
+
+          Server logs:
+          ---
+          EOF
+
+          cat ollama.log >> log_analysis_prompt.txt
+
+          # Run Claude in headless mode to analyze
+          claude -p log_analysis_prompt.txt > log_analysis_result.txt
+
+          echo "=== Claude Analysis Result ==="
+          cat log_analysis_result.txt
+
+          # Check if analysis passed
+          if grep -q "^PASS" log_analysis_result.txt; then
+            echo "✓ Log analysis PASSED - Model loaded correctly on Tesla K80"
+            exit 0
+          elif grep -q "^WARN" log_analysis_result.txt; then
+            echo "⚠ Log analysis has WARNINGS - Review needed"
+            cat log_analysis_result.txt
+            exit 0  # Don't fail on warnings, but they're visible
+          else
+            echo "✗ Log analysis FAILED - Model loading issues detected"
+            cat log_analysis_result.txt
+            exit 1
+          fi
+
+      - name: Check GPU memory usage
+        if: always()
+        run: |
+          echo "=== GPU Memory Status ==="
+          nvidia-smi --query-gpu=memory.used,memory.total --format=csv
+
+      - name: Stop ollama server
+        if: always()
+        run: |
+          if [ -f ollama.pid ]; then
+            kill $(cat ollama.pid) || true
+            rm ollama.pid
+          fi
+          pkill -f "ollama serve" || true
+
+      - name: Upload logs and analysis
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: ollama-logs-and-analysis
+          path: |
+            ollama.log
+            log_analysis_prompt.txt
+            log_analysis_result.txt
+            build/**/*.log
+          retention-days: 7
+
+      - name: Upload binary artifact
+        if: success()
+        uses: actions/upload-artifact@v4
+        with:
+          name: ollama-binary-${{ github.sha }}
+          path: ollama
+          retention-days: 14