Add GitHub Actions workflow for Tesla K80 CI/CD

- Tesla K80 build and test workflow with self-hosted runner - Build using GCC 10 and CUDA 11.4 for Compute Capability 3.7 - Run unit tests, integration tests, and model inference tests - Test gemma2:2b model loading and GPU acceleration - Use Claude headless mode to analyze server logs and verify proper GPU initialization - Upload logs, analysis results, and binary artifacts - Comprehensive documentation in workflows README
2025-12-10 15:57:04 +00:00 · 2025-10-28 18:09:49 +08:00
parent fe0fd5b494
commit 92acf0f91e
2 changed files with 335 additions and 0 deletions
--- a/.github/workflows/README.md
+++ b/.github/workflows/README.md
@@ -0,0 +1,150 @@
 # GitHub Actions Workflows
 ## Tesla K80 CI Workflow
 The `tesla-k80-ci.yml` workflow builds and tests ollama with CUDA Compute Capability 3.7 support using a self-hosted runner.
 ### Prerequisites
 #### Self-Hosted Runner Setup
 1. **Install GitHub Actions Runner on your Tesla K80 machine**:
   ```bash
   # Navigate to your repository on GitHub:
   # Settings > Actions > Runners > New self-hosted runner
   # Follow the provided instructions to download and configure the runner
   mkdir -p ~/actions-runner && cd ~/actions-runner
   curl -o actions-runner-linux-x64-2.XXX.X.tar.gz -L \
     https://github.com/actions/runner/releases/download/vX.XXX.X/actions-runner-linux-x64-2.XXX.X.tar.gz
   tar xzf ./actions-runner-linux-x64-2.XXX.X.tar.gz
   # Configure (use token from GitHub)
   ./config.sh --url https://github.com/YOUR_USERNAME/ollama37 --token YOUR_TOKEN
   # Install and start as a service
   sudo ./svc.sh install
   sudo ./svc.sh start
   ```
 2. **Verify runner environment has**:
   - CUDA 11.4+ toolkit installed
   - GCC 10 at `/usr/local/bin/gcc` and `/usr/local/bin/g++`
   - CMake 3.24+
   - Go 1.24+ (or let the workflow install it)
   - NVIDIA driver with Tesla K80 support
   - Network access to download Go dependencies and models
   - **Claude CLI** installed and configured (`claude -p` must be available)
     - Install: Follow instructions at https://docs.claude.com/en/docs/claude-code/installation
     - The runner needs API access to use Claude for log analysis
 3. **Optional: Add runner labels**:
   - You can add custom labels like `tesla-k80`, `cuda`, `gpu` during runner configuration
   - Then target specific runners by uncommenting the labeled `runs-on` line in the workflow
 #### Environment Variables (Optional)
 You can set repository secrets or environment variables for:
 - `OLLAMA_DEBUG=1` - Enable debug logging
 - `OLLAMA_MODELS` - Custom model storage path
 - Any other ollama configuration
 ### Workflow Triggers
 The workflow runs on:
 - **Push** to `main` or `develop` branches
 - **Pull requests** to `main` branch
 - **Manual dispatch** via GitHub Actions UI
 ### Workflow Steps
 1. **Environment Setup**: Checkout code, install Go, display system info
 2. **Build**: Clean previous builds, configure CMake with GCC 10, build C++/CUDA components and Go binary
 3. **Unit Tests**: Run Go unit tests with race detector
 4. **Integration Tests**: Start ollama server, wait for ready, run integration tests
 5. **Model Tests**: Pull gemma2:2b, run inference, verify GPU acceleration
 6. **Log Analysis**: Use Claude headless mode to validate model loaded properly with Tesla K80
 7. **Cleanup**: Stop server, upload logs/artifacts
 ### Artifacts
 - **ollama-logs-and-analysis** (always): Server logs, Claude analysis prompt, and analysis result
 - **ollama-binary-{sha}** (on success): Compiled ollama binary for the commit
 ### Log Analysis with Claude
 The workflow uses Claude in headless mode (`claude -p`) to intelligently analyze ollama server logs and verify proper Tesla K80 GPU initialization. This provides automated validation that:
 1. **Model Loading**: Gemma2:2b loaded without errors
 2. **GPU Acceleration**: CUDA properly detected and initialized for Compute 3.7
 3. **No CPU Fallback**: Model is running on GPU, not falling back to CPU
 4. **No Compatibility Issues**: No CUDA version warnings or errors
 5. **Memory Allocation**: Successful GPU memory allocation
 6. **Inference Success**: Model inference completed without errors
 **Analysis Results**:
 - `PASS`: All checks passed, model working correctly with GPU
 - `WARN: <reason>`: Model works but has warnings worth reviewing
 - `FAIL: <reason>`: Critical issues detected, workflow fails
 This approach is superior to simple grep/pattern matching because Claude can:
 - Understand context and correlate multiple log entries
 - Distinguish between critical errors and benign warnings
 - Identify subtle issues like silent CPU fallback
 - Provide human-readable explanations of problems
 **Example**: If logs show "CUDA initialization successful" but later "using CPU backend", Claude will catch this inconsistency and fail the test, while simple pattern matching might miss it.
 ### Customization
 #### Testing different models
 Uncomment and expand the "Test model operations" step:
 ```yaml
 - name: Test model operations
  run: |
    ./ollama pull llama3.2:1b
    ./ollama run llama3.2:1b "test prompt" --verbose
    nvidia-smi  # Verify GPU was used
 ```
 #### Running on specific branches
 Modify the `on` section:
 ```yaml
 on:
  push:
    branches: [ main, develop, feature/* ]
 ```
 #### Scheduled runs
 Add cron schedule for nightly builds:
 ```yaml
 on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily
 ```
 ### Troubleshooting
 **Runner offline**: Check runner service status
 ```bash
 sudo systemctl status actions.runner.*
 ```
 **Build failures**: Check uploaded logs in Actions > workflow run > Artifacts
 **GPU not detected**: Verify `nvidia-smi` works on the runner machine
 **Permissions**: Ensure runner user has access to CUDA libraries and can bind to port 11434
 ### Security Considerations
 - Self-hosted runners should be on a secure, isolated machine
 - Consider using runner groups to restrict which repositories can use the runner
 - Do not use self-hosted runners for public repositories (untrusted PRs)
 - Keep the runner software updated
--- a/.github/workflows/tesla-k80-ci.yml
+++ b/.github/workflows/tesla-k80-ci.yml
@@ -0,0 +1,185 @@
 name: Tesla K80 Build and Test
 on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  workflow_dispatch: # Allow manual trigger
 jobs:
  build-and-test:
    runs-on: self-hosted
    # Use specific labels if you want to target a particular self-hosted runner
    # runs-on: [self-hosted, linux, cuda, tesla-k80]
    timeout-minutes: 60 # Prevent hung jobs
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0 # Full history for accurate versioning
      - name: Clean previous build
        run: |
          rm -rf build
          rm -f ollama
      - name: Configure CMake
        run: |
          CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake -B build
        env:
          CMAKE_BUILD_TYPE: Release
      - name: Build C++/CUDA components
        run: |
          CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --build build --config Release
        timeout-minutes: 30
      - name: Build Go binary
        run: |
          go build -v -o ollama .
      - name: Verify binary
        run: |
          ls -lh ollama
          file ollama
          ./ollama --version
      - name: Run Go unit tests
        run: |
          go test -v -race -timeout 10m ./...
        continue-on-error: false
      - name: Start ollama server (background)
        run: |
          ./ollama serve > ollama.log 2>&1 &
          echo $! > ollama.pid
          echo "Ollama server started with PID $(cat ollama.pid)"
      - name: Wait for server to be ready
        run: |
          for i in {1..30}; do
            if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
              echo "Server is ready!"
              exit 0
            fi
            echo "Waiting for server... attempt $i/30"
            sleep 2
          done
          echo "Server failed to start"
          cat ollama.log
          exit 1
      - name: Run integration tests
        run: |
          go test -v -timeout 20m ./integration/...
        continue-on-error: false
      - name: Clear server logs for model test
        run: |
          # Truncate log file to start fresh for model loading test
          > ollama.log
      - name: Pull gemma3:4b model
        run: |
          echo "Pulling gemma3:4b model..."
          ./ollama pull gemma2:2b
          echo "Model pull completed"
        timeout-minutes: 15
      - name: Run inference with gemma3:4b
        run: |
          echo "Running inference test..."
          ./ollama run gemma2:2b "Hello, this is a test. Please respond with a short greeting." --verbose
          echo "Inference completed"
        timeout-minutes: 5
      - name: Wait for logs to flush
        run: sleep 3
      - name: Analyze server logs with Claude
        run: |
          echo "Analyzing ollama server logs for proper model loading..."
          # Create analysis prompt
          cat > log_analysis_prompt.txt << 'EOF'
          Analyze the following Ollama server logs from a Tesla K80 (CUDA Compute Capability 3.7) system.
          Verify that:
          1. The model loaded successfully without errors
          2. CUDA/GPU acceleration was properly detected and initialized
          3. The model is using the Tesla K80 GPU (not CPU fallback)
          4. There are no CUDA compatibility warnings or errors
          5. Memory allocation was successful
          6. Inference completed without errors
          Respond with:
          - "PASS" if all checks pass and model loaded properly with GPU acceleration
          - "FAIL: <reason>" if there are critical issues
          - "WARN: <reason>" if there are warnings but model works
          Be specific about what succeeded or failed. Look for CUDA errors, memory issues, or CPU fallback warnings.
          Server logs:
          ---
          EOF
          cat ollama.log >> log_analysis_prompt.txt
          # Run Claude in headless mode to analyze
          claude -p log_analysis_prompt.txt > log_analysis_result.txt
          echo "=== Claude Analysis Result ==="
          cat log_analysis_result.txt
          # Check if analysis passed
          if grep -q "^PASS" log_analysis_result.txt; then
            echo "✓ Log analysis PASSED - Model loaded correctly on Tesla K80"
            exit 0
          elif grep -q "^WARN" log_analysis_result.txt; then
            echo "⚠ Log analysis has WARNINGS - Review needed"
            cat log_analysis_result.txt
            exit 0  # Don't fail on warnings, but they're visible
          else
            echo "✗ Log analysis FAILED - Model loading issues detected"
            cat log_analysis_result.txt
            exit 1
          fi
      - name: Check GPU memory usage
        if: always()
        run: |
          echo "=== GPU Memory Status ==="
          nvidia-smi --query-gpu=memory.used,memory.total --format=csv
      - name: Stop ollama server
        if: always()
        run: |
          if [ -f ollama.pid ]; then
            kill $(cat ollama.pid) || true
            rm ollama.pid
          fi
          pkill -f "ollama serve" || true
      - name: Upload logs and analysis
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: ollama-logs-and-analysis
          path: |
            ollama.log
            log_analysis_prompt.txt
            log_analysis_result.txt
            build/**/*.log
          retention-days: 7
      - name: Upload binary artifact
        if: success()
        uses: actions/upload-artifact@v4
        with:
          name: ollama-binary-${{ github.sha }}
          path: ollama
          retention-days: 14