Add GitHub Actions CI/CD pipeline and test framework

- Add .github/workflows/build-test.yml for automated testing
- Add tests/ directory with TypeScript test runner
- Add docs/CICD.md documentation
- Remove .gitlab-ci.yml (migrated to GitHub Actions)
- Update .gitignore for test artifacts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Shang Chieh Tseng
2025-12-15 14:06:44 +08:00
parent 2b5aeaf86b
commit d11140c016
23 changed files with 3014 additions and 50 deletions

View File

@@ -0,0 +1,30 @@
id: TC-INFERENCE-001
name: Model Pull
suite: inference
priority: 1
timeout: 600000
dependencies:
- TC-RUNTIME-003
steps:
- name: Check if model exists
command: docker exec ollama37 ollama list | grep -q "gemma3:4b" && echo "Model exists" || echo "Model not found"
- name: Pull model if needed
command: docker exec ollama37 ollama list | grep -q "gemma3:4b" || docker exec ollama37 ollama pull gemma3:4b
timeout: 600000
- name: Verify model available
command: docker exec ollama37 ollama list
criteria: |
The gemma3:4b model should be available for inference.
Expected:
- Model is either already present or successfully downloaded
- "ollama list" shows gemma3:4b in the output
- No download errors
Accept if model already exists (skip download).
Model size is ~3GB, download may take time.

View File

@@ -0,0 +1,28 @@
id: TC-INFERENCE-002
name: Basic Inference
suite: inference
priority: 2
timeout: 180000
dependencies:
- TC-INFERENCE-001
steps:
- name: Run simple math question
command: docker exec ollama37 ollama run gemma3:4b "What is 2+2? Answer with just the number." 2>&1
timeout: 120000
- name: Check GPU memory usage
command: docker exec ollama37 nvidia-smi --query-compute-apps=pid,used_memory --format=csv 2>/dev/null || echo "No GPU processes"
criteria: |
Basic inference should work on Tesla K80.
Expected:
- Model responds to the math question
- Response should indicate "4" (accept variations: "4", "four", "The answer is 4", etc.)
- GPU memory should be allocated during inference
- No CUDA errors in output
This is AI-generated output - accept reasonable variations.
Focus on the model producing a coherent response.

View File

@@ -0,0 +1,34 @@
id: TC-INFERENCE-003
name: API Endpoint Test
suite: inference
priority: 3
timeout: 120000
dependencies:
- TC-INFERENCE-001
steps:
- name: Test generate endpoint (non-streaming)
command: |
curl -s http://localhost:11434/api/generate \
-d '{"model":"gemma3:4b","prompt":"Say hello in one word","stream":false}' \
| head -c 500
- name: Test generate endpoint (streaming)
command: |
curl -s http://localhost:11434/api/generate \
-d '{"model":"gemma3:4b","prompt":"Count from 1 to 3","stream":true}' \
| head -5
criteria: |
Ollama REST API should handle inference requests.
Expected for non-streaming:
- Returns JSON with "response" field
- Response contains some greeting (hello, hi, etc.)
Expected for streaming:
- Returns multiple JSON lines
- Each line contains partial response
Accept any valid JSON response. Content may vary.

View File

@@ -0,0 +1,32 @@
id: TC-INFERENCE-004
name: CUBLAS Fallback Verification
suite: inference
priority: 4
timeout: 120000
dependencies:
- TC-INFERENCE-002
steps:
- name: Check for CUBLAS errors in logs
command: cd docker && docker compose logs 2>&1 | grep -i "CUBLAS_STATUS" | grep -v "SUCCESS" | head -10 || echo "No CUBLAS errors"
- name: Check compute capability detection
command: cd docker && docker compose logs 2>&1 | grep -iE "compute|capability|cc.*3" | head -10 || echo "No compute capability logs"
- name: Verify no GPU errors
command: cd docker && docker compose logs 2>&1 | grep -iE "error|fail" | grep -i gpu | head -10 || echo "No GPU errors"
criteria: |
CUBLAS should work correctly on Tesla K80 using legacy fallback.
Expected:
- No CUBLAS_STATUS_ARCH_MISMATCH errors
- No CUBLAS_STATUS_NOT_SUPPORTED errors
- Compute capability 3.7 may be mentioned in debug logs
- No fatal GPU-related errors
The K80 uses legacy CUBLAS functions (cublasSgemmBatched)
instead of modern Ex variants. This should work transparently.
Accept warnings. Only fail on actual CUBLAS errors.