Commit Graph

4517 Commits

Author SHA1 Message Date
Shang Chieh Tseng
40b956b23c Fix false positive CPU backend error in test configuration
The test configuration was treating 'CPU backend' as a failure pattern,
but this is incorrect. Loading the CPU backend library is normal - ollama
loads both CUDA and CPU backends for fallback operations.

The log line 'load_backend: loaded CPU backend from libggml-cpu-*.so'
is a success message, not an error.

Changed failure patterns from:
- 'CPU backend' (too broad, matches normal loading)
- 'failed to load.*CUDA' (too specific)

To more accurate patterns:
- 'failed to load.*backend' (matches actual load failures)
- 'backend.*failed' (matches failure messages)

This prevents false positives while still catching real backend failures.
2025-10-30 16:00:20 +08:00
Shang Chieh Tseng
1906882ce6 Fix test-runner log monitor to properly follow log file
The log monitor was using bufio.Scanner which doesn't automatically
follow file growth like 'tail -f'. When scanner reached EOF, it would
stay at EOF even as new lines were written to the log file.

This caused GPU detection to fail because the GPU-related log lines
were written after the scanner reached EOF, so they were never processed.

Solution: Switch to bufio.Reader.ReadString() which properly handles
reading from a growing file by returning io.EOF when no data is available,
allowing us to wait and retry while keeping the file position.
2025-10-30 15:55:20 +08:00
Shang Chieh Tseng
f1d4c7f969 Fix test config: don't treat CPU backend loading as failure
The failure pattern 'CPU backend' was incorrectly flagging the normal log
message 'load_backend: loaded CPU backend from...' as an error. This is
expected behavior - both CUDA and CPU backends are loaded, but GPU is
actually used for computation (as shown by 'offloaded 35/35 layers to GPU').

Changed failure patterns to detect actual GPU failures:
- Removed: 'CPU backend' (too broad, catches normal backend loading)
- Added: 'failed to load.*CUDA' (actual load failures)
- Added: 'no GPU detected' (GPU not available)

Root cause: monitor.go processes failure patterns first (highest priority),
so the 'CPU backend' pattern was creating EventError events before success
patterns could be checked, causing tests to fail despite GPU working.
2025-10-30 15:39:17 +08:00
Shang Chieh Tseng
6bbdf3e148 Fix test-runner GPU detection by preserving startup events
The log monitor was calling Reset() before each model test, which cleared
all GPU detection events that occurred during server startup. This caused
the validation to fail with 'GPU acceleration not detected' even though
GPU was being used successfully.

Root cause: GPU detection logs are written during server startup
(lines like 'offloaded 35/35 layers to GPU'), but monitor.Reset() was
clearing these events before validation could check them.

Solution: Comment out the monitor.Reset() call to preserve GPU detection
events from server startup. These events are still relevant for validating
that the model is using GPU acceleration.
2025-10-30 15:27:40 +08:00
Shang Chieh Tseng
d9d3f7b0b4 Fix GitHub Actions workflows to upload build libraries and remove LD_LIBRARY_PATH
Changes:
- Update tesla-k80-ci.yml to upload build/lib/ollama/ containing CUDA backend
- Remove all LD_LIBRARY_PATH environment variables (no longer needed with RPATH)
- Test workflows now receive libggml-cuda.so enabling GPU offload

This fixes the issue where test workflows couldn't offload to GPU because the
CUDA backend library wasn't included in the artifact.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 15:08:34 +08:00
Shang Chieh Tseng
d8ea75a3e2 Fix test-runner to inherit LD_LIBRARY_PATH for CUDA backend loading
The test-runner was starting the ollama server subprocess without inheriting
environment variables, causing the GGML CUDA backend to fail loading even
though LD_LIBRARY_PATH was set in the GitHub Actions workflow.

Changes:
- Added s.cmd.Env = os.Environ() to inherit all environment variables
- This ensures LD_LIBRARY_PATH is passed to the ollama server subprocess
- Fixes GPU offloading failure where layers were not being loaded to GPU

Root cause analysis from logs:
- GPUs were detected: Tesla K80 with 11.1 GiB available
- Server scheduled 35 layers for GPU offload
- But actual offload was 0/35 layers (all stayed on CPU)
- Runner subprocess couldn't find CUDA libraries without LD_LIBRARY_PATH

This fix ensures the runner subprocess can dynamically load libggml-cuda.so
by inheriting the CUDA library paths from the parent process.
2025-10-30 14:08:24 +08:00
Shang Chieh Tseng
c022e79e77 Add LD_LIBRARY_PATH to GitHub Actions workflows for CUDA library discovery
Set LD_LIBRARY_PATH in all workflow steps to ensure CUDA 11.4 libraries
are found during both compile time and runtime. This fixes the issue where
the GGML CUDA backend (libggml-cuda.so) fails to load when running
'ollama serve'.

Library paths added:
- /usr/local/cuda-11.4/lib64
- /usr/local/cuda-11.4/targets/x86_64-linux/lib
- /usr/lib64
- /usr/local/lib64

Updated workflows:
- tesla-k80-ci.yml: CMake configure, C++/CUDA build, Go build, binary verify
- tesla-k80-single-gpu-tests.yml: All test execution steps
- tesla-k80-multi-gpu-tests.yml: All test execution steps
2025-10-30 13:28:44 +08:00
Shang Chieh Tseng
bc8992d014 Add RPATH for CUDA libraries in Linux builds
- Configure CMakeLists.txt to embed RPATH for CUDA library paths
- Includes $ORIGIN, CUDA 11.4 paths, and system library paths
- Eliminates need for LD_LIBRARY_PATH at runtime
- Binary can now find CUDA libraries automatically
2025-10-30 12:51:59 +08:00
Shang Chieh Tseng
46f1038724 Fix Claude validation response format parsing
The Claude AI validator was receiving detailed explanations with markdown
formatting (e.g., '**PASS**') instead of the expected simple format.

Updated the validation prompt to explicitly require responses to start
with either 'PASS' or 'FAIL: <reason>' without any additional formatting,
explanations, or markdown before the verdict.

This fixes the 'Warning: Unexpected Claude response format' error that
was causing valid test results to be incorrectly marked as unclear.
2025-10-30 12:34:02 +08:00
Shang Chieh Tseng
c8b7015a2c Move test-runner temp directory into project
- Change temp directory from /tmp/test-runner-claude to .test-runner-temp
- Keeps temporary files within project bounds for Claude Code access
- Add .test-runner-temp to .gitignore to exclude from version control
- Fixes Claude AI validation permission issue
2025-10-30 12:25:25 +08:00
Shang Chieh Tseng
9b487aa5f5 Rename validateConfig function to validateConfigFile to avoid conflict
- Function in main.go renamed from validateConfig to validateConfigFile
- Resolves redeclaration error with validateConfig in config.go
- config.go has validateConfig(*Config) for internal validation
- main.go has validateConfigFile(string) for CLI command
2025-10-30 12:16:55 +08:00
Shang Chieh Tseng
a7b3f6eda5 Fix test-runner variable name conflict
- Rename validateConfig flag variable to validateConfigPath
- Resolves compilation error: validateConfig was both a *string variable and function name
- Function call now uses correct variable name
2025-10-30 12:15:12 +08:00
Shang Chieh Tseng
5895b414f4 Fix cross-workflow artifact download using dawidd6/action-download-artifact
- Replace actions/download-artifact@v4 with dawidd6/action-download-artifact@v6
- The default download-artifact action only works within same workflow run
- Third-party action enables downloading artifacts from different workflow
- Both test workflows now download from latest successful tesla-k80-ci.yml run
2025-10-30 12:12:59 +08:00
Shang Chieh Tseng
a171c8a087 Fix test workflows to use build artifacts instead of local binary
- Build workflow now uploads ollama binary as artifact with 7-day retention
- Test workflows download artifact instead of expecting local binary
- Eliminates 'ollama binary not found' error when running tests
- Enables build-once, test-multiple-times workflow pattern
- Added binary verification step to confirm artifact download
2025-10-30 12:07:28 +08:00
Shang Chieh Tseng
6c3876a30d Add multi-GPU test workflow and rename single-GPU workflow
- Rename tesla-k80-tests.yml to tesla-k80-single-gpu-tests.yml for clarity
- Add new tesla-k80-multi-gpu-tests.yml workflow for large models
- Add multi-gpu profile to test/config/models.yaml with gemma3:27b and gpt-oss:20b
- Multi-GPU workflow includes GPU count verification and weekly schedule
- Profile-specific validation allows multi-GPU splits for large models
- Separate workflows optimize CI efficiency: quick tests vs. thorough tests
2025-10-30 12:04:50 +08:00
Shang Chieh Tseng
1aa80e9411 Simplify test profiles to focus on Tesla K80 capabilities
Changes to test/config/models.yaml:

Quick profile:
- Use gemma3:4b (was gemma2:2b)
- Single prompt: 'Hello, respond with a brief greeting.'
- Timeout: 60s
- Purpose: Fast smoke test (~5 min)

Full profile:
- REMOVED: gemma2:2b, gemma3:4b (redundant with quick test)
- ONLY gemma3:12b (largest model for single K80)
- Single prompt: 'Hello, respond with a brief greeting.' (same as quick)
- Timeout: 120s (sufficient - loads in ~24s)
- Purpose: Validate Phase 2 memory optimization for large models

Rationale:
- Quick test validates basic functionality with gemma3:4b
- Full test validates single-GPU capability with gemma3:12b
- No need to test multiple sizes if both work
- Consistent prompts make comparison easier
- Tests the critical optimization: 12B model on single K80
2025-10-30 11:57:30 +08:00
Shang Chieh Tseng
4de7dd453b Add Claude AI-powered response validation and update test model
Changes:
1. Update quick test to use gemma3:4b (was gemma2:2b)
   - Increased timeout to 60s for larger model

2. Implement Claude headless validation (validate.go)
   - Hybrid approach: simple checks first, then Claude validation ALWAYS runs
   - Claude validates response quality, coherence, relevance
   - Detects gibberish, errors, and malformed responses
   - Falls back to simple validation if Claude CLI unavailable
   - Verbose logging shows Claude validation results

3. Validation flow:
   - Step 1: Fast checks (empty response, token count)
   - Step 2: Claude AI analysis (runs regardless of simple check)
   - Claude result overrides simple checks
   - If Claude unavailable, uses simple validation only

4. Workflow improvements:
   - Remove useless GPU memory check step (server already stopped)
   - Cleaner workflow output

Benefits:
- Intelligent response quality validation
- Catches subtle issues (gibberish, off-topic responses)
- Better than hardcoded pattern matching
- Graceful degradation when Claude unavailable
2025-10-30 11:42:10 +08:00
Shang Chieh Tseng
d59284d30a Implement Go-based test runner framework for Tesla K80 testing
Add comprehensive test orchestration framework:

Test Runner (cmd/test-runner/):
- config.go: YAML configuration loading and validation
- server.go: Ollama server lifecycle management (start/stop/health checks)
- monitor.go: Real-time log monitoring with pattern matching
- test.go: Model testing via Ollama API (pull, chat, validation)
- validate.go: Test result validation (GPU usage, response quality, log analysis)
- report.go: Structured reporting (JSON and Markdown formats)
- main.go: CLI interface with run/validate/list commands

Test Configurations (test/config/):
- models.yaml: Full test suite with quick/full/stress profiles
- quick.yaml: Fast smoke test with gemma2:2b

Updated Workflow:
- tesla-k80-tests.yml: Use test-runner instead of shell scripts
- Run quick tests first, then full tests if passing
- Generate structured JSON reports for pass/fail checking
- Upload test results as artifacts

Features:
- Multi-model testing with configurable profiles
- API-based testing (not CLI commands)
- Real-time log monitoring for GPU events and errors
- Automatic validation of GPU loading and response quality
- Structured JSON and Markdown reports
- Graceful server lifecycle management
- Interrupt handling (Ctrl+C cleanup)

Addresses limitations of shell-based testing by providing:
- Better error handling and reporting
- Programmatic test orchestration
- Reusable test framework
- Clear pass/fail criteria
- Detailed test metrics and timing
2025-10-30 11:04:48 +08:00
Shang Chieh Tseng
aaaf334e7f Update tesla-k80-ci.yml 2025-10-30 11:02:14 +08:00
Shang Chieh Tseng
b402b073c5 Split Tesla K80 workflows into build and test; add test framework plan
- Changed tesla-k80-ci.yml to manual trigger only, simplified to build-only workflow
- Created tesla-k80-tests.yml for separate test execution (manual trigger)
- Added .github/workflows/CLAUDE.md with comprehensive test framework design
- Removed binary artifact upload (not needed for single self-hosted runner)
- Replaced README.md with CLAUDE.md for better documentation structure

Test framework plan:
- Go-based test runner at cmd/test-runner/
- YAML configuration for multi-model testing
- Server lifecycle management with log monitoring
- API-based testing with structured reporting
- Support for test profiles (quick/full/stress)
2025-10-30 10:59:52 +08:00
Shang Chieh Tseng
7e317fdd74 Add Phase 2 summary documentation for CC 3.7 graph correction
Documents the complete Tesla K80 memory estimation optimization journey,
including per-GPU graph allocation and empirical correction factor implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 10:27:25 +08:00
Shang Chieh Tseng
296d537a2c Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction
Added Phase 2 documentation for single-GPU optimization:
- CC 3.7 graph correction factor (85% of estimate)
- gemma3:12b now loads on single GPU
- Improved from 11.9 GiB → 11.0 GiB estimation
- Validated with 10.0 GiB actual usage, 94% GPU utilization

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:16:38 +08:00
Shang Chieh Tseng
6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00
Shang Chieh Tseng
d04ea50ced Fix gpt-oss model architecture to match GGUF tensor format
The gpt-oss model architecture code expected fused tensors (attn_qkv,
ffn_gate_up_exps) but the actual GGUF files contain separate tensors
(attn_q/k/v, ffn_gate_exps/up_exps), causing nil pointer panics during
model loading.

Changes:
- model/models/gptoss/model.go: Updated AttentionBlock to use separate
  Query/Key/Value fields instead of fused QKV, modified Forward() to
  compute projections separately
- model/models/gptoss/model.go: Updated MLPBlock to use separate Gate/Up
  fields instead of fused GateUp, simplified Forward() logic
- fs/ggml/type.go: Reorganized MXFP4 tensor type constant ordering
- ml/backend/ggml/ggml/include/ggml.h: Moved GGML_TYPE_MXFP4 to end of
  enum to match GGUF file format specification
- ml/backend/ggml/ggml/src/ggml.c: Updated type name array to match
  reordered enum
- CLAUDE.md: Documented gpt-oss model compatibility fix

Result: gpt-oss:20b model now loads and runs successfully on Tesla K80,
all 25 layers offload to GPU correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 23:34:03 +08:00
Shang Chieh Tseng
241a03402e Optimize GPU memory estimation for single-GPU preference on Tesla K80
Implemented multi-GPU memory optimization to reduce unnecessary model splits
across dual Tesla K80 GPUs by fixing graph memory overestimation.

Changes:
1. Per-GPU graph allocation strategy
   - Secondary GPUs: 190 MiB (empirically measured)
   - Primary GPU: Full 1.3 GiB graph allocation
   - Applied during layer distribution, not just final allocation

2. Reverse-order layer distribution
   - Prefer loading all layers on last GPU (GPU 1) first
   - Only use secondary GPUs when primary is full
   - Changed from round-robin to reverse-order (j-1 instead of i%j)

Results:
 gemma3:4b: Single GPU (no split, was already working)
 gemma3:12b: 1,48 layer split (improved from 25,24 split)
   - GPU 0: 1 layer, 610 MiB (down from 4156 MiB)
   - GPU 1: 48 layers, 9857 MiB (primary)
   - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB)

Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models
to run on single GPU with better performance (no cross-GPU overhead).

Files modified:
- llm/memory.go: Core allocation logic (lines 230-288)
- llm/CLAUDE.md: Detailed implementation guide
- CLAUDE.md: Project status and results summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 19:58:20 +08:00
Shang Chieh Tseng
5077ab3fb4 Document Phase 9 completion: Fix CUDA backend loading for CC 3.7
Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
 CUDA backend loads without undefined symbol errors
 GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
 Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 17:44:36 +08:00
Shang Chieh Tseng
66fca1b685 Remove remaining MMA/WMMA template instances for CC 3.7 optimization
Delete 24 tensor core template instance files that were missed in the initial optimization:
- 19 fattn-mma-f16 template instances (various ncols1/ncols2 combinations)
- 5 fattn-wmma-f16 template instances (kqfloat and kqhalf variants)

These files implement tensor core operations (MMA/WMMA) which require Compute Capability 7.0+
and are not available on Tesla K80 (CC 3.7). Removing them completes the CC 3.7-only optimization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 15:24:08 +08:00
Shang Chieh Tseng
771044bead Complete CC 3.7-only CUDA optimization for Tesla K80 support
Simplify CUDA backend to exclusively support Compute Capability 3.7 (Kepler/Tesla K80).
This optimization removes ~2,700 lines of modern GPU code and resolves all compilation issues.

Changes:
- Remove tensor core files (mma.cuh, fattn-wmma-f16.*, fattn-mma-f16.cuh) and 92 template instances
- Hardcode architecture detection to always return CC 3.7 (370) in common.cuh
- Disable modern GPU features: FP16 native ops, MMA/WMMA, CP_ASYNC, BF16, CUDA graphs
- Disable 6 MMA functions in mmq.cuh while preserving DP4A functions for CC 3.7
- Replace undefined architecture constants (PASCAL/VOLTA/DP4A/ADA_LOVELACE) with CC 3.7 equivalents
- Set CMAKE_CUDA_ARCHITECTURES to "37" only in CMakeLists.txt and CMakePresets.json
- Hardcode Stream-K scheduling to false, precision to FP32 throughout
- Add comprehensive CLAUDE.md documentation with complete optimization history

Build configuration now compiles only for architecture 37, resulting in 80-85% smaller
binaries and 5-6x faster build times. All removed code paths were unreachable on CC 3.7
hardware, ensuring no performance degradation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 15:21:08 +08:00
Shang Chieh Tseng
135b799b13 Update command. 2025-10-29 14:21:03 +08:00
Shang Chieh Tseng
6024408ea5 Update command. 2025-10-28 18:42:49 +08:00
Shang Chieh Tseng
92acf0f91e Add GitHub Actions workflow for Tesla K80 CI/CD
- Tesla K80 build and test workflow with self-hosted runner
- Build using GCC 10 and CUDA 11.4 for Compute Capability 3.7
- Run unit tests, integration tests, and model inference tests
- Test gemma2:2b model loading and GPU acceleration
- Use Claude headless mode to analyze server logs and verify proper GPU initialization
- Upload logs, analysis results, and binary artifacts
- Comprehensive documentation in workflows README
2025-10-28 18:09:49 +08:00
Shang Chieh Tseng
fe0fd5b494 Update manual-build.md 2025-10-28 17:20:03 +08:00
Shang Chieh Tseng
e6e91af024 Separate NVIDIA driver and CUDA toolkit installation steps
- Split Step 3 into two distinct steps:
  - Step 3: NVIDIA Driver 470 installation via .run file
  - Step 4: CUDA 11.4 Toolkit installation via local installer
- Add libglvnd-devel dependency requirement
- Add text mode (init 3) requirement for driver installation
- Specify exact driver version (470.256.02) and download URL
- Specify exact CUDA installer (11.4.0 with 470.42.01 driver)
- Add note to deselect driver during CUDA installation
- Separate environment configuration:
  - PATH in /etc/profile.d/cuda-11.4.sh
  - Dynamic linker in /etc/ld.so.conf.d/cuda-11-4.conf
- Update all subsequent step numbers (5-7)
- Update all cross-references throughout document
2025-10-28 16:55:38 +08:00
Shang Chieh Tseng
35c4d078f7 Fix step reference in troubleshooting: GCC 10 is Step 1, not Step 5 2025-10-28 15:56:49 +08:00
Shang Chieh Tseng
417b451af1 Add system compiler symlink updates to use GCC 10 by default 2025-10-28 15:53:49 +08:00
Shang Chieh Tseng
c788de5f8b Fix GCC 10 dynamic linker config to include both /usr/lib64 and /usr/local/lib64 2025-10-28 15:51:41 +08:00
Shang Chieh Tseng
e549dcb710 Reorganize installation steps: Move GCC 10 to Step 1 before kernel compilation 2025-10-28 15:35:10 +08:00
Shang Chieh Tseng
29706d14d7 Consolidate GCC 10 installation steps into single script format 2025-10-28 15:28:11 +08:00
Shang Chieh Tseng
85d98064d1 Fix kernel config copy path to use /usr/src/kernels for Rocky Linux 9 2025-10-28 15:27:40 +08:00
Shang Chieh Tseng
8dc4ca7ccc Reorganize Docker build infrastructure for better maintainability
- Restructure from ollama37/ to docker/ with clear separation
- Separate builder and runtime images into dedicated directories
- Group environment scripts in builder/scripts/ subdirectory
- Add comprehensive root-level README.md (257 lines)
- Add .dockerignore files for optimized build contexts
- Enhance shell scripts with shebangs and documentation headers
- Update docker-compose.yml to build locally instead of pulling
- Add environment variables for GPU and host configuration
- Remove duplicate Dockerfile and confusing nested structure

New structure:
  docker/
  ├── README.md (comprehensive documentation)
  ├── docker-compose.yml (local build support)
  ├── builder/ (build environment: CUDA 11.4 + GCC 10 + Go 1.24)
  │   ├── Dockerfile
  │   ├── README.md
  │   ├── .dockerignore
  │   └── scripts/ (organized environment setup)
  └── runtime/ (production image)
      ├── Dockerfile
      ├── README.md
      └── .dockerignore

This reorganization eliminates confusion, removes duplication, and
provides a professional, maintainable structure for Tesla K80 builds.
2025-10-28 14:47:39 +08:00
Shang Chieh Tseng
736cbdf52a Remove unuse file. 2025-10-22 22:35:41 +08:00
Shang Chieh Tseng
29cb9d3a27 Remove GitHub Actions workflows from fork
Removed all GitHub Actions workflows (.github/workflows/) as they're not needed
for this Tesla K80 support fork. The workflows were designed for the official
Ollama repository's CI/CD pipeline and would fail in a fork since they:
- Attempt to push to Ollama's Docker Hub
- Run automated tests on PRs (not needed for personal fork)
- Handle official release process

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 19:22:12 +08:00
Shang Chieh Tseng
c61e0ce554 Update README.md for v1.4.0: GPT-OSS support and Tesla K80 memory improvements
- Added GPT-OSS model to supported models list with multi-GPU optimization notes
- Documented Tesla K80 Multi-GPU usage example with nvidia-smi monitoring
- Added comprehensive Tesla K80 Memory Improvements section covering:
  * VMM pool crash fixes with granularity alignment
  * Multi-GPU model switching scheduler improvements
  * Silent inference failure resolution
- Updated recent updates section for v1.4.0 release
- Enhanced technical details with multi-GPU optimization specs

These improvements enable robust production use of Tesla K80 hardware
for LLM inference with seamless model switching capabilities.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
v1.4.0
2025-08-10 01:42:38 +08:00
Shang Chieh Tseng
08f38b19ea Fix Tesla K80 multi-GPU model switching deadlocks and silent failures
Resolves two critical issues preventing robust model switching:

1. Scheduler deadlock: Fixed improper loop control flow that prevented
   model unloading from triggering after conflict detection. Added proper
   multi-GPU conflict detection and unload sequencing.

2. Silent inference failures: Changed critical cudaSetDevice() calls from
   graceful error handling back to CUDA_CHECK to prevent models from
   appearing to load successfully but failing silently during inference.

Result: Robust Tesla K80 dual-GPU model switching with self-healing
recovery instead of requiring system reboots.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 01:30:10 +08:00
Shang Chieh Tseng
46213c5880 Fix Tesla K80 VMM pool crash by aligning to granularity
- Fix CUDA_ERROR_INVALID_VALUE from cuMemAddressReserve by aligning max_pool_size to GPU granularity
- Set dynamic max_pool_size based on 90% of actual GPU memory instead of static 32GB
- Add memory availability check before allocation to prevent OOM
- Tested on Tesla K80 dual GPU setup with successful model loading and chat completions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 17:48:31 +08:00
Shang Chieh Tseng
e4113f080a Merge upstream ollama/ollama with Tesla K80 support preserved
Successfully synced with upstream ollama/ollama main branch while maintaining:
- CUDA Compute Capability 3.7 support for Tesla K80 GPUs
- CUDA 11 build configuration with architecture 37
- BF16 compatibility fallback for older GPUs

New features from upstream:
- gpt-oss model support (tested working on Tesla K80)
- Various performance improvements and bug fixes
- Updated model architectures and optimizations

All Tesla K80 optimizations and documentation preserved.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 15:17:24 +08:00
Shang Chieh Tseng
2be9575694 Fix BF16 compatibility for Tesla K80 (Compute Capability 3.7)
Add runtime check for BF16 support which requires Compute Capability 8.0+.
Tesla K80 and other CC 3.7 GPUs will fallback to FP16/FP32 operations.
This ensures the upstream BF16 optimizations work on newer GPUs while
maintaining compatibility with legacy hardware.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 15:15:49 +08:00
Shang Chieh Tseng
83973336d6 Optimize Docker build performance with parallel compilation
- Add -j$(nproc) flag to cmake build in ollama37.Dockerfile
- Use all available CPU cores for faster compilation
- Add sync-upstream.md documentation for future maintenance

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 11:44:59 +08:00
Shang Chieh Tseng
0cd81c838a Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support
- Added support for new gpt-oss model from upstream
- Preserved CUDA Compute Capability 3.7 (Tesla K80) support
- Kept CUDA 11 configuration alongside CUDA 12
- Maintained all documentation specific to ollama37 fork
- Integrated new tool parsing improvements
- Added new backend methods and patches from upstream
2025-08-08 10:43:29 +08:00
Daniel Hiltgen
114c3f2265 tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00