mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-09 23:37:06 +00:00
This commit implements comprehensive Tesla K80 (Kepler, compute 3.7) compatibility for batched matrix multiplication operations. **Problem:** Modern CUBLAS functions fail on Tesla K80 with CUBLAS_STATUS_ARCH_MISMATCH: 1. CUBLAS_GEMM_DEFAULT_TENSOR_OP requires Tensor Cores (Volta+ only) 2. cublasGemmStridedBatchedEx/cublasGemmBatchedEx have architectural requirements beyond algorithm selection **Solution - Two-Tier Fallback:** Tier 1: Algorithm Selection - Volta+ (cc >= 7.0): CUBLAS_GEMM_DEFAULT_TENSOR_OP - Pre-Volta (cc < 7.0): CUBLAS_GEMM_DEFAULT Tier 2: Function Selection - Volta+ or non-FP32: Use *Ex variants (flexible precision) - Kepler/Maxwell/Pascal with FP32: Use legacy type-specific functions (cublasSgemmStridedBatched, cublasSgemmBatched) **Changes:** CUDA Implementation: - ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu * ggml_cuda_op_mul_mat_cublas: Algorithm selection for non-batched ops * ggml_cuda_mul_mat_batched_cublas_impl: Two-tier fallback for batched ops * Added GGML_CUDA_DEBUG environment variable for conditional debug logging * Comprehensive function documentation explaining fallback strategy Documentation: - CLAUDE.md * Added Tesla K80 CUBLAS Compatibility section * Documented GGML_CUDA_DEBUG environment variable * Enhanced "Running Ollama" section with log capture examples * Updated Files Modified list Code Comments: - Added detailed comments throughout CUDA code explaining: * Why TENSOR_OP fails on pre-Volta GPUs * Why *Ex functions require architectural support * Compute capability checks and fallback logic * Debug logging usage **Testing:** All models verified working on Tesla K80: - ✅ gemma3:4b - ✅ gpt-oss - ✅ deepseek-r1 Debug flag tested in both enabled and disabled states. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.4 KiB
6.4 KiB
Claude Code Development Notes
This document tracks development goals and notes for this Ollama repository fork.
Project Goals
1. CUDA Compute Capability 3.7 Support (Tesla K80)
- Objective: Add support for CUDA compute capability 3.7 to enable running on Tesla K80 GPUs
- Environment:
- GCC version: 10.5
- CUDA version: 11.4.4
- NVIDIA driver: 470
- Target GPU: Tesla K80 (compute capability 3.7)
- Status: ✅ Complete
2. Code Documentation Policy
- Issue: This repo is cloned from official Ollama, which lacks code comments, making debugging difficult
- Policy: Add helpful comments when figuring out code functionality
- Rationale: Improve code maintainability and debugging experience
Implementation Summary
Files Modified
ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt- Added 3.7 compute capability to default architecture listCMakePresets.json- Added compute 3.7 to "CUDA 11" preset and created dedicated "CUDA 11 K80" presetml/backend/ggml/ggml/src/CMakeLists.txt- Enabled Alderlake CPU variant without AVX_VNNIml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu- Added CUBLAS legacy function fallback for Kepler GPU compatibility
Key Changes
- Added
37-virtualto CMAKE_CUDA_ARCHITECTURES (using PTX with JIT compilation for better compatibility) - Updated "CUDA 11" preset to include compute 3.7 alongside other supported architectures
- Created "CUDA 11 K80" preset for K80-only optimized builds
- Enabled Alderlake CPU variant without AVX_VNNI (GCC 10 limitation)
- Added
-Wno-deprecated-gpu-targetsflag to suppress warnings
CUDA Version Compatibility
- CUDA 11.4.4 supports: 37, 50, 52, 60, 61, 70, 75, 80, 86
- CUDA 11.4.4 does NOT support: 87 (requires 11.7+), 89 (requires 11.8+), 90 (requires 12.0+)
- CUDA 12+ dropped Kepler support entirely
Tesla K80 CUBLAS Compatibility
Challenge: Tesla K80 (Kepler, compute 3.7) requires special handling for batched matrix multiplication due to:
- Lack of Tensor Cores (introduced in Volta, compute 7.0+)
- Architectural limitations with modern CUBLAS
*Exfunction variants
Solution - Two-Tier Fallback Strategy:
Tier 1: GEMM Algorithm Selection
- Volta+ (cc >= 7.0): Use
CUBLAS_GEMM_DEFAULT_TENSOR_OP(value 99) - Pre-Volta (cc < 7.0): Use
CUBLAS_GEMM_DEFAULT(value -1)
Tier 2: CUBLAS Function Selection
- Modern GPUs (Volta+): Use
cublasGemmStridedBatchedEx/cublasGemmBatchedEx- Support mixed precision, flexible compute types, algorithm selection
- Legacy GPUs (Kepler/Maxwell/Pascal with FP32): Use
cublasSgemmStridedBatched/cublasSgemmBatched- The
*Exvariants have architectural requirements beyond algorithm selection - Even with
CUBLAS_GEMM_DEFAULT,*Exfunctions fail withCUBLAS_STATUS_ARCH_MISMATCH - Legacy functions only support FP32, but work reliably on older architectures
- The
Modified Function: ggml_cuda_mul_mat_batched_cublas_impl in ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:1986
Tested Models (verified on Tesla K80):
- ✅ gemma3:4b
- ✅ gpt-oss
- ✅ deepseek-r1
Build Instructions
Complete Build from Scratch
# Clean any previous build artifacts
rm -rf build
go clean -cache
# Configure the build (specify GCC 10.5 explicitly)
CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --preset "CUDA 11"
# Build the C/C++/CUDA libraries
CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --build build -j 48
# Build the Go binary
go build -o ollama .
Running Ollama
Basic Server Start
# Start the Ollama server
./ollama serve
# Check GPU detection
nvidia-smi
Debug and Logging Options
Environment Variables:
OLLAMA_DEBUG=1- Enable verbose Ollama server loggingGGML_CUDA_DEBUG=1- Enable detailed CUDA/CUBLAS operation logging (batched matrix multiplication)
# Run with Ollama verbose logging only
OLLAMA_DEBUG=1 ./ollama serve
# Run with both Ollama and CUDA debug logging
OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve
# Capture all output to file
./ollama serve 2>&1 | tee /tmp/ollama_serve.log
# Capture only stderr (warnings/errors) to file
./ollama serve 2> /tmp/ollama_errors.log
# Run in background with full logging
OLLAMA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_full.log &
# Run in background with debug logging
OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_debug.log &
# Monitor a running background server
tail -f /tmp/ollama_full.log
# Tail recent log entries
tail -100 /tmp/ollama_full.log
# Stop all ollama processes
pkill ollama
When to Use GGML_CUDA_DEBUG:
- Debugging CUBLAS errors on Tesla K80 or other legacy GPUs
- Verifying compute capability detection
- Troubleshooting batched matrix multiplication issues
- Understanding which CUBLAS functions are being used (legacy vs Ex variants)
CPU Architecture Compatibility
The GCC/CUDA/Alderlake Constraint
This build faces a fundamental compatibility constraint:
The Constraint Chain:
- Tesla K80 (compute 3.7) → Last supported by Driver 470.xx
- Driver 470.256.02 → Maximum CUDA version is CUDA 11.4
- CUDA 11.4 → Maximum GCC version is GCC 10 (enforced in
host_config.h) - AVX_VNNI (Alderlake CPUs) → Requires GCC 11+ for
-mavxvnniflag
Result: Cannot have both K80 GPU support AND full Alderlake CPU optimization.
Solution: Alderlake Without AVX_VNNI
Implementation:
- Alderlake CPU variant is enabled in the build
- AVX_VNNI instruction set is excluded (requires GCC 11+)
- Alderlake still gets: SSE4.2, AVX, F16C, AVX2, BMI2, FMA optimizations
- Code falls back to
_mm256_maddubs_epi16()for operations that would use VNNI
Modified file: ml/backend/ggml/ggml/src/CMakeLists.txt line 338
Performance Impact:
- Most operations: No impact (still uses AVX2, FMA, BMI2)
- INT8 dot products: ~10-20% slower than native AVX_VNNI
- Overall model inference: ~3-7% slower (depends on quantization)
CPU Support Matrix
| CPU Generation | Variant Used | Full Optimization | Notes |
|---|---|---|---|
| Haswell (2013) | haswell | ✅ Yes | Xeon E5-2676 v3 |
| Skylake-X (2017) | skylakex | ✅ Yes | Includes AVX512 |
| Icelake (2019) | icelake | ✅ Yes | Includes AVX512_VNNI |
| Alderlake (2021) | alderlake | ⚠️ Partial | Missing AVX_VNNI only |
| Raptor Lake (2022) | alderlake | ⚠️ Partial | Missing AVX_VNNI only |