mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Files

Shang Chieh Tseng d948926581 Fix Tesla K80 CUBLAS compatibility with two-tier fallback strategy

This commit implements comprehensive Tesla K80 (Kepler, compute 3.7)
compatibility for batched matrix multiplication operations.

**Problem:**
Modern CUBLAS functions fail on Tesla K80 with CUBLAS_STATUS_ARCH_MISMATCH:
1. CUBLAS_GEMM_DEFAULT_TENSOR_OP requires Tensor Cores (Volta+ only)
2. cublasGemmStridedBatchedEx/cublasGemmBatchedEx have architectural
   requirements beyond algorithm selection

**Solution - Two-Tier Fallback:**

Tier 1: Algorithm Selection
- Volta+ (cc >= 7.0): CUBLAS_GEMM_DEFAULT_TENSOR_OP
- Pre-Volta (cc < 7.0): CUBLAS_GEMM_DEFAULT

Tier 2: Function Selection
- Volta+ or non-FP32: Use *Ex variants (flexible precision)
- Kepler/Maxwell/Pascal with FP32: Use legacy type-specific functions
  (cublasSgemmStridedBatched, cublasSgemmBatched)

**Changes:**

CUDA Implementation:
- ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu
  * ggml_cuda_op_mul_mat_cublas: Algorithm selection for non-batched ops
  * ggml_cuda_mul_mat_batched_cublas_impl: Two-tier fallback for batched ops
  * Added GGML_CUDA_DEBUG environment variable for conditional debug logging
  * Comprehensive function documentation explaining fallback strategy

Documentation:
- CLAUDE.md
  * Added Tesla K80 CUBLAS Compatibility section
  * Documented GGML_CUDA_DEBUG environment variable
  * Enhanced "Running Ollama" section with log capture examples
  * Updated Files Modified list

Code Comments:
- Added detailed comments throughout CUDA code explaining:
  * Why TENSOR_OP fails on pre-Volta GPUs
  * Why *Ex functions require architectural support
  * Compute capability checks and fallback logic
  * Debug logging usage

**Testing:**
All models verified working on Tesla K80:
- ✅ gemma3:4b
- ✅ gpt-oss
- ✅ deepseek-r1

Debug flag tested in both enabled and disabled states.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 23:52:45 +08:00

6.4 KiB

Raw Blame History

Claude Code Development Notes

This document tracks development goals and notes for this Ollama repository fork.

Project Goals

1. CUDA Compute Capability 3.7 Support (Tesla K80)

Objective: Add support for CUDA compute capability 3.7 to enable running on Tesla K80 GPUs
Environment:
- GCC version: 10.5
- CUDA version: 11.4.4
- NVIDIA driver: 470
- Target GPU: Tesla K80 (compute capability 3.7)
Status: ✅ Complete

2. Code Documentation Policy

Issue: This repo is cloned from official Ollama, which lacks code comments, making debugging difficult
Policy: Add helpful comments when figuring out code functionality
Rationale: Improve code maintainability and debugging experience

Implementation Summary

Files Modified

ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt - Added 3.7 compute capability to default architecture list
CMakePresets.json - Added compute 3.7 to "CUDA 11" preset and created dedicated "CUDA 11 K80" preset
ml/backend/ggml/ggml/src/CMakeLists.txt - Enabled Alderlake CPU variant without AVX_VNNI
ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu - Added CUBLAS legacy function fallback for Kepler GPU compatibility

Key Changes

Added 37-virtual to CMAKE_CUDA_ARCHITECTURES (using PTX with JIT compilation for better compatibility)
Updated "CUDA 11" preset to include compute 3.7 alongside other supported architectures
Created "CUDA 11 K80" preset for K80-only optimized builds
Enabled Alderlake CPU variant without AVX_VNNI (GCC 10 limitation)
Added -Wno-deprecated-gpu-targets flag to suppress warnings

CUDA Version Compatibility

CUDA 11.4.4 supports: 37, 50, 52, 60, 61, 70, 75, 80, 86
CUDA 11.4.4 does NOT support: 87 (requires 11.7+), 89 (requires 11.8+), 90 (requires 12.0+)
CUDA 12+ dropped Kepler support entirely

Tesla K80 CUBLAS Compatibility

Challenge: Tesla K80 (Kepler, compute 3.7) requires special handling for batched matrix multiplication due to:

Lack of Tensor Cores (introduced in Volta, compute 7.0+)
Architectural limitations with modern CUBLAS *Ex function variants

Solution - Two-Tier Fallback Strategy:

Tier 1: GEMM Algorithm Selection

Volta+ (cc >= 7.0): Use CUBLAS_GEMM_DEFAULT_TENSOR_OP (value 99)
Pre-Volta (cc < 7.0): Use CUBLAS_GEMM_DEFAULT (value -1)

Tier 2: CUBLAS Function Selection

Modern GPUs (Volta+): Use cublasGemmStridedBatchedEx / cublasGemmBatchedEx
- Support mixed precision, flexible compute types, algorithm selection
Legacy GPUs (Kepler/Maxwell/Pascal with FP32): Use cublasSgemmStridedBatched / cublasSgemmBatched
- The *Ex variants have architectural requirements beyond algorithm selection
- Even with CUBLAS_GEMM_DEFAULT, *Ex functions fail with CUBLAS_STATUS_ARCH_MISMATCH
- Legacy functions only support FP32, but work reliably on older architectures

Modified Function: ggml_cuda_mul_mat_batched_cublas_impl in ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:1986

Tested Models (verified on Tesla K80):

✅ gemma3:4b
✅ gpt-oss
✅ deepseek-r1

Build Instructions

Complete Build from Scratch

# Clean any previous build artifacts
rm -rf build
go clean -cache

# Configure the build (specify GCC 10.5 explicitly)
CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --preset "CUDA 11"

# Build the C/C++/CUDA libraries
CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --build build -j 48

# Build the Go binary
go build -o ollama .

Running Ollama

Basic Server Start

# Start the Ollama server
./ollama serve

# Check GPU detection
nvidia-smi

Debug and Logging Options

Environment Variables:

OLLAMA_DEBUG=1 - Enable verbose Ollama server logging
GGML_CUDA_DEBUG=1 - Enable detailed CUDA/CUBLAS operation logging (batched matrix multiplication)

# Run with Ollama verbose logging only
OLLAMA_DEBUG=1 ./ollama serve

# Run with both Ollama and CUDA debug logging
OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve

# Capture all output to file
./ollama serve 2>&1 | tee /tmp/ollama_serve.log

# Capture only stderr (warnings/errors) to file
./ollama serve 2> /tmp/ollama_errors.log

# Run in background with full logging
OLLAMA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_full.log &

# Run in background with debug logging
OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_debug.log &

# Monitor a running background server
tail -f /tmp/ollama_full.log

# Tail recent log entries
tail -100 /tmp/ollama_full.log

# Stop all ollama processes
pkill ollama

When to Use GGML_CUDA_DEBUG:

Debugging CUBLAS errors on Tesla K80 or other legacy GPUs
Verifying compute capability detection
Troubleshooting batched matrix multiplication issues
Understanding which CUBLAS functions are being used (legacy vs Ex variants)

CPU Architecture Compatibility

The GCC/CUDA/Alderlake Constraint

This build faces a fundamental compatibility constraint:

The Constraint Chain:

Tesla K80 (compute 3.7) → Last supported by Driver 470.xx
Driver 470.256.02 → Maximum CUDA version is CUDA 11.4
CUDA 11.4 → Maximum GCC version is GCC 10 (enforced in host_config.h)
AVX_VNNI (Alderlake CPUs) → Requires GCC 11+ for -mavxvnni flag

Result: Cannot have both K80 GPU support AND full Alderlake CPU optimization.

Solution: Alderlake Without AVX_VNNI

Implementation:

Alderlake CPU variant is enabled in the build
AVX_VNNI instruction set is excluded (requires GCC 11+)
Alderlake still gets: SSE4.2, AVX, F16C, AVX2, BMI2, FMA optimizations
Code falls back to _mm256_maddubs_epi16() for operations that would use VNNI

Modified file: ml/backend/ggml/ggml/src/CMakeLists.txt line 338

Performance Impact:

Most operations: No impact (still uses AVX2, FMA, BMI2)
INT8 dot products: ~10-20% slower than native AVX_VNNI
Overall model inference: ~3-7% slower (depends on quantization)

CPU Support Matrix

CPU Generation	Variant Used	Full Optimization	Notes
Haswell (2013)	haswell	✅ Yes	Xeon E5-2676 v3
Skylake-X (2017)	skylakex	✅ Yes	Includes AVX512
Icelake (2019)	icelake	✅ Yes	Includes AVX512_VNNI
Alderlake (2021)	alderlake	⚠️ Partial	Missing AVX_VNNI only
Raptor Lake (2022)	alderlake	⚠️ Partial	Missing AVX_VNNI only

6.4 KiB Raw Blame History