# Claude Code Development Notes This document tracks development goals and notes for this Ollama repository fork. ## Project Goals ### 1. CUDA Compute Capability 3.7 Support (Tesla K80) - **Objective**: Add support for CUDA compute capability 3.7 to enable running on Tesla K80 GPUs - **Environment**: - GCC version: 10.5 - CUDA version: 11.4.4 - NVIDIA driver: 470 - Target GPU: Tesla K80 (compute capability 3.7) - **Status**: ✅ Complete ### 2. Code Documentation Policy - **Issue**: This repo is cloned from official Ollama, which lacks code comments, making debugging difficult - **Policy**: Add helpful comments when figuring out code functionality - **Rationale**: Improve code maintainability and debugging experience ## Implementation Summary ### Files Modified 1. `ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt` - Added 3.7 compute capability to default architecture list 2. `CMakePresets.json` - Added compute 3.7 to "CUDA 11" preset and created dedicated "CUDA 11 K80" preset 3. `ml/backend/ggml/ggml/src/CMakeLists.txt` - Enabled Alderlake CPU variant without AVX_VNNI 4. `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` - Added CUBLAS legacy function fallback for Kepler GPU compatibility ### Key Changes - Added `37-virtual` to CMAKE_CUDA_ARCHITECTURES (using PTX with JIT compilation for better compatibility) - Updated "CUDA 11" preset to include compute 3.7 alongside other supported architectures - Created "CUDA 11 K80" preset for K80-only optimized builds - Enabled Alderlake CPU variant without AVX_VNNI (GCC 10 limitation) - Added `-Wno-deprecated-gpu-targets` flag to suppress warnings ### CUDA Version Compatibility - **CUDA 11.4.4 supports**: 37, 50, 52, 60, 61, 70, 75, 80, 86 - **CUDA 11.4.4 does NOT support**: 87 (requires 11.7+), 89 (requires 11.8+), 90 (requires 12.0+) - CUDA 12+ dropped Kepler support entirely ### Tesla K80 CUBLAS Compatibility **Challenge**: Tesla K80 (Kepler, compute 3.7) requires special handling for batched matrix multiplication due to: 1. Lack of Tensor Cores (introduced in Volta, compute 7.0+) 2. Architectural limitations with modern CUBLAS `*Ex` function variants **Solution - Two-Tier Fallback Strategy**: **Tier 1: GEMM Algorithm Selection** - Volta+ (cc >= 7.0): Use `CUBLAS_GEMM_DEFAULT_TENSOR_OP` (value 99) - Pre-Volta (cc < 7.0): Use `CUBLAS_GEMM_DEFAULT` (value -1) **Tier 2: CUBLAS Function Selection** - **Modern GPUs** (Volta+): Use `cublasGemmStridedBatchedEx` / `cublasGemmBatchedEx` - Support mixed precision, flexible compute types, algorithm selection - **Legacy GPUs** (Kepler/Maxwell/Pascal with FP32): Use `cublasSgemmStridedBatched` / `cublasSgemmBatched` - The `*Ex` variants have architectural requirements beyond algorithm selection - Even with `CUBLAS_GEMM_DEFAULT`, `*Ex` functions fail with `CUBLAS_STATUS_ARCH_MISMATCH` - Legacy functions only support FP32, but work reliably on older architectures **Modified Function**: `ggml_cuda_mul_mat_batched_cublas_impl` in `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:1986` **Tested Models** (verified on Tesla K80): - ✅ gemma3:4b - ✅ gpt-oss - ✅ deepseek-r1 ## Build Instructions ### Complete Build from Scratch ```bash # Clean any previous build artifacts rm -rf build go clean -cache # Configure the build (For all 11.4 or k80) CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --preset "CUDA 11" CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --preset "CUDA 11 K80" # Build the C/C++/CUDA libraries CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++ cmake --build build -j$(nproc) # Build the Go binary go build -o ollama . ``` ## Running Ollama ### Basic Server Start ```bash # Start the Ollama server ./ollama serve # Check GPU detection nvidia-smi ``` ### Debug and Logging Options **Environment Variables**: - `OLLAMA_DEBUG=1` - Enable verbose Ollama server logging - `GGML_CUDA_DEBUG=1` - Enable detailed CUDA/CUBLAS operation logging (batched matrix multiplication) ```bash # Run with Ollama verbose logging only OLLAMA_DEBUG=1 ./ollama serve # Run with both Ollama and CUDA debug logging OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve # Capture all output to file ./ollama serve 2>&1 | tee /tmp/ollama_serve.log # Capture only stderr (warnings/errors) to file ./ollama serve 2> /tmp/ollama_errors.log # Run in background with full logging OLLAMA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_full.log & # Run in background with debug logging OLLAMA_DEBUG=1 GGML_CUDA_DEBUG=1 ./ollama serve 2>&1 | tee /tmp/ollama_debug.log & # Monitor a running background server tail -f /tmp/ollama_full.log # Tail recent log entries tail -100 /tmp/ollama_full.log # Stop all ollama processes pkill ollama ``` **When to Use GGML_CUDA_DEBUG**: - Debugging CUBLAS errors on Tesla K80 or other legacy GPUs - Verifying compute capability detection - Troubleshooting batched matrix multiplication issues - Understanding which CUBLAS functions are being used (legacy vs Ex variants) ## CPU Architecture Compatibility ### The GCC/CUDA/Alderlake Constraint This build faces a fundamental compatibility constraint: **The Constraint Chain:** 1. **Tesla K80** (compute 3.7) → Last supported by **Driver 470.xx** 2. **Driver 470.256.02** → Maximum CUDA version is **CUDA 11.4** 3. **CUDA 11.4** → Maximum GCC version is **GCC 10** (enforced in `host_config.h`) 4. **AVX_VNNI** (Alderlake CPUs) → Requires **GCC 11+** for `-mavxvnni` flag **Result:** Cannot have both K80 GPU support AND full Alderlake CPU optimization. ### Solution: Alderlake Without AVX_VNNI **Implementation:** - Alderlake CPU variant is **enabled** in the build - AVX_VNNI instruction set is **excluded** (requires GCC 11+) - Alderlake still gets: SSE4.2, AVX, F16C, AVX2, BMI2, FMA optimizations - Code falls back to `_mm256_maddubs_epi16()` for operations that would use VNNI **Modified file:** `ml/backend/ggml/ggml/src/CMakeLists.txt` line 338 **Performance Impact:** - Most operations: **No impact** (still uses AVX2, FMA, BMI2) - INT8 dot products: **~10-20% slower** than native AVX_VNNI - Overall model inference: **~3-7% slower** (depends on quantization) ### CPU Support Matrix | CPU Generation | Variant Used | Full Optimization | Notes | |----------------|--------------|-------------------|-------| | Haswell (2013) | haswell | ✅ Yes | Xeon E5-2676 v3 | | Skylake-X (2017) | skylakex | ✅ Yes | Includes AVX512 | | Icelake (2019) | icelake | ✅ Yes | Includes AVX512_VNNI | | Alderlake (2021) | alderlake | ⚠️ Partial | Missing AVX_VNNI only | | Raptor Lake (2022) | alderlake | ⚠️ Partial | Missing AVX_VNNI only |