1. Fix binary path resolution using symlink (docker/runtime/Dockerfile) - Build binary to source directory (./ollama) - Create symlink from /usr/local/bin/ollama to /usr/local/src/ollama37/ollama - Allows ml/path.go to resolve libraries via filepath.EvalSymlinks() - Fixes "total vram=0 B" issue without requiring -w flag 2. Add comprehensive logging for model loading phases (llm/server.go) - Log runner subprocess startup and readiness - Log each memory allocation phase (FIT, ALLOC, COMMIT) - Log layer allocation adjustments during convergence - Log when model weights are being loaded (slowest phase) - Log progress during waitUntilRunnerLaunched (every 1s) - Improves visibility during 1-2 minute first-time model loads 3. Fix flash attention compute capability check (ml/device.go) - Changed DriverMajor to ComputeMajor for correct capability detection - Flash attention requires compute capability >= 7.0, not driver version These changes improve user experience during model loading by providing clear feedback at each stage, especially during the slow COMMIT phase where GGUF weights are loaded and CUDA kernels compile. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Ollama37 Docker Build System
Two-stage Docker build for Ollama with CUDA 11.4 and Compute Capability 3.7 support (Tesla K80)
Overview
This Docker build system uses a two-stage architecture to build and run Ollama with Tesla K80 (compute capability 3.7) support:
-
Builder Image (
builder/Dockerfile) - Base environment with build tools- Rocky Linux 8
- CUDA 11.4 toolkit (required for Tesla K80)
- GCC 10 (built from source, required by CUDA 11.4)
- CMake 4.0 (built from source)
- Go 1.25.3
-
Runtime Image (
runtime/Dockerfile) - Two-stage build process- Stage 1 (compile): Clone source → Configure CMake → Build C/C++/CUDA → Build Go binary
- Stage 2 (runtime): Copy artifacts → Setup runtime environment
The runtime uses the builder image as its base to ensure library path compatibility between build and runtime environments.
Prerequisites
- Docker with NVIDIA Container Runtime
- Docker Compose
- NVIDIA GPU drivers (470+ for Tesla K80)
- Verify GPU access:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.4.3-base-rockylinux8 nvidia-smi
Quick Start
1. Build Images
cd /home/jack/Documents/ollama37/docker
make build
This will:
- Build the builder image (if not present) - ~90 minutes first time
- Build the runtime image - ~10 minutes
First-time build: ~100 minutes total (includes building GCC 10 and CMake 4 from source)
Subsequent builds: ~10 minutes (builder image is cached)
2. Run with Docker Compose (Recommended)
docker-compose up -d
Check logs:
docker-compose logs -f
Stop the server:
docker-compose down
3. Run Manually
docker run -d \
--name ollama37 \
--runtime=nvidia \
--gpus all \
-p 11434:11434 \
-v ollama-data:/root/.ollama \
ollama37:latest
Usage
Using the API
# List models
curl http://localhost:11434/api/tags
# Pull a model
curl http://localhost:11434/api/pull -d '{"name": "gemma3:4b"}'
# Run inference
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Why is the sky blue?",
"stream": false
}'
Using the CLI
# List models
docker exec ollama37 ollama list
# Pull a model
docker exec ollama37 ollama pull gemma3:4b
# Run a model
docker exec ollama37 ollama run gemma3:4b "Hello!"
Architecture
Build System Components
docker/
├── builder/
│ └── Dockerfile # Base image: CUDA 11.4, GCC 10, CMake 4, Go 1.25.3
├── runtime/
│ └── Dockerfile # Two-stage: compile ollama37, package runtime
├── Makefile # Build orchestration (images only)
├── docker-compose.yml # Runtime orchestration
└── README.md # This file
Two-Stage Build Process
Stage 1: Builder Image (builder/Dockerfile)
Purpose: Provide consistent build environment
Contents:
- Rocky Linux 8 base
- CUDA 11.4 toolkit (compilation only, no driver)
- GCC 10 from source (~60 min build time)
- CMake 4.0 from source (~8 min build time)
- Go 1.25.3 binary
- All build dependencies
Build time: ~90 minutes (first time), cached thereafter
Image size: ~15GB
Stage 2: Runtime Image (runtime/Dockerfile)
Stage 2.1 - Compile (FROM ollama37-builder)
- Clone ollama37 source from GitHub
- Configure with CMake ("CUDA 11" preset for compute 3.7)
- Build C/C++/CUDA libraries
- Build Go binary
Stage 2.2 - Runtime (FROM ollama37-builder)
- Copy entire source tree (includes compiled artifacts)
- Copy binary to /usr/local/bin/ollama
- Setup LD_LIBRARY_PATH for runtime libraries
- Configure server, expose ports, setup volumes
Build time: ~10 minutes
Image size: ~18GB (includes build environment + compiled Ollama)
Why Both Stages Use Builder Base?
Problem: Compiled binaries have hardcoded library paths (via rpath/LD_LIBRARY_PATH)
Solution: Use identical base images for compile and runtime stages
Benefits:
- ✅ Library paths match between build and runtime
- ✅ All GCC 10 runtime libraries present
- ✅ All CUDA libraries at expected paths
- ✅ No complex artifact extraction/copying
- ✅ Guaranteed compatibility
Trade-off: Larger runtime image (~18GB) vs complexity and reliability issues
Alternative: Single-Stage Build
See Dockerfile.single-stage.archived for the original single-stage design that inspired this architecture.
Build Commands
Using the Makefile
# Build both builder and runtime images
make build
# Build only builder image
make build-builder
# Build only runtime image (will auto-build builder if needed)
make build-runtime
# Remove all images
make clean
# Show help
make help
Direct Docker Commands
# Build builder image
docker build -f builder/Dockerfile -t ollama37-builder:latest builder/
# Build runtime image
docker build -f runtime/Dockerfile -t ollama37:latest .
Runtime Management
Using Docker Compose (Recommended)
# Start server
docker-compose up -d
# View logs (live tail)
docker-compose logs -f
# Stop server
docker-compose down
# Stop and remove volumes
docker-compose down -v
# Restart server
docker-compose restart
Manual Docker Commands
# Start container
docker run -d \
--name ollama37 \
--runtime=nvidia \
--gpus all \
-p 11434:11434 \
-v ollama-data:/root/.ollama \
ollama37:latest
# View logs
docker logs -f ollama37
# Stop container
docker stop ollama37
docker rm ollama37
# Shell access
docker exec -it ollama37 bash
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
0.0.0.0:11434 |
Server listen address |
LD_LIBRARY_PATH |
/usr/local/src/ollama37/build/lib/ollama:/usr/local/lib64:/usr/local/cuda-11.4/lib64:/usr/lib64 |
Library search path |
NVIDIA_VISIBLE_DEVICES |
all |
Which GPUs to use |
NVIDIA_DRIVER_CAPABILITIES |
compute,utility |
GPU capabilities |
OLLAMA_DEBUG |
(unset) | Enable verbose Ollama logging |
GGML_CUDA_DEBUG |
(unset) | Enable CUDA/CUBLAS debug logging |
Volume Mounts
/root/.ollama- Model storage (use Docker volumeollama-data)
Customizing docker-compose.yml
# Change port
ports:
- "11435:11434" # Host:Container
# Use specific GPU
environment:
- NVIDIA_VISIBLE_DEVICES=0 # Use GPU 0 only
# Enable debug logging
environment:
- OLLAMA_DEBUG=1
- GGML_CUDA_DEBUG=1
GPU Support
Supported Compute Capabilities
- 3.7 - Tesla K80 (primary target)
- 5.0-5.2 - Maxwell (GTX 900 series)
- 6.0-6.1 - Pascal (GTX 10 series)
- 7.0-7.5 - Volta, Turing (RTX 20 series)
- 8.0-8.6 - Ampere (RTX 30 series)
Tesla K80 Recommendations
VRAM: 12GB per GPU (24GB for dual-GPU K80)
Model sizes:
- Small (1-4B): Full precision or Q8 quantization
- Medium (7-8B): Q4_K_M quantization
- Large (13B+): Q4_0 quantization or multi-GPU
Tested models:
- ✅ gemma3:4b
- ✅ gpt-oss
- ✅ deepseek-r1
Multi-GPU:
# Use all GPUs
docker run --gpus all ...
# Use specific GPU
docker run --gpus '"device=0"' ...
# Use multiple specific GPUs
docker run --gpus '"device=0,1"' ...
Troubleshooting
GPU not detected
# Check GPU visibility in container
docker exec ollama37 nvidia-smi
# Check CUDA libraries
docker exec ollama37 ldconfig -p | grep cuda
# Check NVIDIA runtime
docker info | grep -i runtime
Model fails to load
# Check logs with CUDA debug
docker run --rm --runtime=nvidia --gpus all \
-e OLLAMA_DEBUG=1 \
-e GGML_CUDA_DEBUG=1 \
-p 11434:11434 \
ollama37:latest
# Check library paths
docker exec ollama37 bash -c 'echo $LD_LIBRARY_PATH'
# Verify CUBLAS functions
docker exec ollama37 bash -c 'ldd /usr/local/bin/ollama | grep cublas'
Build fails with "out of memory"
# Edit runtime/Dockerfile line for cmake build
# Change: cmake --build build -j$(nproc)
# To: cmake --build build -j2
# Or set Docker memory limit
docker build --memory=8g ...
Port already in use
# Find process using port 11434
sudo lsof -i :11434
# Kill the process or change port in docker-compose.yml
ports:
- "11435:11434"
Build cache issues
# Rebuild runtime image without cache
docker build --no-cache -f runtime/Dockerfile -t ollama37:latest .
# Rebuild builder image without cache
docker build --no-cache -f builder/Dockerfile -t ollama37-builder:latest builder/
# Remove all images and rebuild
make clean
make build
Rebuilding
Rebuild with latest code
# Runtime Dockerfile clones from GitHub, so rebuild to get latest
make build-runtime
# Restart container
docker-compose restart
Rebuild everything from scratch
# Stop and remove containers
docker-compose down -v
# Remove images
make clean
# Rebuild all
make build
# Start fresh
docker-compose up -d
Rebuild only builder (rare)
# Only needed if you change CUDA/GCC/CMake/Go versions
make clean
make build-builder
make build-runtime
Development
Modifying the build
- Change build tools - Edit
builder/Dockerfile - Change Ollama build process - Edit
runtime/Dockerfile - Change build orchestration - Edit
Makefile - Change runtime config - Edit
docker-compose.yml
Testing changes
# Build with your changes
make build
# Run and test
docker-compose up -d
docker-compose logs -f
# If issues, check inside container
docker exec -it ollama37 bash
Shell access for debugging
# Enter running container
docker exec -it ollama37 bash
# Check GPU
nvidia-smi
# Check libraries
ldd /usr/local/bin/ollama
ldconfig -p | grep -E "cuda|cublas"
# Test binary
/usr/local/bin/ollama --version
Image Sizes
| Image | Size | Contents |
|---|---|---|
ollama37-builder:latest |
~15GB | CUDA, GCC, CMake, Go, build deps |
ollama37:latest |
~18GB | Builder + Ollama binary + libraries |
Note: Large size ensures all runtime dependencies are present and properly linked.
Build Times
| Task | First Build | Cached Build |
|---|---|---|
| Builder image | ~90 min | <1 min |
| Runtime image | ~10 min | ~10 min |
| Total | ~100 min | ~10 min |
Breakdown (first build):
- GCC 10: ~60 min
- CMake 4: ~8 min
- CUDA toolkit: ~10 min
- Go install: ~1 min
- Ollama build: ~10 min
Documentation
- ../CLAUDE.md - Project goals, implementation details, and technical notes
- Upstream Ollama - Original Ollama project
- dogkeeper886/ollama37 - This fork with K80 support
License
MIT (same as upstream Ollama)