mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-09 23:37:06 +00:00

Files

Shang Chieh Tseng 7d9b59c520 Improve GPU detection and add detailed model loading logs

1. Fix binary path resolution using symlink (docker/runtime/Dockerfile)
   - Build binary to source directory (./ollama)
   - Create symlink from /usr/local/bin/ollama to /usr/local/src/ollama37/ollama
   - Allows ml/path.go to resolve libraries via filepath.EvalSymlinks()
   - Fixes "total vram=0 B" issue without requiring -w flag

2. Add comprehensive logging for model loading phases (llm/server.go)
   - Log runner subprocess startup and readiness
   - Log each memory allocation phase (FIT, ALLOC, COMMIT)
   - Log layer allocation adjustments during convergence
   - Log when model weights are being loaded (slowest phase)
   - Log progress during waitUntilRunnerLaunched (every 1s)
   - Improves visibility during 1-2 minute first-time model loads

3. Fix flash attention compute capability check (ml/device.go)
   - Changed DriverMajor to ComputeMajor for correct capability detection
   - Flash attention requires compute capability >= 7.0, not driver version

These changes improve user experience during model loading by providing
clear feedback at each stage, especially during the slow COMMIT phase
where GGUF weights are loaded and CUDA kernels compile.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-11 23:28:00 +08:00

builder

Redesign Docker build system to two-stage architecture with builder/runtime separation

2025-11-10 13:14:49 +08:00

runtime

Improve GPU detection and add detailed model loading logs

2025-11-11 23:28:00 +08:00

docker-compose.yml

Redesign Docker build system to single-stage architecture for reliable model loading

2025-11-10 09:19:22 +08:00

Dockerfile.single-stage.archived

Redesign Docker build system to two-stage architecture with builder/runtime separation

2025-11-10 13:14:49 +08:00

Makefile

Redesign Docker build system to two-stage architecture with builder/runtime separation

2025-11-10 13:14:49 +08:00

README.md

Redesign Docker build system to two-stage architecture with builder/runtime separation

2025-11-10 13:14:49 +08:00

README.md

Ollama37 Docker Build System

Two-stage Docker build for Ollama with CUDA 11.4 and Compute Capability 3.7 support (Tesla K80)

Overview

This Docker build system uses a two-stage architecture to build and run Ollama with Tesla K80 (compute capability 3.7) support:

Builder Image (builder/Dockerfile) - Base environment with build tools
- Rocky Linux 8
- CUDA 11.4 toolkit (required for Tesla K80)
- GCC 10 (built from source, required by CUDA 11.4)
- CMake 4.0 (built from source)
- Go 1.25.3
Runtime Image (runtime/Dockerfile) - Two-stage build process
- Stage 1 (compile): Clone source → Configure CMake → Build C/C++/CUDA → Build Go binary
- Stage 2 (runtime): Copy artifacts → Setup runtime environment

The runtime uses the builder image as its base to ensure library path compatibility between build and runtime environments.

Prerequisites

Docker with NVIDIA Container Runtime
Docker Compose
NVIDIA GPU drivers (470+ for Tesla K80)

Verify GPU access:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.4.3-base-rockylinux8 nvidia-smi

Quick Start

1. Build Images

cd /home/jack/Documents/ollama37/docker
make build

This will:

Build the builder image (if not present) - ~90 minutes first time
Build the runtime image - ~10 minutes

First-time build: ~100 minutes total (includes building GCC 10 and CMake 4 from source)

Subsequent builds: ~10 minutes (builder image is cached)

2. Run with Docker Compose (Recommended)

docker-compose up -d

Check logs:

docker-compose logs -f

Stop the server:

docker-compose down

3. Run Manually

docker run -d \
  --name ollama37 \
  --runtime=nvidia \
  --gpus all \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama37:latest

Usage

Using the API

# List models
curl http://localhost:11434/api/tags

# Pull a model
curl http://localhost:11434/api/pull -d '{"name": "gemma3:4b"}'

# Run inference
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Using the CLI

# List models
docker exec ollama37 ollama list

# Pull a model
docker exec ollama37 ollama pull gemma3:4b

# Run a model
docker exec ollama37 ollama run gemma3:4b "Hello!"

Architecture

Build System Components

docker/
├── builder/
│   └── Dockerfile          # Base image: CUDA 11.4, GCC 10, CMake 4, Go 1.25.3
├── runtime/
│   └── Dockerfile          # Two-stage: compile ollama37, package runtime
├── Makefile                # Build orchestration (images only)
├── docker-compose.yml      # Runtime orchestration
└── README.md               # This file

Two-Stage Build Process

Stage 1: Builder Image (`builder/Dockerfile`)

Purpose: Provide consistent build environment

Contents:

Rocky Linux 8 base
CUDA 11.4 toolkit (compilation only, no driver)
GCC 10 from source (~60 min build time)
CMake 4.0 from source (~8 min build time)
Go 1.25.3 binary
All build dependencies

Build time: ~90 minutes (first time), cached thereafter

Image size: ~15GB

Stage 2: Runtime Image (`runtime/Dockerfile`)

Stage 2.1 - Compile (FROM ollama37-builder)

Clone ollama37 source from GitHub
Configure with CMake ("CUDA 11" preset for compute 3.7)
Build C/C++/CUDA libraries
Build Go binary

Stage 2.2 - Runtime (FROM ollama37-builder)

Copy entire source tree (includes compiled artifacts)
Copy binary to /usr/local/bin/ollama
Setup LD_LIBRARY_PATH for runtime libraries
Configure server, expose ports, setup volumes

Build time: ~10 minutes

Image size: ~18GB (includes build environment + compiled Ollama)

Why Both Stages Use Builder Base?

Problem: Compiled binaries have hardcoded library paths (via rpath/LD_LIBRARY_PATH)

Solution: Use identical base images for compile and runtime stages

Benefits:

✅ Library paths match between build and runtime
✅ All GCC 10 runtime libraries present
✅ All CUDA libraries at expected paths
✅ No complex artifact extraction/copying
✅ Guaranteed compatibility

Trade-off: Larger runtime image (~18GB) vs complexity and reliability issues

Alternative: Single-Stage Build

See Dockerfile.single-stage.archived for the original single-stage design that inspired this architecture.

Build Commands

Using the Makefile

# Build both builder and runtime images
make build

# Build only builder image
make build-builder

# Build only runtime image (will auto-build builder if needed)
make build-runtime

# Remove all images
make clean

# Show help
make help

Direct Docker Commands

# Build builder image
docker build -f builder/Dockerfile -t ollama37-builder:latest builder/

# Build runtime image
docker build -f runtime/Dockerfile -t ollama37:latest .

Runtime Management

Using Docker Compose (Recommended)

# Start server
docker-compose up -d

# View logs (live tail)
docker-compose logs -f

# Stop server
docker-compose down

# Stop and remove volumes
docker-compose down -v

# Restart server
docker-compose restart

Manual Docker Commands

# Start container
docker run -d \
  --name ollama37 \
  --runtime=nvidia \
  --gpus all \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama37:latest

# View logs
docker logs -f ollama37

# Stop container
docker stop ollama37
docker rm ollama37

# Shell access
docker exec -it ollama37 bash

Configuration

Environment Variables

Variable	Default	Description
`OLLAMA_HOST`	`0.0.0.0:11434`	Server listen address
`LD_LIBRARY_PATH`	`/usr/local/src/ollama37/build/lib/ollama:/usr/local/lib64:/usr/local/cuda-11.4/lib64:/usr/lib64`	Library search path
`NVIDIA_VISIBLE_DEVICES`	`all`	Which GPUs to use
`NVIDIA_DRIVER_CAPABILITIES`	`compute,utility`	GPU capabilities
`OLLAMA_DEBUG`	(unset)	Enable verbose Ollama logging
`GGML_CUDA_DEBUG`	(unset)	Enable CUDA/CUBLAS debug logging

Volume Mounts

/root/.ollama - Model storage (use Docker volume ollama-data)

Customizing docker-compose.yml

# Change port
ports:
  - "11435:11434"  # Host:Container

# Use specific GPU
environment:
  - NVIDIA_VISIBLE_DEVICES=0  # Use GPU 0 only

# Enable debug logging
environment:
  - OLLAMA_DEBUG=1
  - GGML_CUDA_DEBUG=1

GPU Support

Supported Compute Capabilities

3.7 - Tesla K80 (primary target)
5.0-5.2 - Maxwell (GTX 900 series)
6.0-6.1 - Pascal (GTX 10 series)
7.0-7.5 - Volta, Turing (RTX 20 series)
8.0-8.6 - Ampere (RTX 30 series)

Tesla K80 Recommendations

VRAM: 12GB per GPU (24GB for dual-GPU K80)

Model sizes:

Small (1-4B): Full precision or Q8 quantization
Medium (7-8B): Q4_K_M quantization
Large (13B+): Q4_0 quantization or multi-GPU

Tested models:

✅ gemma3:4b
✅ gpt-oss
✅ deepseek-r1

Multi-GPU:

# Use all GPUs
docker run --gpus all ...

# Use specific GPU
docker run --gpus '"device=0"' ...

# Use multiple specific GPUs
docker run --gpus '"device=0,1"' ...

Troubleshooting

GPU not detected

# Check GPU visibility in container
docker exec ollama37 nvidia-smi

# Check CUDA libraries
docker exec ollama37 ldconfig -p | grep cuda

# Check NVIDIA runtime
docker info | grep -i runtime

Model fails to load

# Check logs with CUDA debug
docker run --rm --runtime=nvidia --gpus all \
  -e OLLAMA_DEBUG=1 \
  -e GGML_CUDA_DEBUG=1 \
  -p 11434:11434 \
  ollama37:latest

# Check library paths
docker exec ollama37 bash -c 'echo $LD_LIBRARY_PATH'

# Verify CUBLAS functions
docker exec ollama37 bash -c 'ldd /usr/local/bin/ollama | grep cublas'

Build fails with "out of memory"

# Edit runtime/Dockerfile line for cmake build
# Change: cmake --build build -j$(nproc)
# To: cmake --build build -j2

# Or set Docker memory limit
docker build --memory=8g ...

Port already in use

# Find process using port 11434
sudo lsof -i :11434

# Kill the process or change port in docker-compose.yml
ports:
  - "11435:11434"

Build cache issues

# Rebuild runtime image without cache
docker build --no-cache -f runtime/Dockerfile -t ollama37:latest .

# Rebuild builder image without cache
docker build --no-cache -f builder/Dockerfile -t ollama37-builder:latest builder/

# Remove all images and rebuild
make clean
make build

Rebuilding

Rebuild with latest code

# Runtime Dockerfile clones from GitHub, so rebuild to get latest
make build-runtime

# Restart container
docker-compose restart

Rebuild everything from scratch

# Stop and remove containers
docker-compose down -v

# Remove images
make clean

# Rebuild all
make build

# Start fresh
docker-compose up -d

Rebuild only builder (rare)

# Only needed if you change CUDA/GCC/CMake/Go versions
make clean
make build-builder
make build-runtime

Development

Modifying the build

Change build tools - Edit builder/Dockerfile
Change Ollama build process - Edit runtime/Dockerfile
Change build orchestration - Edit Makefile
Change runtime config - Edit docker-compose.yml

Testing changes

# Build with your changes
make build

# Run and test
docker-compose up -d
docker-compose logs -f

# If issues, check inside container
docker exec -it ollama37 bash

Shell access for debugging

# Enter running container
docker exec -it ollama37 bash

# Check GPU
nvidia-smi

# Check libraries
ldd /usr/local/bin/ollama
ldconfig -p | grep -E "cuda|cublas"

# Test binary
/usr/local/bin/ollama --version

Image Sizes

Image	Size	Contents
`ollama37-builder:latest`	~15GB	CUDA, GCC, CMake, Go, build deps
`ollama37:latest`	~18GB	Builder + Ollama binary + libraries

Note: Large size ensures all runtime dependencies are present and properly linked.

Build Times

Task	First Build	Cached Build
Builder image	~90 min	<1 min
Runtime image	~10 min	~10 min
Total	~100 min	~10 min

Breakdown (first build):

GCC 10: ~60 min
CMake 4: ~8 min
CUDA toolkit: ~10 min
Go install: ~1 min
Ollama build: ~10 min

Documentation

../CLAUDE.md - Project goals, implementation details, and technical notes
Upstream Ollama - Original Ollama project
dogkeeper886/ollama37 - This fork with K80 support

License

MIT (same as upstream Ollama)

README.md

Ollama37 Docker Build System

Overview

Prerequisites

Quick Start

1. Build Images

2. Run with Docker Compose (Recommended)

3. Run Manually

Usage

Using the API

Using the CLI

Architecture

Build System Components

Two-Stage Build Process

Stage 1: Builder Image (builder/Dockerfile)

Stage 2: Runtime Image (runtime/Dockerfile)

Why Both Stages Use Builder Base?

Alternative: Single-Stage Build

Build Commands

Using the Makefile

Direct Docker Commands

Runtime Management

Using Docker Compose (Recommended)

Manual Docker Commands

Configuration

Environment Variables

Volume Mounts

Customizing docker-compose.yml

GPU Support

Supported Compute Capabilities

Tesla K80 Recommendations

Troubleshooting

GPU not detected

Model fails to load

Build fails with "out of memory"

Port already in use

Build cache issues

Rebuilding

Rebuild with latest code

Rebuild everything from scratch

Rebuild only builder (rare)

Development

Modifying the build

Testing changes

Shell access for debugging

Image Sizes

Build Times

Documentation

License

Stage 1: Builder Image (`builder/Dockerfile`)

Stage 2: Runtime Image (`runtime/Dockerfile`)