ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	4de7dd453b	Add Claude AI-powered response validation and update test model Changes: 1. Update quick test to use gemma3:4b (was gemma2:2b) - Increased timeout to 60s for larger model 2. Implement Claude headless validation (validate.go) - Hybrid approach: simple checks first, then Claude validation ALWAYS runs - Claude validates response quality, coherence, relevance - Detects gibberish, errors, and malformed responses - Falls back to simple validation if Claude CLI unavailable - Verbose logging shows Claude validation results 3. Validation flow: - Step 1: Fast checks (empty response, token count) - Step 2: Claude AI analysis (runs regardless of simple check) - Claude result overrides simple checks - If Claude unavailable, uses simple validation only 4. Workflow improvements: - Remove useless GPU memory check step (server already stopped) - Cleaner workflow output Benefits: - Intelligent response quality validation - Catches subtle issues (gibberish, off-topic responses) - Better than hardcoded pattern matching - Graceful degradation when Claude unavailable	2025-10-30 11:42:10 +08:00
Shang Chieh Tseng	d59284d30a	Implement Go-based test runner framework for Tesla K80 testing Add comprehensive test orchestration framework: Test Runner (cmd/test-runner/): - config.go: YAML configuration loading and validation - server.go: Ollama server lifecycle management (start/stop/health checks) - monitor.go: Real-time log monitoring with pattern matching - test.go: Model testing via Ollama API (pull, chat, validation) - validate.go: Test result validation (GPU usage, response quality, log analysis) - report.go: Structured reporting (JSON and Markdown formats) - main.go: CLI interface with run/validate/list commands Test Configurations (test/config/): - models.yaml: Full test suite with quick/full/stress profiles - quick.yaml: Fast smoke test with gemma2:2b Updated Workflow: - tesla-k80-tests.yml: Use test-runner instead of shell scripts - Run quick tests first, then full tests if passing - Generate structured JSON reports for pass/fail checking - Upload test results as artifacts Features: - Multi-model testing with configurable profiles - API-based testing (not CLI commands) - Real-time log monitoring for GPU events and errors - Automatic validation of GPU loading and response quality - Structured JSON and Markdown reports - Graceful server lifecycle management - Interrupt handling (Ctrl+C cleanup) Addresses limitations of shell-based testing by providing: - Better error handling and reporting - Programmatic test orchestration - Reusable test framework - Clear pass/fail criteria - Detailed test metrics and timing	2025-10-30 11:04:48 +08:00
Shang Chieh Tseng	aaaf334e7f	Update tesla-k80-ci.yml	2025-10-30 11:02:14 +08:00
Shang Chieh Tseng	b402b073c5	Split Tesla K80 workflows into build and test; add test framework plan - Changed tesla-k80-ci.yml to manual trigger only, simplified to build-only workflow - Created tesla-k80-tests.yml for separate test execution (manual trigger) - Added .github/workflows/CLAUDE.md with comprehensive test framework design - Removed binary artifact upload (not needed for single self-hosted runner) - Replaced README.md with CLAUDE.md for better documentation structure Test framework plan: - Go-based test runner at cmd/test-runner/ - YAML configuration for multi-model testing - Server lifecycle management with log monitoring - API-based testing with structured reporting - Support for test profiles (quick/full/stress)	2025-10-30 10:59:52 +08:00
Shang Chieh Tseng	7e317fdd74	Add Phase 2 summary documentation for CC 3.7 graph correction Documents the complete Tesla K80 memory estimation optimization journey, including per-GPU graph allocation and empirical correction factor implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 10:27:25 +08:00
Shang Chieh Tseng	296d537a2c	Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction Added Phase 2 documentation for single-GPU optimization: - CC 3.7 graph correction factor (85% of estimate) - gemma3:12b now loads on single GPU - Improved from 11.9 GiB → 11.0 GiB estimation - Validated with 10.0 GiB actual usage, 94% GPU utilization 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 00:16:38 +08:00
Shang Chieh Tseng	6d87524e22	Fix gemma3:12b to load on single Tesla K80 GPU Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs despite fitting in single Tesla K80 (11.2 GiB available). Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high (estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check to fail by ~200 MiB margin. Solution: Apply empirical 85% correction factor to graph estimates for Tesla K80 (CC 3.7) based on measured actual usage. Results: - Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB) - GPU split: 1,48 layers → single GPU (no split) - GPU 0: 10,015 MiB (was 617 MiB) - GPU 1: 7 MiB (was 9,866 MiB) - Inference: 94% GPU utilization, no cross-GPU overhead Testing: ✅ gemma3:12b loads on single GPU with correct inference 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-30 00:15:59 +08:00
Shang Chieh Tseng	d04ea50ced	Fix gpt-oss model architecture to match GGUF tensor format The gpt-oss model architecture code expected fused tensors (attn_qkv, ffn_gate_up_exps) but the actual GGUF files contain separate tensors (attn_q/k/v, ffn_gate_exps/up_exps), causing nil pointer panics during model loading. Changes: - model/models/gptoss/model.go: Updated AttentionBlock to use separate Query/Key/Value fields instead of fused QKV, modified Forward() to compute projections separately - model/models/gptoss/model.go: Updated MLPBlock to use separate Gate/Up fields instead of fused GateUp, simplified Forward() logic - fs/ggml/type.go: Reorganized MXFP4 tensor type constant ordering - ml/backend/ggml/ggml/include/ggml.h: Moved GGML_TYPE_MXFP4 to end of enum to match GGUF file format specification - ml/backend/ggml/ggml/src/ggml.c: Updated type name array to match reordered enum - CLAUDE.md: Documented gpt-oss model compatibility fix Result: gpt-oss:20b model now loads and runs successfully on Tesla K80, all 25 layers offload to GPU correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 23:34:03 +08:00
Shang Chieh Tseng	241a03402e	Optimize GPU memory estimation for single-GPU preference on Tesla K80 Implemented multi-GPU memory optimization to reduce unnecessary model splits across dual Tesla K80 GPUs by fixing graph memory overestimation. Changes: 1. Per-GPU graph allocation strategy - Secondary GPUs: 190 MiB (empirically measured) - Primary GPU: Full 1.3 GiB graph allocation - Applied during layer distribution, not just final allocation 2. Reverse-order layer distribution - Prefer loading all layers on last GPU (GPU 1) first - Only use secondary GPUs when primary is full - Changed from round-robin to reverse-order (j-1 instead of i%j) Results: ✅ gemma3:4b: Single GPU (no split, was already working) ✅ gemma3:12b: 1,48 layer split (improved from 25,24 split) - GPU 0: 1 layer, 610 MiB (down from 4156 MiB) - GPU 1: 48 layers, 9857 MiB (primary) - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB) Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models to run on single GPU with better performance (no cross-GPU overhead). Files modified: - llm/memory.go: Core allocation logic (lines 230-288) - llm/CLAUDE.md: Detailed implementation guide - CLAUDE.md: Project status and results summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 19:58:20 +08:00
Shang Chieh Tseng	5077ab3fb4	Document Phase 9 completion: Fix CUDA backend loading for CC 3.7 Phase 9 successfully resolved runtime loading issues where CUDA backend failed to load due to undefined Flash Attention symbols. Solution: - Disabled flash attention helper functions (lines 126-274 in fattn.cu) - Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7 - Added GGML_UNUSED macros to prevent compiler warnings - Added ggml_backend_cuda_score() function for backend selection Testing Results: ✅ CUDA backend loads without undefined symbol errors ✅ GPU layers offload correctly (e.g., 35/35 for gemma3:4b) ✅ Fast GPU inference confirmed working Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores). If attempted, gracefully aborts with clear error message. All 9 phases of CC 3.7-only optimization now complete and tested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 17:44:36 +08:00
Shang Chieh Tseng	66fca1b685	Remove remaining MMA/WMMA template instances for CC 3.7 optimization Delete 24 tensor core template instance files that were missed in the initial optimization: - 19 fattn-mma-f16 template instances (various ncols1/ncols2 combinations) - 5 fattn-wmma-f16 template instances (kqfloat and kqhalf variants) These files implement tensor core operations (MMA/WMMA) which require Compute Capability 7.0+ and are not available on Tesla K80 (CC 3.7). Removing them completes the CC 3.7-only optimization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 15:24:08 +08:00
Shang Chieh Tseng	771044bead	Complete CC 3.7-only CUDA optimization for Tesla K80 support Simplify CUDA backend to exclusively support Compute Capability 3.7 (Kepler/Tesla K80). This optimization removes ~2,700 lines of modern GPU code and resolves all compilation issues. Changes: - Remove tensor core files (mma.cuh, fattn-wmma-f16.*, fattn-mma-f16.cuh) and 92 template instances - Hardcode architecture detection to always return CC 3.7 (370) in common.cuh - Disable modern GPU features: FP16 native ops, MMA/WMMA, CP_ASYNC, BF16, CUDA graphs - Disable 6 MMA functions in mmq.cuh while preserving DP4A functions for CC 3.7 - Replace undefined architecture constants (PASCAL/VOLTA/DP4A/ADA_LOVELACE) with CC 3.7 equivalents - Set CMAKE_CUDA_ARCHITECTURES to "37" only in CMakeLists.txt and CMakePresets.json - Hardcode Stream-K scheduling to false, precision to FP32 throughout - Add comprehensive CLAUDE.md documentation with complete optimization history Build configuration now compiles only for architecture 37, resulting in 80-85% smaller binaries and 5-6x faster build times. All removed code paths were unreachable on CC 3.7 hardware, ensuring no performance degradation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 15:21:08 +08:00
Shang Chieh Tseng	135b799b13	Update command.	2025-10-29 14:21:03 +08:00
Shang Chieh Tseng	6024408ea5	Update command.	2025-10-28 18:42:49 +08:00
Shang Chieh Tseng	92acf0f91e	Add GitHub Actions workflow for Tesla K80 CI/CD - Tesla K80 build and test workflow with self-hosted runner - Build using GCC 10 and CUDA 11.4 for Compute Capability 3.7 - Run unit tests, integration tests, and model inference tests - Test gemma2:2b model loading and GPU acceleration - Use Claude headless mode to analyze server logs and verify proper GPU initialization - Upload logs, analysis results, and binary artifacts - Comprehensive documentation in workflows README	2025-10-28 18:09:49 +08:00
Shang Chieh Tseng	fe0fd5b494	Update manual-build.md	2025-10-28 17:20:03 +08:00
Shang Chieh Tseng	e6e91af024	Separate NVIDIA driver and CUDA toolkit installation steps - Split Step 3 into two distinct steps: - Step 3: NVIDIA Driver 470 installation via .run file - Step 4: CUDA 11.4 Toolkit installation via local installer - Add libglvnd-devel dependency requirement - Add text mode (init 3) requirement for driver installation - Specify exact driver version (470.256.02) and download URL - Specify exact CUDA installer (11.4.0 with 470.42.01 driver) - Add note to deselect driver during CUDA installation - Separate environment configuration: - PATH in /etc/profile.d/cuda-11.4.sh - Dynamic linker in /etc/ld.so.conf.d/cuda-11-4.conf - Update all subsequent step numbers (5-7) - Update all cross-references throughout document	2025-10-28 16:55:38 +08:00
Shang Chieh Tseng	35c4d078f7	Fix step reference in troubleshooting: GCC 10 is Step 1, not Step 5	2025-10-28 15:56:49 +08:00
Shang Chieh Tseng	417b451af1	Add system compiler symlink updates to use GCC 10 by default	2025-10-28 15:53:49 +08:00
Shang Chieh Tseng	c788de5f8b	Fix GCC 10 dynamic linker config to include both /usr/lib64 and /usr/local/lib64	2025-10-28 15:51:41 +08:00
Shang Chieh Tseng	e549dcb710	Reorganize installation steps: Move GCC 10 to Step 1 before kernel compilation	2025-10-28 15:35:10 +08:00
Shang Chieh Tseng	29706d14d7	Consolidate GCC 10 installation steps into single script format	2025-10-28 15:28:11 +08:00
Shang Chieh Tseng	85d98064d1	Fix kernel config copy path to use /usr/src/kernels for Rocky Linux 9	2025-10-28 15:27:40 +08:00
Shang Chieh Tseng	8dc4ca7ccc	Reorganize Docker build infrastructure for better maintainability - Restructure from ollama37/ to docker/ with clear separation - Separate builder and runtime images into dedicated directories - Group environment scripts in builder/scripts/ subdirectory - Add comprehensive root-level README.md (257 lines) - Add .dockerignore files for optimized build contexts - Enhance shell scripts with shebangs and documentation headers - Update docker-compose.yml to build locally instead of pulling - Add environment variables for GPU and host configuration - Remove duplicate Dockerfile and confusing nested structure New structure: docker/ ├── README.md (comprehensive documentation) ├── docker-compose.yml (local build support) ├── builder/ (build environment: CUDA 11.4 + GCC 10 + Go 1.24) │ ├── Dockerfile │ ├── README.md │ ├── .dockerignore │ └── scripts/ (organized environment setup) └── runtime/ (production image) ├── Dockerfile ├── README.md └── .dockerignore This reorganization eliminates confusion, removes duplication, and provides a professional, maintainable structure for Tesla K80 builds.	2025-10-28 14:47:39 +08:00
Shang Chieh Tseng	736cbdf52a	Remove unuse file.	2025-10-22 22:35:41 +08:00
Shang Chieh Tseng	29cb9d3a27	Remove GitHub Actions workflows from fork Removed all GitHub Actions workflows (.github/workflows/) as they're not needed for this Tesla K80 support fork. The workflows were designed for the official Ollama repository's CI/CD pipeline and would fail in a fork since they: - Attempt to push to Ollama's Docker Hub - Run automated tests on PRs (not needed for personal fork) - Handle official release process 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 19:22:12 +08:00
Shang Chieh Tseng	c61e0ce554	Update README.md for v1.4.0: GPT-OSS support and Tesla K80 memory improvements - Added GPT-OSS model to supported models list with multi-GPU optimization notes - Documented Tesla K80 Multi-GPU usage example with nvidia-smi monitoring - Added comprehensive Tesla K80 Memory Improvements section covering: * VMM pool crash fixes with granularity alignment * Multi-GPU model switching scheduler improvements * Silent inference failure resolution - Updated recent updates section for v1.4.0 release - Enhanced technical details with multi-GPU optimization specs These improvements enable robust production use of Tesla K80 hardware for LLM inference with seamless model switching capabilities. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> v1.4.0	2025-08-10 01:42:38 +08:00
Shang Chieh Tseng	08f38b19ea	Fix Tesla K80 multi-GPU model switching deadlocks and silent failures Resolves two critical issues preventing robust model switching: 1. Scheduler deadlock: Fixed improper loop control flow that prevented model unloading from triggering after conflict detection. Added proper multi-GPU conflict detection and unload sequencing. 2. Silent inference failures: Changed critical cudaSetDevice() calls from graceful error handling back to CUDA_CHECK to prevent models from appearing to load successfully but failing silently during inference. Result: Robust Tesla K80 dual-GPU model switching with self-healing recovery instead of requiring system reboots. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 01:30:10 +08:00
Shang Chieh Tseng	46213c5880	Fix Tesla K80 VMM pool crash by aligning to granularity - Fix CUDA_ERROR_INVALID_VALUE from cuMemAddressReserve by aligning max_pool_size to GPU granularity - Set dynamic max_pool_size based on 90% of actual GPU memory instead of static 32GB - Add memory availability check before allocation to prevent OOM - Tested on Tesla K80 dual GPU setup with successful model loading and chat completions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 17:48:31 +08:00
Shang Chieh Tseng	e4113f080a	Merge upstream ollama/ollama with Tesla K80 support preserved Successfully synced with upstream ollama/ollama main branch while maintaining: - CUDA Compute Capability 3.7 support for Tesla K80 GPUs - CUDA 11 build configuration with architecture 37 - BF16 compatibility fallback for older GPUs New features from upstream: - gpt-oss model support (tested working on Tesla K80) - Various performance improvements and bug fixes - Updated model architectures and optimizations All Tesla K80 optimizations and documentation preserved. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 15:17:24 +08:00
Shang Chieh Tseng	2be9575694	Fix BF16 compatibility for Tesla K80 (Compute Capability 3.7) Add runtime check for BF16 support which requires Compute Capability 8.0+. Tesla K80 and other CC 3.7 GPUs will fallback to FP16/FP32 operations. This ensures the upstream BF16 optimizations work on newer GPUs while maintaining compatibility with legacy hardware. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 15:15:49 +08:00
Shang Chieh Tseng	83973336d6	Optimize Docker build performance with parallel compilation - Add -j$(nproc) flag to cmake build in ollama37.Dockerfile - Use all available CPU cores for faster compilation - Add sync-upstream.md documentation for future maintenance 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 11:44:59 +08:00
Shang Chieh Tseng	0cd81c838a	Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support - Added support for new gpt-oss model from upstream - Preserved CUDA Compute Capability 3.7 (Tesla K80) support - Kept CUDA 11 configuration alongside CUDA 12 - Maintained all documentation specific to ollama37 fork - Integrated new tool parsing improvements - Added new backend methods and patches from upstream	2025-08-08 10:43:29 +08:00
Daniel Hiltgen	114c3f2265	tests: add integration coverage for oss-gpt (#11696 ) Also wires up support to override the default "smol" model	2025-08-07 15:06:57 -07:00
Jesse Gross	f2e9c9aff5	server: Reduce gpt-oss context length for small VRAM GPUs gpt-oss works best with a context length of at least 8k. However, for GPUs with limited amount of VRAM, there is a significant performance hit to this increased context. In these cases, we switch to the Ollama default of 4k	2025-08-07 14:23:55 -07:00
Devon Rifkin	aa9d889522	Merge pull request #11765 from ollama/drifkin/thinking-without-content openai: always provide reasoning	2025-08-06 19:02:23 -07:00
Devon Rifkin	735c41f9ca	openai: always provide reasoning We were missing passing along thinking if content was nil (as opposed to empty string) Also added a test for content not being passed, which was the real cause of <https://github.com/ollama/ollama/issues/11704>, since with the way `Content` is typed, not passing it and empty string are distinct	2025-08-06 18:54:20 -07:00
Devon Rifkin	223a619468	Merge pull request #11761 from ollama/drifkin/openai-tool-names openai: when converting role=tool messages, propagate the tool name	2025-08-06 17:53:25 -07:00
Devon Rifkin	759dd78dd6	openai: when converting role=tool messages, propagate the tool name Added support for converting both `name` and `tool_call_id` fields, which different clients might provide. `name` is a legacy field from the OpenAI completions API. For `tool_call_id` we inspect previous messages and look for a matching tool call ID and grab its name Issue: https://github.com/ollama/ollama/issues/11704	2025-08-06 17:00:24 -07:00
Patrick Devine	44bc36d063	docs: update the faq (#11760 )	2025-08-06 16:55:57 -07:00
Devon Rifkin	8f14e1f5f6	Merge pull request #11759 from ollama/drifkin/oai-tool-calling openai: allow for content _and_ tool calls in the same message	2025-08-06 16:11:31 -07:00
Devon Rifkin	203c137810	openai: allow for content _and_ tool calls in the same message Previously our OpenAI chat completions compat layer assumed that tool calls and content would never be provided together, but this is not a correct assumption. Content is only optional when tool calls are present, but tool calls and content can be provided together Fixes: https://github.com/ollama/ollama/issues/11704	2025-08-06 15:50:30 -07:00
Daniel Hiltgen	fa8be9e35c	clean up debugging (#11756 )	2025-08-06 13:31:22 -07:00
Gao feng	8a75e9ee15	Update downloading to pulling in api.md (#11170 ) update api.md to make it consist with code. https://github.com/ollama/ollama/blob/main/server/download.go#L447	2025-08-06 11:33:09 -07:00
Parth Sareen	4742e12c23	docs: update turbo model name (#11707 )	2025-08-05 17:29:08 -07:00
Devon Rifkin	2d06977ade	Merge pull request #11705 from ollama/drifkin/fn-schema tools: support anyOf types	2025-08-05 17:02:42 -07:00
Devon Rifkin	30f8a68c4c	tools: support anyOf types afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support	2025-08-05 16:46:24 -07:00
Daniel Hiltgen	e378e33421	win: static link msvc libs (#11612 ) This should help reduce the runtime dependencies on windows.	2025-08-05 16:10:42 -07:00
Michael Yang	fcec04bf42	gptoss: fix memory calc (#11700 )	2025-08-05 15:56:12 -07:00
Jeffrey Morgan	ee92ca3e1d	docs: add docs for Ollama Turbo (#11687 )	2025-08-05 13:09:10 -07:00

1 2 3 4 5 ...

4501 Commits