Commit Graph

4509 Commits

Author SHA1 Message Date
Shang Chieh Tseng
46f1038724 Fix Claude validation response format parsing
The Claude AI validator was receiving detailed explanations with markdown
formatting (e.g., '**PASS**') instead of the expected simple format.

Updated the validation prompt to explicitly require responses to start
with either 'PASS' or 'FAIL: <reason>' without any additional formatting,
explanations, or markdown before the verdict.

This fixes the 'Warning: Unexpected Claude response format' error that
was causing valid test results to be incorrectly marked as unclear.
2025-10-30 12:34:02 +08:00
Shang Chieh Tseng
c8b7015a2c Move test-runner temp directory into project
- Change temp directory from /tmp/test-runner-claude to .test-runner-temp
- Keeps temporary files within project bounds for Claude Code access
- Add .test-runner-temp to .gitignore to exclude from version control
- Fixes Claude AI validation permission issue
2025-10-30 12:25:25 +08:00
Shang Chieh Tseng
9b487aa5f5 Rename validateConfig function to validateConfigFile to avoid conflict
- Function in main.go renamed from validateConfig to validateConfigFile
- Resolves redeclaration error with validateConfig in config.go
- config.go has validateConfig(*Config) for internal validation
- main.go has validateConfigFile(string) for CLI command
2025-10-30 12:16:55 +08:00
Shang Chieh Tseng
a7b3f6eda5 Fix test-runner variable name conflict
- Rename validateConfig flag variable to validateConfigPath
- Resolves compilation error: validateConfig was both a *string variable and function name
- Function call now uses correct variable name
2025-10-30 12:15:12 +08:00
Shang Chieh Tseng
5895b414f4 Fix cross-workflow artifact download using dawidd6/action-download-artifact
- Replace actions/download-artifact@v4 with dawidd6/action-download-artifact@v6
- The default download-artifact action only works within same workflow run
- Third-party action enables downloading artifacts from different workflow
- Both test workflows now download from latest successful tesla-k80-ci.yml run
2025-10-30 12:12:59 +08:00
Shang Chieh Tseng
a171c8a087 Fix test workflows to use build artifacts instead of local binary
- Build workflow now uploads ollama binary as artifact with 7-day retention
- Test workflows download artifact instead of expecting local binary
- Eliminates 'ollama binary not found' error when running tests
- Enables build-once, test-multiple-times workflow pattern
- Added binary verification step to confirm artifact download
2025-10-30 12:07:28 +08:00
Shang Chieh Tseng
6c3876a30d Add multi-GPU test workflow and rename single-GPU workflow
- Rename tesla-k80-tests.yml to tesla-k80-single-gpu-tests.yml for clarity
- Add new tesla-k80-multi-gpu-tests.yml workflow for large models
- Add multi-gpu profile to test/config/models.yaml with gemma3:27b and gpt-oss:20b
- Multi-GPU workflow includes GPU count verification and weekly schedule
- Profile-specific validation allows multi-GPU splits for large models
- Separate workflows optimize CI efficiency: quick tests vs. thorough tests
2025-10-30 12:04:50 +08:00
Shang Chieh Tseng
1aa80e9411 Simplify test profiles to focus on Tesla K80 capabilities
Changes to test/config/models.yaml:

Quick profile:
- Use gemma3:4b (was gemma2:2b)
- Single prompt: 'Hello, respond with a brief greeting.'
- Timeout: 60s
- Purpose: Fast smoke test (~5 min)

Full profile:
- REMOVED: gemma2:2b, gemma3:4b (redundant with quick test)
- ONLY gemma3:12b (largest model for single K80)
- Single prompt: 'Hello, respond with a brief greeting.' (same as quick)
- Timeout: 120s (sufficient - loads in ~24s)
- Purpose: Validate Phase 2 memory optimization for large models

Rationale:
- Quick test validates basic functionality with gemma3:4b
- Full test validates single-GPU capability with gemma3:12b
- No need to test multiple sizes if both work
- Consistent prompts make comparison easier
- Tests the critical optimization: 12B model on single K80
2025-10-30 11:57:30 +08:00
Shang Chieh Tseng
4de7dd453b Add Claude AI-powered response validation and update test model
Changes:
1. Update quick test to use gemma3:4b (was gemma2:2b)
   - Increased timeout to 60s for larger model

2. Implement Claude headless validation (validate.go)
   - Hybrid approach: simple checks first, then Claude validation ALWAYS runs
   - Claude validates response quality, coherence, relevance
   - Detects gibberish, errors, and malformed responses
   - Falls back to simple validation if Claude CLI unavailable
   - Verbose logging shows Claude validation results

3. Validation flow:
   - Step 1: Fast checks (empty response, token count)
   - Step 2: Claude AI analysis (runs regardless of simple check)
   - Claude result overrides simple checks
   - If Claude unavailable, uses simple validation only

4. Workflow improvements:
   - Remove useless GPU memory check step (server already stopped)
   - Cleaner workflow output

Benefits:
- Intelligent response quality validation
- Catches subtle issues (gibberish, off-topic responses)
- Better than hardcoded pattern matching
- Graceful degradation when Claude unavailable
2025-10-30 11:42:10 +08:00
Shang Chieh Tseng
d59284d30a Implement Go-based test runner framework for Tesla K80 testing
Add comprehensive test orchestration framework:

Test Runner (cmd/test-runner/):
- config.go: YAML configuration loading and validation
- server.go: Ollama server lifecycle management (start/stop/health checks)
- monitor.go: Real-time log monitoring with pattern matching
- test.go: Model testing via Ollama API (pull, chat, validation)
- validate.go: Test result validation (GPU usage, response quality, log analysis)
- report.go: Structured reporting (JSON and Markdown formats)
- main.go: CLI interface with run/validate/list commands

Test Configurations (test/config/):
- models.yaml: Full test suite with quick/full/stress profiles
- quick.yaml: Fast smoke test with gemma2:2b

Updated Workflow:
- tesla-k80-tests.yml: Use test-runner instead of shell scripts
- Run quick tests first, then full tests if passing
- Generate structured JSON reports for pass/fail checking
- Upload test results as artifacts

Features:
- Multi-model testing with configurable profiles
- API-based testing (not CLI commands)
- Real-time log monitoring for GPU events and errors
- Automatic validation of GPU loading and response quality
- Structured JSON and Markdown reports
- Graceful server lifecycle management
- Interrupt handling (Ctrl+C cleanup)

Addresses limitations of shell-based testing by providing:
- Better error handling and reporting
- Programmatic test orchestration
- Reusable test framework
- Clear pass/fail criteria
- Detailed test metrics and timing
2025-10-30 11:04:48 +08:00
Shang Chieh Tseng
aaaf334e7f Update tesla-k80-ci.yml 2025-10-30 11:02:14 +08:00
Shang Chieh Tseng
b402b073c5 Split Tesla K80 workflows into build and test; add test framework plan
- Changed tesla-k80-ci.yml to manual trigger only, simplified to build-only workflow
- Created tesla-k80-tests.yml for separate test execution (manual trigger)
- Added .github/workflows/CLAUDE.md with comprehensive test framework design
- Removed binary artifact upload (not needed for single self-hosted runner)
- Replaced README.md with CLAUDE.md for better documentation structure

Test framework plan:
- Go-based test runner at cmd/test-runner/
- YAML configuration for multi-model testing
- Server lifecycle management with log monitoring
- API-based testing with structured reporting
- Support for test profiles (quick/full/stress)
2025-10-30 10:59:52 +08:00
Shang Chieh Tseng
7e317fdd74 Add Phase 2 summary documentation for CC 3.7 graph correction
Documents the complete Tesla K80 memory estimation optimization journey,
including per-GPU graph allocation and empirical correction factor implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 10:27:25 +08:00
Shang Chieh Tseng
296d537a2c Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction
Added Phase 2 documentation for single-GPU optimization:
- CC 3.7 graph correction factor (85% of estimate)
- gemma3:12b now loads on single GPU
- Improved from 11.9 GiB → 11.0 GiB estimation
- Validated with 10.0 GiB actual usage, 94% GPU utilization

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:16:38 +08:00
Shang Chieh Tseng
6d87524e22 Fix gemma3:12b to load on single Tesla K80 GPU
Problem: gemma3:12b (10.2 GiB actual) was splitting across 2 GPUs
despite fitting in single Tesla K80 (11.2 GiB available).

Root Cause: Graph memory estimates for CC 3.7 were 15-20% too high
(estimated 1.3 GiB, actual 1.1 GiB), causing single-GPU fit check
to fail by ~200 MiB margin.

Solution: Apply empirical 85% correction factor to graph estimates
for Tesla K80 (CC 3.7) based on measured actual usage.

Results:
- Memory estimate: 11.9 GiB → 11.0 GiB (-900 MiB)
- GPU split: 1,48 layers → single GPU (no split)
- GPU 0: 10,015 MiB (was 617 MiB)
- GPU 1: 7 MiB (was 9,866 MiB)
- Inference: 94% GPU utilization, no cross-GPU overhead

Testing:  gemma3:12b loads on single GPU with correct inference

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:15:59 +08:00
Shang Chieh Tseng
d04ea50ced Fix gpt-oss model architecture to match GGUF tensor format
The gpt-oss model architecture code expected fused tensors (attn_qkv,
ffn_gate_up_exps) but the actual GGUF files contain separate tensors
(attn_q/k/v, ffn_gate_exps/up_exps), causing nil pointer panics during
model loading.

Changes:
- model/models/gptoss/model.go: Updated AttentionBlock to use separate
  Query/Key/Value fields instead of fused QKV, modified Forward() to
  compute projections separately
- model/models/gptoss/model.go: Updated MLPBlock to use separate Gate/Up
  fields instead of fused GateUp, simplified Forward() logic
- fs/ggml/type.go: Reorganized MXFP4 tensor type constant ordering
- ml/backend/ggml/ggml/include/ggml.h: Moved GGML_TYPE_MXFP4 to end of
  enum to match GGUF file format specification
- ml/backend/ggml/ggml/src/ggml.c: Updated type name array to match
  reordered enum
- CLAUDE.md: Documented gpt-oss model compatibility fix

Result: gpt-oss:20b model now loads and runs successfully on Tesla K80,
all 25 layers offload to GPU correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 23:34:03 +08:00
Shang Chieh Tseng
241a03402e Optimize GPU memory estimation for single-GPU preference on Tesla K80
Implemented multi-GPU memory optimization to reduce unnecessary model splits
across dual Tesla K80 GPUs by fixing graph memory overestimation.

Changes:
1. Per-GPU graph allocation strategy
   - Secondary GPUs: 190 MiB (empirically measured)
   - Primary GPU: Full 1.3 GiB graph allocation
   - Applied during layer distribution, not just final allocation

2. Reverse-order layer distribution
   - Prefer loading all layers on last GPU (GPU 1) first
   - Only use secondary GPUs when primary is full
   - Changed from round-robin to reverse-order (j-1 instead of i%j)

Results:
 gemma3:4b: Single GPU (no split, was already working)
 gemma3:12b: 1,48 layer split (improved from 25,24 split)
   - GPU 0: 1 layer, 610 MiB (down from 4156 MiB)
   - GPU 1: 48 layers, 9857 MiB (primary)
   - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB)

Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models
to run on single GPU with better performance (no cross-GPU overhead).

Files modified:
- llm/memory.go: Core allocation logic (lines 230-288)
- llm/CLAUDE.md: Detailed implementation guide
- CLAUDE.md: Project status and results summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 19:58:20 +08:00
Shang Chieh Tseng
5077ab3fb4 Document Phase 9 completion: Fix CUDA backend loading for CC 3.7
Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
 CUDA backend loads without undefined symbol errors
 GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
 Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 17:44:36 +08:00
Shang Chieh Tseng
66fca1b685 Remove remaining MMA/WMMA template instances for CC 3.7 optimization
Delete 24 tensor core template instance files that were missed in the initial optimization:
- 19 fattn-mma-f16 template instances (various ncols1/ncols2 combinations)
- 5 fattn-wmma-f16 template instances (kqfloat and kqhalf variants)

These files implement tensor core operations (MMA/WMMA) which require Compute Capability 7.0+
and are not available on Tesla K80 (CC 3.7). Removing them completes the CC 3.7-only optimization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 15:24:08 +08:00
Shang Chieh Tseng
771044bead Complete CC 3.7-only CUDA optimization for Tesla K80 support
Simplify CUDA backend to exclusively support Compute Capability 3.7 (Kepler/Tesla K80).
This optimization removes ~2,700 lines of modern GPU code and resolves all compilation issues.

Changes:
- Remove tensor core files (mma.cuh, fattn-wmma-f16.*, fattn-mma-f16.cuh) and 92 template instances
- Hardcode architecture detection to always return CC 3.7 (370) in common.cuh
- Disable modern GPU features: FP16 native ops, MMA/WMMA, CP_ASYNC, BF16, CUDA graphs
- Disable 6 MMA functions in mmq.cuh while preserving DP4A functions for CC 3.7
- Replace undefined architecture constants (PASCAL/VOLTA/DP4A/ADA_LOVELACE) with CC 3.7 equivalents
- Set CMAKE_CUDA_ARCHITECTURES to "37" only in CMakeLists.txt and CMakePresets.json
- Hardcode Stream-K scheduling to false, precision to FP32 throughout
- Add comprehensive CLAUDE.md documentation with complete optimization history

Build configuration now compiles only for architecture 37, resulting in 80-85% smaller
binaries and 5-6x faster build times. All removed code paths were unreachable on CC 3.7
hardware, ensuring no performance degradation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 15:21:08 +08:00
Shang Chieh Tseng
135b799b13 Update command. 2025-10-29 14:21:03 +08:00
Shang Chieh Tseng
6024408ea5 Update command. 2025-10-28 18:42:49 +08:00
Shang Chieh Tseng
92acf0f91e Add GitHub Actions workflow for Tesla K80 CI/CD
- Tesla K80 build and test workflow with self-hosted runner
- Build using GCC 10 and CUDA 11.4 for Compute Capability 3.7
- Run unit tests, integration tests, and model inference tests
- Test gemma2:2b model loading and GPU acceleration
- Use Claude headless mode to analyze server logs and verify proper GPU initialization
- Upload logs, analysis results, and binary artifacts
- Comprehensive documentation in workflows README
2025-10-28 18:09:49 +08:00
Shang Chieh Tseng
fe0fd5b494 Update manual-build.md 2025-10-28 17:20:03 +08:00
Shang Chieh Tseng
e6e91af024 Separate NVIDIA driver and CUDA toolkit installation steps
- Split Step 3 into two distinct steps:
  - Step 3: NVIDIA Driver 470 installation via .run file
  - Step 4: CUDA 11.4 Toolkit installation via local installer
- Add libglvnd-devel dependency requirement
- Add text mode (init 3) requirement for driver installation
- Specify exact driver version (470.256.02) and download URL
- Specify exact CUDA installer (11.4.0 with 470.42.01 driver)
- Add note to deselect driver during CUDA installation
- Separate environment configuration:
  - PATH in /etc/profile.d/cuda-11.4.sh
  - Dynamic linker in /etc/ld.so.conf.d/cuda-11-4.conf
- Update all subsequent step numbers (5-7)
- Update all cross-references throughout document
2025-10-28 16:55:38 +08:00
Shang Chieh Tseng
35c4d078f7 Fix step reference in troubleshooting: GCC 10 is Step 1, not Step 5 2025-10-28 15:56:49 +08:00
Shang Chieh Tseng
417b451af1 Add system compiler symlink updates to use GCC 10 by default 2025-10-28 15:53:49 +08:00
Shang Chieh Tseng
c788de5f8b Fix GCC 10 dynamic linker config to include both /usr/lib64 and /usr/local/lib64 2025-10-28 15:51:41 +08:00
Shang Chieh Tseng
e549dcb710 Reorganize installation steps: Move GCC 10 to Step 1 before kernel compilation 2025-10-28 15:35:10 +08:00
Shang Chieh Tseng
29706d14d7 Consolidate GCC 10 installation steps into single script format 2025-10-28 15:28:11 +08:00
Shang Chieh Tseng
85d98064d1 Fix kernel config copy path to use /usr/src/kernels for Rocky Linux 9 2025-10-28 15:27:40 +08:00
Shang Chieh Tseng
8dc4ca7ccc Reorganize Docker build infrastructure for better maintainability
- Restructure from ollama37/ to docker/ with clear separation
- Separate builder and runtime images into dedicated directories
- Group environment scripts in builder/scripts/ subdirectory
- Add comprehensive root-level README.md (257 lines)
- Add .dockerignore files for optimized build contexts
- Enhance shell scripts with shebangs and documentation headers
- Update docker-compose.yml to build locally instead of pulling
- Add environment variables for GPU and host configuration
- Remove duplicate Dockerfile and confusing nested structure

New structure:
  docker/
  ├── README.md (comprehensive documentation)
  ├── docker-compose.yml (local build support)
  ├── builder/ (build environment: CUDA 11.4 + GCC 10 + Go 1.24)
  │   ├── Dockerfile
  │   ├── README.md
  │   ├── .dockerignore
  │   └── scripts/ (organized environment setup)
  └── runtime/ (production image)
      ├── Dockerfile
      ├── README.md
      └── .dockerignore

This reorganization eliminates confusion, removes duplication, and
provides a professional, maintainable structure for Tesla K80 builds.
2025-10-28 14:47:39 +08:00
Shang Chieh Tseng
736cbdf52a Remove unuse file. 2025-10-22 22:35:41 +08:00
Shang Chieh Tseng
29cb9d3a27 Remove GitHub Actions workflows from fork
Removed all GitHub Actions workflows (.github/workflows/) as they're not needed
for this Tesla K80 support fork. The workflows were designed for the official
Ollama repository's CI/CD pipeline and would fail in a fork since they:
- Attempt to push to Ollama's Docker Hub
- Run automated tests on PRs (not needed for personal fork)
- Handle official release process

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 19:22:12 +08:00
Shang Chieh Tseng
c61e0ce554 Update README.md for v1.4.0: GPT-OSS support and Tesla K80 memory improvements
- Added GPT-OSS model to supported models list with multi-GPU optimization notes
- Documented Tesla K80 Multi-GPU usage example with nvidia-smi monitoring
- Added comprehensive Tesla K80 Memory Improvements section covering:
  * VMM pool crash fixes with granularity alignment
  * Multi-GPU model switching scheduler improvements
  * Silent inference failure resolution
- Updated recent updates section for v1.4.0 release
- Enhanced technical details with multi-GPU optimization specs

These improvements enable robust production use of Tesla K80 hardware
for LLM inference with seamless model switching capabilities.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
v1.4.0
2025-08-10 01:42:38 +08:00
Shang Chieh Tseng
08f38b19ea Fix Tesla K80 multi-GPU model switching deadlocks and silent failures
Resolves two critical issues preventing robust model switching:

1. Scheduler deadlock: Fixed improper loop control flow that prevented
   model unloading from triggering after conflict detection. Added proper
   multi-GPU conflict detection and unload sequencing.

2. Silent inference failures: Changed critical cudaSetDevice() calls from
   graceful error handling back to CUDA_CHECK to prevent models from
   appearing to load successfully but failing silently during inference.

Result: Robust Tesla K80 dual-GPU model switching with self-healing
recovery instead of requiring system reboots.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 01:30:10 +08:00
Shang Chieh Tseng
46213c5880 Fix Tesla K80 VMM pool crash by aligning to granularity
- Fix CUDA_ERROR_INVALID_VALUE from cuMemAddressReserve by aligning max_pool_size to GPU granularity
- Set dynamic max_pool_size based on 90% of actual GPU memory instead of static 32GB
- Add memory availability check before allocation to prevent OOM
- Tested on Tesla K80 dual GPU setup with successful model loading and chat completions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 17:48:31 +08:00
Shang Chieh Tseng
e4113f080a Merge upstream ollama/ollama with Tesla K80 support preserved
Successfully synced with upstream ollama/ollama main branch while maintaining:
- CUDA Compute Capability 3.7 support for Tesla K80 GPUs
- CUDA 11 build configuration with architecture 37
- BF16 compatibility fallback for older GPUs

New features from upstream:
- gpt-oss model support (tested working on Tesla K80)
- Various performance improvements and bug fixes
- Updated model architectures and optimizations

All Tesla K80 optimizations and documentation preserved.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 15:17:24 +08:00
Shang Chieh Tseng
2be9575694 Fix BF16 compatibility for Tesla K80 (Compute Capability 3.7)
Add runtime check for BF16 support which requires Compute Capability 8.0+.
Tesla K80 and other CC 3.7 GPUs will fallback to FP16/FP32 operations.
This ensures the upstream BF16 optimizations work on newer GPUs while
maintaining compatibility with legacy hardware.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 15:15:49 +08:00
Shang Chieh Tseng
83973336d6 Optimize Docker build performance with parallel compilation
- Add -j$(nproc) flag to cmake build in ollama37.Dockerfile
- Use all available CPU cores for faster compilation
- Add sync-upstream.md documentation for future maintenance

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-08 11:44:59 +08:00
Shang Chieh Tseng
0cd81c838a Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support
- Added support for new gpt-oss model from upstream
- Preserved CUDA Compute Capability 3.7 (Tesla K80) support
- Kept CUDA 11 configuration alongside CUDA 12
- Maintained all documentation specific to ollama37 fork
- Integrated new tool parsing improvements
- Added new backend methods and patches from upstream
2025-08-08 10:43:29 +08:00
Daniel Hiltgen
114c3f2265 tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Jesse Gross
f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
2025-08-07 14:23:55 -07:00
Devon Rifkin
aa9d889522 Merge pull request #11765 from ollama/drifkin/thinking-without-content
openai: always provide reasoning
2025-08-06 19:02:23 -07:00
Devon Rifkin
735c41f9ca openai: always provide reasoning
We were missing passing along thinking if content was nil (as opposed
to empty string)

Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
2025-08-06 18:54:20 -07:00
Devon Rifkin
223a619468 Merge pull request #11761 from ollama/drifkin/openai-tool-names
openai: when converting role=tool messages, propagate the tool name
2025-08-06 17:53:25 -07:00
Devon Rifkin
759dd78dd6 openai: when converting role=tool messages, propagate the tool name
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name

Issue: https://github.com/ollama/ollama/issues/11704
2025-08-06 17:00:24 -07:00
Patrick Devine
44bc36d063 docs: update the faq (#11760) 2025-08-06 16:55:57 -07:00
Devon Rifkin
8f14e1f5f6 Merge pull request #11759 from ollama/drifkin/oai-tool-calling
openai: allow for content _and_ tool calls in the same message
2025-08-06 16:11:31 -07:00
Devon Rifkin
203c137810 openai: allow for content _and_ tool calls in the same message
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together

Fixes: https://github.com/ollama/ollama/issues/11704
2025-08-06 15:50:30 -07:00