ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	736cbdf52a	Remove unuse file.	2025-10-22 22:35:41 +08:00
Shang Chieh Tseng	29cb9d3a27	Remove GitHub Actions workflows from fork Removed all GitHub Actions workflows (.github/workflows/) as they're not needed for this Tesla K80 support fork. The workflows were designed for the official Ollama repository's CI/CD pipeline and would fail in a fork since they: - Attempt to push to Ollama's Docker Hub - Run automated tests on PRs (not needed for personal fork) - Handle official release process 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 19:22:12 +08:00
Shang Chieh Tseng	c61e0ce554	Update README.md for v1.4.0: GPT-OSS support and Tesla K80 memory improvements - Added GPT-OSS model to supported models list with multi-GPU optimization notes - Documented Tesla K80 Multi-GPU usage example with nvidia-smi monitoring - Added comprehensive Tesla K80 Memory Improvements section covering: * VMM pool crash fixes with granularity alignment * Multi-GPU model switching scheduler improvements * Silent inference failure resolution - Updated recent updates section for v1.4.0 release - Enhanced technical details with multi-GPU optimization specs These improvements enable robust production use of Tesla K80 hardware for LLM inference with seamless model switching capabilities. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> v1.4.0	2025-08-10 01:42:38 +08:00
Shang Chieh Tseng	08f38b19ea	Fix Tesla K80 multi-GPU model switching deadlocks and silent failures Resolves two critical issues preventing robust model switching: 1. Scheduler deadlock: Fixed improper loop control flow that prevented model unloading from triggering after conflict detection. Added proper multi-GPU conflict detection and unload sequencing. 2. Silent inference failures: Changed critical cudaSetDevice() calls from graceful error handling back to CUDA_CHECK to prevent models from appearing to load successfully but failing silently during inference. Result: Robust Tesla K80 dual-GPU model switching with self-healing recovery instead of requiring system reboots. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 01:30:10 +08:00
Shang Chieh Tseng	46213c5880	Fix Tesla K80 VMM pool crash by aligning to granularity - Fix CUDA_ERROR_INVALID_VALUE from cuMemAddressReserve by aligning max_pool_size to GPU granularity - Set dynamic max_pool_size based on 90% of actual GPU memory instead of static 32GB - Add memory availability check before allocation to prevent OOM - Tested on Tesla K80 dual GPU setup with successful model loading and chat completions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 17:48:31 +08:00
Shang Chieh Tseng	e4113f080a	Merge upstream ollama/ollama with Tesla K80 support preserved Successfully synced with upstream ollama/ollama main branch while maintaining: - CUDA Compute Capability 3.7 support for Tesla K80 GPUs - CUDA 11 build configuration with architecture 37 - BF16 compatibility fallback for older GPUs New features from upstream: - gpt-oss model support (tested working on Tesla K80) - Various performance improvements and bug fixes - Updated model architectures and optimizations All Tesla K80 optimizations and documentation preserved. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 15:17:24 +08:00
Shang Chieh Tseng	2be9575694	Fix BF16 compatibility for Tesla K80 (Compute Capability 3.7) Add runtime check for BF16 support which requires Compute Capability 8.0+. Tesla K80 and other CC 3.7 GPUs will fallback to FP16/FP32 operations. This ensures the upstream BF16 optimizations work on newer GPUs while maintaining compatibility with legacy hardware. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 15:15:49 +08:00
Shang Chieh Tseng	83973336d6	Optimize Docker build performance with parallel compilation - Add -j$(nproc) flag to cmake build in ollama37.Dockerfile - Use all available CPU cores for faster compilation - Add sync-upstream.md documentation for future maintenance 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-08 11:44:59 +08:00
Shang Chieh Tseng	0cd81c838a	Merge upstream ollama/ollama main branch while preserving CUDA 3.7 support - Added support for new gpt-oss model from upstream - Preserved CUDA Compute Capability 3.7 (Tesla K80) support - Kept CUDA 11 configuration alongside CUDA 12 - Maintained all documentation specific to ollama37 fork - Integrated new tool parsing improvements - Added new backend methods and patches from upstream	2025-08-08 10:43:29 +08:00
Daniel Hiltgen	114c3f2265	tests: add integration coverage for oss-gpt (#11696 ) Also wires up support to override the default "smol" model	2025-08-07 15:06:57 -07:00
Jesse Gross	f2e9c9aff5	server: Reduce gpt-oss context length for small VRAM GPUs gpt-oss works best with a context length of at least 8k. However, for GPUs with limited amount of VRAM, there is a significant performance hit to this increased context. In these cases, we switch to the Ollama default of 4k	2025-08-07 14:23:55 -07:00
Devon Rifkin	aa9d889522	Merge pull request #11765 from ollama/drifkin/thinking-without-content openai: always provide reasoning	2025-08-06 19:02:23 -07:00
Devon Rifkin	735c41f9ca	openai: always provide reasoning We were missing passing along thinking if content was nil (as opposed to empty string) Also added a test for content not being passed, which was the real cause of <https://github.com/ollama/ollama/issues/11704>, since with the way `Content` is typed, not passing it and empty string are distinct	2025-08-06 18:54:20 -07:00
Devon Rifkin	223a619468	Merge pull request #11761 from ollama/drifkin/openai-tool-names openai: when converting role=tool messages, propagate the tool name	2025-08-06 17:53:25 -07:00
Devon Rifkin	759dd78dd6	openai: when converting role=tool messages, propagate the tool name Added support for converting both `name` and `tool_call_id` fields, which different clients might provide. `name` is a legacy field from the OpenAI completions API. For `tool_call_id` we inspect previous messages and look for a matching tool call ID and grab its name Issue: https://github.com/ollama/ollama/issues/11704	2025-08-06 17:00:24 -07:00
Patrick Devine	44bc36d063	docs: update the faq (#11760 )	2025-08-06 16:55:57 -07:00
Devon Rifkin	8f14e1f5f6	Merge pull request #11759 from ollama/drifkin/oai-tool-calling openai: allow for content _and_ tool calls in the same message	2025-08-06 16:11:31 -07:00
Devon Rifkin	203c137810	openai: allow for content _and_ tool calls in the same message Previously our OpenAI chat completions compat layer assumed that tool calls and content would never be provided together, but this is not a correct assumption. Content is only optional when tool calls are present, but tool calls and content can be provided together Fixes: https://github.com/ollama/ollama/issues/11704	2025-08-06 15:50:30 -07:00
Daniel Hiltgen	fa8be9e35c	clean up debugging (#11756 )	2025-08-06 13:31:22 -07:00
Gao feng	8a75e9ee15	Update downloading to pulling in api.md (#11170 ) update api.md to make it consist with code. https://github.com/ollama/ollama/blob/main/server/download.go#L447	2025-08-06 11:33:09 -07:00
Parth Sareen	4742e12c23	docs: update turbo model name (#11707 )	2025-08-05 17:29:08 -07:00
Devon Rifkin	2d06977ade	Merge pull request #11705 from ollama/drifkin/fn-schema tools: support anyOf types	2025-08-05 17:02:42 -07:00
Devon Rifkin	30f8a68c4c	tools: support anyOf types afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support	2025-08-05 16:46:24 -07:00
Daniel Hiltgen	e378e33421	win: static link msvc libs (#11612 ) This should help reduce the runtime dependencies on windows.	2025-08-05 16:10:42 -07:00
Michael Yang	fcec04bf42	gptoss: fix memory calc (#11700 )	2025-08-05 15:56:12 -07:00
Jeffrey Morgan	ee92ca3e1d	docs: add docs for Ollama Turbo (#11687 )	2025-08-05 13:09:10 -07:00
Jesse Gross	8253ad4d2b	ggml: Prevent kv cache quanitization on gpt-oss KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671	2025-08-05 13:04:03 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00
Jesse Gross	0d38b66502	kvcache: Log contents of cache when unable to find a slot There is a bug when using sliding window attention where we run out of KV cache slots. This is likely due to not correctly removing all of the entries as they slide out of range. This adds additional logging when this occurs to track down the source. Bug #10127	2025-08-04 16:59:29 -07:00
Jesse Gross	4183bb0574	kvcache: Enable SWA to retain additional entries Models that use sliding window attention can only resume a sequence from the cache if it falls within the saved windows. This works well if the next message picks up where the old one left off. However, it generally prevents a partial prefix match unless the entire conversation falls within the sliding window. This can be a problem with reasoning models where the traces are supposed to be removed from future messages, forcing the entire history to be re-evaluated. This change allows models to specify that a larger amount of the history be retained in memory, to allow more partial resumption. It still respects the window that the model was trained on for token generation.	2025-07-31 14:48:01 -07:00
Sajal Kulshreshtha	ff89ba90bc	fixing broken AMD driver link (#11579 )	2025-07-30 12:02:54 -07:00
Daniel Hiltgen	6dcc5dfb9c	Revert "CI: switch back to x86 macos builder" (#11588 ) This reverts commit 9d071e6089319b37acf62bb739e3430dcb2ac0c3.	2025-07-30 08:56:01 -07:00
Daniel Hiltgen	25911a6e6b	mac: disable bf16 on unsupported OS versions (#11585 ) Support for bf16 was added in MacOS v14+ and attempting to enable on older versions causes runtime failures.	2025-07-30 08:50:54 -07:00
Daniel Hiltgen	8afa6e83f2	CI: switch back to x86 macos builder (#11572 )	2025-07-29 16:41:25 -07:00
Oliver Simons	ea85e27bbd	Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525 ) * Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust	2025-07-29 12:37:06 -07:00
Jesse Gross	c116a7523d	kvcache: Don't shift empty batches When we context shift, we delete half the context and apply RoPE with an offset to the other half. We used to RoPE across the entire context in a single pass with a zero offset for the deleted section. With the change to shifting in batches, we can skip any batches where all of the offsets would be zero. This typically reduces the number of operations by half.	2025-07-29 12:32:22 -07:00
Yoshi	3515cc377c	docs: fix typos and remove trailing whitespaces (#11554 )	2025-07-28 11:19:13 -07:00
Mayan EDMS	bbf66c0b96	readme: add Mayan EDMS to community integrations (#11543 )	2025-07-27 15:02:52 -07:00
Jesse Gross	764be7480f	kvcache: Group shift operations into batches Currently, when we need to do a shift on the cache, it is one RoPE operation on the entire size of the cache (per layer). In some cases, this can create a compute graph that is larger than the forward pass since the forward pass is working in batches. Since we don't consider shifting in our memory estimates, it's possible for this to cause a crash if we run out of memory. By limiting the size of the RoPE calls to batch size chunks, we ensure that the shift will never exceed the size of the forward pass, since the forward pass will also contain a RoPE of the same size. This does not have a sigificant impact on performance since RoPE is a math operation that is mostly proportional to the size of its inputs. In theory defrag could have the same issue since it also creates a compute graph outside of the forward pass, however, since it is only copies, it does not require any working space.	2025-07-25 16:50:27 -07:00
Ruyut	b72e5adb14	CONTRIBUTING: fix typo in commit message example (#11528 )	2025-07-25 14:24:06 -07:00
Patrick Devine	80b538e312	cli: catch upstream errors gracefully (#11512 )	2025-07-23 22:16:55 -07:00
Jeffrey Morgan	4f8a0166cc	tools: loosen tool argument parsing (#11509 )	2025-07-23 21:21:29 -07:00
minxinyi	1e6eab5c33	server: use slices.Equal to simplify code (#11502 )	2025-07-23 14:25:39 -07:00
Michael Yang	6c733bf0a6	s#x/exp/maps#maps# (#11506 )	2025-07-23 13:23:32 -07:00
Patrick Devine	3bac5cba60	Fix GetModelInfo (#11496 ) --------- Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-07-22 13:40:47 -07:00
ycomiti	4151ef8cf7	Update linux.md (#11462 )	2025-07-22 11:17:31 -07:00
Stefan Wärting	82da19c634	readme: add GMAI - Gradle Managed to community integrations (#11461 )	2025-07-20 14:55:47 -07:00
Jeffrey Morgan	bdd9d22dfd	tools: fix parsing issue when a tool name is a substring of another (#11456 ) Co-authored-by: frob <rick+github@frob.com.au>	2025-07-20 14:55:14 -07:00
Shang Chieh Tseng	f337f53408	docs: update documentation to reflect Gemma3n support in v1.3.0 Update README.md and CLAUDE.md to correctly reference Gemma3n model support that was added in version 1.3.0, replacing generic "Gemma 3" references with the specific "Gemma3n" model name. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 09:47:05 +08:00
Shang Chieh Tseng	ef67ce4d2e	docs: update README.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 09:27:42 +08:00

1 2 3 4 5 ...

4477 Commits