ollama37/fs/ggml/ggml.go at 8253ad4d2b2e7ac58268192051b92b59986c874f

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Files

Jesse Gross 8253ad4d2b ggml: Prevent kv cache quanitization on gpt-oss

KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671

2025-08-05 13:04:03 -07:00

20 KiB

Raw Blame History

View Raw

20 KiB Raw Blame History

20 KiB

Raw Blame History