ollama37/llama/runner/runner.go at 9ab62eb96f32d892720265e502796590d6a5dd72

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 15:57:04 +00:00

Files

Jesse Gross 08a832b482 llama: Ensure KV cache is fully defragmented.

Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949

2024-12-17 14:01:19 -08:00

26 KiB

Raw Blame History

View Raw

26 KiB Raw Blame History

26 KiB

Raw Blame History