llama: Ensure KV cache is fully defragmented.

Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949
This commit is contained in:
Jesse Gross
2024-12-12 14:48:52 -08:00
committed by Jesse Gross
parent 2ddc32d5c5
commit 08a832b482
3 changed files with 289 additions and 61 deletions

View File

@@ -433,14 +433,7 @@ func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch)
err := s.lc.Decode(batch)
if err != nil {
if errors.Is(err, llama.ErrKvCacheFull) {
slog.Debug("defragmenting kv cache")
s.cache.lc.KvCacheDefrag()
err = s.lc.Decode(batch)
}
if err != nil {
return fmt.Errorf("failed to decode batch: %w", err)
}
return fmt.Errorf("failed to decode batch: %w", err)
}
if crossAttention {