mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-13 01:07:12 +00:00
The Llama engine always places vision projectors on the first GPU if one exists. However, the Ollama engine groups it with the output layer, which means the projector is only offloaded if all other layers are offloaded. The memory estimation code always assumes the former layout - this changes it to use the correct layout based on the engine. This addresses two impacts of the current behavior: - In multi-GPU setups, we can crash with OOM errors when we try to allocate memory on a full GPU while another still has space. - If the vision projector is large, it may prevent us from offloading anything when we could have fit some of the text layers.
12 KiB
12 KiB