Offload layers to GPU based on new model size estimates (#1850)

* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
This commit is contained in:
Jeffrey Morgan
2024-01-08 16:42:00 -05:00
committed by GitHub
parent 7e8f7c8358
commit 08f1e18965
10 changed files with 161 additions and 154 deletions

View File

@@ -78,7 +78,11 @@ type model interface {
ModelFamily() string
ModelType() string
FileType() string
NumLayers() int64
NumLayers() uint32
NumGQA() uint32
NumEmbed() uint32
NumHead() uint32
NumHeadKv() uint32
}
type container interface {