mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-10 07:46:59 +00:00

Files

Jesse Gross fe623c2cf4 ollamarunner: Multi-modal worst case graph

We currently preallocate compute graph memory for the worst case
batch of text tokens. This adds support for doing the same for
images.

Note that image models are more complicated than text models in
how they process their inputs so there may be cases where this
approach isn't completely generic for all models. It covers all
currently supported models though.

2025-05-15 13:46:20 -07:00

common

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

llamarunner

ollamarunner: Base cached tokens on current prompt

2025-05-15 13:46:20 -07:00

ollamarunner

ollamarunner: Multi-modal worst case graph

2025-05-15 13:46:20 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding