mirror of
https://github.com/dogkeeper886/ollama37.git
synced 2025-12-12 00:37:04 +00:00
Currently, if an error occurs during the prep stages (such as tokenizing) of a single request, it will only affect that request. However, if an error happens during decoding, it can take down the entire runner. Instead, it's better to drop the tokens that triggered the error and try to keep going. However, we also need to stop when we run out of tokens, otherwise, this just causes an infinite loop. This is likely the cause of at least some of the hanging issues that have been reported. Bug #7573
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings