ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-09 23:37:06 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	7d9b59c520	Improve GPU detection and add detailed model loading logs 1. Fix binary path resolution using symlink (docker/runtime/Dockerfile) - Build binary to source directory (./ollama) - Create symlink from /usr/local/bin/ollama to /usr/local/src/ollama37/ollama - Allows ml/path.go to resolve libraries via filepath.EvalSymlinks() - Fixes "total vram=0 B" issue without requiring -w flag 2. Add comprehensive logging for model loading phases (llm/server.go) - Log runner subprocess startup and readiness - Log each memory allocation phase (FIT, ALLOC, COMMIT) - Log layer allocation adjustments during convergence - Log when model weights are being loaded (slowest phase) - Log progress during waitUntilRunnerLaunched (every 1s) - Improves visibility during 1-2 minute first-time model loads 3. Fix flash attention compute capability check (ml/device.go) - Changed DriverMajor to ComputeMajor for correct capability detection - Flash attention requires compute capability >= 7.0, not driver version These changes improve user experience during model loading by providing clear feedback at each stage, especially during the slow COMMIT phase where GGUF weights are loaded and CUDA kernels compile. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-11 23:28:00 +08:00
Shang Chieh Tseng	92ba15bcb1	Fix multi-GPU memory allocation for large models (deepseek-r1:14b) This commit fixes the issue where large models (>10B parameters) fail to load due to underestimated compute buffer memory requirements, causing allocation failures when the model should use multiple GPUs. Problem: - deepseek-r1:14b (14B, qwen2 architecture) failed with "failed to allocate compute buffers" error - System has 2×Tesla K80 GPUs (24GB total) but tried to fit 12GB model in 1×11GB GPU - Root cause: Memory estimation underestimated compute buffers by 3-4× (estimated 916 MB, actual requirement ~3-4 GB) Solution: 1. Added model-family-specific batch size defaults (llm/memory.go) - Different architectures have different optimal batch sizes - deepseek2: 2048/256, qwen2: 512/512, llama: 512/512, etc. - Ensures accurate memory estimation based on architecture 2. Updated server to use architecture-specific batch sizes (llm/server.go) - Detects model architecture from GGUF metadata - Uses family defaults when user doesn't specify - Ensures consistency between estimation and allocation 3. Applied 3.5× safety margin to compute buffer estimates (llm/memory.go) - Accounts for temporary tensors not captured in GraphSize formulas - Conservative approach prevents allocation failures - Documented with detailed analysis of underestimation causes 4. Implemented measurement API for future use (llama-context.cpp, llama.go) - C++ function to measure actual memory requirements - Go wrapper for integration into GPU selection - Foundation for future measurement-based approach - Currently unused but documented for future improvement Results: - deepseek-r1:14b now loads successfully using both GPUs - Proper distribution: 25 layers on GPU0, 24 layers on GPU1 - Total memory: 16.2 GB across 2×11 GB GPUs (8.4 + 7.8 GB) - Compute buffers: 3.1 GB per GPU (with safety margin applied) - All other models continue to work correctly Comprehensive documentation added to all modified code explaining: - Problem analysis with real examples - Solution rationale and trade-offs - Future improvement paths Tested with: deepseek-r1:14b, deepseek-r1:8b, gemma3:4b, gpt-oss 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-06 14:13:29 +08:00
Shang Chieh Tseng	d948926581	Fix Tesla K80 CUBLAS compatibility with two-tier fallback strategy This commit implements comprehensive Tesla K80 (Kepler, compute 3.7) compatibility for batched matrix multiplication operations. Problem: Modern CUBLAS functions fail on Tesla K80 with CUBLAS_STATUS_ARCH_MISMATCH: 1. CUBLAS_GEMM_DEFAULT_TENSOR_OP requires Tensor Cores (Volta+ only) 2. cublasGemmStridedBatchedEx/cublasGemmBatchedEx have architectural requirements beyond algorithm selection Solution - Two-Tier Fallback: Tier 1: Algorithm Selection - Volta+ (cc >= 7.0): CUBLAS_GEMM_DEFAULT_TENSOR_OP - Pre-Volta (cc < 7.0): CUBLAS_GEMM_DEFAULT Tier 2: Function Selection - Volta+ or non-FP32: Use Ex variants (flexible precision) - Kepler/Maxwell/Pascal with FP32: Use legacy type-specific functions (cublasSgemmStridedBatched, cublasSgemmBatched) Changes:* CUDA Implementation: - ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu * ggml_cuda_op_mul_mat_cublas: Algorithm selection for non-batched ops * ggml_cuda_mul_mat_batched_cublas_impl: Two-tier fallback for batched ops * Added GGML_CUDA_DEBUG environment variable for conditional debug logging * Comprehensive function documentation explaining fallback strategy Documentation: - CLAUDE.md * Added Tesla K80 CUBLAS Compatibility section * Documented GGML_CUDA_DEBUG environment variable * Enhanced "Running Ollama" section with log capture examples * Updated Files Modified list Code Comments: - Added detailed comments throughout CUDA code explaining: * Why TENSOR_OP fails on pre-Volta GPUs * Why Ex functions require architectural support Compute capability checks and fallback logic * Debug logging usage Testing: All models verified working on Tesla K80: - ✅ gemma3:4b - ✅ gpt-oss - ✅ deepseek-r1 Debug flag tested in both enabled and disabled states. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 23:52:45 +08:00
Shang Chieh Tseng	ef14fb5b26	Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - NVIDIA Driver: 470.256.02 (last version supporting Kepler/K80) - CUDA Version: 11.4.4 (last CUDA 11.x supporting compute 3.7) - GCC Version: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 14:03:05 +08:00
Shang Chieh Tseng	cbcbc9ae07	Add support for new models and fix GitHub issues - Add Gemma3n model support with text generation capabilities - Add new CUDA mean operations for improved performance - Add macOS documentation and performance tests - Update LLAMA patches for ROCm/CUDA compatibility - Fix various model conversion and processing issues - Update CI workflows and build configurations - Add library model tests and Shakespeare test data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-20 00:12:36 +08:00
Jesse Gross	f15ffc4320	llm: Make "POST predict" error message more informative "POST predict" basically means that the runner has crashed, which can have many reasons. However, many people think this is a specific error and either report only this message or group together unrelated bugs. This replaces it with a more friendly and helpful message.	2025-05-29 09:41:19 -07:00
Jesse Gross	94ab428e3f	ggml: Seperate tensor load from backend creation Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.	2025-05-19 09:54:22 -07:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Daniel Hiltgen	c6bcdc4223	Revert "remove cuda v11 (#10569 )" (#10692 ) Bring back v11 until we can better warn users that their driver is too old. This reverts commit `fa393554b9`.	2025-05-13 13:12:54 -07:00
Michael Yang	f95a1f2bef	feat: add trace log level (#10650 ) reduce prompt log to trace level	2025-05-12 11:43:00 -07:00
Daniel Hiltgen	5e380c3b42	sched: fix race leading to orphaned runners (#10599 ) If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.	2025-05-07 09:38:17 -07:00
Daniel Hiltgen	fa393554b9	remove cuda v11 (#10569 ) This reduces the size of our Windows installer payloads by ~256M by dropping support for nvidia drivers older than Feb 2023. Hardware support is unchanged. Linux default bundle sizes are reduced by ~600M to 1G.	2025-05-06 17:33:19 -07:00
Jeffrey Morgan	3b2d2c8326	api: remove unused or unsupported api options (#10574 ) Some options listed in api/types.go are not supported in newer models, or have been deprecated in the past. This is the first of a series of PRs to clean up the API options	2025-05-05 14:54:40 -07:00
Daniel Hiltgen	6a74bba7e7	win: ensure ollama paths come first (#10549 ) For all search path env vars make sure our dirs are first to avoid potentially finding other incompatible libraries on the users system. Also fixes a minor build script glitch for windows rocm	2025-05-03 13:11:48 -07:00
Daniel Hiltgen	76ea735aaf	sched: logging improvements (#10550 ) This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.	2025-05-03 12:01:56 -07:00
Daniel Hiltgen	718eda1b3e	Narrow set of paths we load GGML from (#10485 ) Users may have other incompatible GGML installs on their systems. This will prevent us from trying to load them from the path.	2025-04-30 11:25:22 -07:00
Parth Sareen	11dde41824	server: improve spacing for JSON grammar (#10131 )	2025-04-24 16:47:57 -07:00
Bruce MacDonald	e53b3cbd0c	llm: set done reason at server level (#9830 ) No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.	2025-04-03 10:19:24 -07:00
Jesse Gross	f66216e399	ggml: Support heterogeneous KV cache layer sizes in memory estimation Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890	2025-03-26 13:16:03 -07:00
Bruce MacDonald	3892c3a703	llm: remove internal subprocess req and resp types (#9324 ) This commit refactors the LLM subsystem by removing internal subprocess request and response types. It consolidates duplicate type definitions across the codebase, moving them to centralized locations. The change also standardizes interfaces between components, simplifies the ServerStatusResp struct, and moves the ParseDurationMs function to a common package. This cleanup reduces code duplication between different runner implementations (llamarunner and ollamarunner).	2025-03-14 15:21:53 -07:00
Daniel Hiltgen	ab39e08eb9	llm: auto detect models that require Ollama Engine (#1 )	2025-03-11 14:49:20 -07:00
Jeffrey Morgan	e093db92c4	sample: temporarily use grammars for constrained generation in new engine (#9586 )	2025-03-10 16:17:39 +01:00
Jesse Gross	b70fc4d51e	model: Don't unconditionally add special tokens We sometimes tokenize partial strings. For example, with multimodal inputs, we split the input string around the images and then tokenize each piece. In these cases, we should only add the special tokens on the first piece.	2025-03-06 16:54:16 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
Jeffrey Morgan	5296f487a8	llm: attempt to evaluate symlinks, but do not fail (#9089 ) provides a better approach to #9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues	2025-02-13 22:37:59 -08:00
Jeffrey Morgan	f05774b04c	llm: do not evaluate symlink for exe path lookup (#9088 ) In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether	2025-02-13 22:13:00 -08:00
Jesse Gross	ed443a0393	Runner for Ollama engine This provides integration with the new Ollama engine (`5824541` next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1	2025-02-13 17:09:26 -08:00
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-02-13 16:31:21 -08:00
Jeffrey Morgan	4759ecae19	ml/backend/ggml: fix library loading on macOS amd64 (#8827 )	2025-02-04 15:05:39 -08:00
Jeffrey Morgan	50566113ac	llm: do not error if LibOllamaPath does not exist (#8801 )	2025-02-03 12:27:48 -08:00
Michael Yang	dcfb7a105c	next build (#8539 ) * add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-01-29 15:03:38 -08:00
Jeffrey Morgan	1deafd8254	llama: update vendored code to commit 46e3556 (#8308 )	2025-01-08 11:22:01 -08:00
Blake Mizerany	2ddc32d5c5	llm: do not error on "null" format (#8139 ) This fixes another regression in the previous commit that fixed other known bugs.	2024-12-17 09:49:37 -08:00
Blake Mizerany	87f0a49fe6	llm: do not silently fail for supplied, but invalid formats (#8130 ) Changes in #8002 introduced fixes for bugs with mangling JSON Schemas. It also fixed a bug where the server would silently fail when clients requested invalid formats. It also, unfortunately, introduced a bug where the server would reject requests with an empty format, which should be allowed. The change in #8127 updated the code to allow the empty format, but also reintroduced the regression where the server would silently fail when the format was set, but invalid. This commit fixes both regressions. The server does not reject the empty format, but it does reject invalid formats. It also adds tests to help us catch regressions in the future. Also, the updated code provides a more detailed error message when a client sends a non-empty, but invalid format, echoing the invalid format in the response. This commits also takes the opportunity to remove superfluous linter checks.	2024-12-16 21:57:49 -08:00
Jeffrey Morgan	0f06a6daa7	llm: loosen format check to default to no format (#8127 )	2024-12-16 18:45:46 -08:00
Blake Mizerany	9039c821a2	llama: preserve field order in user-defined JSON schemas (#8002 ) Previously we decoded and re-encoded JSON schemas during validation, which served no purpose since json.RawMessage already validates JSON syntax. Worse, the re-encoding lost field ordering from the original schema, which affects inference quality during step-by-step reasoning. While fixing this ordering issue by using json.RawMessage directly, testing revealed that schema_to_grammar (from llama.cpp) also fails to preserve field order during grammar generation. This appears to be the root cause of inference degradation. This change prevents us from mangling the user's original schema order, but we still need to address the ordering issue in schema_to_grammar. That will be a separate change. Updates #7978	2024-12-11 14:07:30 -08:00
Jeffrey Morgan	527cc97899	llama: update vendored code to commit 40c6d79f (#7875 )	2024-12-10 19:21:34 -08:00
Daniel Hiltgen	4879a234c4	build: Make target improvements (#7499 ) * llama: wire up builtin runner This adds a new entrypoint into the ollama CLI to run the cgo built runner. On Mac arm64, this will have GPU support, but on all other platforms it will be the lowest common denominator CPU build. After we fully transition to the new Go runners more tech-debt can be removed and we can stop building the "default" runner via make and rely on the builtin always. * build: Make target improvements Add a few new targets and help for building locally. This also adjusts the runner lookup to favor local builds, then runners relative to the executable, and finally payloads. * Support customized CPU flags for runners This implements a simplified custom CPU flags pattern for the runners. When built without overrides, the runner name contains the vector flag we check for (AVX) to ensure we don't try to run on unsupported systems and crash. If the user builds a customized set, we omit the naming scheme and don't check for compatibility. This avoids checking requirements at runtime, so that logic has been removed as well. This can be used to build GPU runners with no vector flags, or CPU/GPU runners with additional flags (e.g. AVX512) enabled. * Use relative paths If the user checks out the repo in a path that contains spaces, make gets really confused so use relative paths for everything in-repo to avoid breakage. * Remove payloads from main binary * install: clean up prior libraries This removes support for v0.3.6 and older versions (before the tar bundle) and ensures we clean up prior libraries before extracting the bundle(s). Without this change, runners and dependent libraries could leak when we update and lead to subtle runtime errors.	2024-12-10 09:47:19 -08:00
Parth Sareen	de52b6c2f9	bugfix: "null" value json mode (#7979 )	2024-12-06 14:13:15 -08:00
Parth Sareen	630e7dc6ff	api: structured outputs - chat endpoint (#7900 ) Adds structured outputs to chat endpoint --------- Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>	2024-12-04 16:31:19 -08:00
Sam	539be43640	llm: normalise kvct parameter handling (#7926 )	2024-12-03 16:30:40 -08:00
Sam	1bdab9fdb1	llm: introduce k/v context quantization (vRAM improvements) (#6279 )	2024-12-03 15:57:19 -08:00
ItzCrazyKns	e3936d4fb3	Support Multiple LoRa Adapters (#7667 ) Closes #7627	2024-11-27 11:00:04 -08:00
Daniel Hiltgen	b85520bfb9	logs: explain client aborts better (#7783 ) Users get confused by "Failed to acquire semaphore" error="context canceled" messages in the logs, which are actually clients giving up. While there could be a legitimate hang bug in the system, sometimes this is just short client timeouts with an overloaded system, so this should help users understand what's going on better.	2024-11-22 08:05:32 -08:00
Daniel Hiltgen	909a88c5c0	Improve crash reporting (#7728 ) Many model crashes are masked behind "An existing connection was forcibly closed by the remote host" This captures that common error message and wires in any detected errors from the log. This also adds the deepseek context shift error to the known errors we capture.	2024-11-19 16:26:57 -08:00
Daniel Hiltgen	81d55d3e4d	fix index out of range on zero layer metal load (#7696 ) If the model doesn't fit any layers on metal, and we load zero layers we would panic trying to look up the GPU size during scheduling ops	2024-11-18 11:48:13 -08:00
Daniel Hiltgen	df011054fa	Jetpack support for Go server (#7217 ) This adds support for the Jetson JetPack variants into the Go runner	2024-11-12 10:31:52 -08:00
Jesse Gross	a909417602	runner.go: Remove unused arguments Now that server.cpp is gone, we don't need to keep passing arguments that were only ignored and only kept for compatibility.	2024-11-06 13:32:18 -08:00
Jesse Gross	de1557a0dc	runner.go: Better handle return NULL values from llama.cpp Llama.cpp sometimes returns NULL as a return value to report an error. We should explicitly check for this and convert it to a Go error rather than putting NULL in our data structures and waiting for it to blow up later.	2024-10-28 18:12:29 -07:00
Patrick Devine	c7cb0f0602	image processing for llama3.2 (#6963 ) Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Jesse Gross <jesse@ollama.com>	2024-10-18 16:12:35 -07:00

1 2 3 4

162 Commits