ollama37

mirror of https://github.com/dogkeeper886/ollama37.git synced 2025-12-18 11:47:07 +00:00

Author	SHA1	Message	Date
Shang Chieh Tseng	ef14fb5b26	Sync with upstream ollama/ollama and restore Tesla K80 (compute 3.7) support This commit represents a complete rework after pulling the latest changes from official ollama/ollama repository and re-applying Tesla K80 compatibility patches. ## Key Changes ### CUDA Compute Capability 3.7 Support (Tesla K80) - Added sm_37 (compute 3.7) to CMAKE_CUDA_ARCHITECTURES in CMakeLists.txt - Updated CMakePresets.json to include compute 3.7 in "CUDA 11" preset - Using 37-virtual (PTX with JIT compilation) for maximum compatibility ### Legacy Toolchain Compatibility - NVIDIA Driver: 470.256.02 (last version supporting Kepler/K80) - CUDA Version: 11.4.4 (last CUDA 11.x supporting compute 3.7) - GCC Version: 10.5.0 (required by CUDA 11.4 host_config.h) ### CPU Architecture Trade-offs Due to GCC 10.5 limitation, sacrificed newer CPU optimizations: - Alderlake CPU variant enabled WITHOUT AVX_VNNI (requires GCC 11+) - Still supports: SSE4.2, AVX, F16C, AVX2, BMI2, FMA - Performance impact: ~3-7% on newer CPUs (acceptable for K80 compatibility) ### Build System Updates - Modified ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt for compute 3.7 - Added -Wno-deprecated-gpu-targets flag to suppress warnings - Updated ml/backend/ggml/ggml/src/CMakeLists.txt for Alderlake without AVX_VNNI ### Upstream Sync Merged latest llama.cpp changes including: - Enhanced KV cache management with ISWA and hybrid memory support - Improved multi-modal support (mtmd framework) - New model architectures (Gemma3, Llama4, Qwen3, etc.) - GPU backend improvements for CUDA, Metal, and ROCm - Updated quantization support and GGUF format handling ### Documentation - Updated CLAUDE.md with comprehensive build instructions - Documented toolchain constraints and CPU architecture trade-offs - Removed outdated CI/CD workflows (tesla-k80-*.yml) - Cleaned up temporary development artifacts ## Rationale This fork maintains Tesla K80 GPU support (compute 3.7) which was dropped in official Ollama due to legacy driver/CUDA requirements. The toolchain constraint creates a deadlock: - K80 → Driver 470 → CUDA 11.4 → GCC 10 → No AVX_VNNI We accept the loss of cutting-edge CPU optimizations to enable running modern LLMs on legacy but still capable Tesla K80 hardware (12GB VRAM per GPU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 14:03:05 +08:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jesse Gross	f66216e399	ggml: Support heterogeneous KV cache layer sizes in memory estimation Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890	2025-03-26 13:16:03 -07:00
Parth Sareen	314573bfe8	config: allow setting context length through env var (#8938 ) * envconfig: allow setting context length through env var	2025-02-24 13:26:35 -08:00
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-02-13 16:31:21 -08:00
Stefan Weil	abfdc4710f	all: fix typos in documentation, code, and comments (#7021 )	2024-12-10 12:58:06 -08:00
Sam	1bdab9fdb1	llm: introduce k/v context quantization (vRAM improvements) (#6279 )	2024-12-03 15:57:19 -08:00
Daniel Hiltgen	05cd82ef94	Rename gpu package discover (#7143 ) Cleaning up go package naming	2024-10-16 17:45:00 -07:00
Michael Yang	77903ab8b4	llama3.1	2024-08-21 11:49:31 -07:00
Michael Yang	b732beba6a	lint	2024-08-01 17:06:06 -07:00
Michael Yang	df993fa37b	comments	2024-07-31 15:58:55 -07:00
Michael Yang	5e9db9fb0b	refactor convert	2024-07-31 15:58:33 -07:00
Michael Yang	35b89b2eab	rfc: dynamic environ lookup	2024-07-22 11:25:30 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00

16 Commits