Commit Graph

9 Commits

Author SHA1 Message Date
Shang Chieh Tseng
296d537a2c Update CLAUDE.md: Document Phase 2 CC 3.7 graph correction
Added Phase 2 documentation for single-GPU optimization:
- CC 3.7 graph correction factor (85% of estimate)
- gemma3:12b now loads on single GPU
- Improved from 11.9 GiB → 11.0 GiB estimation
- Validated with 10.0 GiB actual usage, 94% GPU utilization

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-30 00:16:38 +08:00
Shang Chieh Tseng
d04ea50ced Fix gpt-oss model architecture to match GGUF tensor format
The gpt-oss model architecture code expected fused tensors (attn_qkv,
ffn_gate_up_exps) but the actual GGUF files contain separate tensors
(attn_q/k/v, ffn_gate_exps/up_exps), causing nil pointer panics during
model loading.

Changes:
- model/models/gptoss/model.go: Updated AttentionBlock to use separate
  Query/Key/Value fields instead of fused QKV, modified Forward() to
  compute projections separately
- model/models/gptoss/model.go: Updated MLPBlock to use separate Gate/Up
  fields instead of fused GateUp, simplified Forward() logic
- fs/ggml/type.go: Reorganized MXFP4 tensor type constant ordering
- ml/backend/ggml/ggml/include/ggml.h: Moved GGML_TYPE_MXFP4 to end of
  enum to match GGUF file format specification
- ml/backend/ggml/ggml/src/ggml.c: Updated type name array to match
  reordered enum
- CLAUDE.md: Documented gpt-oss model compatibility fix

Result: gpt-oss:20b model now loads and runs successfully on Tesla K80,
all 25 layers offload to GPU correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 23:34:03 +08:00
Shang Chieh Tseng
241a03402e Optimize GPU memory estimation for single-GPU preference on Tesla K80
Implemented multi-GPU memory optimization to reduce unnecessary model splits
across dual Tesla K80 GPUs by fixing graph memory overestimation.

Changes:
1. Per-GPU graph allocation strategy
   - Secondary GPUs: 190 MiB (empirically measured)
   - Primary GPU: Full 1.3 GiB graph allocation
   - Applied during layer distribution, not just final allocation

2. Reverse-order layer distribution
   - Prefer loading all layers on last GPU (GPU 1) first
   - Only use secondary GPUs when primary is full
   - Changed from round-robin to reverse-order (j-1 instead of i%j)

Results:
 gemma3:4b: Single GPU (no split, was already working)
 gemma3:12b: 1,48 layer split (improved from 25,24 split)
   - GPU 0: 1 layer, 610 MiB (down from 4156 MiB)
   - GPU 1: 48 layers, 9857 MiB (primary)
   - Total actual: 10.5 GiB (fits in single K80's 11.2 GiB)

Memory estimate reduced from 13.0 GiB → 11.9 GiB, enabling more models
to run on single GPU with better performance (no cross-GPU overhead).

Files modified:
- llm/memory.go: Core allocation logic (lines 230-288)
- llm/CLAUDE.md: Detailed implementation guide
- CLAUDE.md: Project status and results summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 19:58:20 +08:00
Shang Chieh Tseng
5077ab3fb4 Document Phase 9 completion: Fix CUDA backend loading for CC 3.7
Phase 9 successfully resolved runtime loading issues where CUDA backend
failed to load due to undefined Flash Attention symbols.

Solution:
- Disabled flash attention helper functions (lines 126-274 in fattn.cu)
- Simplified ggml_cuda_flash_attn_ext() to abort immediately for CC 3.7
- Added GGML_UNUSED macros to prevent compiler warnings
- Added ggml_backend_cuda_score() function for backend selection

Testing Results:
 CUDA backend loads without undefined symbol errors
 GPU layers offload correctly (e.g., 35/35 for gemma3:4b)
 Fast GPU inference confirmed working

Flash Attention is not supported on CC 3.7 (requires Volta/Tensor Cores).
If attempted, gracefully aborts with clear error message.

All 9 phases of CC 3.7-only optimization now complete and tested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 17:44:36 +08:00
Shang Chieh Tseng
771044bead Complete CC 3.7-only CUDA optimization for Tesla K80 support
Simplify CUDA backend to exclusively support Compute Capability 3.7 (Kepler/Tesla K80).
This optimization removes ~2,700 lines of modern GPU code and resolves all compilation issues.

Changes:
- Remove tensor core files (mma.cuh, fattn-wmma-f16.*, fattn-mma-f16.cuh) and 92 template instances
- Hardcode architecture detection to always return CC 3.7 (370) in common.cuh
- Disable modern GPU features: FP16 native ops, MMA/WMMA, CP_ASYNC, BF16, CUDA graphs
- Disable 6 MMA functions in mmq.cuh while preserving DP4A functions for CC 3.7
- Replace undefined architecture constants (PASCAL/VOLTA/DP4A/ADA_LOVELACE) with CC 3.7 equivalents
- Set CMAKE_CUDA_ARCHITECTURES to "37" only in CMakeLists.txt and CMakePresets.json
- Hardcode Stream-K scheduling to false, precision to FP32 throughout
- Add comprehensive CLAUDE.md documentation with complete optimization history

Build configuration now compiles only for architecture 37, resulting in 80-85% smaller
binaries and 5-6x faster build times. All removed code paths were unreachable on CC 3.7
hardware, ensuring no performance degradation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 15:21:08 +08:00
Shang Chieh Tseng
135b799b13 Update command. 2025-10-29 14:21:03 +08:00
Shang Chieh Tseng
f337f53408 docs: update documentation to reflect Gemma3n support in v1.3.0
Update README.md and CLAUDE.md to correctly reference Gemma3n model
support that was added in version 1.3.0, replacing generic "Gemma 3"
references with the specific "Gemma3n" model name.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-20 09:47:05 +08:00
Shang Chieh Tseng
7c029749bc docs: restructure README and create comprehensive manual build guide
- Restructure README.md for better readability and organization
- Reduce README word count by 75% while maintaining key information
- Move detailed installation guides to docs/manual-build.md
- Add Tesla K80-specific build instructions and optimizations
- Update CLAUDE.md with new documentation structure and references
- Improve title formatting with emoji and clear tagline

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-20 09:11:43 +08:00
Shang Chieh Tseng
cbcbc9ae07 Add support for new models and fix GitHub issues
- Add Gemma3n model support with text generation capabilities
- Add new CUDA mean operations for improved performance
- Add macOS documentation and performance tests
- Update LLAMA patches for ROCm/CUDA compatibility
- Fix various model conversion and processing issues
- Update CI workflows and build configurations
- Add library model tests and Shakespeare test data

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-20 00:12:36 +08:00