IBM granite/granitemoe architecture support (#6760)

* fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: df270ef745 Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR https://github.com/ggerganov/llama.cpp/pull/9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-12-10 07:46:59 +00:00 · 2024-10-17 12:59:52 -06:00
parent 05cd82ef94
commit f2890a4494
263 changed files with 14255 additions and 10867 deletions
--- a/llama/patches/01-cuda.diff
+++ b/llama/patches/01-cuda.diff
@@ -1,8 +1,21 @@
+diff --git a/ggml/include/ggml-cuda.h b/ggml/include/ggml-cuda.h
+index 71bb6dcf..08be0895 100644
+--- a/ggml/include/ggml-cuda.h
+++ b/ggml/include/ggml-cuda.h
+@@ -34,6 +34,8 @@ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_typ
+ // pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
+ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
+ 
+GGML_API GGML_CALL int ggml_backend_cuda_reg_devices();
+
+ GGML_API GGML_CALL int  ggml_backend_cuda_get_device_count(void);
+ GGML_API GGML_CALL void ggml_backend_cuda_get_device_description(int device, char * description, size_t description_size);
+ GGML_API GGML_CALL void ggml_backend_cuda_get_device_memory(int device, size_t * free, size_t * total);
 diff --git a/ggml/src/ggml-backend.c b/ggml/src/ggml-backend.c
-index 9e35ce98..179be840 100644
+index ba280e06..d5c3fe49 100644
 --- a/ggml/src/ggml-backend.c
 +++ b/ggml/src/ggml-backend.c
-@@ -87,7 +87,12 @@ void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
+@@ -83,7 +83,12 @@ void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
     if (buffer->iface.free_buffer != NULL) {
         buffer->iface.free_buffer(buffer);
     }
@@ -16,10 +29,10 @@ index 9e35ce98..179be840 100644
 
 size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
 diff --git a/ggml/src/ggml-cuda.cu b/ggml/src/ggml-cuda.cu
-index 04b6e528..43b12bdf 100644
+index 6efdab14..809d6ab1 100644
 --- a/ggml/src/ggml-cuda.cu
 +++ b/ggml/src/ggml-cuda.cu
-@@ -392,6 +392,10 @@ GGML_CALL static bool ggml_backend_buffer_is_cuda(ggml_backend_buffer_t buffer)
+@@ -469,6 +469,10 @@ GGML_CALL static bool ggml_backend_buffer_is_cuda(ggml_backend_buffer_t buffer)
 GGML_CALL static void ggml_backend_cuda_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
     delete ctx;
@@ -30,7 +43,7 @@ index 04b6e528..43b12bdf 100644
 }
 
 GGML_CALL static void * ggml_backend_cuda_buffer_get_base(ggml_backend_buffer_t buffer) {
-@@ -3028,8 +3032,6 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cuda_init(const char * params,
+@@ -3204,8 +3208,6 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cuda_init(const char * params,
     GGML_UNUSED(params);
 }
 
@@ -39,16 +52,3 @@ index 04b6e528..43b12bdf 100644
 GGML_CALL int ggml_backend_cuda_reg_devices() {
     int device_count = ggml_backend_cuda_get_device_count();
     //int device_count = 1; // DEBUG: some tools require delaying CUDA initialization
-diff --git a/ggml/include/ggml-cuda.h b/ggml/include/ggml-cuda.h
-index 5eb4af40..50b91009 100644
--- a/ggml/include/ggml-cuda.h
-+++ b/ggml/include/ggml-cuda.h
-@@ -31,6 +31,8 @@ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_typ
- // pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
- GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
- 
-+GGML_API GGML_CALL int ggml_backend_cuda_reg_devices();
-+
- GGML_API GGML_CALL int  ggml_backend_cuda_get_device_count(void);
- GGML_API GGML_CALL void ggml_backend_cuda_get_device_description(int device, char * description, size_t description_size);
- GGML_API GGML_CALL void ggml_backend_cuda_get_device_memory(int device, size_t * free, size_t * total);
--- a/llama/patches/02-pretokenizer.diff
+++ b/llama/patches/02-pretokenizer.diff
@@ -1,8 +1,8 @@
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 88355971..dd7d41ed 100644
+index 4c0a1bb6..800dfb95 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -6083,16 +6083,7 @@ static void llm_load_vocab(
+@@ -6287,16 +6287,7 @@ static void llm_load_vocab(
         if (vocab.type == LLAMA_VOCAB_TYPE_BPE) {
             vocab.tokenizer_add_space_prefix = false;
             vocab.tokenizer_clean_spaces = true;
@@ -20,9 +20,9 @@ index 88355971..dd7d41ed 100644
                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
             } else if (
                     tokenizer_pre == "llama3"   ||
-@@ -6188,7 +6179,8 @@ static void llm_load_vocab(
-                 tokenizer_pre == "exaone") {
-                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_EXAONE;
+@@ -6398,7 +6389,8 @@ static void llm_load_vocab(
+                 vocab.tokenizer_add_bos = true;
+                 vocab.tokenizer_clean_spaces = false;
             } else {
 -                throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
 +                LLAMA_LOG_WARN("%s: missing or unrecognized pre-tokenizer type, using: 'default'\n", __func__);
--- a/llama/patches/03-metal.diff
+++ b/llama/patches/03-metal.diff
@@ -1,15 +1,15 @@
 diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
-index 0207b787..b5e9884b 100644
+index 9da08fe2..3a433703 100644
 --- a/ggml/src/ggml-metal.m
 +++ b/ggml/src/ggml-metal.m
-@@ -1396,27 +1396,23 @@ static enum ggml_status ggml_metal_graph_compute(
-                         // to the matrix-vector kernel
-                         int ne11_mm_min = 1;
+@@ -1720,27 +1720,23 @@ static void ggml_metal_encode_node(
+                 // to the matrix-vector kernel
+                 int ne11_mm_min = 1;
 
 -#if 0
-                         // the numbers below are measured on M2 Ultra for 7B and 13B models
-                         // these numbers do not translate to other devices or model sizes
-                         // TODO: need to find a better approach
+                 // the numbers below are measured on M2 Ultra for 7B and 13B models
+                 // these numbers do not translate to other devices or model sizes
+                 // TODO: need to find a better approach
 -                        if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) {
 -                            switch (src0t) {
 -                                case GGML_TYPE_F16:  ne11_mm_min = 2;  break;
--- a/llama/patches/04-ggml-metal.diff
+++ b/llama/patches/04-ggml-metal.diff
@@ -1,8 +1,8 @@
 diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
-index b56c3604..400d43f4 100644
+index 3a433703..829c5e39 100644
 --- a/ggml/src/ggml-metal.m
 +++ b/ggml/src/ggml-metal.m
-@@ -377,8 +377,8 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
+@@ -392,8 +392,8 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
 #if GGML_METAL_EMBED_LIBRARY
             GGML_METAL_LOG_INFO("%s: using embedded metal library\n", __func__);
 
--- a/llama/patches/05-embeddings.diff
+++ b/llama/patches/05-embeddings.diff
@@ -1,31 +1,31 @@
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 88355971..d7db689b 100644
+index 4c0a1bb6..17e5bc2a 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -15906,7 +15906,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
+@@ -16928,7 +16928,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
     const auto n_embd  = hparams.n_embd;
- 
+
     // TODO: use a per-batch flag for logits presence instead
 -    const bool has_logits = !cparams.embeddings;
 +    const bool has_logits =  cparams.causal_attn;
     const bool has_embd   =  cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);
- 
+
     const size_t logits_size = has_logits ? n_vocab*n_outputs_max : 0;
-@@ -16175,20 +16175,23 @@ static int llama_decode_internal(
+@@ -17200,20 +17200,23 @@ static int llama_decode_internal(
             // no output
             res  = nullptr;
             embd = nullptr;
 -        } else if (cparams.embeddings) {
 -            res  = nullptr; // do not extract logits for embedding case
 -            embd = nullptr;
+-            for (int i = ggml_graph_n_nodes(gf) - 1; i >= 0; --i) {
 +        }
 +
 +        if (cparams.embeddings) {
-             for (int i = gf->n_nodes - 1; i >= 0; --i) {
-                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
-                    embd = gf->nodes[i];
-+                embd = gf->nodes[i];
-+                if (strcmp(embd->name, "result_embd_pooled") == 0) {
+            for (int i = ggml_graph_n_nodes(gf) - 1; i >= 0; --i) {
+                embd = ggml_graph_node(gf, i);
+                 if (strcmp(ggml_graph_node(gf, i)->name, "result_embd_pooled") == 0) {
+-                    embd = ggml_graph_node(gf, i);
                     break;
                 }
             }
@@ -39,5 +39,5 @@ index 88355971..d7db689b 100644
 +            res = nullptr; // do not extract logits when not needed
 +        }
         // LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs);
- 
+
         ggml_backend_sched_alloc_graph(lctx.sched, gf);
--- a/llama/patches/07-clip-unicode.diff
+++ b/llama/patches/07-clip-unicode.diff
@@ -1,10 +1,10 @@
 diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
-index dcc65f02..1a990306 100644
+index 14e02c8d..6e849d8e 100644
 --- a/examples/llava/clip.cpp
 +++ b/examples/llava/clip.cpp
-@@ -66,6 +66,19 @@
- #include <cinttypes>
- #include <limits>
+@@ -44,6 +44,19 @@
+ #define LOG_ERR(...) do { fprintf(stderr, __VA_ARGS__); } while (0)
+ #define LOG_DBG(...) do { fprintf(stderr, __VA_ARGS__); } while (0)
 
 +#if defined(_WIN32)
 +#define WIN32_LEAN_AND_MEAN
@@ -22,10 +22,11 @@ index dcc65f02..1a990306 100644
 //#define CLIP_DEBUG_FUNCTIONS
 
 // RGB uint8 image
-@@ -1248,7 +1261,29 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
+@@ -1225,8 +1238,29 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
+             gguf_free(ctx);
             return nullptr;
         }
- 
+-
 +#ifdef _WIN32
 +        int wlen = MultiByteToWideChar(CP_UTF8, 0, fname, -1, NULL, 0);
 +        if (!wlen) {
@@ -50,9 +51,9 @@ index dcc65f02..1a990306 100644
         auto fin = std::ifstream(fname, std::ios::binary);
 +#endif
         if (!fin) {
-             LOG_TEE("cannot open model file for loading tensors\n");
+             LOG_ERR("cannot open model file for loading tensors\n");
             clip_free(new_clip);
-@@ -1288,7 +1323,11 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
+@@ -1266,7 +1300,11 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
                 ggml_backend_tensor_set(cur, read_buf.data(), 0, num_bytes);
             }
         }
--- a/llama/patches/08-solar-pro.diff
+++ b/llama/patches/08-solar-pro.diff
@@ -1,34 +1,34 @@
 diff --git a/src/llama.cpp b/src/llama.cpp
-index f79bd782..b7771f53 100644
+index bdad28b3..1fe6189a 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -213,6 +213,7 @@ enum llm_arch {
-     LLM_ARCH_NEMOTRON,
-     LLM_ARCH_EXAONE,
-     LLM_ARCH_RWKV6,
+@@ -217,6 +217,7 @@ enum llm_arch {
+     LLM_ARCH_GRANITE,
+     LLM_ARCH_GRANITE_MOE,
+     LLM_ARCH_CHAMELEON,
 +    LLM_ARCH_SOLAR,
     LLM_ARCH_UNKNOWN,
 };
 
-@@ -261,6 +262,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
-     { LLM_ARCH_NEMOTRON,        "nemotron"     },
-     { LLM_ARCH_EXAONE,          "exaone"       },
-     { LLM_ARCH_RWKV6,           "rwkv6"        },
+@@ -270,6 +271,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
+     { LLM_ARCH_GRANITE,         "granite"      },
+     { LLM_ARCH_GRANITE_MOE,     "granitemoe"   },
+     { LLM_ARCH_CHAMELEON,       "chameleon"    },
 +    { LLM_ARCH_SOLAR,           "solar"        },
     { LLM_ARCH_UNKNOWN,         "(unknown)"    },
 };
 
-@@ -314,6 +316,7 @@ enum llm_kv {
-     LLM_KV_ATTENTION_KV_LORA_RANK,
+@@ -327,6 +329,7 @@ enum llm_kv {
     LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,
     LLM_KV_ATTENTION_SLIDING_WINDOW,
+     LLM_KV_ATTENTION_SCALE,
 +    LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION,
 
     LLM_KV_ROPE_DIMENSION_COUNT,
     LLM_KV_ROPE_FREQ_BASE,
-@@ -405,19 +408,20 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
-     { LLM_KV_TIME_MIX_EXTRA_DIM,                "%s.time_mix_extra_dim"                },
-     { LLM_KV_TIME_DECAY_EXTRA_DIM,              "%s.time_decay_extra_dim"              },
+@@ -421,20 +424,21 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
+     { LLM_KV_RESIDUAL_SCALE,                    "%s.residual_scale"                    },
+     { LLM_KV_EMBEDDING_SCALE,                   "%s.embedding_scale"                   },
 
 -    { LLM_KV_ATTENTION_HEAD_COUNT,             "%s.attention.head_count"             },
 -    { LLM_KV_ATTENTION_HEAD_COUNT_KV,          "%s.attention.head_count_kv"          },
@@ -43,6 +43,7 @@ index f79bd782..b7771f53 100644
 -    { LLM_KV_ATTENTION_KV_LORA_RANK,           "%s.attention.kv_lora_rank"           },
 -    { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count" },
 -    { LLM_KV_ATTENTION_SLIDING_WINDOW,         "%s.attention.sliding_window"         },
+-    { LLM_KV_ATTENTION_SCALE,                  "%s.attention.scale"                  },
 +    { LLM_KV_ATTENTION_HEAD_COUNT,             "%s.attention.head_count"               },
 +    { LLM_KV_ATTENTION_HEAD_COUNT_KV,          "%s.attention.head_count_kv"            },
 +    { LLM_KV_ATTENTION_MAX_ALIBI_BIAS,         "%s.attention.max_alibi_bias"           },
@@ -56,20 +57,21 @@ index f79bd782..b7771f53 100644
 +    { LLM_KV_ATTENTION_KV_LORA_RANK,           "%s.attention.kv_lora_rank"             },
 +    { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count"   },
 +    { LLM_KV_ATTENTION_SLIDING_WINDOW,         "%s.attention.sliding_window"           },
+    { LLM_KV_ATTENTION_SCALE,                  "%s.attention.scale"                    },
 +    { LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION,  "%s.attention.block_skip_connection.%d" },
 
     { LLM_KV_ROPE_DIMENSION_COUNT,          "%s.rope.dimension_count"                 },
     { LLM_KV_ROPE_FREQ_BASE,                "%s.rope.freq_base"                       },
-@@ -589,6 +593,7 @@ enum llm_tensor {
-     LLM_TENSOR_ENC_FFN_DOWN,
-     LLM_TENSOR_ENC_FFN_UP,
+@@ -608,6 +612,7 @@ enum llm_tensor {
     LLM_TENSOR_ENC_OUTPUT_NORM,
+     LLM_TENSOR_CLS,
+     LLM_TENSOR_CLS_OUT,
 +    LLM_TENSOR_BSKCN_TV,
 };
 
 static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
-@@ -1408,6 +1413,24 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
-             { LLM_TENSOR_CHANNEL_MIX_RECEPTANCE,    "blk.%d.channel_mix_receptance" },
+@@ -1527,6 +1532,24 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
+             { LLM_TENSOR_ATTN_K_NORM,     "blk.%d.attn_k_norm" },
         },
     },
 +    {
@@ -93,7 +95,7 @@ index f79bd782..b7771f53 100644
     {
         LLM_ARCH_UNKNOWN,
         {
-@@ -2237,6 +2260,7 @@ enum e_model {
+@@ -2360,6 +2383,7 @@ enum e_model {
     MODEL_15B,
     MODEL_16B,
     MODEL_20B,
@@ -101,7 +103,7 @@ index f79bd782..b7771f53 100644
     MODEL_30B,
     MODEL_34B,
     MODEL_35B,
-@@ -2284,6 +2308,8 @@ struct llama_hparams {
+@@ -2409,6 +2433,8 @@ struct llama_hparams {
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
 
@@ -110,7 +112,7 @@ index f79bd782..b7771f53 100644
     uint32_t n_layer_dense_lead = 0;
     uint32_t n_lora_q = 0;
     uint32_t n_lora_kv = 0;
-@@ -2349,6 +2375,7 @@ struct llama_hparams {
+@@ -2479,6 +2505,7 @@ struct llama_hparams {
         if (this->n_head_arr    != other.n_head_arr)    return true;
         if (this->n_head_kv_arr != other.n_head_kv_arr) return true;
         if (this->n_ff_arr      != other.n_ff_arr)      return true;
@@ -118,7 +120,7 @@ index f79bd782..b7771f53 100644
 
         if (this->n_rel_attn_bkts    != other.n_rel_attn_bkts)    return true;
         if (this->n_layer_dense_lead != other.n_layer_dense_lead) return true;
-@@ -2455,6 +2482,14 @@ struct llama_hparams {
+@@ -2588,6 +2615,14 @@ struct llama_hparams {
             return ssm_d_state * ssm_d_inner;
         }
     }
@@ -133,7 +135,7 @@ index f79bd782..b7771f53 100644
 };
 
 static_assert(std::is_trivially_copyable<llama_hparams>::value, "llama_hparams must be trivially copyable");
-@@ -2635,6 +2670,8 @@ struct llama_layer {
+@@ -2769,6 +2804,8 @@ struct llama_layer {
     struct ggml_tensor * ffn_gate_scale;
     struct ggml_tensor * ffn_up_scale;
     struct ggml_tensor * ffn_down_scale;
@@ -142,9 +144,9 @@ index f79bd782..b7771f53 100644
 };
 
 // very similar to llama_batch,
-@@ -5937,6 +5974,21 @@ static void llm_load_hparams(
+@@ -6134,6 +6171,21 @@ static void llm_load_hparams(
                     default: model.type = e_model::MODEL_UNKNOWN;
-                 }
+                }
             } break;
 +        case LLM_ARCH_SOLAR:
 +            {
@@ -164,10 +166,15 @@ index f79bd782..b7771f53 100644
         default: (void)0;
     }
 
-@@ -8420,6 +8472,38 @@ static bool llm_load_tensors(
-                     }
+@@ -8831,6 +8883,38 @@ static bool llm_load_tensors(
 
-                 } break;
+                         layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
+ 
+                        layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
+                        layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
+                        layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
+                    }
+                } break;
 +            case LLM_ARCH_SOLAR:
 +                {
 +                    model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
@@ -195,15 +202,10 @@ index f79bd782..b7771f53 100644
 +
 +                        layer.bskcn_tv = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_BSKCN_TV, "weight"), {2}, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
 +
-+                        layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
-+                        layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
-+                        layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
-+                    }
-+                } break;
-             default:
-                 throw std::runtime_error("unknown architecture");
-         }
-@@ -15173,6 +15257,158 @@ struct llm_build_context {
+                         layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
+                         layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
+                         layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
+@@ -16179,6 +16263,158 @@ struct llm_build_context {
 
         return gf;
     }
@@ -362,9 +364,9 @@ index f79bd782..b7771f53 100644
 };
 
 static struct ggml_cgraph * llama_build_graph_defrag(llama_context & lctx, const std::vector<uint32_t> & ids) {
-@@ -15423,6 +15659,10 @@ static struct ggml_cgraph * llama_build_graph(
+@@ -16443,6 +16679,10 @@ static struct ggml_cgraph * llama_build_graph(
             {
-                 result = llm.build_rwkv6();
+                 result = llm.build_chameleon();
             } break;
 +        case LLM_ARCH_SOLAR:
 +            {
@@ -373,14 +375,11 @@ index f79bd782..b7771f53 100644
         default:
             GGML_ABORT("fatal error");
     }
-@@ -18503,6 +18743,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
-         case LLM_ARCH_ARCTIC:
-         case LLM_ARCH_DEEPSEEK2:
-         case LLM_ARCH_CHATGLM:
+@@ -19589,6 +19829,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
+         case LLM_ARCH_GRANITE:
+         case LLM_ARCH_GRANITE_MOE:
+         case LLM_ARCH_CHAMELEON:
 +        case LLM_ARCH_SOLAR:
             return LLAMA_ROPE_TYPE_NORM;
 
         // the pairs of head values are offset by n_rot/2
-- 
-2.46.0
-
--- a/llama/patches/11-blas.diff
+++ b/llama/patches/11-blas.diff
@@ -1,14 +1,14 @@
 diff --git a/ggml/src/ggml-blas.cpp b/ggml/src/ggml-blas.cpp
-index 71373173..1309c451 100644
+index 6d99c6be..8e1ab99d 100644
 --- a/ggml/src/ggml-blas.cpp
 +++ b/ggml/src/ggml-blas.cpp
@@ -1,3 +1,5 @@
 +#ifdef GGML_USE_BLAS
 +
+ #include "ggml-impl.h"
 #include "ggml-blas.h"
 #include "ggml-backend-impl.h"
- 
-@@ -365,3 +367,5 @@ void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads)
+@@ -366,3 +368,5 @@ void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads)
     ggml_backend_blas_context * ctx = (ggml_backend_blas_context *)backend_blas->context;
     ctx->n_threads = n_threads;
 }