IBM granite/granitemoe architecture support (#6760)

* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
df270ef745

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
https://github.com/ggerganov/llama.cpp/pull/9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes #6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This commit is contained in:
Gabe Goodhart
2024-10-17 12:59:52 -06:00
committed by GitHub
parent 05cd82ef94
commit f2890a4494
263 changed files with 14255 additions and 10867 deletions

View File

@@ -1,8 +1,21 @@
diff --git a/ggml/include/ggml-cuda.h b/ggml/include/ggml-cuda.h
index 71bb6dcf..08be0895 100644
--- a/ggml/include/ggml-cuda.h
+++ b/ggml/include/ggml-cuda.h
@@ -34,6 +34,8 @@ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_typ
// pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
+GGML_API GGML_CALL int ggml_backend_cuda_reg_devices();
+
GGML_API GGML_CALL int ggml_backend_cuda_get_device_count(void);
GGML_API GGML_CALL void ggml_backend_cuda_get_device_description(int device, char * description, size_t description_size);
GGML_API GGML_CALL void ggml_backend_cuda_get_device_memory(int device, size_t * free, size_t * total);
diff --git a/ggml/src/ggml-backend.c b/ggml/src/ggml-backend.c
index 9e35ce98..179be840 100644
index ba280e06..d5c3fe49 100644
--- a/ggml/src/ggml-backend.c
+++ b/ggml/src/ggml-backend.c
@@ -87,7 +87,12 @@ void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
@@ -83,7 +83,12 @@ void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
if (buffer->iface.free_buffer != NULL) {
buffer->iface.free_buffer(buffer);
}
@@ -16,10 +29,10 @@ index 9e35ce98..179be840 100644
size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
diff --git a/ggml/src/ggml-cuda.cu b/ggml/src/ggml-cuda.cu
index 04b6e528..43b12bdf 100644
index 6efdab14..809d6ab1 100644
--- a/ggml/src/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda.cu
@@ -392,6 +392,10 @@ GGML_CALL static bool ggml_backend_buffer_is_cuda(ggml_backend_buffer_t buffer)
@@ -469,6 +469,10 @@ GGML_CALL static bool ggml_backend_buffer_is_cuda(ggml_backend_buffer_t buffer)
GGML_CALL static void ggml_backend_cuda_buffer_free_buffer(ggml_backend_buffer_t buffer) {
ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
delete ctx;
@@ -30,7 +43,7 @@ index 04b6e528..43b12bdf 100644
}
GGML_CALL static void * ggml_backend_cuda_buffer_get_base(ggml_backend_buffer_t buffer) {
@@ -3028,8 +3032,6 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cuda_init(const char * params,
@@ -3204,8 +3208,6 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cuda_init(const char * params,
GGML_UNUSED(params);
}
@@ -39,16 +52,3 @@ index 04b6e528..43b12bdf 100644
GGML_CALL int ggml_backend_cuda_reg_devices() {
int device_count = ggml_backend_cuda_get_device_count();
//int device_count = 1; // DEBUG: some tools require delaying CUDA initialization
diff --git a/ggml/include/ggml-cuda.h b/ggml/include/ggml-cuda.h
index 5eb4af40..50b91009 100644
--- a/ggml/include/ggml-cuda.h
+++ b/ggml/include/ggml-cuda.h
@@ -31,6 +31,8 @@ GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_typ
// pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
GGML_API GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
+GGML_API GGML_CALL int ggml_backend_cuda_reg_devices();
+
GGML_API GGML_CALL int ggml_backend_cuda_get_device_count(void);
GGML_API GGML_CALL void ggml_backend_cuda_get_device_description(int device, char * description, size_t description_size);
GGML_API GGML_CALL void ggml_backend_cuda_get_device_memory(int device, size_t * free, size_t * total);

View File

@@ -1,8 +1,8 @@
diff --git a/src/llama.cpp b/src/llama.cpp
index 88355971..dd7d41ed 100644
index 4c0a1bb6..800dfb95 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -6083,16 +6083,7 @@ static void llm_load_vocab(
@@ -6287,16 +6287,7 @@ static void llm_load_vocab(
if (vocab.type == LLAMA_VOCAB_TYPE_BPE) {
vocab.tokenizer_add_space_prefix = false;
vocab.tokenizer_clean_spaces = true;
@@ -20,9 +20,9 @@ index 88355971..dd7d41ed 100644
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
} else if (
tokenizer_pre == "llama3" ||
@@ -6188,7 +6179,8 @@ static void llm_load_vocab(
tokenizer_pre == "exaone") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_EXAONE;
@@ -6398,7 +6389,8 @@ static void llm_load_vocab(
vocab.tokenizer_add_bos = true;
vocab.tokenizer_clean_spaces = false;
} else {
- throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
+ LLAMA_LOG_WARN("%s: missing or unrecognized pre-tokenizer type, using: 'default'\n", __func__);

View File

@@ -1,15 +1,15 @@
diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index 0207b787..b5e9884b 100644
index 9da08fe2..3a433703 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -1396,27 +1396,23 @@ static enum ggml_status ggml_metal_graph_compute(
// to the matrix-vector kernel
int ne11_mm_min = 1;
@@ -1720,27 +1720,23 @@ static void ggml_metal_encode_node(
// to the matrix-vector kernel
int ne11_mm_min = 1;
-#if 0
// the numbers below are measured on M2 Ultra for 7B and 13B models
// these numbers do not translate to other devices or model sizes
// TODO: need to find a better approach
// the numbers below are measured on M2 Ultra for 7B and 13B models
// these numbers do not translate to other devices or model sizes
// TODO: need to find a better approach
- if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) {
- switch (src0t) {
- case GGML_TYPE_F16: ne11_mm_min = 2; break;

View File

@@ -1,8 +1,8 @@
diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index b56c3604..400d43f4 100644
index 3a433703..829c5e39 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -377,8 +377,8 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
@@ -392,8 +392,8 @@ static void ggml_metal_log(enum ggml_log_level level, const char * format, ...){
#if GGML_METAL_EMBED_LIBRARY
GGML_METAL_LOG_INFO("%s: using embedded metal library\n", __func__);

View File

@@ -1,31 +1,31 @@
diff --git a/src/llama.cpp b/src/llama.cpp
index 88355971..d7db689b 100644
index 4c0a1bb6..17e5bc2a 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -15906,7 +15906,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
@@ -16928,7 +16928,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
const auto n_embd = hparams.n_embd;
// TODO: use a per-batch flag for logits presence instead
- const bool has_logits = !cparams.embeddings;
+ const bool has_logits = cparams.causal_attn;
const bool has_embd = cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);
const size_t logits_size = has_logits ? n_vocab*n_outputs_max : 0;
@@ -16175,20 +16175,23 @@ static int llama_decode_internal(
@@ -17200,20 +17200,23 @@ static int llama_decode_internal(
// no output
res = nullptr;
embd = nullptr;
- } else if (cparams.embeddings) {
- res = nullptr; // do not extract logits for embedding case
- embd = nullptr;
- for (int i = ggml_graph_n_nodes(gf) - 1; i >= 0; --i) {
+ }
+
+ if (cparams.embeddings) {
for (int i = gf->n_nodes - 1; i >= 0; --i) {
- if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
- embd = gf->nodes[i];
+ embd = gf->nodes[i];
+ if (strcmp(embd->name, "result_embd_pooled") == 0) {
+ for (int i = ggml_graph_n_nodes(gf) - 1; i >= 0; --i) {
+ embd = ggml_graph_node(gf, i);
if (strcmp(ggml_graph_node(gf, i)->name, "result_embd_pooled") == 0) {
- embd = ggml_graph_node(gf, i);
break;
}
}
@@ -39,5 +39,5 @@ index 88355971..d7db689b 100644
+ res = nullptr; // do not extract logits when not needed
+ }
// LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs);
ggml_backend_sched_alloc_graph(lctx.sched, gf);

View File

@@ -1,10 +1,10 @@
diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index dcc65f02..1a990306 100644
index 14e02c8d..6e849d8e 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -66,6 +66,19 @@
#include <cinttypes>
#include <limits>
@@ -44,6 +44,19 @@
#define LOG_ERR(...) do { fprintf(stderr, __VA_ARGS__); } while (0)
#define LOG_DBG(...) do { fprintf(stderr, __VA_ARGS__); } while (0)
+#if defined(_WIN32)
+#define WIN32_LEAN_AND_MEAN
@@ -22,10 +22,11 @@ index dcc65f02..1a990306 100644
//#define CLIP_DEBUG_FUNCTIONS
// RGB uint8 image
@@ -1248,7 +1261,29 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
@@ -1225,8 +1238,29 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
gguf_free(ctx);
return nullptr;
}
-
+#ifdef _WIN32
+ int wlen = MultiByteToWideChar(CP_UTF8, 0, fname, -1, NULL, 0);
+ if (!wlen) {
@@ -50,9 +51,9 @@ index dcc65f02..1a990306 100644
auto fin = std::ifstream(fname, std::ios::binary);
+#endif
if (!fin) {
LOG_TEE("cannot open model file for loading tensors\n");
LOG_ERR("cannot open model file for loading tensors\n");
clip_free(new_clip);
@@ -1288,7 +1323,11 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
@@ -1266,7 +1300,11 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
ggml_backend_tensor_set(cur, read_buf.data(), 0, num_bytes);
}
}

View File

@@ -1,34 +1,34 @@
diff --git a/src/llama.cpp b/src/llama.cpp
index f79bd782..b7771f53 100644
index bdad28b3..1fe6189a 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -213,6 +213,7 @@ enum llm_arch {
LLM_ARCH_NEMOTRON,
LLM_ARCH_EXAONE,
LLM_ARCH_RWKV6,
@@ -217,6 +217,7 @@ enum llm_arch {
LLM_ARCH_GRANITE,
LLM_ARCH_GRANITE_MOE,
LLM_ARCH_CHAMELEON,
+ LLM_ARCH_SOLAR,
LLM_ARCH_UNKNOWN,
};
@@ -261,6 +262,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_NEMOTRON, "nemotron" },
{ LLM_ARCH_EXAONE, "exaone" },
{ LLM_ARCH_RWKV6, "rwkv6" },
@@ -270,6 +271,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_GRANITE, "granite" },
{ LLM_ARCH_GRANITE_MOE, "granitemoe" },
{ LLM_ARCH_CHAMELEON, "chameleon" },
+ { LLM_ARCH_SOLAR, "solar" },
{ LLM_ARCH_UNKNOWN, "(unknown)" },
};
@@ -314,6 +316,7 @@ enum llm_kv {
LLM_KV_ATTENTION_KV_LORA_RANK,
@@ -327,6 +329,7 @@ enum llm_kv {
LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,
LLM_KV_ATTENTION_SLIDING_WINDOW,
LLM_KV_ATTENTION_SCALE,
+ LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION,
LLM_KV_ROPE_DIMENSION_COUNT,
LLM_KV_ROPE_FREQ_BASE,
@@ -405,19 +408,20 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_TIME_MIX_EXTRA_DIM, "%s.time_mix_extra_dim" },
{ LLM_KV_TIME_DECAY_EXTRA_DIM, "%s.time_decay_extra_dim" },
@@ -421,20 +424,21 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_RESIDUAL_SCALE, "%s.residual_scale" },
{ LLM_KV_EMBEDDING_SCALE, "%s.embedding_scale" },
- { LLM_KV_ATTENTION_HEAD_COUNT, "%s.attention.head_count" },
- { LLM_KV_ATTENTION_HEAD_COUNT_KV, "%s.attention.head_count_kv" },
@@ -43,6 +43,7 @@ index f79bd782..b7771f53 100644
- { LLM_KV_ATTENTION_KV_LORA_RANK, "%s.attention.kv_lora_rank" },
- { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count" },
- { LLM_KV_ATTENTION_SLIDING_WINDOW, "%s.attention.sliding_window" },
- { LLM_KV_ATTENTION_SCALE, "%s.attention.scale" },
+ { LLM_KV_ATTENTION_HEAD_COUNT, "%s.attention.head_count" },
+ { LLM_KV_ATTENTION_HEAD_COUNT_KV, "%s.attention.head_count_kv" },
+ { LLM_KV_ATTENTION_MAX_ALIBI_BIAS, "%s.attention.max_alibi_bias" },
@@ -56,20 +57,21 @@ index f79bd782..b7771f53 100644
+ { LLM_KV_ATTENTION_KV_LORA_RANK, "%s.attention.kv_lora_rank" },
+ { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count" },
+ { LLM_KV_ATTENTION_SLIDING_WINDOW, "%s.attention.sliding_window" },
+ { LLM_KV_ATTENTION_SCALE, "%s.attention.scale" },
+ { LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION, "%s.attention.block_skip_connection.%d" },
{ LLM_KV_ROPE_DIMENSION_COUNT, "%s.rope.dimension_count" },
{ LLM_KV_ROPE_FREQ_BASE, "%s.rope.freq_base" },
@@ -589,6 +593,7 @@ enum llm_tensor {
LLM_TENSOR_ENC_FFN_DOWN,
LLM_TENSOR_ENC_FFN_UP,
@@ -608,6 +612,7 @@ enum llm_tensor {
LLM_TENSOR_ENC_OUTPUT_NORM,
LLM_TENSOR_CLS,
LLM_TENSOR_CLS_OUT,
+ LLM_TENSOR_BSKCN_TV,
};
static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
@@ -1408,6 +1413,24 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
{ LLM_TENSOR_CHANNEL_MIX_RECEPTANCE, "blk.%d.channel_mix_receptance" },
@@ -1527,6 +1532,24 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
},
},
+ {
@@ -93,7 +95,7 @@ index f79bd782..b7771f53 100644
{
LLM_ARCH_UNKNOWN,
{
@@ -2237,6 +2260,7 @@ enum e_model {
@@ -2360,6 +2383,7 @@ enum e_model {
MODEL_15B,
MODEL_16B,
MODEL_20B,
@@ -101,7 +103,7 @@ index f79bd782..b7771f53 100644
MODEL_30B,
MODEL_34B,
MODEL_35B,
@@ -2284,6 +2308,8 @@ struct llama_hparams {
@@ -2409,6 +2433,8 @@ struct llama_hparams {
std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
@@ -110,7 +112,7 @@ index f79bd782..b7771f53 100644
uint32_t n_layer_dense_lead = 0;
uint32_t n_lora_q = 0;
uint32_t n_lora_kv = 0;
@@ -2349,6 +2375,7 @@ struct llama_hparams {
@@ -2479,6 +2505,7 @@ struct llama_hparams {
if (this->n_head_arr != other.n_head_arr) return true;
if (this->n_head_kv_arr != other.n_head_kv_arr) return true;
if (this->n_ff_arr != other.n_ff_arr) return true;
@@ -118,7 +120,7 @@ index f79bd782..b7771f53 100644
if (this->n_rel_attn_bkts != other.n_rel_attn_bkts) return true;
if (this->n_layer_dense_lead != other.n_layer_dense_lead) return true;
@@ -2455,6 +2482,14 @@ struct llama_hparams {
@@ -2588,6 +2615,14 @@ struct llama_hparams {
return ssm_d_state * ssm_d_inner;
}
}
@@ -133,7 +135,7 @@ index f79bd782..b7771f53 100644
};
static_assert(std::is_trivially_copyable<llama_hparams>::value, "llama_hparams must be trivially copyable");
@@ -2635,6 +2670,8 @@ struct llama_layer {
@@ -2769,6 +2804,8 @@ struct llama_layer {
struct ggml_tensor * ffn_gate_scale;
struct ggml_tensor * ffn_up_scale;
struct ggml_tensor * ffn_down_scale;
@@ -142,9 +144,9 @@ index f79bd782..b7771f53 100644
};
// very similar to llama_batch,
@@ -5937,6 +5974,21 @@ static void llm_load_hparams(
@@ -6134,6 +6171,21 @@ static void llm_load_hparams(
default: model.type = e_model::MODEL_UNKNOWN;
}
}
} break;
+ case LLM_ARCH_SOLAR:
+ {
@@ -164,10 +166,15 @@ index f79bd782..b7771f53 100644
default: (void)0;
}
@@ -8420,6 +8472,38 @@ static bool llm_load_tensors(
}
@@ -8831,6 +8883,38 @@ static bool llm_load_tensors(
} break;
layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
+ layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff});
+ layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd});
+ layer.ffn_up = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff});
+ }
+ } break;
+ case LLM_ARCH_SOLAR:
+ {
+ model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
@@ -195,15 +202,10 @@ index f79bd782..b7771f53 100644
+
+ layer.bskcn_tv = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_BSKCN_TV, "weight"), {2}, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
+
+ layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff});
+ layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd});
+ layer.ffn_up = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff});
+ }
+ } break;
default:
throw std::runtime_error("unknown architecture");
}
@@ -15173,6 +15257,158 @@ struct llm_build_context {
layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff});
layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd});
layer.ffn_up = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff});
@@ -16179,6 +16263,158 @@ struct llm_build_context {
return gf;
}
@@ -362,9 +364,9 @@ index f79bd782..b7771f53 100644
};
static struct ggml_cgraph * llama_build_graph_defrag(llama_context & lctx, const std::vector<uint32_t> & ids) {
@@ -15423,6 +15659,10 @@ static struct ggml_cgraph * llama_build_graph(
@@ -16443,6 +16679,10 @@ static struct ggml_cgraph * llama_build_graph(
{
result = llm.build_rwkv6();
result = llm.build_chameleon();
} break;
+ case LLM_ARCH_SOLAR:
+ {
@@ -373,14 +375,11 @@ index f79bd782..b7771f53 100644
default:
GGML_ABORT("fatal error");
}
@@ -18503,6 +18743,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
case LLM_ARCH_ARCTIC:
case LLM_ARCH_DEEPSEEK2:
case LLM_ARCH_CHATGLM:
@@ -19589,6 +19829,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
case LLM_ARCH_GRANITE:
case LLM_ARCH_GRANITE_MOE:
case LLM_ARCH_CHAMELEON:
+ case LLM_ARCH_SOLAR:
return LLAMA_ROPE_TYPE_NORM;
// the pairs of head values are offset by n_rot/2
--
2.46.0

View File

@@ -1,14 +1,14 @@
diff --git a/ggml/src/ggml-blas.cpp b/ggml/src/ggml-blas.cpp
index 71373173..1309c451 100644
index 6d99c6be..8e1ab99d 100644
--- a/ggml/src/ggml-blas.cpp
+++ b/ggml/src/ggml-blas.cpp
@@ -1,3 +1,5 @@
+#ifdef GGML_USE_BLAS
+
#include "ggml-impl.h"
#include "ggml-blas.h"
#include "ggml-backend-impl.h"
@@ -365,3 +367,5 @@ void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads)
@@ -366,3 +368,5 @@ void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads)
ggml_backend_blas_context * ctx = (ggml_backend_blas_context *)backend_blas->context;
ctx->n_threads = n_threads;
}