Merging tensors of larger models #1

kir-gadjello · 2023-03-10T21:33:07Z

Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook

It shouldn't be hard to merge tensors with my https://github.com/kir-gadjello/zipslicer library, but it's pure Python! If you want to keep the project pure C++ you might want to write a standalone gist script that uses zipslicer to unpack weight shards into binary files.

ggerganov · 2023-03-10T22:02:58Z

Thanks! The bigger problem now is that I am out of disk space, haha!
Anyway, will try to figure out something later

theontho · 2023-03-11T19:37:15Z

Leave a tip jar to get a @ggerganov bigger SSD and / or macbook :D

eous · 2023-03-11T19:55:50Z

Its kinda pointless now but I was able to merge the 30B and 65B with this core bit of hackery added to the convert script.

+    fname_model = sys.argv[1] + "/consolidated." + str(i).zfill(2) + ".pth"
+    model_i = torch.load(fname_model, map_location="cpu")
+    
+    # Since the models are split, we need to append the tensors changing the shape/size
+    for k, v in model_i.items():
+        if k in model:
+            if model[k].dtype != v.dtype:
+                print("ERROR: Tensor types do not match: ", model[k].dtype, " vs ", v.dtype)
+                sys.exit(1)
+            elif len(model[k].shape) == 1:
+                print("Skipping tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                continue
+            elif k == "output.weight":
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=0)
+                print("New shape: ", model[k].shape)                
+                continue
+            elif "tok_embeddings" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+                continue
+            elif "attention.wo" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+                continue
+            elif "feed_forward.w2" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+            else:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype, " with shape: ", model[k].shape)
+                model[k] = torch.cat((model[k], v), dim=0)
+                print("New shape: ", model[k].shape)
+        else:
+            print("Adding tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+            model[k] = v
+    del model_i```

ggerganov · 2023-03-12T06:22:47Z

Fixed with 007a8f6

On startup, we go through all the parts and merge them dynamically in the ggml buffers.

…-instead-of-wget-1 Update command for downloading the weights to use `curl` `curl` is preinstalled on macOS and the new command is equivalent to the `wget` version but avoids having to install `wget`. This should save people some time.

…refactors

broken change: delete original profile ggml-org#1 from q_f32 profiles

Support redpajama

broken change: delete original profile ggml-org#1 from q_f32 profiles

Fix streaming

* kquants_iter for hipblas and add gfx803 * Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16 * remove dmmv_f16 for now

Nits found in binary renames

* a chinese word formed of 3 chinese charcters but the first 2 is not word * tokenizer-fix * E5 Pretokenizer bugfix * whitespace fix * remove extra wpm --------- Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com> Co-authored-by: Oliver Ye <OliverY@MacBook-Pro.local>

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread ggml-org#1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ```

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>

Fixed Line

* fstring ggml-org#1 * fstring ggml-org#2

* dictionary ggml-org#1 * dictionary ggml-org#2

#1) * Fixed a bug where debug code was included in the release, resulting in an undefined function error. * Change the path of the QNN library when building in termux environment * Revert "Change the path of the QNN library when building in termux environment" This reverts commit c6e26a3. * Changed so that GGML_QNN_DEFAULT_LIB_SEARCH_PATH can be set from command line arguments

ggerganov closed this as completed Mar 12, 2023

ggerganov mentioned this issue Mar 12, 2023

What is the meaning of hacked? #33

Closed

gjmulder added the enhancement label Mar 15, 2023

SavageShrimp mentioned this issue Mar 20, 2023

segmentation fault Alpaca #317

Closed

bogdad mentioned this issue Mar 29, 2023

Support tensors with 64-bit number of elements in ggml #599

Closed

FNsi mentioned this issue Apr 1, 2023

Performance investigation using AMD BLIS instead of OpenBLAS on 16 core AMD Zen1 #637

Closed

4 tasks

sha0coder mentioned this issue Apr 5, 2023

[Bug] dequantize_row_q4_0 segfaults #791

Closed

Zeki-Zhang mentioned this issue Apr 29, 2023

quantization command in README.md #1227

Closed

4 tasks

mqy added a commit to mqy/llama.cpp that referenced this issue May 26, 2023

ggml.c: bugfix CBLAS profile ggml-org#1 was not executed; misc minor …

4d7c7b8

…refactors

mqy added a commit to mqy/llama.cpp that referenced this issue May 26, 2023

ggml.c: bugfix CBLAS profile ggml-org#1 was not executed; misc minor …

26eb856

…refactors

richkcho mentioned this issue May 29, 2023

Merged lora model forgets lora when converted to ggml. (with llama-cpp-python, DOES NOT repro with ./main) #1631

Closed

mqy added a commit to mqy/llama.cpp that referenced this issue May 29, 2023

ggml.c: bugfix CBLAS profile ggml-org#1 was not executed; misc minor …

2ea239a

…refactors

Kangmo mentioned this issue May 31, 2023

[User] error: inlining failed in call to 'always_inline' 'vfmaq_f16': target specific option mismatch #1655

Closed

mqy added a commit to mqy/llama.cpp that referenced this issue May 31, 2023

mulmat-tune: fixed wrong result file name; decrease hist buf size;

c0d321f

broken change: delete original profile ggml-org#1 from q_f32 profiles

syoyo pushed a commit to syoyo/llama.cpp that referenced this issue May 31, 2023

Merge pull request ggml-org#1 from togethercomputer/support_redpajama

ecd78a6

Support redpajama

mqy added a commit to mqy/llama.cpp that referenced this issue Jun 4, 2023

mulmat-tune: fixed wrong result file name; decrease hist buf size;

c67cb1b

broken change: delete original profile ggml-org#1 from q_f32 profiles

mqy mentioned this issue Jun 8, 2023

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage #1632

Closed

AphidGit mentioned this issue Jun 13, 2023

LLaMA NUMA could be better #1437

Closed

rshenoy2000 mentioned this issue Jul 3, 2023

[User] Converting GPT4All Model fails #2080

Closed

adaaaaaa mentioned this issue Jul 5, 2023

illegal instrution #2090

Closed

This was referenced Jul 9, 2023

Implement classifier-free guidance #2135

Merged

Pool Android performance and GPU not used at all when built with OpenCL #2052

Closed

coadmonky mentioned this issue Jul 24, 2023

Decrease in Performance #2355

Closed

rooprob pushed a commit to rooprob/llama.cpp that referenced this issue Aug 2, 2023

Merge pull request ggml-org#1 from sumo43/master

deb3818

Fix streaming

funnbot pushed a commit to funnbot/llama.cpp that referenced this issue Aug 8, 2023

headers fix; add kquants_iter for hipblas and add gfx803 (ggml-org#1)

bb16eff

* kquants_iter for hipblas and add gfx803 * Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16 * remove dmmv_f16 for now

Lucidology mentioned this issue Apr 21, 2024

Unable to make LLAMA_CUBLAS=1 Unknown option forward-unknown-to-host-compiler #1404

Closed

oldgithubman mentioned this issue May 2, 2024

Cannot convert llama3 8b model to gguf #7021

Closed

steampunque mentioned this issue May 21, 2024

b2950 broke RPC mode #7427

Closed

HanClinto pushed a commit to HanClinto/llama.cpp that referenced this issue Jun 10, 2024

Merge pull request ggml-org#1 from HanClinto/bins-rename-nits

82df7f9

Nits found in binary renames

takosalad mentioned this issue Jun 24, 2024

Bug: Crashes at the end of startup during first prompt processing #8096

Closed

micsthepick mentioned this issue Jul 1, 2024

Bug: GGML assert with bf16, RTX3090 #8234

Closed

ko-alex mentioned this issue Jul 4, 2024

Bug: gemma 2 27B GGML_ASSERT n_dims <= ne0 #8246

Closed

apresence mentioned this issue Jul 10, 2024

Bug: InternLM 2.5 Chat Tool Calls: Incorrect and Inconsistent Formatting #8405

Closed

chraac mentioned this issue Jul 16, 2024

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Closed

4 tasks

m828 mentioned this issue Jul 16, 2024

Bug: ROCm CUDA error #8504

Closed

ggerganov pushed a commit that referenced this issue Aug 6, 2024

Merge pull request #1 from harvestingmoon/minicpm-v2.5

b31f51f

Fixed Line

fan-chao mentioned this issue Aug 13, 2024

[CANN] Support Q4_0 for Ascend NPU #8822

Merged

4 tasks

slaren mentioned this issue Aug 15, 2024

Threadpool: take 2 #8672

Merged

4 tasks

znzjugod mentioned this issue Aug 30, 2024

Bug: A crash occurs when llama-bench is running on multiple cann devices. #9250

Closed

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024

Streamline with fstrings (ggml-org#1006)

ce971a0

* fstring ggml-org#1 * fstring ggml-org#2

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024

Streamline with dictionaries (ggml-org#1005)

7de1ebf

* dictionary ggml-org#1 * dictionary ggml-org#2

michaellin99999 mentioned this issue Nov 18, 2024

LORA Adapter Hot Swap Implementation Problem #10374

Closed

qixing-ai mentioned this issue Nov 30, 2024

Bug: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6, please update your driver to a newer version, or use an earlier cuda container: unknown. #9665

Closed

narc1ssus1 mentioned this issue Jan 23, 2025

Misc. bug: Docker Image llama-quantize Segmentation fault #11196

Closed

ko-alex mentioned this issue Jan 27, 2025

SIGSEGV during inference #11456

Closed

marvin-0042 mentioned this issue Jan 30, 2025

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit #11474

Closed

5 tasks

gaykawadpk mentioned this issue Feb 12, 2025

Misc. bug: llama-cli crash on ubuntu with GGML-VULKAN=ON #11823

Closed

clort81 mentioned this issue Feb 22, 2025

Misc. bug: llama-cli '--log-disable' parameter omits response #11983

Open

acbits mentioned this issue Feb 25, 2025

Regression. Unable to run any model. CRASH!!! #12075

Closed

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

martinalderson mentioned this issue Mar 15, 2025

Bug: Vulkan backend crash on model loading #8828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging tensors of larger models #1

Merging tensors of larger models #1

kir-gadjello commented Mar 10, 2023

ggerganov commented Mar 10, 2023

theontho commented Mar 11, 2023

eous commented Mar 11, 2023 •

edited

Loading

ggerganov commented Mar 12, 2023

Merging tensors of larger models #1

Merging tensors of larger models #1

Comments

kir-gadjello commented Mar 10, 2023

ggerganov commented Mar 10, 2023

theontho commented Mar 11, 2023

eous commented Mar 11, 2023 • edited Loading

ggerganov commented Mar 12, 2023

eous commented Mar 11, 2023 •

edited

Loading