Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merging tensors of larger models #1

Closed
kir-gadjello opened this issue Mar 10, 2023 · 4 comments
Closed

Merging tensors of larger models #1

kir-gadjello opened this issue Mar 10, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@kir-gadjello
Copy link
Contributor

Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook

It shouldn't be hard to merge tensors with my https://github.com/kir-gadjello/zipslicer library, but it's pure Python! If you want to keep the project pure C++ you might want to write a standalone gist script that uses zipslicer to unpack weight shards into binary files.

@ggerganov
Copy link
Member

Thanks! The bigger problem now is that I am out of disk space, haha!
Anyway, will try to figure out something later

@theontho
Copy link

Leave a tip jar to get a @ggerganov bigger SSD and / or macbook :D

@eous
Copy link

eous commented Mar 11, 2023

Its kinda pointless now but I was able to merge the 30B and 65B with this core bit of hackery added to the convert script.

+    fname_model = sys.argv[1] + "/consolidated." + str(i).zfill(2) + ".pth"
+    model_i = torch.load(fname_model, map_location="cpu")
+    
+    # Since the models are split, we need to append the tensors changing the shape/size
+    for k, v in model_i.items():
+        if k in model:
+            if model[k].dtype != v.dtype:
+                print("ERROR: Tensor types do not match: ", model[k].dtype, " vs ", v.dtype)
+                sys.exit(1)
+            elif len(model[k].shape) == 1:
+                print("Skipping tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                continue
+            elif k == "output.weight":
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=0)
+                print("New shape: ", model[k].shape)                
+                continue
+            elif "tok_embeddings" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+                continue
+            elif "attention.wo" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+                continue
+            elif "feed_forward.w2" in k:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+                model[k] = torch.cat((model[k], v), dim=1)
+                print("New shape: ", model[k].shape)
+            else:
+                print("Concatenating tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype, " with shape: ", model[k].shape)
+                model[k] = torch.cat((model[k], v), dim=0)
+                print("New shape: ", model[k].shape)
+        else:
+            print("Adding tensor: " + k + " with shape: ", v.shape, " and type: ", v.dtype)
+            model[k] = v
+    del model_i```

@ggerganov
Copy link
Member

Fixed with 007a8f6

On startup, we go through all the parts and merge them dynamically in the ggml buffers.

@gjmulder gjmulder added the enhancement New feature or request label Mar 15, 2023
nemtos pushed a commit to nemtos/llama.cpp that referenced this issue Apr 9, 2023
…-instead-of-wget-1

Update command for downloading the weights to use `curl`

`curl` is preinstalled on macOS and the new command is equivalent to the `wget` version but avoids having to install `wget`.
This should save people some time.
mqy added a commit to mqy/llama.cpp that referenced this issue May 26, 2023
mqy added a commit to mqy/llama.cpp that referenced this issue May 26, 2023
mqy added a commit to mqy/llama.cpp that referenced this issue May 29, 2023
mqy added a commit to mqy/llama.cpp that referenced this issue May 31, 2023
broken change: delete original profile ggml-org#1 from q_f32 profiles
syoyo pushed a commit to syoyo/llama.cpp that referenced this issue May 31, 2023
mqy added a commit to mqy/llama.cpp that referenced this issue Jun 4, 2023
broken change: delete original profile ggml-org#1 from q_f32 profiles
rooprob pushed a commit to rooprob/llama.cpp that referenced this issue Aug 2, 2023
funnbot pushed a commit to funnbot/llama.cpp that referenced this issue Aug 8, 2023
* kquants_iter for hipblas and add gfx803
* Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16
* remove dmmv_f16 for now
HanClinto pushed a commit to HanClinto/llama.cpp that referenced this issue Jun 10, 2024
Oliver-Y added a commit to Oliver-Y/llama.cpp that referenced this issue Jul 23, 2024
* a chinese word formed of 3 chinese charcters but the first 2 is not word

* tokenizer-fix

* E5 Pretokenizer bugfix

* whitespace fix

* remove extra wpm

---------

Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>
Co-authored-by: Oliver Ye <OliverY@MacBook-Pro.local>
cunnie added a commit to cunnie/llama.cpp that referenced this issue Aug 3, 2024
When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread ggml-org#1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```
ggerganov added a commit that referenced this issue Aug 4, 2024
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
ggerganov pushed a commit that referenced this issue Aug 6, 2024
@slaren slaren mentioned this issue Aug 15, 2024
4 tasks
jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
ykhrustalev referenced this issue in ykhrustalev/llama.cpp Sep 26, 2024
#1)

* Fixed a bug where debug code was included in the release, resulting in an undefined function error.

* Change the path of the QNN library when building in termux environment

* Revert "Change the path of the QNN library when building in termux environment"

This reverts commit c6e26a3.

* Changed so that GGML_QNN_DEFAULT_LIB_SEARCH_PATH can be set from command line arguments
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants