Skip to content

How to use .safetensors model ? #688

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
lambda-science opened this issue Apr 1, 2023 · 5 comments
Closed

How to use .safetensors model ? #688

lambda-science opened this issue Apr 1, 2023 · 5 comments

Comments

@lambda-science
Copy link

I downloaded a model alpaca-30b-lora-int4 from https://huggingface.co/elinas/alpaca-30b-lora-int4/tree/main
The model is a .safetensors in GPTQ format I think
I need to convert it to GGML .bin so I used the script provided in llama.cpp with the command python convert-gptq-to-ggml.py models/30B/alpaca-30b-4bit.safetensors models/30B//tokenizer.model models/30B/alpaca-30b-4bit.bin
But I get the following error

Traceback (most recent call last):
  File "/big/meyer/expe/llama.cpp/convert-gptq-to-ggml.py", line 21, in <module>
    model = torch.load(fname_model, map_location="cpu")
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1035, in _legacy_load
    raise RuntimeError("Invalid magic number; corrupt file?")
RuntimeError: Invalid magic number; corrupt file?

How to use .safetensors models with llama.cpp ?

@comex
Copy link
Contributor

comex commented Apr 1, 2023

My conversion script (#545) will support this soon.

@ghost
Copy link

ghost commented Apr 6, 2023

I thought I'd give it a spin on some safetensors models:

$ python llama.cpp.convert-script/convert.py --outtype q4_1 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Error: Input uses the newer GPTQ-for-LLaMa format (using g_idx), which is not yet natively supported by GGML.  For now you can still convert this model by passing `--outtype f16` to dequantize, but that will result in a much larger output file for no quality benefit.

$ python llama.cpp.convert-script/convert.py --outtype f16 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g-f16.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Writing vocab...
[1/291] Writing tensor tok_embeddings.weight, size 32000 x 4096...
[2/291] Writing tensor norm.weight, size 4096...
[3/291] Writing tensor output.weight, size 32000 x 4096...
Traceback (most recent call last):
  File "llama.cpp.convert-script/convert.py", line 1053, in <module>
    main()
  File "llama.cpp.convert-script/convert.py", line 1049, in main
    OutputFile.write_all(outfile, params, model, vocab)
  File "llama.cpp.convert-script/convert.py", line 870, in write_all
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
  File "llama.cpp.convert-script/convert.py", line 794, in bounded_parallel_map
    result = futures.pop(0).result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "llama.cpp.convert-script/convert.py", line 867, in do_item
    return lazy_tensor.load().to_ggml().ndarray
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 446, in load
    return self.load().astype(data_type)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 525, in load
    return lazy_tensor.load().permute(n_head)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 576, in load
    return GPTQForLLaMaQuantizedTensor(model, namebase)
  File "llama.cpp.convert-script/convert.py", line 316, in __init__
    scales = load_unquantized(model[f"{namebase}.scales"], np.float32)
  File "llama.cpp.convert-script/convert.py", line 261, in load_unquantized
    assert tensor.ndarray.dtype == expected_dtype, (tensor.ndarray.dtype, expected_dtype)
AssertionError: (dtype('float16'), <class 'numpy.float32'>)

If I didn't think I'd probably cause even more trouble with clumsy efforts, I'd have a stab at fixing it.

@comex
Copy link
Contributor

comex commented Apr 6, 2023

I’ll take a look.

@hughobrien
Copy link

#545 worked great for this, thanks @comex

@prusnak
Copy link
Collaborator

prusnak commented Apr 14, 2023

try the new convert.py script that is now in master

@prusnak prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
ggml-org#688)

* Examples from ggml to gguf

* Use gguf file extension

Update examples to use filenames with gguf extension (e.g. llama-model.gguf).

---------

Co-authored-by: Andrei <abetlen@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants