How to use .safetensors model ? #688

lambda-science · 2023-04-01T19:13:02Z

I downloaded a model alpaca-30b-lora-int4 from https://huggingface.co/elinas/alpaca-30b-lora-int4/tree/main
The model is a .safetensors in GPTQ format I think
I need to convert it to GGML .bin so I used the script provided in llama.cpp with the command python convert-gptq-to-ggml.py models/30B/alpaca-30b-4bit.safetensors models/30B//tokenizer.model models/30B/alpaca-30b-4bit.bin
But I get the following error

Traceback (most recent call last):
  File "/big/meyer/expe/llama.cpp/convert-gptq-to-ggml.py", line 21, in <module>
    model = torch.load(fname_model, map_location="cpu")
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1035, in _legacy_load
    raise RuntimeError("Invalid magic number; corrupt file?")
RuntimeError: Invalid magic number; corrupt file?

How to use .safetensors models with llama.cpp ?

The text was updated successfully, but these errors were encountered:

comex · 2023-04-01T20:05:34Z

My conversion script (#545) will support this soon.

ghost · 2023-04-06T15:37:52Z

I thought I'd give it a spin on some safetensors models:

$ python llama.cpp.convert-script/convert.py --outtype q4_1 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Error: Input uses the newer GPTQ-for-LLaMa format (using g_idx), which is not yet natively supported by GGML.  For now you can still convert this model by passing `--outtype f16` to dequantize, but that will result in a much larger output file for no quality benefit.

$ python llama.cpp.convert-script/convert.py --outtype f16 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g-f16.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Writing vocab...
[1/291] Writing tensor tok_embeddings.weight, size 32000 x 4096...
[2/291] Writing tensor norm.weight, size 4096...
[3/291] Writing tensor output.weight, size 32000 x 4096...
Traceback (most recent call last):
  File "llama.cpp.convert-script/convert.py", line 1053, in <module>
    main()
  File "llama.cpp.convert-script/convert.py", line 1049, in main
    OutputFile.write_all(outfile, params, model, vocab)
  File "llama.cpp.convert-script/convert.py", line 870, in write_all
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
  File "llama.cpp.convert-script/convert.py", line 794, in bounded_parallel_map
    result = futures.pop(0).result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "llama.cpp.convert-script/convert.py", line 867, in do_item
    return lazy_tensor.load().to_ggml().ndarray
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 446, in load
    return self.load().astype(data_type)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 525, in load
    return lazy_tensor.load().permute(n_head)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 576, in load
    return GPTQForLLaMaQuantizedTensor(model, namebase)
  File "llama.cpp.convert-script/convert.py", line 316, in __init__
    scales = load_unquantized(model[f"{namebase}.scales"], np.float32)
  File "llama.cpp.convert-script/convert.py", line 261, in load_unquantized
    assert tensor.ndarray.dtype == expected_dtype, (tensor.ndarray.dtype, expected_dtype)
AssertionError: (dtype('float16'), <class 'numpy.float32'>)

If I didn't think I'd probably cause even more trouble with clumsy efforts, I'd have a stab at fixing it.

comex · 2023-04-06T21:23:33Z

I’ll take a look.

hughobrien · 2023-04-14T07:56:47Z

#545 worked great for this, thanks @comex

prusnak · 2023-04-14T13:12:58Z

try the new convert.py script that is now in master

ggml-org#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com>

prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use .safetensors model ? #688

How to use .safetensors model ? #688

lambda-science commented Apr 1, 2023

comex commented Apr 1, 2023

ghost commented Apr 6, 2023

comex commented Apr 6, 2023

hughobrien commented Apr 14, 2023

prusnak commented Apr 14, 2023

How to use .safetensors model ? #688

How to use .safetensors model ? #688

Comments

lambda-science commented Apr 1, 2023

comex commented Apr 1, 2023

ghost commented Apr 6, 2023

comex commented Apr 6, 2023

hughobrien commented Apr 14, 2023

prusnak commented Apr 14, 2023