[hack] Quantization w/o intermediate f32 converted model #7371

ochafik · 2024-05-18T18:17:14Z

ochafik
May 18, 2024
Collaborator

Hi all!

I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since #6387) and realized I didn't have enough disk space left 😓. Even the recently added direct Q8 quantization (#7234) eats lots of disk (besides, I wanted a Q4 w/o needless loss of quality).

So I did a (Unix-only) dirty hack that has convert.py quantize the model on the fly (using subprocess calls to a lightly modified ./quantize): see this branch

git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/quant-lowdisk

make clean && make quantize
pip install -U sentencepiece "huggingface_hub[cli,hf_transfer]"
export HF_HUB_ENABLE_HF_TRANSFER=1

python convert.py \
  `huggingface-cli download NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO` \
  --outfile Nous-Hermes-2-Mixtral-8x7B-DPO-Q4_K_M \
  --quant Q4_K_M

Here's how it works:

Writes a bogus GGUF file temp-empty-f32.gguf that has all the KVs and the tensor metadata, but no tensor data (tensor infos have bogus data offsets)
Then, call ./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K: this writes everything to out-Q6_K.gguf except the actual quantized tensors (left as zeroes).
And finally, for each tensor, do the actual quantization. This is peak hacky: I'm writing each unquantized tensor in a temp-single-f32.gguf file (which needlessly also contains all KVs) and calling ./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K. That --single-tensor mode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.

So, this is way too dirty to be mergeable in any form, but:

Allows quantization of very large models w/ much less disk space now (e.g. takes "only" 87GB for original model + 40GB for the output to quantize that mixtral, not 2*87GB + 40GB - Edited: for Mixtra 8x22B it's saving 300GB of disk), in case anyone else is insterested.
If there was any appetite to turn this into something clean, I think best way would be to update the quantization API to allow binding it from convert.py (using something like the ggml Python bindings). I've got half of it working but not sure how useful it is in the grander scheme of things.

One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.

On Mac, the following creates a RAM-backed 4GB volume at /Volumes/RAM Disk (see this gist):

diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nobrowse -nomount ram://8388608`

So just replace "temp-single-f32.gguf" with "/Volumes/RAM Disk/temp-single-f32.gguf" in convert.py and you're good to go.

BlairSadewitz · 2025-01-27T22:29:18Z

BlairSadewitz
Jan 27, 2025

I'm kinda surprised that no one responded to this. I think you could use it to do the conversion faster by using multiple threads. I don't really code, so I have no idea specifically how, but could you create a sparse output file and map the memory for each tensor in the output file per thread? Something like that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hack] Quantization w/o intermediate f32 converted model #7371

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[hack] Quantization w/o intermediate f32 converted model #7371

ochafik May 18, 2024 Collaborator

Replies: 1 comment

BlairSadewitz Jan 27, 2025

ochafik
May 18, 2024
Collaborator

BlairSadewitz
Jan 27, 2025