[hack] Quantization w/o intermediate f32 converted model #7371
ochafik
started this conversation in
Show and tell
Replies: 1 comment
-
I'm kinda surprised that no one responded to this. I think you could use it to do the conversion faster by using multiple threads. I don't really code, so I have no idea specifically how, but could you create a sparse output file and map the memory for each tensor in the output file per thread? Something like that. |
Beta Was this translation helpful? Give feedback.
0 replies
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
Hi all!
I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since #6387) and realized I didn't have enough disk space left 😓. Even the recently added direct Q8 quantization (#7234) eats lots of disk (besides, I wanted a Q4 w/o needless loss of quality).
So I did a (Unix-only) dirty hack that has
convert.py
quantize the model on the fly (using subprocess calls to a lightly modified./quantize
): see this branchHere's how it works:
temp-empty-f32.gguf
that has all the KVs and the tensor metadata, but no tensor data (tensor infos have bogus data offsets)./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K
: this writes everything toout-Q6_K.gguf
except the actual quantized tensors (left as zeroes).temp-single-f32.gguf
file (which needlessly also contains all KVs) and calling./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K
. That--single-tensor
mode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.So, this is way too dirty to be mergeable in any form, but:
convert.py
(using something like the ggml Python bindings). I've got half of it working but not sure how useful it is in the grander scheme of things.One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.
On Mac, the following creates a RAM-backed 4GB volume at
/Volumes/RAM Disk
(see this gist):So just replace
"temp-single-f32.gguf"
with"/Volumes/RAM Disk/temp-single-f32.gguf"
inconvert.py
and you're good to go.Beta Was this translation helpful? Give feedback.
All reactions