quantize: Handle user-defined quantization levels for additional tensors #12511

EAddario · 2025-03-22T08:21:22Z

This PR adds the ability to quantize other tensors, beyond token-embedding and output-tensor. It handles most of the supported architectures. ~~except Mamba, RWKV6, RWKV6QWEN2 and T5 to avoid having too many command options, but can add as well if maintainers request it.~~

For full background on the PR, please see: Squeezing Tensor Bits: the quest for smaller LLMs

max-krasnyansky · 2025-03-25T03:19:30Z

How about we add a more generic --tensor-type tensor_name_pattern=type.
@slaren has the PR #11397 that overrides backend mapping per tensor.
Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

EAddario · 2025-03-25T09:24:05Z

That's an excellent idea! and it'll allow to add all supported tensor types (50+) without creating a mess of parameters. Plus, it will give me something to do over the weekend 😆

jukofyork · 2025-03-27T13:05:51Z

How about we add a more generic --tensor-type tensor_name_pattern=type. @slaren has the PR #11397 that overrides backend mapping per tensor. Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

Yeah, I think this is definitely the way to go - the regex support of that PR gives really good flexibility.

jukofyork · 2025-04-02T12:28:34Z

I've also modified llama-imatrix to display importance score statistics, which will help in deciding which tensor/layer number to quantize. Will create PR when a get some free time.

This sounds interesting - look forward to seeing it!

ngxson · 2025-04-02T14:12:45Z

examples/quantize/quantize.cpp

@@ -244,6 +247,103 @@ static ggml_type parse_ggml_type(const char * arg) {
    return GGML_TYPE_COUNT;
 }

+// Allowed tensors for arbitrary quantization with --tensor-type option
+static const std::vector<std::string> ALLOWED_TENSOR_TYPE = {


I'm rethinking about this, maybe we can simplify this functionality by adding just 2 flags:

--dump-mapping to get the list of tensors and the target quantized type. User can then modify the target quant directly

--mapping FILE user can then specify a custom mapping file given from step above

I think it makes sense to only allow certain tensors to be quantized, otherwise users will lobotomize their model and then complain that llama.cpp is broken

Agree with @ddh0 although I can see how down the line something similar to what @ngxson is suggesting may be useful: I'm testing the layer-wise quant using the modified llama-imatrix for guidance, and whilst I'm getting some really encouraging results (I'll publish the full model in my HF repo over the weekend), the process is overly manual and the regexes can get unwieldy (e.g. --tensor-type "(1[3-9]|2[0-9]|30)\.attn_v=q6_k" --tensor-type "([0-9]|[1-2][0-9]|30|31)\.ffn_down=q3_k" --tensor-type "(10|1[3-9]|2[0-9]|30)\.attn_q=q5_k" ...

I think it would be nice to have a way to AutoMagically generate optimum regexes to be fed into llama-quantize!

I think it makes sense to only allow certain tensors to be quantized, otherwise users will lobotomize their model and then complain that llama.cpp is broken

This is the case of full granular-control vs guided hand-holding to competence. And, we can have the best of both worlds. What we will need is a brief, informative How-To Guide, which introduces and explains the concepts of per-tensor and per-layer quantization, and then gives concrete examples, which they can base their quantization decisions on. Adding to this: the guide should further educate the User on which tensors/weights are targets for quantization (Embeddings, ATTN_K, ATTN_Q, ATTN_V, ATTN_Output, FFN_Down, FFN_Gate, FFN_Up, and, Output) and which are more likely not to be good targets for quantization (FFN_NORM, etc) and why. And, then for the more dense of mankind: a brief disclaimer stating that any and all modifications that they make to their custom quantized model is their business and responsibility and thereby waiving ggml-org or any of you guys from liability. 🤔

Just because some choose not to read and learn, doesn't mean we should have to suffer a loss of "power-user" features, because those who aren't paying attention will lobotomize their quantized models. This is all a fun game of trial-error and experimentation. If the users have made it this far, they will have to learn.

ddh0 · 2025-04-03T02:42:46Z

Currently quantize.cpp is missing an #include <algorithm> which causes the build to fail on linux CUDA:

/home/dylan/Documents/AI/llama.cpp/examples/quantize/quantize.cpp: In function ‘bool string_parse_tensor_type(const char*, std::vector<tensor_quantization>&)’:
/home/dylan/Documents/AI/llama.cpp/examples/quantize/quantize.cpp:318:10: error: ‘transform’ is not a member of ‘std’
  318 |     std::transform(tn.begin(), tn.end(), tn.begin(), tolower);
      |          ^~~~~~~~~
gmake[2]: *** [examples/quantize/CMakeFiles/llama-quantize.dir/build.make:79: examples/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:4067: examples/quantize/CMakeFiles/llama-quantize.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....

Adding the missing include lets me build successfully

ddh0 · 2025-04-03T20:06:38Z

I started a discussion thread if anyone's interested, so we don't clog this PR: #12741

EAddario · 2025-04-03T22:33:14Z

I'll add results of my weekend testing there as well

ubergarm · 2025-04-04T16:23:01Z

fwiw I've been doing this with ik_llama.cpp llama-quantize --custom-q feature with good success. Just in case there is any desire to keep this PRs syntax compatible or not.

Specifying exact quants per tensor becomes more important now as -ot is merged, and the MLA PR is in the works here. This allows trading off quality and performance specific to target hardware configurations (e.g. how much VRAM to leave available for MLA kv-cache with using -ot exps=CPU etc).

I have a couple custom quants up on huggingface ubergarm/DeepSeek-V3-0324-GGUF that use this technique.

Here is an example bash script recipe for an experimental CPU only speed blend:

CPU only quant performance blend V3-0324 recipe

NOTE: mainline llama.cpp doesn't have all these quants, but you can see how regex tensor<->quant mappings via --custom-q allows easy testing and maintenance of recipe scripts.

#!/usr/bin/env bash

# CPU only inference blend

# Notes:
# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2765210993
# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2768567062
custom="
# Token embedding and output tensors
# note token_embd cannot be repacked quant type
token_embd\.weight=iq6_k
output\.weight=iq5_k_r4
output_norm\.weight=iq5_k_r4

# First 3 dense layers (0-3)
blk\.[0-2]\.attn_k_b.*=q6_0_r4
blk\.[0-2]\.attn_.*=iq5_k_r4
blk\.[0-2]\..*=iq5_k_r4

# All attention, norm weights, and bias tensors for MoE layers (3-60)
# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0_r4 for CPU only speed boost
blk\.[3-9]\.attn_k_b.*=q6_0_r4
blk\.[1-5][0-9]\.attn_k_b.*=q6_0_r4
blk\.60\.attn_k_b.*=q6_0_r4

blk\.[3-9]\.attn_.*=iq5_k_r4
blk\.[1-5][0-9]\.attn_.*=iq5_k_r4
blk\.60\.attn_.*=iq5_k_r4

blk\.[3-9]\.ffn_norm\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_k_r4
blk\.60\.ffn_norm\.weight=iq5_k_r4

blk\.[3-9]\.exp_probs_b\.bias=iq5_k_r4
blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_k_r4
blk\.60\.exp_probs_b\.bias=iq5_k_r4

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_k_r4
blk\.60\.ffn_down_shexp\.weight=iq5_k_r4

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
blk\.60\.ffn_(gate|up)_shexp\.weight=iq5_k_r4

# Routed Experts (3-60)
# First 16 layers are more sensitive so keep larger
blk\.[3-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[1][0-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[2-5][0-9]\.ffn_down_exps\.weight=iq4_k_r4
blk\.60\.ffn_down_exps\.weight=iq4_k_r4

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[1][0-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[2-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq3_k_r4
"
custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
    --token-embedding-type iq6_k \
    --output-tensor-type iq5_k_r4 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K.gguf \
    IQ3_K \
    24

David-AU-github · 2025-04-06T07:24:29Z

Super stoked about this, especially option to adjust weights for "shared expert" weights.
Its playtime.

EAddario · 2025-04-07T18:16:18Z

TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model.

More info here

Test results

Model	Naive (GB)	TWQ (GB)	Reduction	LWQ (GB)	Reduction	Naive 𝜌PPL	TWQ 𝜌PPL	LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M	3.78	3.48	7.9%	3.67	2.9%	93.64%	91.75%	94.84%
DeepSeek-R1-Distill-Llama-8B-IQ3_S	3.68	3.24	12.0%	3.56	3.3%	93.71%	91.50%	93.48%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL	4.68	4.3	8.1%	4.4	6.0%	98.82%	96.44%	95.87%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L	4.32	3.45	20.1%	3.88	10.2%	97.25%	92.60%	94.83%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M	4.02	3.37	16.2%	3.57	11.2%	96.92%	91.45%	94.63%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S	3.66	3.28	10.4%	3.43	6.3%	94.59%	90.73%	92.46%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M	4.92	4.44	9.8%	4.41	10.4%	98.85%	98.02%	98.04%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S	4.69	4.31	8.1%	4.33	7.7%	99.01%	97.97%	97.57%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M	5.73	5.35	6.6%	5.38	6.1%	99.09%	98.83%	98.95%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S	5.6	5.19	7.3%	5.3	5.4%	99.00%	98.82%	98.82%
DeepSeek-R1-Distill-Llama-8B-Q6_K	6.6	6.17	6.5%	6.51	1.4%	99.47%	98.91%	99.18%
DeepSeek-R1-Distill-Llama-8B-Q8_0	8.54	7.84	8.2%	7.47	12.5%	99.93%	98.99%	98.97%

David-AU-github · 2025-04-07T23:46:02Z

TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model.

More info here

Test results

Model Naive (GB) TWQ (GB) Reduction LWQ (GB) Reduction Naive 𝜌PPL TWQ 𝜌PPL LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M 3.78 3.48 7.9% 3.67 2.9% 93.64% 91.75% 94.84%
DeepSeek-R1-Distill-Llama-8B-IQ3_S 3.68 3.24 12.0% 3.56 3.3% 93.71% 91.50% 93.48%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL 4.68 4.3 8.1% 4.4 6.0% 98.82% 96.44% 95.87%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L 4.32 3.45 20.1% 3.88 10.2% 97.25% 92.60% 94.83%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M 4.02 3.37 16.2% 3.57 11.2% 96.92% 91.45% 94.63%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S 3.66 3.28 10.4% 3.43 6.3% 94.59% 90.73% 92.46%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M 4.92 4.44 9.8% 4.41 10.4% 98.85% 98.02% 98.04%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S 4.69 4.31 8.1% 4.33 7.7% 99.01% 97.97% 97.57%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M 5.73 5.35 6.6% 5.38 6.1% 99.09% 98.83% 98.95%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S 5.6 5.19 7.3% 5.3 5.4% 99.00% 98.82% 98.82%
DeepSeek-R1-Distill-Llama-8B-Q6_K 6.6 6.17 6.5% 6.51 1.4% 99.47% 98.91% 99.18%
DeepSeek-R1-Distill-Llama-8B-Q8_0 8.54 7.84 8.2% 7.47 12.5% 99.93% 98.99% 98.97%

@EAddario
Just a heads up:
Distill (and preview models) are sensitive in layers 0-7 and 28-31.
You could give these more bits, and lower the other layers to maintain or augment function.

ggerganov · 2025-04-08T07:14:03Z

include/llama.h

+        void * kv_overrides;                  // pointer to vector containing overrides
+        void * tensor_types;                  // pointer to vector containing tensor types
    } llama_model_quantize_params;


This changes the public interface, so add a comment in #9289.

Note that passing C++ objects here is not correct and we eventually have to fix this API to not do that. It hasn't become a problem yet because the quantization functions are likely not used frequently by 3rd party applications.

@EAddario If you are interested, you can give it a shot in another PR and fix these structs to become C compatible.

Thanks @ggerganov, happy to

src/llama-quant.cpp

slaren

This is a bit too hacky for my preference, but I suppose if people are already creating custom mixes by modifying the code it is better to at least have a tool to do it.

I would prefer if the allowed tensor check is removed, it doesn't really work as a reliable check, and it will prevent some legitimate uses.

EAddario · 2025-04-12T11:26:46Z

Thanks for approving @slaren. Any particular use case you have in mind it will prevent? Maybe I can work it into the logic.

EAddario · 2025-04-12T14:43:24Z

Got a better quality LWQ mix using the stats from the modified llama-imatrix. More info here

Test results

Model	Naive (GB)	TWQ (GB)	Reduction	LWQ (GB)	Reduction	Naive 𝜌PPL	TWQ 𝜌PPL	LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M	3.78	3.48	7.9%	3.69	2.5%	93.64%	91.75%	94.24%
DeepSeek-R1-Distill-Llama-8B-IQ3_S	3.68	3.24	12.0%	3.43	6.8%	93.71%	91.50%	92.97%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL	4.68	4.30	8.1%	4.39	6.1%	98.82%	96.44%	96.12%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L	4.32	3.45	20.1%	3.76	13.0%	97.25%	92.60%	94.79%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M	4.02	3.37	16.2%	3.56	11.3%	96.92%	91.45%	94.45%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S	3.66	3.28	10.4%	3.31	9.7%	94.59%	90.73%	92.23%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M	4.92	4.44	9.8%	4.41	10.5%	98.85%	98.02%	98.03%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S	4.69	4.31	8.1%	4.28	8.8%	99.01%	97.97%	97.72%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M	5.73	5.35	6.6%	5.38	6.2%	99.09%	98.83%	98.94%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S	5.60	5.19	7.3%	5.24	6.4%	99.00%	98.82%	98.85%
DeepSeek-R1-Distill-Llama-8B-Q6_K	6.60	6.17	6.5%	6.57	0.5%	99.47%	98.91%	99.19%
DeepSeek-R1-Distill-Llama-8B-Q8_0	8.54	7.84	8.2%	7.73	9.4%	99.93%	98.99%	99.26%

slaren · 2025-04-12T14:57:09Z

Any particular use case you have in mind it will prevent? Maybe I can work it into the logic.

For example, using ffn as the pattern to set the type of all ffn tensors, or attn of all attention tensors, without having to specify each one individually.

EAddario · 2025-04-13T10:01:24Z

I see what you mean. The choice of approach was a trade-off between ensuring the program continues to work exactly as before (backwards compatibility), not introducing new options that are already available (--pure, --output-tensor-type and --token-embedding-type), and adding new capabilities in a way that's consistent with existing error checking logic.

By restricting the tensors, users won't be able to do things that clearly are not useful, like trying to quantize norms, lerps, ropes, etc., but you're right in that users wanting to quantize all attn tensors would need to pass three options (--tensor-type attn_q=q4_k --tensor-type attn_k=q4_k --tensor-type attn_v=q4_k) instead of just one (--tensor-type attn=q4_k).

Once the changes are merged, I'll open a new PR to address this, within the tensor checking logic to avoid matching instances like attn_norm, ffn_norm, etc., plus implementing @ggerganov's recommendation to make the struct C compatible.

acbits · 2025-04-13T20:43:40Z

By restricting the tensors, users won't be able to do things that clearly are not useful, like trying to quantize norms, lerps, ropes, etc., but you're right in that users wanting to quantize all attn tensors would need to pass three options (--tensor-type attn_q=q4_k --tensor-type attn_k=q4_k --tensor-type attn_v=q4_k) instead of just one (--tensor-type attn=q4_k).

Late to this conversation, but isn't this case already handled by a regex that uses grouping?

--tensor-type 'attn_(q|k)'=q4_k could be applied to attn_q and attn_k?

EAddario · 2025-04-14T06:29:15Z

Not quite @acbits. For the reasons described above, the program requires the full tensor name, with the regex applying only to preceding characters.

I'll improve this behaviour in the next PR.

joseph777111 · 2025-04-14T16:27:58Z

@EAddario Congrats! 🚀

EAddario added 24 commits March 13, 2025 18:54

Add llama_model_quantize_params parameters

09f716d

Add new quantize parameters parsing and validation

ac908af

Update usage

337d979

Add new parameters defaults

6f8d16d

Add new quantization parameters logic

71c9f93

Add llama_model_quantize_params parameters

8e18131

Add new quantize parameters parsing and validation

a77d947

Update usage

2414eaa

Add new parameters defaults

0dd66b8

Add new quantization parameters logic

1d841c6

Merge main changes into branch

120f71b

Merge branch 'master' into quantize

dbcc0b5

Minor refactoring as per the contributors' coding guidelines

d86de03

Update descriptions to match existing style

99bae5e

Merge branch 'master' into quantize

60b0a53

Merge branch 'master' into quantize

3e2063d

Merge branch 'master' into quantize

b99fa62

Add llama_model_quantize_params parameters

f97b693

Add new quantize parameters parsing and validation

f11e3da

Update usage

ad1e352

Add new parameters defaults

4e5c96a

Add new quantization parameters logic

9b3ccb5

Minor refactoring as per the contributors' guidelines

35f45f1

Merge branch 'master' into quantize

071e9ef

github-actions bot added the examples label Mar 22, 2025

EAddario changed the title ~~Handle user-defined quantization levels for additional tensors~~ quantize: Handle user-defined quantization levels for additional tensors Mar 22, 2025

jukofyork mentioned this pull request Mar 27, 2025

llama : add option to override model tensor buffers #11397

Merged

2 tasks

EAddario mentioned this pull request Apr 2, 2025

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open

ngxson reviewed Apr 2, 2025

View reviewed changes

EAddario added 2 commits April 3, 2025 08:07

Refactor function name and update ALLOWED_TENSOR_TYPE

054ede4

Add missing #include

5a304b8

Handle edge case when tensor name is cls.output

1acb9f4

ubergarm mentioned this pull request Apr 4, 2025

Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications #12727

Open

EAddario added 2 commits April 7, 2025 19:36

Minor logging improvement

04604a4

Merge branch 'master' into quantize

30443a5

ggerganov approved these changes Apr 8, 2025

View reviewed changes

ggerganov mentioned this pull request Apr 8, 2025

imatrix : use GGUF to store importance matrices #9400

Draft

8 tasks

EAddario mentioned this pull request Apr 8, 2025

changelog : libllama API #9289

Open

ggerganov requested a review from slaren April 11, 2025 07:20

slaren approved these changes Apr 11, 2025

View reviewed changes

ggerganov merged commit 71e90e8 into ggml-org:master Apr 13, 2025
51 checks passed

EAddario deleted the quantize branch April 14, 2025 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantize: Handle user-defined quantization levels for additional tensors #12511

quantize: Handle user-defined quantization levels for additional tensors #12511

EAddario commented Mar 22, 2025 •

edited

Loading

max-krasnyansky commented Mar 25, 2025

EAddario commented Mar 25, 2025

jukofyork commented Mar 27, 2025

jukofyork commented Apr 2, 2025

ngxson Apr 2, 2025

ddh0 Apr 3, 2025

EAddario Apr 3, 2025

joseph777111 Apr 5, 2025 •

edited

Loading

ddh0 commented Apr 3, 2025

ddh0 commented Apr 3, 2025

EAddario commented Apr 3, 2025

ubergarm commented Apr 4, 2025 •

edited

Loading

David-AU-github commented Apr 6, 2025

EAddario commented Apr 7, 2025

David-AU-github commented Apr 7, 2025 •

edited

Loading

Test results

ggerganov Apr 8, 2025

EAddario Apr 8, 2025

slaren left a comment

EAddario commented Apr 12, 2025

EAddario commented Apr 12, 2025

slaren commented Apr 12, 2025

EAddario commented Apr 13, 2025 •

edited

Loading

acbits commented Apr 13, 2025 •

edited

Loading

EAddario commented Apr 14, 2025

joseph777111 commented Apr 14, 2025

quantize: Handle user-defined quantization levels for additional tensors #12511

quantize: Handle user-defined quantization levels for additional tensors #12511

Conversation

EAddario commented Mar 22, 2025 • edited Loading

max-krasnyansky commented Mar 25, 2025

EAddario commented Mar 25, 2025

jukofyork commented Mar 27, 2025

jukofyork commented Apr 2, 2025

ngxson Apr 2, 2025

Choose a reason for hiding this comment

ddh0 Apr 3, 2025

Choose a reason for hiding this comment

EAddario Apr 3, 2025

Choose a reason for hiding this comment

joseph777111 Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

ddh0 commented Apr 3, 2025

ddh0 commented Apr 3, 2025

EAddario commented Apr 3, 2025

ubergarm commented Apr 4, 2025 • edited Loading

David-AU-github commented Apr 6, 2025

EAddario commented Apr 7, 2025

Test results

David-AU-github commented Apr 7, 2025 • edited Loading

Test results

ggerganov Apr 8, 2025

Choose a reason for hiding this comment

EAddario Apr 8, 2025

Choose a reason for hiding this comment

slaren left a comment

Choose a reason for hiding this comment

EAddario commented Apr 12, 2025

EAddario commented Apr 12, 2025

Test results

slaren commented Apr 12, 2025

EAddario commented Apr 13, 2025 • edited Loading

acbits commented Apr 13, 2025 • edited Loading

EAddario commented Apr 14, 2025

joseph777111 commented Apr 14, 2025

EAddario commented Mar 22, 2025 •

edited

Loading

joseph777111 Apr 5, 2025 •

edited

Loading

ubergarm commented Apr 4, 2025 •

edited

Loading

David-AU-github commented Apr 7, 2025 •

edited

Loading

EAddario commented Apr 13, 2025 •

edited

Loading

acbits commented Apr 13, 2025 •

edited

Loading