imatrix: add option to display importance score statistics for a given imatrix file #12718

EAddario · 2025-04-02T13:40:19Z

A new --show-statistics option generates a report highlighting which tensors/layers contribute the most in a model. The report is sorted from the highest influence to lowest. The process computes the average value of scores per tensor/layer and calculates their % contribution, exiting immediately after completion.

This PR can be used along with quantize: Handle user-defined quantization levels for additional tensors to do layer-wise quantization similar, but not quite the same, to the process described in Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels

Output example:

llama-imatrix --in-file imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat --show-statistics

Computing statistics for imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat (225 tensors)

 Layer	               Tensor	          μ(Importance Scores)	   Contribution
================================================================================
    -	                        output	        5523.92	             13.9226 %
   27	                        attn_v	         356.58	              0.8987 %
   27	                        attn_k	         356.58	              0.8987 %
   27	                        attn_q	         356.58	              0.8987 %
   24	                        attn_k	         347.19	              0.8751 %
   24	                        attn_q	         347.19	              0.8751 %
   24	                        attn_v	         347.19	              0.8751 %
   25	                        attn_q	         346.77	              0.8740 %
   25	                        attn_k	         346.77	              0.8740 %
   25	                        attn_v	         346.77	              0.8740 %
   29	                        attn_v	         344.46	              0.8682 %
...
   0	                      ffn_down	           0.09	              0.0002 %

ngxson · 2025-04-02T14:09:57Z

Nice idea, seems like something we discuss the last time? @bartowski1182

Btw is it possible to show importance score from an existing imatrix file @EAddario ?

EAddario · 2025-04-02T17:21:43Z

Thank you @ngxson. Yes, it will process any imatrix file produced by llama-imatrix, but it is restricted to single file (does not deal with multiple --in-file)

jukofyork · 2025-04-03T17:20:10Z

Isn't this just related to the hidden state norms getting larger as you move through the different layers? If so, then it won't really account for the accumulation of errors caused by an early layer on the final output?

EAddario · 2025-04-06T13:21:51Z

Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.

The logic llama-imatrix uses to calculate the IS is to square the value of the corresponding weight during inference, keep a running total of how many times that particular value has been updated, and then save the average when inference has finished.

This only applies to 2d or larger tensors, so it will ignore norms (1d), but since errors influence which weights get updated (and how frequently), the IS does account for errors, albeit indirectly.

Make sense?

compilade · 2025-04-06T23:23:33Z

Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.

@EAddario

I think the mean squared activations (which would be their variance assuming a mean of 0) cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations. (llama-imatrix calculates the sum of squared activations and their count, it doesn't directly take into account the model weights; it's only when quantizing that they are taken into account (and even then it depends on the type))

The goal here is to find which layers need more precision, right?

I'm not sure if the mean squared activations really are what you're looking for.

There might be other measures like skewness and kurtosis which may be useful. But I'm not sure if taking only the activations into account is the right way to get the insights you seek.

What I'd like to try eventually would be to use a simultaneous quantization algorithm to try multiple bit-widths at once in a reasonable amount of time so that the errors can be compared per tensor to help with the choice of quantization type.

This would be possible for x[i] ≈ q[i] * s types using a cumulative search similar to #12557, but I don't know how to do that with x[i] ≈ q[i] * s - m types yet.

I still think it can be useful to have some way to visualize what is in imatrix files and/or the distribution of the activations. But not all the necessary information is kept in imatrix files, only the per-channel sum of squared activations, which is a bit limiting for this purpose. Adding more measures (like the mean, skewness and kurtosis, either per-tensor or per-channel) in the file would be easier after #9400.

In the paper you link (https://arxiv.org/pdf/2406.17415), the closest thing to what you propose would be the LIM (layer input modification) score, which is calculated as follows (in Section 3.1), where $L_i$ is the i-th layer, and $L_i^I$ are the input activations and $L_i^O$ the corresponding output activations:

$$ LIM(L_i) = -\frac{L_i^I \cdot L_i^O}{\left|L_i^I\right| \left|L_i^O\right|} $$

llama-imatrix technically has access to both the input and output activations of a layer, but only uses its input.

EAddario · 2025-04-07T22:12:12Z

Very clear now, thanks @compilade. You're correct, I'm using the mean squared activation averaged to identify which tensors/layers produce large magnitude activations and ~~whilst~~ agree it isn't as accurate as, say, correlation / covariance / LIM ~~I think it's still a reasonable proxy, specially considering how the importance scores are actually used during quantization (quant_weights in ggml-quants.c)~~

I had a quick look at your PRs. I definitely like the idea of storing imatrix data in GGUF format and can appreciate how it would improve the generation of these types of stats. #12557 is quite intriguing, but truth be told I haven't had a chance to really digest it fully (there's a lot going on!) but would love to see it merged specially if it improves ternary quants

EAddario · 2025-04-08T10:39:35Z

Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆

jukofyork · 2025-04-08T11:08:19Z

Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆

I can help you with this, but it will need a fair bit of compute to calculate. I've not got time to explain fully but basically:

Decide on what you are optimising: L2-error in the final hidden-state, perplexity (ie: "wellcalibratedness" of the top choice), KL-divergence (ie: "wellcalibratedness" of the full probability distribution), earth-movers-distance, hinge-loss, or whatever.
Use some form of (2-sided) Finite-Differences to to estimate the gradient of the loss you are optimising with respect to moving up/down 1 bit of quant for a given parameter group (eg: layer-based or tensor-based groupings).

You will likely have to transform the loss measure somehow:

Perplexity is actually just a transformed version of negative log-loss, as is McFadden's Pseudo-R-squared and a whole host of different domain-specific measures of "wellcalibratedness". The fact people often plot the log-PPL suggests this is not a good metric to use for this...
The real thing you are measuring is "bits" (in the Information Theory sense; not the normal colloquial term) and negative-log-loss has a nice interpretation for this (the late David MacKay's book Information Theory, Inference, and Learning Algorithms is an amazing read to see the links if you are more interested in this!).

Assuming Finite-Differences is too costly to perform, then then you can use a stochastic approximation (FDSA) or its extension SPSA to estimate the gradients using whatever compute you can muster up.

jukofyork · 2025-04-08T11:15:23Z

I've edited the post above quite a lot so should hopefully make more sense (in case you're reading from the email notification).

EAddario · 2025-04-08T21:06:13Z

Thank you, now I know what I'm doing over the weekend 😁

On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration

jukofyork · 2025-04-08T21:18:08Z

Thank you, now I know what I'm doing over the weekend 😁

On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration

No problem and just remember the most important thing to figure out is exactly what you are optimising first! There are actually a lot of compelling options for this; each with their own reasons for and against... All have different costs to compute too:

Metrics using the full probability distribution like KL-divergence or earth-movers distance are the most expensive.
Then metrics that need a probability and have to pass through softmax are next.
Then metrics that require multiplication with lm_head (which in modern models can be >> hidden_dim!) are next.
Metrics involving the final hidden state are the cheapest.

EAddario added 6 commits April 1, 2025 21:54

Add --show-statistics option

d8e902e

Add --show-statistics logic

f46693b

Merge branch 'master' into imatrix

b3ac78b

Add tensor name parsing

dc3373e

Tidy output format

0589c3e

Fix typo in title

e1fd1af

github-actions bot added the examples label Apr 2, 2025

Green-Sky mentioned this pull request Apr 3, 2025

Add imatrix support leejet/stable-diffusion.cpp#633

Open

Merge branch 'master' into imatrix

490a8fe

EAddario marked this pull request as draft April 8, 2025 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imatrix: add option to display importance score statistics for a given imatrix file #12718

imatrix: add option to display importance score statistics for a given imatrix file #12718

EAddario commented Apr 2, 2025 •

edited

Loading

ngxson commented Apr 2, 2025

EAddario commented Apr 2, 2025

jukofyork commented Apr 3, 2025

EAddario commented Apr 6, 2025

compilade commented Apr 6, 2025

EAddario commented Apr 7, 2025 •

edited

Loading

EAddario commented Apr 8, 2025

jukofyork commented Apr 8, 2025 •

edited

Loading

jukofyork commented Apr 8, 2025

EAddario commented Apr 8, 2025

jukofyork commented Apr 8, 2025 •

edited

Loading

imatrix: add option to display importance score statistics for a given imatrix file #12718

Are you sure you want to change the base?

imatrix: add option to display importance score statistics for a given imatrix file #12718

Conversation

EAddario commented Apr 2, 2025 • edited Loading

ngxson commented Apr 2, 2025

EAddario commented Apr 2, 2025

jukofyork commented Apr 3, 2025

EAddario commented Apr 6, 2025

compilade commented Apr 6, 2025

EAddario commented Apr 7, 2025 • edited Loading

EAddario commented Apr 8, 2025

jukofyork commented Apr 8, 2025 • edited Loading

jukofyork commented Apr 8, 2025

EAddario commented Apr 8, 2025

jukofyork commented Apr 8, 2025 • edited Loading

EAddario commented Apr 2, 2025 •

edited

Loading

EAddario commented Apr 7, 2025 •

edited

Loading

jukofyork commented Apr 8, 2025 •

edited

Loading

jukofyork commented Apr 8, 2025 •

edited

Loading