-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
imatrix: add option to display importance score statistics for a given imatrix file #12718
base: master
Are you sure you want to change the base?
Conversation
Nice idea, seems like something we discuss the last time? @bartowski1182 Btw is it possible to show importance score from an existing imatrix file @EAddario ? |
Thank you @ngxson. Yes, it will process any imatrix file produced by llama-imatrix, but it is restricted to single file (does not deal with multiple --in-file) |
Isn't this just related to the hidden state norms getting larger as you move through the different layers? If so, then it won't really account for the accumulation of errors caused by an early layer on the final output? |
Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total. The logic llama-imatrix uses to calculate the IS is to square the value of the corresponding weight during inference, keep a running total of how many times that particular value has been updated, and then save the average when inference has finished. This only applies to 2d or larger tensors, so it will ignore norms (1d), but since errors influence which weights get updated (and how frequently), the IS does account for errors, albeit indirectly. Make sense? |
I think the mean squared activations (which would be their variance assuming a mean of 0) cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations. ( The goal here is to find which layers need more precision, right? I'm not sure if the mean squared activations really are what you're looking for. There might be other measures like skewness and kurtosis which may be useful. But I'm not sure if taking only the activations into account is the right way to get the insights you seek. What I'd like to try eventually would be to use a simultaneous quantization algorithm to try multiple bit-widths at once in a reasonable amount of time so that the errors can be compared per tensor to help with the choice of quantization type. This would be possible for I still think it can be useful to have some way to visualize what is in In the paper you link (https://arxiv.org/pdf/2406.17415), the closest thing to what you propose would be the LIM (layer input modification) score, which is calculated as follows (in Section 3.1), where
|
Very clear now, thanks @compilade. You're correct, I'm using the mean squared activation averaged to identify which tensors/layers produce large magnitude activations and I had a quick look at your PRs. I definitely like the idea of storing imatrix data in GGUF format and can appreciate how it would improve the generation of these types of stats. #12557 is quite intriguing, but truth be told I haven't had a chance to really digest it fully (there's a lot going on!) but would love to see it merged specially if it improves ternary quants |
Had a chance to think this more thoroughly and now get the implications of @jukofyork and @compilade's comments. Agree my current approach is not really identifying influence but rather score "growth". Back to the drawing table 😆 |
I can help you with this, but it will need a fair bit of compute to calculate. I've not got time to explain fully but basically:
You will likely have to transform the loss measure somehow:
Assuming Finite-Differences is too costly to perform, then then you can use a stochastic approximation (FDSA) or its extension SPSA to estimate the gradients using whatever compute you can muster up. |
I've edited the post above quite a lot so should hopefully make more sense (in case you're reading from the email notification). |
Thank you, now I know what I'm doing over the weekend 😁 On a serious note, much appreciated @jukofyork. Plenty of food for thought. I'll give it proper consideration |
No problem and just remember the most important thing to figure out is exactly what you are optimising first! There are actually a lot of compelling options for this; each with their own reasons for and against... All have different costs to compute too:
|
A new
--show-statistics
option generates a report highlighting which tensors/layers contribute the most in a model. The report is sorted from the highest influence to lowest. The process computes the average value of scores per tensor/layer and calculates their % contribution, exiting immediately after completion.This PR can be used along with quantize: Handle user-defined quantization levels for additional tensors to do layer-wise quantization similar, but not quite the same, to the process described in Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Output example:
llama-imatrix --in-file imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat --show-statistics