ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

ggerganov · 2023-04-25T18:45:36Z

8-bit integer quantization support

Perplexity: 5.9563

main: seed = 1682448271
llama.cpp: loading model from ../models/7B/ggml-model-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403851.11 KB
llama_model_load_internal: mem required  = 9022.32 MB (+ 1026.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.16 seconds per pass - ETA 23 minutes
[1]4.2384,[2]4.7355,[3]5.5889,[4]6.1733,[5]6.3007,[6]6.2664,[7]6.4584,[8]6.5534,[9]6.8821,[10]7.1241,[11]7.3343,[12]7.3544,[13]7.2708,[14]7.3211,[15]7.5611,[16]7.1921,[17]7.0824,[18]7.0290,[19]6.6805,[20]6.6714,[21]6.5799,[22]6.4063,[23]6.3776,[24]6.2865,[25]6.2832,[26]6.1237,[27]5.9534,[28]5.8555,[29]5.7683,[30]5.6163,[31]5.5870,[32]5.6074,[33]5.5519,[34]5.5823,[35]5.6052,[36]5.6417,[37]5.6421,[38]5.6530,[39]5.6856,[40]5.7357,[41]5.7442,[42]5.7817,[43]5.7443,[44]5.8007,[45]5.8034,[46]5.7780,[47]5.7980,[48]5.7735,[49]5.7746,[50]5.7359,[51]5.7319,[52]5.7225,[53]5.7672,[54]5.7516,[55]5.7301,[56]5.7587,[57]5.7784,[58]5.7973,[59]5.8144,[60]5.8550,[61]5.8479,[62]5.9050,[63]5.9359,[64]5.9492,[65]5.9908,[66]5.9990,[67]6.0162,[68]6.0307,[69]6.0542,[70]6.0843,[71]6.1051,[72]6.1363,[73]6.1946,[74]6.1988,[75]6.2121,[76]6.2241,[77]6.2353,[78]6.2208,[79]6.2482,[80]6.2417,[81]6.2527,[82]6.2567,[83]6.2067,[84]6.1891,[85]6.1764,[86]6.1553,[87]6.0913,[88]6.0665,[89]6.0472,[90]6.0334,[91]6.0560,[92]6.0502,[93]6.0510,[94]6.0486,[95]6.0757,[96]6.0756,[97]6.0701,[98]6.0644,[99]6.0516,[100]6.0506,[101]6.0743,[102]6.0695,[103]6.0895,[104]6.0968,[105]6.0968,[106]6.1132,[107]6.1125,[108]6.1257,[109]6.1208,[110]6.1173,[111]6.1394,[112]6.1595,[113]6.1616,[114]6.1578,[115]6.1637,[116]6.1548,[117]6.1596,[118]6.1877,[119]6.2092,[120]6.2432,[121]6.2577,[122]6.2818,[123]6.3180,[124]6.3350,[125]6.3260,[126]6.3639,[127]6.3994,[128]6.4290,[129]6.4143,[130]6.4225,[131]6.4188,[132]6.4116,[133]6.3988,[134]6.4088,[135]6.4048,[136]6.3944,[137]6.3872,[138]6.3698,[139]6.3597,[140]6.3562,[141]6.3275,[142]6.3241,[143]6.2945,[144]6.2745,[145]6.2656,[146]6.2541,[147]6.2576,[148]6.2580,[149]6.2528,[150]6.2486,[151]6.2506,[152]6.2410,[153]6.2254,[154]6.2170,[155]6.2237,[156]6.2190,[157]6.2355,[158]6.2396,[159]6.2440,[160]6.2466,[161]6.2584,[162]6.2308,[163]6.2197,[164]6.1968,[165]6.1670,[166]6.1406,[167]6.1045,[168]6.0748,[169]6.0614,[170]6.0508,[171]6.0248,[172]6.0083,[173]5.9923,[174]5.9631,[175]5.9419,[176]5.9307,[177]5.9112,[178]5.8890,[179]5.8725,[180]5.8632,[181]5.8422,[182]5.8248,[183]5.8115,[184]5.8107,[185]5.8036,[186]5.8047,[187]5.8108,[188]5.8071,[189]5.8239,[190]5.8247,[191]5.8453,[192]5.8610,[193]5.8772,[194]5.8879,[195]5.9087,[196]5.9240,[197]5.9445,[198]5.9593,[199]5.9623,[200]5.9671,[201]5.9619,[202]5.9801,[203]5.9872,[204]5.9855,[205]5.9956,[206]6.0024,[207]5.9986,[208]6.0069,[209]6.0108,[210]6.0159,[211]6.0265,[212]6.0334,[213]6.0436,[214]6.0458,[215]6.0483,[216]6.0622,[217]6.0800,[218]6.0927,[219]6.0925,[220]6.0890,[221]6.0840,[222]6.0818,[223]6.0725,[224]6.0655,[225]6.0617,[226]6.0818,[227]6.0896,[228]6.0948,[229]6.1008,[230]6.0976,[231]6.1139,[232]6.1026,[233]6.0866,[234]6.0722,[235]6.0523,[236]6.0459,[237]6.0366,[238]6.0393,[239]6.0249,[240]6.0151,[241]6.0169,[242]6.0206,[243]6.0189,[244]6.0079,[245]6.0050,[246]5.9942,[247]5.9829,[248]5.9759,[249]5.9735,[250]5.9781,[251]5.9713,[252]5.9681,[253]5.9587,[254]5.9536,[255]5.9429,[256]5.9255,[257]5.9139,[258]5.9060,[259]5.9038,[260]5.8959,[261]5.8918,[262]5.8864,[263]5.8813,[264]5.8592,[265]5.8587,[266]5.8569,[267]5.8505,[268]5.8591,[269]5.8572,[270]5.8583,[271]5.8658,[272]5.8691,[273]5.8694,[274]5.8719,[275]5.8799,[276]5.8858,[277]5.9012,[278]5.9110,[279]5.9202,[280]5.9230,[281]5.9326,[282]5.9383,[283]5.9527,[284]5.9605,[285]5.9688,[286]5.9823,[287]5.9818,[288]5.9875,[289]5.9795,[290]5.9642,[291]5.9497,[292]5.9353,[293]5.9224,[294]5.9246,[295]5.9238,[296]5.9284,[297]5.9271,[298]5.9300,[299]5.9276,[300]5.9171,[301]5.9172,[302]5.9096,[303]5.9013,[304]5.8931,[305]5.8897,[306]5.8775,[307]5.8797,[308]5.8827,[309]5.8674,[310]5.8621,[311]5.8559,[312]5.8581,[313]5.8526,[314]5.8510,[315]5.8357,[316]5.8306,[317]5.8148,[318]5.7952,[319]5.8067,[320]5.8187,[321]5.8231,[322]5.8192,[323]5.8126,[324]5.8099,[325]5.8199,[326]5.8201,[327]5.8222,[328]5.8260,[329]5.8318,[330]5.8344,[331]5.8465,[332]5.8437,[333]5.8504,[334]5.8451,[335]5.8393,[336]5.8430,[337]5.8409,[338]5.8402,[339]5.8352,[340]5.8311,[341]5.8389,[342]5.8417,[343]5.8464,[344]5.8465,[345]5.8470,[346]5.8446,[347]5.8487,[348]5.8520,[349]5.8543,[350]5.8511,[351]5.8520,[352]5.8520,[353]5.8463,[354]5.8464,[355]5.8514,[356]5.8544,[357]5.8510,[358]5.8598,[359]5.8624,[360]5.8591,[361]5.8587,[362]5.8656,[363]5.8765,[364]5.8824,[365]5.8875,[366]5.8888,[367]5.8972,[368]5.8949,[369]5.8958,[370]5.8972,[371]5.8921,[372]5.8968,[373]5.9013,[374]5.8998,[375]5.9000,[376]5.9065,[377]5.9022,[378]5.9049,[379]5.9106,[380]5.9029,[381]5.8996,[382]5.8946,[383]5.8940,[384]5.8935,[385]5.8925,[386]5.8920,[387]5.8919,[388]5.8883,[389]5.8833,[390]5.8766,[391]5.8692,[392]5.8654,[393]5.8638,[394]5.8663,[395]5.8651,[396]5.8581,[397]5.8649,[398]5.8686,[399]5.8762,[400]5.8764,[401]5.8777,[402]5.8787,[403]5.8806,[404]5.8870,[405]5.8776,[406]5.8744,[407]5.8740,[408]5.8757,[409]5.8869,[410]5.8976,[411]5.9087,[412]5.9241,[413]5.9349,[414]5.9423,[415]5.9477,[416]5.9553,[417]5.9671,[418]5.9705,[419]5.9771,[420]5.9857,[421]5.9970,[422]6.0010,[423]6.0080,[424]6.0184,[425]6.0268,[426]6.0331,[427]6.0375,[428]6.0456,[429]6.0505,[430]6.0586,[431]6.0723,[432]6.0760,[433]6.0753,[434]6.0713,[435]6.0722,[436]6.0747,[437]6.0841,[438]6.0914,[439]6.0883,[440]6.0875,[441]6.0826,[442]6.0811,[443]6.0824,[444]6.0829,[445]6.0811,[446]6.0834,[447]6.0863,[448]6.0904,[449]6.0881,[450]6.0889,[451]6.0850,[452]6.0716,[453]6.0632,[454]6.0576,[455]6.0586,[456]6.0632,[457]6.0651,[458]6.0630,[459]6.0636,[460]6.0720,[461]6.0693,[462]6.0679,[463]6.0717,[464]6.0706,[465]6.0680,[466]6.0604,[467]6.0606,[468]6.0603,[469]6.0623,[470]6.0627,[471]6.0580,[472]6.0623,[473]6.0572,[474]6.0583,[475]6.0523,[476]6.0539,[477]6.0468,[478]6.0457,[479]6.0512,[480]6.0556,[481]6.0573,[482]6.0530,[483]6.0489,[484]6.0509,[485]6.0488,[486]6.0431,[487]6.0428,[488]6.0405,[489]6.0359,[490]6.0336,[491]6.0307,[492]6.0252,[493]6.0226,[494]6.0209,[495]6.0204,[496]6.0166,[497]6.0111,[498]6.0094,[499]6.0053,[500]5.9962,[501]5.9897,[502]5.9899,[503]5.9893,[504]5.9808,[505]5.9829,[506]5.9837,[507]5.9780,[508]5.9741,[509]5.9735,[510]5.9769,[511]5.9815,[512]5.9849,[513]5.9869,[514]5.9930,[515]5.9877,[516]5.9867,[517]5.9878,[518]5.9875,[519]5.9904,[520]5.9929,[521]5.9941,[522]5.9967,[523]5.9974,[524]6.0030,[525]6.0061,[526]6.0070,[527]6.0088,[528]6.0038,[529]6.0043,[530]5.9995,[531]5.9984,[532]6.0029,[533]6.0052,[534]6.0036,[535]6.0057,[536]6.0004,[537]5.9984,[538]6.0033,[539]6.0044,[540]6.0080,[541]6.0083,[542]6.0094,[543]6.0110,[544]6.0121,[545]6.0102,[546]6.0110,[547]6.0070,[548]6.0024,[549]6.0025,[550]5.9996,[551]5.9963,[552]5.9941,[553]5.9906,[554]5.9886,[555]5.9856,[556]5.9852,[557]5.9875,[558]5.9837,[559]5.9834,[560]5.9833,[561]5.9835,[562]5.9814,[563]5.9810,[564]5.9853,[565]5.9873,[566]5.9871,[567]5.9850,[568]5.9856,[569]5.9843,[570]5.9871,[571]5.9876,[572]5.9886,[573]5.9886,[574]5.9851,[575]5.9844,[576]5.9843,[577]5.9829,[578]5.9811,[579]5.9816,[580]5.9753,[581]5.9717,[582]5.9707,[583]5.9715,[584]5.9718,[585]5.9644,[586]5.9577,[587]5.9583,[588]5.9631,[589]5.9682,[590]5.9712,[591]5.9733,[592]5.9721,[593]5.9690,[594]5.9700,[595]5.9677,[596]5.9709,[597]5.9689,[598]5.9660,[599]5.9681,[600]5.9676,[601]5.9661,[602]5.9670,[603]5.9697,[604]5.9705,[605]5.9739,[606]5.9759,[607]5.9742,[608]5.9710,[609]5.9718,[610]5.9753,[611]5.9736,[612]5.9762,[613]5.9727,[614]5.9678,[615]5.9608,[616]5.9635,[617]5.9577,[618]5.9530,[619]5.9478,[620]5.9345,[621]5.9280,[622]5.9264,[623]5.9280,[624]5.9285,[625]5.9287,[626]5.9276,[627]5.9298,[628]5.9299,[629]5.9295,[630]5.9327,[631]5.9382,[632]5.9438,[633]5.9424,[634]5.9458,[635]5.9464,[636]5.9431,[637]5.9396,[638]5.9420,[639]5.9390,[640]5.9399,[641]5.9401,[642]5.9466,[643]5.9487,[644]5.9499,[645]5.9481,[646]5.9520,[647]5.9480,[648]5.9489,[649]5.9492,[650]5.9529,[651]5.9581,[652]5.9592,[653]5.9631,[654]5.9569,[655]5.9563,
llama_print_timings:        load time =  5233.52 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1351323.55 ms / 335360 tokens (    4.03 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1385072.75 ms

sw · 2023-04-25T18:57:35Z

For AVX2/AVX/scalar, we might want to keep ggml_vec_dot_q4_0_q8_0 and ggml_vec_dot_q4_2_q8_0, so as not to waste cycles and memory for s0 and s1, which aren't used.

I'm actually surprised that they're worth using on ARM NEON, as the alternative is simply subtracting 8 from the Q4 quants.

ggerganov · 2023-04-25T19:11:07Z

@sw there is no noticeable difference difference between the two. Still, changed to use Q8_0 as suggested

sw · 2023-04-25T19:26:44Z

I guess it's not finished? You're using block_q8_1 in ggml_vec_dot_q4_0_q8_0; it just happens to work but doesn't do what it should. Maybe we need a field in quantize_fns to indicate the quantization type for the dot product, which can then be used instead of hard-coding GGML_TYPE_SIZE[GGML_TYPE_Q8_1] etc.

ggerganov · 2023-04-25T19:51:13Z

Wow - this is difficult 😄 I keep messing up something

ggml.h

ggml.c

sw · 2023-04-25T20:34:22Z

Looks good now; I think it's very slightly slower for Q4_0 and Q4_2 because we're now missing the SIMD optimizations for quantize_row_q8_0.

ggerganov · 2023-04-25T20:39:46Z

Ok, will merge now and we can finish the AVX stuff from master

ggerganov added 2 commits April 25, 2023 21:44

ggml : add Q8_0 quantization format (rename the old one to Q8_1)

f83c321

tests : fix test-quantize-fns

79cfdf5

ggerganov added 2 commits April 25, 2023 22:03

ggml : finalize Q8_0 implementation

d8bf720

ggml : use q4_0_q8_0 and q4_2_q8_0

6496b79

ggerganov added 2 commits April 25, 2023 22:14

ggml : fix Q8_0 dot product bug (ARM)

88618ab

ggml : Q8_0 unroll x2

6e0f0b6

ggerganov added the generation quality Quality of model output label Apr 25, 2023

ggerganov self-assigned this Apr 25, 2023

ggml : fix bug - using wrong block type

46fc696

ggerganov force-pushed the q8_0 branch from 0cdbc28 to 46fc696 Compare April 25, 2023 19:32

ggml : extend quantize_fns_t with "vec_dot_type"

91bfa51

sw reviewed Apr 25, 2023

View reviewed changes

ggml.h Show resolved Hide resolved

ggml : fix Q8_0 to use 255 values out of 256

4ddb983

sw reviewed Apr 25, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggml : fix assert using wrong QK4_2 instead of QK4_3

e8c3731

ggerganov merged commit 7a32fcb into master Apr 25, 2023

ggerganov deleted the q8_0 branch April 25, 2023 20:40

sw mentioned this pull request Apr 26, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

ggerganov commented Apr 25, 2023 •

edited

Loading

sw commented Apr 25, 2023 •

edited

Loading

ggerganov commented Apr 25, 2023

sw commented Apr 25, 2023

ggerganov commented Apr 25, 2023

sw commented Apr 25, 2023

ggerganov commented Apr 25, 2023

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

Conversation

ggerganov commented Apr 25, 2023 • edited Loading

sw commented Apr 25, 2023 • edited Loading

ggerganov commented Apr 25, 2023

sw commented Apr 25, 2023

ggerganov commented Apr 25, 2023

sw commented Apr 25, 2023

ggerganov commented Apr 25, 2023

ggerganov commented Apr 25, 2023 •

edited

Loading

sw commented Apr 25, 2023 •

edited

Loading