Skip to content

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 10 commits into from
Apr 25, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 25, 2023

8-bit integer quantization support

Perplexity: 5.9563

main: seed = 1682448271
llama.cpp: loading model from ../models/7B/ggml-model-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403851.11 KB
llama_model_load_internal: mem required  = 9022.32 MB (+ 1026.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.16 seconds per pass - ETA 23 minutes
[1]4.2384,[2]4.7355,[3]5.5889,[4]6.1733,[5]6.3007,[6]6.2664,[7]6.4584,[8]6.5534,[9]6.8821,[10]7.1241,[11]7.3343,[12]7.3544,[13]7.2708,[14]7.3211,[15]7.5611,[16]7.1921,[17]7.0824,[18]7.0290,[19]6.6805,[20]6.6714,[21]6.5799,[22]6.4063,[23]6.3776,[24]6.2865,[25]6.2832,[26]6.1237,[27]5.9534,[28]5.8555,[29]5.7683,[30]5.6163,[31]5.5870,[32]5.6074,[33]5.5519,[34]5.5823,[35]5.6052,[36]5.6417,[37]5.6421,[38]5.6530,[39]5.6856,[40]5.7357,[41]5.7442,[42]5.7817,[43]5.7443,[44]5.8007,[45]5.8034,[46]5.7780,[47]5.7980,[48]5.7735,[49]5.7746,[50]5.7359,[51]5.7319,[52]5.7225,[53]5.7672,[54]5.7516,[55]5.7301,[56]5.7587,[57]5.7784,[58]5.7973,[59]5.8144,[60]5.8550,[61]5.8479,[62]5.9050,[63]5.9359,[64]5.9492,[65]5.9908,[66]5.9990,[67]6.0162,[68]6.0307,[69]6.0542,[70]6.0843,[71]6.1051,[72]6.1363,[73]6.1946,[74]6.1988,[75]6.2121,[76]6.2241,[77]6.2353,[78]6.2208,[79]6.2482,[80]6.2417,[81]6.2527,[82]6.2567,[83]6.2067,[84]6.1891,[85]6.1764,[86]6.1553,[87]6.0913,[88]6.0665,[89]6.0472,[90]6.0334,[91]6.0560,[92]6.0502,[93]6.0510,[94]6.0486,[95]6.0757,[96]6.0756,[97]6.0701,[98]6.0644,[99]6.0516,[100]6.0506,[101]6.0743,[102]6.0695,[103]6.0895,[104]6.0968,[105]6.0968,[106]6.1132,[107]6.1125,[108]6.1257,[109]6.1208,[110]6.1173,[111]6.1394,[112]6.1595,[113]6.1616,[114]6.1578,[115]6.1637,[116]6.1548,[117]6.1596,[118]6.1877,[119]6.2092,[120]6.2432,[121]6.2577,[122]6.2818,[123]6.3180,[124]6.3350,[125]6.3260,[126]6.3639,[127]6.3994,[128]6.4290,[129]6.4143,[130]6.4225,[131]6.4188,[132]6.4116,[133]6.3988,[134]6.4088,[135]6.4048,[136]6.3944,[137]6.3872,[138]6.3698,[139]6.3597,[140]6.3562,[141]6.3275,[142]6.3241,[143]6.2945,[144]6.2745,[145]6.2656,[146]6.2541,[147]6.2576,[148]6.2580,[149]6.2528,[150]6.2486,[151]6.2506,[152]6.2410,[153]6.2254,[154]6.2170,[155]6.2237,[156]6.2190,[157]6.2355,[158]6.2396,[159]6.2440,[160]6.2466,[161]6.2584,[162]6.2308,[163]6.2197,[164]6.1968,[165]6.1670,[166]6.1406,[167]6.1045,[168]6.0748,[169]6.0614,[170]6.0508,[171]6.0248,[172]6.0083,[173]5.9923,[174]5.9631,[175]5.9419,[176]5.9307,[177]5.9112,[178]5.8890,[179]5.8725,[180]5.8632,[181]5.8422,[182]5.8248,[183]5.8115,[184]5.8107,[185]5.8036,[186]5.8047,[187]5.8108,[188]5.8071,[189]5.8239,[190]5.8247,[191]5.8453,[192]5.8610,[193]5.8772,[194]5.8879,[195]5.9087,[196]5.9240,[197]5.9445,[198]5.9593,[199]5.9623,[200]5.9671,[201]5.9619,[202]5.9801,[203]5.9872,[204]5.9855,[205]5.9956,[206]6.0024,[207]5.9986,[208]6.0069,[209]6.0108,[210]6.0159,[211]6.0265,[212]6.0334,[213]6.0436,[214]6.0458,[215]6.0483,[216]6.0622,[217]6.0800,[218]6.0927,[219]6.0925,[220]6.0890,[221]6.0840,[222]6.0818,[223]6.0725,[224]6.0655,[225]6.0617,[226]6.0818,[227]6.0896,[228]6.0948,[229]6.1008,[230]6.0976,[231]6.1139,[232]6.1026,[233]6.0866,[234]6.0722,[235]6.0523,[236]6.0459,[237]6.0366,[238]6.0393,[239]6.0249,[240]6.0151,[241]6.0169,[242]6.0206,[243]6.0189,[244]6.0079,[245]6.0050,[246]5.9942,[247]5.9829,[248]5.9759,[249]5.9735,[250]5.9781,[251]5.9713,[252]5.9681,[253]5.9587,[254]5.9536,[255]5.9429,[256]5.9255,[257]5.9139,[258]5.9060,[259]5.9038,[260]5.8959,[261]5.8918,[262]5.8864,[263]5.8813,[264]5.8592,[265]5.8587,[266]5.8569,[267]5.8505,[268]5.8591,[269]5.8572,[270]5.8583,[271]5.8658,[272]5.8691,[273]5.8694,[274]5.8719,[275]5.8799,[276]5.8858,[277]5.9012,[278]5.9110,[279]5.9202,[280]5.9230,[281]5.9326,[282]5.9383,[283]5.9527,[284]5.9605,[285]5.9688,[286]5.9823,[287]5.9818,[288]5.9875,[289]5.9795,[290]5.9642,[291]5.9497,[292]5.9353,[293]5.9224,[294]5.9246,[295]5.9238,[296]5.9284,[297]5.9271,[298]5.9300,[299]5.9276,[300]5.9171,[301]5.9172,[302]5.9096,[303]5.9013,[304]5.8931,[305]5.8897,[306]5.8775,[307]5.8797,[308]5.8827,[309]5.8674,[310]5.8621,[311]5.8559,[312]5.8581,[313]5.8526,[314]5.8510,[315]5.8357,[316]5.8306,[317]5.8148,[318]5.7952,[319]5.8067,[320]5.8187,[321]5.8231,[322]5.8192,[323]5.8126,[324]5.8099,[325]5.8199,[326]5.8201,[327]5.8222,[328]5.8260,[329]5.8318,[330]5.8344,[331]5.8465,[332]5.8437,[333]5.8504,[334]5.8451,[335]5.8393,[336]5.8430,[337]5.8409,[338]5.8402,[339]5.8352,[340]5.8311,[341]5.8389,[342]5.8417,[343]5.8464,[344]5.8465,[345]5.8470,[346]5.8446,[347]5.8487,[348]5.8520,[349]5.8543,[350]5.8511,[351]5.8520,[352]5.8520,[353]5.8463,[354]5.8464,[355]5.8514,[356]5.8544,[357]5.8510,[358]5.8598,[359]5.8624,[360]5.8591,[361]5.8587,[362]5.8656,[363]5.8765,[364]5.8824,[365]5.8875,[366]5.8888,[367]5.8972,[368]5.8949,[369]5.8958,[370]5.8972,[371]5.8921,[372]5.8968,[373]5.9013,[374]5.8998,[375]5.9000,[376]5.9065,[377]5.9022,[378]5.9049,[379]5.9106,[380]5.9029,[381]5.8996,[382]5.8946,[383]5.8940,[384]5.8935,[385]5.8925,[386]5.8920,[387]5.8919,[388]5.8883,[389]5.8833,[390]5.8766,[391]5.8692,[392]5.8654,[393]5.8638,[394]5.8663,[395]5.8651,[396]5.8581,[397]5.8649,[398]5.8686,[399]5.8762,[400]5.8764,[401]5.8777,[402]5.8787,[403]5.8806,[404]5.8870,[405]5.8776,[406]5.8744,[407]5.8740,[408]5.8757,[409]5.8869,[410]5.8976,[411]5.9087,[412]5.9241,[413]5.9349,[414]5.9423,[415]5.9477,[416]5.9553,[417]5.9671,[418]5.9705,[419]5.9771,[420]5.9857,[421]5.9970,[422]6.0010,[423]6.0080,[424]6.0184,[425]6.0268,[426]6.0331,[427]6.0375,[428]6.0456,[429]6.0505,[430]6.0586,[431]6.0723,[432]6.0760,[433]6.0753,[434]6.0713,[435]6.0722,[436]6.0747,[437]6.0841,[438]6.0914,[439]6.0883,[440]6.0875,[441]6.0826,[442]6.0811,[443]6.0824,[444]6.0829,[445]6.0811,[446]6.0834,[447]6.0863,[448]6.0904,[449]6.0881,[450]6.0889,[451]6.0850,[452]6.0716,[453]6.0632,[454]6.0576,[455]6.0586,[456]6.0632,[457]6.0651,[458]6.0630,[459]6.0636,[460]6.0720,[461]6.0693,[462]6.0679,[463]6.0717,[464]6.0706,[465]6.0680,[466]6.0604,[467]6.0606,[468]6.0603,[469]6.0623,[470]6.0627,[471]6.0580,[472]6.0623,[473]6.0572,[474]6.0583,[475]6.0523,[476]6.0539,[477]6.0468,[478]6.0457,[479]6.0512,[480]6.0556,[481]6.0573,[482]6.0530,[483]6.0489,[484]6.0509,[485]6.0488,[486]6.0431,[487]6.0428,[488]6.0405,[489]6.0359,[490]6.0336,[491]6.0307,[492]6.0252,[493]6.0226,[494]6.0209,[495]6.0204,[496]6.0166,[497]6.0111,[498]6.0094,[499]6.0053,[500]5.9962,[501]5.9897,[502]5.9899,[503]5.9893,[504]5.9808,[505]5.9829,[506]5.9837,[507]5.9780,[508]5.9741,[509]5.9735,[510]5.9769,[511]5.9815,[512]5.9849,[513]5.9869,[514]5.9930,[515]5.9877,[516]5.9867,[517]5.9878,[518]5.9875,[519]5.9904,[520]5.9929,[521]5.9941,[522]5.9967,[523]5.9974,[524]6.0030,[525]6.0061,[526]6.0070,[527]6.0088,[528]6.0038,[529]6.0043,[530]5.9995,[531]5.9984,[532]6.0029,[533]6.0052,[534]6.0036,[535]6.0057,[536]6.0004,[537]5.9984,[538]6.0033,[539]6.0044,[540]6.0080,[541]6.0083,[542]6.0094,[543]6.0110,[544]6.0121,[545]6.0102,[546]6.0110,[547]6.0070,[548]6.0024,[549]6.0025,[550]5.9996,[551]5.9963,[552]5.9941,[553]5.9906,[554]5.9886,[555]5.9856,[556]5.9852,[557]5.9875,[558]5.9837,[559]5.9834,[560]5.9833,[561]5.9835,[562]5.9814,[563]5.9810,[564]5.9853,[565]5.9873,[566]5.9871,[567]5.9850,[568]5.9856,[569]5.9843,[570]5.9871,[571]5.9876,[572]5.9886,[573]5.9886,[574]5.9851,[575]5.9844,[576]5.9843,[577]5.9829,[578]5.9811,[579]5.9816,[580]5.9753,[581]5.9717,[582]5.9707,[583]5.9715,[584]5.9718,[585]5.9644,[586]5.9577,[587]5.9583,[588]5.9631,[589]5.9682,[590]5.9712,[591]5.9733,[592]5.9721,[593]5.9690,[594]5.9700,[595]5.9677,[596]5.9709,[597]5.9689,[598]5.9660,[599]5.9681,[600]5.9676,[601]5.9661,[602]5.9670,[603]5.9697,[604]5.9705,[605]5.9739,[606]5.9759,[607]5.9742,[608]5.9710,[609]5.9718,[610]5.9753,[611]5.9736,[612]5.9762,[613]5.9727,[614]5.9678,[615]5.9608,[616]5.9635,[617]5.9577,[618]5.9530,[619]5.9478,[620]5.9345,[621]5.9280,[622]5.9264,[623]5.9280,[624]5.9285,[625]5.9287,[626]5.9276,[627]5.9298,[628]5.9299,[629]5.9295,[630]5.9327,[631]5.9382,[632]5.9438,[633]5.9424,[634]5.9458,[635]5.9464,[636]5.9431,[637]5.9396,[638]5.9420,[639]5.9390,[640]5.9399,[641]5.9401,[642]5.9466,[643]5.9487,[644]5.9499,[645]5.9481,[646]5.9520,[647]5.9480,[648]5.9489,[649]5.9492,[650]5.9529,[651]5.9581,[652]5.9592,[653]5.9631,[654]5.9569,[655]5.9563,
llama_print_timings:        load time =  5233.52 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1351323.55 ms / 335360 tokens (    4.03 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1385072.75 ms

@sw
Copy link
Contributor

sw commented Apr 25, 2023

For AVX2/AVX/scalar, we might want to keep ggml_vec_dot_q4_0_q8_0 and ggml_vec_dot_q4_2_q8_0, so as not to waste cycles and memory for s0 and s1, which aren't used.

I'm actually surprised that they're worth using on ARM NEON, as the alternative is simply subtracting 8 from the Q4 quants.

@ggerganov
Copy link
Member Author

@sw there is no noticeable difference difference between the two. Still, changed to use Q8_0 as suggested

@ggerganov ggerganov added the generation quality Quality of model output label Apr 25, 2023
@ggerganov ggerganov self-assigned this Apr 25, 2023
@sw
Copy link
Contributor

sw commented Apr 25, 2023

I guess it's not finished? You're using block_q8_1 in ggml_vec_dot_q4_0_q8_0; it just happens to work but doesn't do what it should. Maybe we need a field in quantize_fns to indicate the quantization type for the dot product, which can then be used instead of hard-coding GGML_TYPE_SIZE[GGML_TYPE_Q8_1] etc.

@ggerganov
Copy link
Member Author

Wow - this is difficult 😄 I keep messing up something

@sw
Copy link
Contributor

sw commented Apr 25, 2023

Looks good now; I think it's very slightly slower for Q4_0 and Q4_2 because we're now missing the SIMD optimizations for quantize_row_q8_0.

@ggerganov
Copy link
Member Author

Ok, will merge now and we can finish the AVX stuff from master

@ggerganov ggerganov merged commit 7a32fcb into master Apr 25, 2023
@ggerganov ggerganov deleted the q8_0 branch April 25, 2023 20:40
@mofosyne mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
generation quality Quality of model output Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants