ggml : alternative Q4_3 format + implementation #1108

ggerganov · 2023-04-21T17:15:08Z

#define QK4_3 32
typedef struct {
    ggml_fp16_t d0;        // delta
    ggml_fp16_t d1;        // delta
    ggml_fp16_t m;         // min
    uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;

Running a perplexity test to see how much we lost from having single min factor in the structure instead of two

llama_print_timings:      sample time =    56.68 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   448.06 ms /     8 tokens (   56.01 ms per token)
llama_print_timings:        eval time =  3177.30 ms /    63 runs   (   50.43 ms per run)
llama_print_timings:       total time =  3691.84 ms

Perplexity results

Final ppl is [655] 6.1000

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_3.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_3a.txt 
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
main
quantize
quantize-stats
perplexity
embedding
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
ggml.c:1120:13: warning: unused function 'quantize_row_q4_2_reference' [-Wunused-function]
static void quantize_row_q4_2_reference(const float * restrict x, block_q4_2 * restrict y, int k) {
            ^
ggml.c:3243:20: warning: unused function 'ggml_vec_silu_f16' [-Wunused-function]
inline static void ggml_vec_silu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
                   ^
ggml.c:3693:19: warning: unused function 'ggml_up64' [-Wunused-function]
static inline int ggml_up64(int n) {
                  ^
3 warnings generated.
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1682099926
llama.cpp: loading model from ./models/7B/ggml-model-q4_3.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6210.95 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
16.06 seconds per pass - ETA 2 hours 55 minutes
[1]4.3839,[2]4.8155,[3]5.6837,[4]6.3070,[5]6.4175,[6]6.3773,[7]6.5560,[8]6.6590,[9]6.9950,[10]7.2612,[11]7.4741,[12]7.5144,[13]7.4453,[14]7.4987,[15]7.7433,[16]7.3598,[17]7.2445,[18]7.2006,[19]6.8344,[20]6.8218,[21]6.7264,[22]6.5580,[23]6.5299,[24]6.4332,[25]6.4411,[26]6.2826,[27]6.1132,[28]6.0182,[29]5.9295,[30]5.7711,[31]5.7404,[32]5.7573,[33]5.7005,[34]5.7339,[35]5.7540,[36]5.7940,[37]5.7947,[38]5.8015,[39]5.8337,[40]5.8829,[41]5.8941,[42]5.9340,[43]5.8936,[44]5.9497,[45]5.9529,[46]5.9245,[47]5.9448,[48]5.9186,[49]5.9190,[50]5.8771,[51]5.8728,[52]5.8612,[53]5.9091,[54]5.8899,[55]5.8689,[56]5.8954,[57]5.9146,[58]5.9330,[59]5.9531,[60]5.9950,[61]5.9845,[62]6.0411,[63]6.0713,[64]6.0859,[65]6.1292,[66]6.1371,[67]6.1552,[68]6.1700,[69]6.1951,[70]6.2242,[71]6.2461,[72]6.2765,[73]6.3322,[74]6.3376,[75]6.3525,[76]6.3656,[77]6.3777,[78]6.3647,[79]6.3915,[80]6.3858,[81]6.4033,[82]6.4088,[83]6.3563,[84]6.3395,[85]6.3286,[86]6.3075,[87]6.2495,[88]6.2262,[89]6.2066,[90]6.1908,[91]6.2149,[92]6.2090,[93]6.2097,[94]6.2072,[95]6.2342,[96]6.2327,[97]6.2272,[98]6.2205,[99]6.2071,[100]6.2056,[101]6.2292,[102]6.2239,[103]6.2428,[104]6.2500,[105]6.2491,[106]6.2660,[107]6.2664,[108]6.2775,[109]6.2710,[110]6.2665,[111]6.2872,[112]6.3072,[113]6.3101,[114]6.3066,[115]6.3120,[116]6.3029,[117]6.3088,[118]6.3367,[119]6.3586,[120]6.3934,[121]6.4090,[122]6.4324,[123]6.4690,[124]6.4858,[125]6.4767,[126]6.5167,[127]6.5534,[128]6.5835,[129]6.5674,[130]6.5765,[131]6.5711,[132]6.5629,[133]6.5503,[134]6.5595,[135]6.5564,[136]6.5448,[137]6.5380,[138]6.5209,[139]6.5096,[140]6.5057,[141]6.4785,[142]6.4748,[143]6.4463,[144]6.4260,[145]6.4186,[146]6.4070,[147]6.4104,[148]6.4099,[149]6.4049,[150]6.4012,[151]6.4034,[152]6.3927,[153]6.3761,[154]6.3673,[155]6.3736,[156]6.3687,[157]6.3854,[158]6.3896,[159]6.3939,[160]6.3958,[161]6.4078,[162]6.3798,[163]6.3676,[164]6.3447,[165]6.3134,[166]6.2866,[167]6.2487,[168]6.2177,[169]6.2039,[170]6.1920,[171]6.1660,[172]6.1485,[173]6.1315,[174]6.1019,[175]6.0799,[176]6.0673,[177]6.0469,[178]6.0237,[179]6.0072,[180]5.9974,[181]5.9754,[182]5.9578,[183]5.9440,[184]5.9428,[185]5.9353,[186]5.9358,[187]5.9424,[188]5.9387,[189]5.9560,[190]5.9572,[191]5.9784,[192]5.9937,[193]6.0107,[194]6.0227,[195]6.0437,[196]6.0597,[197]6.0800,[198]6.0951,[199]6.0981,[200]6.1034,[201]6.0990,[202]6.1175,[203]6.1243,[204]6.1236,[205]6.1342,[206]6.1410,[207]6.1371,[208]6.1457,[209]6.1502,[210]6.1550,[211]6.1649,[212]6.1731,[213]6.1830,[214]6.1861,[215]6.1885,[216]6.2025,[217]6.2197,[218]6.2329,[219]6.2329,[220]6.2288,[221]6.2233,[222]6.2215,[223]6.2120,[224]6.2054,[225]6.2014,[226]6.2217,[227]6.2310,[228]6.2368,[229]6.2430,[230]6.2398,[231]6.2558,[232]6.2438,[233]6.2272,[234]6.2123,[235]6.1945,[236]6.1882,[237]6.1782,[238]6.1805,[239]6.1658,[240]6.1550,[241]6.1569,[242]6.1603,[243]6.1584,[244]6.1476,[245]6.1445,[246]6.1338,[247]6.1223,[248]6.1152,[249]6.1120,[250]6.1169,[251]6.1104,[252]6.1069,[253]6.0976,[254]6.0929,[255]6.0812,[256]6.0634,[257]6.0514,[258]6.0435,[259]6.0413,[260]6.0331,[261]6.0292,[262]6.0236,[263]6.0177,[264]5.9977,[265]5.9975,[266]5.9955,[267]5.9890,[268]5.9979,[269]5.9965,[270]5.9965,[271]6.0041,[272]6.0077,[273]6.0077,[274]6.0099,[275]6.0183,[276]6.0239,[277]6.0396,[278]6.0493,[279]6.0584,[280]6.0613,[281]6.0714,[282]6.0771,[283]6.0921,[284]6.1004,[285]6.1085,[286]6.1211,[287]6.1209,[288]6.1266,[289]6.1185,[290]6.1028,[291]6.0872,[292]6.0719,[293]6.0590,[294]6.0605,[295]6.0590,[296]6.0637,[297]6.0621,[298]6.0652,[299]6.0626,[300]6.0518,[301]6.0513,[302]6.0435,[303]6.0344,[304]6.0256,[305]6.0220,[306]6.0097,[307]6.0115,[308]6.0141,[309]5.9983,[310]5.9933,[311]5.9865,[312]5.9891,[313]5.9836,[314]5.9821,[315]5.9666,[316]5.9616,[317]5.9453,[318]5.9253,[319]5.9365,[320]5.9486,[321]5.9529,[322]5.9489,[323]5.9424,[324]5.9396,[325]5.9502,[326]5.9503,[327]5.9525,[328]5.9562,[329]5.9618,[330]5.9649,[331]5.9771,[332]5.9743,[333]5.9811,[334]5.9758,[335]5.9695,[336]5.9731,[337]5.9707,[338]5.9701,[339]5.9650,[340]5.9611,[341]5.9691,[342]5.9720,[343]5.9765,[344]5.9768,[345]5.9771,[346]5.9744,[347]5.9780,[348]5.9815,[349]5.9837,[350]5.9806,[351]5.9810,[352]5.9812,[353]5.9752,[354]5.9763,[355]5.9817,[356]5.9848,[357]5.9816,[358]5.9908,[359]5.9932,[360]5.9904,[361]5.9902,[362]5.9967,[363]6.0077,[364]6.0140,[365]6.0196,[366]6.0213,[367]6.0296,[368]6.0266,[369]6.0275,[370]6.0291,[371]6.0236,[372]6.0284,[373]6.0328,[374]6.0311,[375]6.0310,[376]6.0376,[377]6.0329,[378]6.0353,[379]6.0414,[380]6.0339,[381]6.0305,[382]6.0255,[383]6.0248,[384]6.0244,[385]6.0234,[386]6.0232,[387]6.0228,[388]6.0195,[389]6.0145,[390]6.0079,[391]6.0002,[392]5.9962,[393]5.9946,[394]5.9976,[395]5.9963,[396]5.9888,[397]5.9953,[398]5.9993,[399]6.0073,[400]6.0070,[401]6.0083,[402]6.0095,[403]6.0115,[404]6.0179,[405]6.0087,[406]6.0059,[407]6.0057,[408]6.0076,[409]6.0189,[410]6.0300,[411]6.0413,[412]6.0570,[413]6.0680,[414]6.0760,[415]6.0814,[416]6.0896,[417]6.1016,[418]6.1051,[419]6.1121,[420]6.1210,[421]6.1323,[422]6.1362,[423]6.1432,[424]6.1538,[425]6.1626,[426]6.1691,[427]6.1738,[428]6.1819,[429]6.1872,[430]6.1953,[431]6.2089,[432]6.2129,[433]6.2120,[434]6.2076,[435]6.2087,[436]6.2112,[437]6.2209,[438]6.2284,[439]6.2252,[440]6.2241,[441]6.2192,[442]6.2177,[443]6.2189,[444]6.2197,[445]6.2178,[446]6.2199,[447]6.2233,[448]6.2273,[449]6.2249,[450]6.2257,[451]6.2215,[452]6.2090,[453]6.2008,[454]6.1952,[455]6.1961,[456]6.2011,[457]6.2031,[458]6.2011,[459]6.2017,[460]6.2103,[461]6.2077,[462]6.2066,[463]6.2104,[464]6.2094,[465]6.2069,[466]6.1995,[467]6.2004,[468]6.2002,[469]6.2022,[470]6.2028,[471]6.1983,[472]6.2034,[473]6.1980,[474]6.1995,[475]6.1935,[476]6.1954,[477]6.1887,[478]6.1879,[479]6.1941,[480]6.1984,[481]6.2003,[482]6.1962,[483]6.1922,[484]6.1943,[485]6.1926,[486]6.1871,[487]6.1873,[488]6.1849,[489]6.1802,[490]6.1781,[491]6.1754,[492]6.1699,[493]6.1671,[494]6.1653,[495]6.1650,[496]6.1612,[497]6.1558,[498]6.1540,[499]6.1495,[500]6.1402,[501]6.1338,[502]6.1341,[503]6.1334,[504]6.1246,[505]6.1267,[506]6.1276,[507]6.1223,[508]6.1184,[509]6.1178,[510]6.1210,[511]6.1258,[512]6.1294,[513]6.1313,[514]6.1376,[515]6.1321,[516]6.1312,[517]6.1321,[518]6.1316,[519]6.1346,[520]6.1369,[521]6.1381,[522]6.1409,[523]6.1416,[524]6.1472,[525]6.1506,[526]6.1515,[527]6.1533,[528]6.1483,[529]6.1489,[530]6.1436,[531]6.1421,[532]6.1471,[533]6.1495,[534]6.1478,[535]6.1498,[536]6.1444,[537]6.1421,[538]6.1472,[539]6.1481,[540]6.1517,[541]6.1522,[542]6.1534,[543]6.1550,[544]6.1559,[545]6.1540,[546]6.1550,[547]6.1509,[548]6.1457,[549]6.1457,[550]6.1431,[551]6.1396,[552]6.1374,[553]6.1336,[554]6.1315,[555]6.1284,[556]6.1279,[557]6.1303,[558]6.1267,[559]6.1264,[560]6.1266,[561]6.1270,[562]6.1246,[563]6.1242,[564]6.1284,[565]6.1305,[566]6.1303,[567]6.1283,[568]6.1290,[569]6.1274,[570]6.1303,[571]6.1306,[572]6.1316,[573]6.1313,[574]6.1279,[575]6.1274,[576]6.1271,[577]6.1255,[578]6.1237,[579]6.1240,[580]6.1178,[581]6.1142,[582]6.1134,[583]6.1143,[584]6.1145,[585]6.1070,[586]6.1001,[587]6.1007,[588]6.1054,[589]6.1108,[590]6.1134,[591]6.1154,[592]6.1144,[593]6.1114,[594]6.1123,[595]6.1101,[596]6.1136,[597]6.1114,[598]6.1085,[599]6.1107,[600]6.1103,[601]6.1091,[602]6.1105,[603]6.1132,[604]6.1141,[605]6.1175,[606]6.1199,[607]6.1184,[608]6.1152,[609]6.1159,[610]6.1196,[611]6.1178,[612]6.1203,[613]6.1166,[614]6.1118,[615]6.1043,[616]6.1070,[617]6.1010,[618]6.0962,[619]6.0906,[620]6.0768,[621]6.0701,[622]6.0685,[623]6.0701,[624]6.0705,[625]6.0706,[626]6.0695,[627]6.0718,[628]6.0719,[629]6.0716,[630]6.0750,[631]6.0806,[632]6.0864,[633]6.0850,[634]6.0884,[635]6.0890,[636]6.0859,[637]6.0825,[638]6.0850,[639]6.0820,[640]6.0830,[641]6.0831,[642]6.0895,[643]6.0918,[644]6.0928,[645]6.0909,[646]6.0952,[647]6.0913,[648]6.0924,[649]6.0926,[650]6.0963,[651]6.1018,[652]6.1029,[653]6.1068,[654]6.1006,[655]6.1000,

llama_print_timings:        load time = 16626.65 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 5703556.31 ms / 335360 tokens (   17.01 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 5738455.43 ms

real	95m38.709s
user	138m57.455s
sys	4m20.864s

For comparison, with the approach on master, we get: 6.0617

This way we always use the same type of instruction across all quantizations

This one achieves 50m / token on M1 Pro

ggerganov · 2023-04-21T20:09:34Z

#1109 looks more promising

ggerganov added 2 commits April 21, 2023 20:12

ggml : prefer vzip to vuzp

5db246a

This way we always use the same type of instruction across all quantizations

ggml : alternative Q4_3 format + implementation

266bb63

This one achieves 50m / token on M1 Pro

ggerganov mentioned this pull request Apr 21, 2023

ggml : alternative Q4_3 implementation using modified Q8_0 #1109

Merged

ggerganov closed this Apr 21, 2023

ggerganov deleted the q4_3a branch April 24, 2023 19:18

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : alternative Q4_3 format + implementation #1108

ggml : alternative Q4_3 format + implementation #1108

ggerganov commented Apr 21, 2023 •

edited

Loading

ggerganov commented Apr 21, 2023

ggml : alternative Q4_3 format + implementation #1108

ggml : alternative Q4_3 format + implementation #1108

Conversation

ggerganov commented Apr 21, 2023 • edited Loading

Perplexity results

ggerganov commented Apr 21, 2023

ggerganov commented Apr 21, 2023 •

edited

Loading