Does SHARK LLM support q4/q8 matrix multiplication? #713

rednoah91 · 2024-06-03T10:24:19Z

Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:

%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%7, %8, %9 : tensor<2048x44x128xi4>, tensor<2048x44xf16>, tensor<2048x44xf16>) outs(%14 : tensor<2048x44x128xf16>) {
        ^bb0(%in: i4 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:10), %in_0: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:19), %in_1: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:33), %out: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:47)):
          %19 = arith.extui %in : i4 to i32 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17195:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %20 = arith.uitofp %19 : i32 to f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17196:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %21 = arith.subf %20, %in_1 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17197:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %22 = arith.mulf %21, %in_0 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17198:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          linalg.yield %22 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17199:7 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
        } -> tensor<2048x44x128xf16> loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17193:12 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))

Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the E approach in this article).

Thanks.

The text was updated successfully, but these errors were encountered:

vivekkhandelwal1 · 2024-08-14T10:28:41Z

Hi @monorimet @AmosLewis @zjgarvey, do you have any info about this query?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does SHARK LLM support q4/q8 matrix multiplication? #713

Does SHARK LLM support q4/q8 matrix multiplication? #713

rednoah91 commented Jun 3, 2024 •

edited

Loading

vivekkhandelwal1 commented Aug 14, 2024

Does SHARK LLM support q4/q8 matrix multiplication? #713

Does SHARK LLM support q4/q8 matrix multiplication? #713

Comments

rednoah91 commented Jun 3, 2024 • edited Loading

vivekkhandelwal1 commented Aug 14, 2024

rednoah91 commented Jun 3, 2024 •

edited

Loading