Skip to content

Does SHARK LLM support q4/q8 matrix multiplication? #713

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
rednoah91 opened this issue Jun 3, 2024 · 1 comment
Open

Does SHARK LLM support q4/q8 matrix multiplication? #713

rednoah91 opened this issue Jun 3, 2024 · 1 comment

Comments

@rednoah91
Copy link

rednoah91 commented Jun 3, 2024

Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:

%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%7, %8, %9 : tensor<2048x44x128xi4>, tensor<2048x44xf16>, tensor<2048x44xf16>) outs(%14 : tensor<2048x44x128xf16>) {
        ^bb0(%in: i4 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:10), %in_0: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:19), %in_1: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:33), %out: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:47)):
          %19 = arith.extui %in : i4 to i32 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17195:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %20 = arith.uitofp %19 : i32 to f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17196:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %21 = arith.subf %20, %in_1 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17197:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %22 = arith.mulf %21, %in_0 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17198:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          linalg.yield %22 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17199:7 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
        } -> tensor<2048x44x128xf16> loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17193:12 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))

Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the E approach in this article).

Thanks.

@vivekkhandelwal1
Copy link
Contributor

Hi @monorimet @AmosLewis @zjgarvey, do you have any info about this query?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants