INT8 quantization support #45

casper-hansen · 2023-09-11T09:17:15Z

The motivation for INT8 is to keep even more accuracy while still getting some gains on inference speed. I experimented with implementing dequantization for INT8 and ultimately need more work on this before it will be usable.

Edit: Implement SmoothQuant instead. Here is a fork of SmoothQuant that supports LLaMa models. Integrate this into AutoAWQ. https://github.com/AniZpZ/smoothquant/tree/llama-dev

__device__ uint8_t dequantize_s8_to_fp16x2(uint32_t const& source)
{
    // https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h#L54
    uint8_t result;

    uint32_t*      h   = reinterpret_cast<uint32_t*>(&result);
    uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);

    // Casper: Original was 0x64646464 = {1124, 1124}
    // Optimize to 0x64806480 because divisible by 8, 16, 32, 64, 128
    // NOTE: Test out {1280, 1280} since it's also divisible by 256
    static constexpr uint32_t mask_for_elt_01     = 0x5250;
    static constexpr uint32_t mask_for_elt_23     = 0x5351;
    static constexpr uint32_t start_byte_for_fp16 = 0x64806480; 
    asm volatile("prmt.b32 %0,%1,%2,%3;\n" : "=r"(h[0]) : "r"(i8s), "n"(start_byte_for_fp16), "n"(mask_for_elt_01));
    asm volatile("prmt.b32 %0,%1,%2,%3;\n" : "=r"(h[1]) : "r"(i8s), "n"(start_byte_for_fp16), "n"(mask_for_elt_23));

    // Lastly, we subtract 1152 from our constructed number using fp16 math to get our signed integer as fp16.
    // Casper 0x64806480 = {1152, 1152}
    static constexpr uint32_t I8s_TO_F16s_MAGIC_NUM = 0x64806480; 
    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[0]) : "r"(h[0]), "r"(I8s_TO_F16s_MAGIC_NUM));
    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[1]) : "r"(h[1]), "r"(I8s_TO_F16s_MAGIC_NUM));
}

yunfeng-scale · 2023-09-20T03:53:44Z

How would you compare this with 8 bit bitsandbytes? i think bitsandbytes have minimal performance loss

casper-hansen · 2023-09-20T07:01:27Z

How would you compare this with 8 bit bitsandbytes? i think bitsandbytes have minimal performance loss

It is not implemented yet, so I cannot speak to it

casper-hansen · 2023-09-26T10:01:55Z

#71 is working on INT8 support. Still things left to be implemented.

casper-hansen added enhancement New feature or request help wanted Extra attention is needed labels Sep 11, 2023

casper-hansen mentioned this issue Sep 11, 2023

📌 AutoAWQ Roadmap #32

Closed

30 tasks

casper-hansen mentioned this issue Sep 26, 2023

支持awq8bit量化吗？ #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 quantization support #45

INT8 quantization support #45

casper-hansen commented Sep 11, 2023 •

edited

Loading

yunfeng-scale commented Sep 20, 2023

casper-hansen commented Sep 20, 2023

casper-hansen commented Sep 26, 2023

INT8 quantization support #45

INT8 quantization support #45

Comments

casper-hansen commented Sep 11, 2023 • edited Loading

yunfeng-scale commented Sep 20, 2023

casper-hansen commented Sep 20, 2023

casper-hansen commented Sep 26, 2023

casper-hansen commented Sep 11, 2023 •

edited

Loading