Naive Run Compressed Pt. 2 #62

Satrat · 2024-08-06T18:07:55Z

SUMMARY:
Follow up PR to neuralmagic/compressed-tensors#109, enables loading compressed models into SparseAutoModel, each quantized layer is decompressed on the forward pass.

Adds run_compressed argument to SparseAutoModel
Removes structure initialization in QuantizationModifier, its no longer needed as we do this on load

TEST PLAN:
Manual example, will follow up with integration tests once the compressed-tensor branch merges

from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM
import torch

model_dir = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-compressed"
model = SparseAutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.float16, device_map="auto", run_compressed=True)

tokenizer = AutoTokenizer.from_pretrained(model_dir)
sample_input = ["I love 8 bit quantization because"]
inputs = tokenizer(sample_input, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_length=50)
outputs = tokenizer.batch_decode(generated_ids)
print(outputs)

Output: ["<|begin_of_text|>I love 8 bit quantization because it's a great way to reduce the precision of floating point numbers and make them more compact. It's also a great way to make them more robust against noise and quantization errors.\n\nHere's a simple example of how you can implement 8 bit quantization in Python:\n```\nimport numpy as np\n\ndef quantize(x, bits=8):\n x_min = np.min(x)\n x_max = np.max(x)\n scale = "]

Runtime: ~26sec on two A4000s, ~65sec for the non-compressed version

* small fixes * initial commit * bug fixes * cleanup * clarity comments * clean up compression classes * fixing zero point issues * comment for hack * update quant check * cleanup fp8 dtypes * cleanup * clean up observer * dtype fix * docstrings * fixes after rebase * test fixes * style * get rid of broken segment * fix broken code

Sara Adkins added 5 commits July 4, 2024 00:55

update auto model

37ac006

style and reload

847b4a4

Merge branch 'main' into sa/naive_run_compressed

c6276e4

fixing tests

278fc20

sparseautomodel compatability

f1e269f

Satrat changed the title ~~[Don't Merge Yet] Naive Run Compressed Pt. 2~~ Naive Run Compressed Pt. 2 Aug 7, 2024

Sara Adkins added 4 commits August 7, 2024 19:26

revert un-needed change

bef05bf

skip init

7892cb2

Merge branch 'main' into sa/naive_run_compressed

7e51048

quality

b6001c6

Satrat marked this pull request as ready for review August 12, 2024 15:53

Satrat requested review from bfineran, kylesayrs, dsikka, horheynm, robertgshaw2-neuralmagic and rahul-tuli August 12, 2024 15:54

bfineran previously approved these changes Aug 19, 2024

View reviewed changes

Sara Adkins added 4 commits August 21, 2024 18:15

Merge branch 'main' into sa/naive_run_compressed

8fb1234

Merge branch 'main' into sa/naive_run_compressed

5d9193b

Merge branch 'main' into sa/naive_run_compressed

e33933e

remove recipe app

6d79ca8

Satrat dismissed bfineran’s stale review via 6d79ca8 August 30, 2024 04:41

Merge branch 'main' into sa/naive_run_compressed

59edf9c

Satrat merged commit 8187914 into main Aug 30, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Run Compressed Pt. 2 #62

Naive Run Compressed Pt. 2 #62

Satrat commented Aug 6, 2024 •

edited

Loading

Naive Run Compressed Pt. 2 #62

Naive Run Compressed Pt. 2 #62

Conversation

Satrat commented Aug 6, 2024 • edited Loading

Satrat commented Aug 6, 2024 •

edited

Loading