-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
support autoTP with weight only quantization in DS inference path #4750
base: master
Are you sure you want to change the base?
Conversation
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code? |
Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, |
It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.
|
ds_output = pipe(query, **inf_kwargs) | ||
|
||
#print(local_rank, "baseline", bs_output) | ||
print(local_rank, "deepspeed", ds_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baodii may I know which device you are running on? cuda or cpu?
@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection? |
@loadams I have solved the merge conflicts. pls check it |
Signed-off-by: Feng Tian <feng.tian@intel.com>
Signed-off-by: Feng Tian <feng.tian@intel.com>
This PR is used to make weight only quantization work with autoTP.
The sample code is like below:
by this way, user can enable WOQ on multiple cards.