-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450
[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450
Conversation
jeejeelee
commented
Nov 19, 2024
•
edited by github-actions
bot
Loading
edited by github-actions
bot
- Add comments for better understanding
- Add some LoRA tests to distributed tests(refer to :[Bugfix] Fix fully sharded LoRA bug #10352 (comment))
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
tests/lora/test_chatglm3_tp.py
Outdated
output2 = do_sample(llm, chatglm3_lora_files, lora_id=2) | ||
for i in range(len(expected_lora_output)): | ||
assert output2[i] == expected_lora_output[i] | ||
cleanup_dist_env_and_memory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 want to test TP with this model, but I can't even get it to pass locally. Could you help me check what might be wrong with my implementation, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the layer implementation of MergedColumnParallelLinear
. Maybe you can print out the slice indices and see if they make sense. Make sure there is no accidental overlap between left_weight
and right_weight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test script, even with 4 GPUs, the test still fails. The main issue is that the test keeps hanging. Without considering unit tests, I've verified that TP=1/2/4 works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you running the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it only fail for the TP4 test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just run
pytest test_chatglm3_tp.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have to run the tests one at a time. @youkaichao may have more insights regarding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will try
return lora_b | ||
|
||
def slice_bias(self, bias: torch.Tensor) -> torch.Tensor: | ||
# TODO: Fix the slicing logic of bias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's because you haven't implemented this yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't plan to address the bias slicing logic in this PR. I have doubts about the bias implementation, as I haven't fully understood it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's about the test hanging, maybe some of the workers failed to call all_gather
.
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@DarkLight1337 I can now complete multi-GPU unit tests locally. Could you tell me where I should add multi-GPU LoRA tests? Is it correct here? |
Yeah, you should update that file. |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Thanks, could you plz look this PR again, thanks |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Can you make these tests run with the existing 4 GPU tests? |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Done, plz look at this again, thanks for your hard work |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks good, thanks for the fix!
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Head branch was pushed to by a user without write access
#10581 should fix the CI failure, please merge from main again. |
Can you check whether the examples test failure is related to this PR? |
I'm outside now, I'll check later |
It seems the example tests triggered in CI are not related to this PR. Could you tell me which example test failed? I can try to reproduce it locally. |
See here for detailed logs: https://buildkite.com/vllm/ci-aws/builds/11689#01935755-0382-4a69-8f86-413cdb8a12c0 |
Ah, I actually already looked at it, but I couldn't figure out which example failed, that's why I'm asking you. |
It seems to be related to https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml#L192, is that right? |
Seems like it. |
I can reproduce this failure in this PR branch ,I am investigating the root cause now. |
I also can reproduce this failure in the latest main branch |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>