Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

Merged
merged 25 commits into from
Nov 24, 2024

Conversation

jeejeelee
Copy link
Collaborator

@jeejeelee jeejeelee commented Nov 19, 2024

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as draft November 19, 2024 14:15
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@jeejeelee jeejeelee marked this pull request as ready for review November 22, 2024 05:12
output2 = do_sample(llm, chatglm3_lora_files, lora_id=2)
for i in range(len(expected_lora_output)):
assert output2[i] == expected_lora_output[i]
cleanup_dist_env_and_memory()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 want to test TP with this model, but I can't even get it to pass locally. Could you help me check what might be wrong with my implementation, thanks

Copy link
Member

@DarkLight1337 DarkLight1337 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the layer implementation of MergedColumnParallelLinear. Maybe you can print out the slice indices and see if they make sense. Make sure there is no accidental overlap between left_weight and right_weight.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test script, even with 4 GPUs, the test still fails. The main issue is that the test keeps hanging. Without considering unit tests, I've verified that TP=1/2/4 works correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you running the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it only fail for the TP4 test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just run

pytest  test_chatglm3_tp.py

Copy link
Member

@DarkLight1337 DarkLight1337 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have to run the tests one at a time. @youkaichao may have more insights regarding this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try

return lora_b

def slice_bias(self, bias: torch.Tensor) -> torch.Tensor:
# TODO: Fix the slicing logic of bias.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's because you haven't implemented this yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't plan to address the bias slicing logic in this PR. I have doubts about the bias implementation, as I haven't fully understood it yet.

Copy link
Member

@DarkLight1337 DarkLight1337 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's about the test hanging, maybe some of the workers failed to call all_gather.

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
jeejeelee and others added 5 commits November 22, 2024 13:25
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee
Copy link
Collaborator Author

@DarkLight1337 I can now complete multi-GPU unit tests locally. Could you tell me where I should add multi-GPU LoRA tests? Is it correct here?

@DarkLight1337
Copy link
Member

Yeah, you should update that file.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot added the ci/build label Nov 22, 2024
@jeejeelee
Copy link
Collaborator Author

Yeah, you should update that file.

Thanks, could you plz look this PR again, thanks

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@DarkLight1337
Copy link
Member

Can you make these tests run with the existing 4 GPU tests?

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee
Copy link
Collaborator Author

Can you make these tests run with the existing 4 GPU tests?

Done, plz look at this again, thanks for your hard work

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 22, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks good, thanks for the fix!

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 22, 2024 15:26
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
auto-merge was automatically disabled November 22, 2024 16:24

Head branch was pushed to by a user without write access

@DarkLight1337
Copy link
Member

#10581 should fix the CI failure, please merge from main again.

@DarkLight1337
Copy link
Member

Can you check whether the examples test failure is related to this PR?

@jeejeelee
Copy link
Collaborator Author

I'm outside now, I'll check later​​​​​​​​​​​​​​​​

@jeejeelee
Copy link
Collaborator Author

jeejeelee commented Nov 23, 2024

Can you check whether the examples test failure is related to this PR?

It seems the example tests triggered in CI are not related to this PR. Could you tell me which example test failed? I can try to reproduce it locally.

@DarkLight1337
Copy link
Member

@jeejeelee
Copy link
Collaborator Author

See here for detailed logs: https://buildkite.com/vllm/ci-aws/builds/11689#01935755-0382-4a69-8f86-413cdb8a12c0

Ah, I actually already looked at it, but I couldn't figure out which example failed, that's why I'm asking you.

@jeejeelee
Copy link
Collaborator Author

@DarkLight1337
Copy link
Member

Seems like it.

@jeejeelee
Copy link
Collaborator Author

Seems like it.

I can reproduce this failure in this PR branch ,I am investigating the root cause now.

@jeejeelee
Copy link
Collaborator Author

Seems like it.

I can reproduce this failure in this PR branch ,I am investigating the root cause now.

I also can reproduce this failure in the latest main branch

@youkaichao youkaichao merged commit 1700c54 into vllm-project:main Nov 24, 2024
64 of 71 checks passed
@jeejeelee jeejeelee deleted the fix-merged-linear-lora branch November 24, 2024 01:59
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 28, 2024
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
anko-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Feb 12, 2025
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants