[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

jeejeelee · 2024-11-19T14:15:24Z

Add comments for better understanding
Add some LoRA tests to distributed tests(refer to :[Bugfix] Fix fully sharded LoRA bug #10352 (comment))

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

github-actions · 2024-11-19T14:15:38Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee · 2024-11-22T05:16:20Z

tests/lora/test_chatglm3_tp.py

+    output2 = do_sample(llm, chatglm3_lora_files, lora_id=2)
+    for i in range(len(expected_lora_output)):
+        assert output2[i] == expected_lora_output[i]
+    cleanup_dist_env_and_memory()


@DarkLight1337 want to test TP with this model, but I can't even get it to pass locally. Could you help me check what might be wrong with my implementation, thanks

I'm not familiar with the layer implementation of MergedColumnParallelLinear. Maybe you can print out the slice indices and see if they make sense. Make sure there is no accidental overlap between left_weight and right_weight.

For this test script, even with 4 GPUs, the test still fails. The main issue is that the test keeps hanging. Without considering unit tests, I've verified that TP=1/2/4 works correctly.

How are you running the tests?

Does it only fail for the TP4 test?

Just run

pytest test_chatglm3_tp.py

You might have to run the tests one at a time. @youkaichao may have more insights regarding this.

Okay, I will try

DarkLight1337 · 2024-11-22T05:18:41Z

vllm/lora/layers.py

        return lora_b

    def slice_bias(self, bias: torch.Tensor) -> torch.Tensor:
+        # TODO: Fix the slicing logic of bias.


Maybe it's because you haven't implemented this yet?

No, I don't plan to address the bias slicing logic in this PR. I have doubts about the bias implementation, as I haven't fully understood it yet.

If it's about the test hanging, maybe some of the workers failed to call all_gather.

vllm/lora/layers.py

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

vllm/lora/fully_sharded_layers.py

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee · 2024-11-22T11:10:52Z

@DarkLight1337 I can now complete multi-GPU unit tests locally. Could you tell me where I should add multi-GPU LoRA tests? Is it correct here?

DarkLight1337 · 2024-11-22T11:29:28Z

Yeah, you should update that file.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee · 2024-11-22T11:43:34Z

Yeah, you should update that file.

Thanks, could you plz look this PR again, thanks

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

DarkLight1337 · 2024-11-22T12:04:07Z

Can you make these tests run with the existing 4 GPU tests?

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee · 2024-11-22T14:29:15Z

Can you make these tests run with the existing 4 GPU tests?

Done, plz look at this again, thanks for your hard work

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

.buildkite/test-pipeline.yaml

DarkLight1337

Otherwise looks good, thanks for the fix!

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

vllm/model_executor/models/chatglm.py

DarkLight1337 · 2024-11-23T02:23:40Z

#10581 should fix the CI failure, please merge from main again.

DarkLight1337 · 2024-11-23T05:15:47Z

Can you check whether the examples test failure is related to this PR?

jeejeelee · 2024-11-23T08:31:00Z

I'm outside now, I'll check later

jeejeelee · 2024-11-23T13:31:14Z

Can you check whether the examples test failure is related to this PR?

It seems the example tests triggered in CI are not related to this PR. Could you tell me which example test failed? I can try to reproduce it locally.

DarkLight1337 · 2024-11-23T13:58:27Z

See here for detailed logs: https://buildkite.com/vllm/ci-aws/builds/11689#01935755-0382-4a69-8f86-413cdb8a12c0

jeejeelee · 2024-11-23T14:18:04Z

See here for detailed logs: https://buildkite.com/vllm/ci-aws/builds/11689#01935755-0382-4a69-8f86-413cdb8a12c0

Ah, I actually already looked at it, but I couldn't figure out which example failed, that's why I'm asking you.

jeejeelee · 2024-11-23T14:21:50Z

See here for detailed logs: https://buildkite.com/vllm/ci-aws/builds/11689#01935755-0382-4a69-8f86-413cdb8a12c0

It seems to be related to https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml#L192, is that right?

DarkLight1337 · 2024-11-23T14:50:27Z

Seems like it.

jeejeelee · 2024-11-23T15:07:06Z

Seems like it.

I can reproduce this failure in this PR branch ,I am investigating the root cause now.

jeejeelee · 2024-11-23T15:45:54Z

Seems like it.

I can reproduce this failure in this PR branch ,I am investigating the root cause now.

I also can reproduce this failure in the latest main branch

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

jeejeelee added 3 commits November 19, 2024 04:57

Init

161982b

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'vllm-project:main' into fix-merged-linear-lora

5d25c64

Complete weight shard logic

74767cb

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee marked this pull request as draft November 19, 2024 14:15

jeejeelee added 6 commits November 19, 2024 22:19

Merge branch 'vllm-project:main' into fix-merged-linear-lora

a9ad377

Merge branch 'vllm-project:main' into fix-merged-linear-lora

c054ddf

Add todo for bias slice

f0e8f31

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'vllm-project:main' into fix-merged-linear-lora

097b003

Done

3fa2fb7

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'vllm-project:main' into fix-merged-linear-lora

5de45db

jeejeelee marked this pull request as ready for review November 22, 2024 05:12

jeejeelee commented Nov 22, 2024

View reviewed changes

DarkLight1337 reviewed Nov 22, 2024

View reviewed changes

vllm/lora/layers.py Outdated Show resolved Hide resolved

Update vllm/lora/layers.py

ff771cd

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

DarkLight1337 reviewed Nov 22, 2024

View reviewed changes

vllm/lora/fully_sharded_layers.py Outdated Show resolved Hide resolved

jeejeelee and others added 5 commits November 22, 2024 13:25

Update vllm/lora/fully_sharded_layers.py

5f66271

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Format code

a76016e

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Add LoRA TP test

9abad3c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Done

3aa890f

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Optimize unit test

efb37a4

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'vllm-project:main' into fix-merged-linear-lora

d7ae951

Configure LoRA TP test

fe826eb

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify bot added the ci/build label Nov 22, 2024

Make yapf happy

0f38dde

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Optimize unit test

b99b893

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Delete empty line

80a238b

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

DarkLight1337 reviewed Nov 22, 2024

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 22, 2024

DarkLight1337 approved these changes Nov 22, 2024

View reviewed changes

Fix conftext bug

b7f0479

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

DarkLight1337 enabled auto-merge (squash) November 22, 2024 15:26

jeejeelee added 2 commits November 22, 2024 16:23

Fix chatglm bug

83b76c6

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Fix chatglm bug

251ab41

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

auto-merge was automatically disabled November 22, 2024 16:24
Head branch was pushed to by a user without write access

jeejeelee commented Nov 22, 2024

View reviewed changes

vllm/model_executor/models/chatglm.py Show resolved Hide resolved

Merge branch 'vllm-project:main' into fix-merged-linear-lora

7f0da81

Merge branch 'vllm-project:main' into fix-merged-linear-lora

34f9381

jeejeelee mentioned this pull request Nov 23, 2024

[CI/Build] Print running script to enhance CI log readability #10594

Merged

youkaichao merged commit 1700c54 into vllm-project:main Nov 24, 2024
64 of 71 checks passed

jeejeelee deleted the fix-merged-linear-lora branch November 24, 2024 01:59

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Bugfix] Fix LoRA weight sharding (vllm-project#10450)

177579f

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

anko-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Feb 12, 2025

[Bugfix] Fix LoRA weight sharding (vllm-project#10450)

4242208

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

jeejeelee commented Nov 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 19, 2024

jeejeelee Nov 22, 2024

DarkLight1337 Nov 22, 2024 •

edited

Loading

jeejeelee Nov 22, 2024

DarkLight1337 Nov 22, 2024

DarkLight1337 Nov 22, 2024

jeejeelee Nov 22, 2024

DarkLight1337 Nov 22, 2024 •

edited

Loading

jeejeelee Nov 22, 2024

DarkLight1337 Nov 22, 2024

jeejeelee Nov 22, 2024

DarkLight1337 Nov 22, 2024 •

edited

Loading

jeejeelee commented Nov 22, 2024

DarkLight1337 commented Nov 22, 2024

jeejeelee commented Nov 22, 2024

DarkLight1337 commented Nov 22, 2024

jeejeelee commented Nov 22, 2024

DarkLight1337 left a comment

DarkLight1337 commented Nov 23, 2024

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024 •

edited

Loading

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA #10450

Conversation

jeejeelee commented Nov 19, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 19, 2024

Choose a reason for hiding this comment

DarkLight1337 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

jeejeelee commented Nov 22, 2024

DarkLight1337 commented Nov 22, 2024

jeejeelee commented Nov 22, 2024

DarkLight1337 commented Nov 22, 2024

jeejeelee commented Nov 22, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Nov 23, 2024

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024 • edited Loading

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

DarkLight1337 commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 23, 2024

jeejeelee commented Nov 19, 2024 •

edited by github-actions bot

Loading

DarkLight1337 Nov 22, 2024 •

edited

Loading

DarkLight1337 Nov 22, 2024 •

edited

Loading

DarkLight1337 Nov 22, 2024 •

edited

Loading

jeejeelee commented Nov 23, 2024 •

edited

Loading