Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BOUNTY - $500] Pipeline Parallel Inference #4

Open
AlexCheema opened this issue Jul 15, 2024 · 9 comments
Open

[BOUNTY - $500] Pipeline Parallel Inference #4

AlexCheema opened this issue Jul 15, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@AlexCheema
Copy link
Contributor

AlexCheema commented Jul 15, 2024

Prerequisite: #1

Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.

What: See https://pytorch.org/docs/stable/pipeline.html

Reward: $500 Bounty paid out with USDC on Ethereum, email alex@exolabs.net.

@AlexCheema AlexCheema changed the title [BOUNTY] Pipeline Parallel Inference [BOUNTY - $500] Pipeline Parallel Inference Jul 15, 2024
@Myestery
Copy link

I'll like to work on this

@AlexCheema
Copy link
Contributor Author

I'll like to work on this

That would be excellent! I can help here and on Discord with any questions / issues you have.

@AlexCheema AlexCheema added the enhancement New feature or request label Jul 18, 2024
@the-alex-b
Copy link
Contributor

Hi there,

I was taking a look at what it would take to make this work and did some testing, found out that when you start two chat sessions and run inference at the same time they mess each other up and tokens from the two sessions bleed into each other. See the two last messages:

image

The one on the left hangs after a while, the right one finishes but is also gibberish. Does this reproduce on your end? I think fixing session isolation might precede parallel pipelining?

@AlexCheema
Copy link
Contributor Author

AlexCheema commented Jul 18, 2024

@the-alex-b Very interesting - you're totally right, we should fix session isolation first. This makes sense since both would share the same kv caches (it's stateful).
What we really need is the ability to create multiple instances of the same model that only hold the weights in memory once.

This can still be part of the same bounty.

This was referenced Jul 23, 2024
@varshith15 varshith15 mentioned this issue Aug 12, 2024
8 tasks
@pranav4501
Copy link
Contributor

Hi @AlexCheema,
Can I work on session isolation?

@AlexCheema
Copy link
Contributor Author

Hi @AlexCheema, Can I work on session isolation?

Hey @pranav4501
I think @varshith15 is already working on that so best to check with him if you can contribute.

Can you also DM me on discord so we can find a good task for you. I can update bounties with something that you'd be interested to work on, as there aren't that many left now!

@pranav4501
Copy link
Contributor

pranav4501 commented Aug 20, 2024

Hi @AlexCheema,
I DM'ed you on discord, I will also take a look at the stable diffusion bounty

@moosh3
Copy link

moosh3 commented Nov 10, 2024

Hello, can we update the GSheet to denote this is taken (if it is, which it seems to be)? cc @AlexCheema [apologies for the pings]

AlexCheema pushed a commit that referenced this issue Dec 6, 2024
@FrostyTheSouthernSnowman

Prerequisite: #1

Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.

What: See https://pytorch.org/docs/stable/pipeline.html

Reward: $500 Bounty paid out with USDC on Ethereum, email alex@exolabs.net.

That pytorch page is giving me a 404. Is the idea here to be able to process multiple separate requests at once or to have a batch api that accepts multiple requests in one api call?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants