Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add rfc for Routing Agent #310

Merged
merged 29 commits into from
Apr 4, 2025
Merged

Add rfc for Routing Agent #310

merged 29 commits into from
Apr 4, 2025

Conversation

haim-barad
Copy link
Contributor

RFC for Routing Agent

@haim-barad
Copy link
Contributor Author

Assigned as a feature in #308 - please approve PR.

@mkbhanda
Copy link
Collaborator

mkbhanda commented Mar 19, 2025

@haim-barad thank you for your proposal. Would you kindly add alternatives considering -- any existing open source projects in this space. Perhaps OPEA can re-use instead of build. Perhaps you have noticed some missing features and you might want to contribute it to a project that we could reuse. Some options are listed at https://github.com/Not-Diamond/awesome-ai-model-routing#intelligent-ai-model-routing. Support for discovering inference endpoints, and obtaining meta data about them perhaps from a model card to help determine best match for an incoming request like cost, access latency, query specific (math based or healthcare, finance) etc. See https://docs.withmartian.com/martian-model-router which even mentions migrating away from inference endpoints with degraded performance, and integrating new models.

The more details you can provide upfront at a high level, more folks can chime in before coding begins.

@haim-barad
Copy link
Contributor Author

Our code is already 95% ready and is based on the open source framework of RouteLLM. We also have plans to incorporate a semantic router and future features to help route between:

  • Need for retrieval from a data source ("to RAG or not to RAG")
  • CAG vs RAG (i.e. it's appropriate under some conditions)
  • etc

Our sources cite the works of RouteLLM and more as appropriately incorporated into our routing agent.

@yinghu5 yinghu5 requested a review from joshuayao March 26, 2025 07:38
@yinghu5
Copy link
Collaborator

yinghu5 commented Mar 26, 2025

@haim-barad @mkbhanda @ftian1 thank you a lot for addressing the problem. Please help to review the RFC. thank you

@yinghu5 yinghu5 added the A0 need to scrub label Mar 27, 2025
@haim-barad
Copy link
Contributor Author

Is there a reason this PR is awaiting review still? Do we need to simultaneously submit our code (which is ready for first release)?

@mkbhanda
Copy link
Collaborator

mkbhanda commented Apr 1, 2025

Our code is already 95% ready and is based on the open source framework of RouteLLM. We also have plans to incorporate a semantic router and future features to help route between:

  • Need for retrieval from a data source ("to RAG or not to RAG")
  • CAG vs RAG (i.e. it's appropriate under some conditions)
  • etc

Our sources cite the works of RouteLLM and more as appropriately incorporated into our routing agent.

@haim-barad what you mention in this conversation is missing in the RFC. Please add. I shall approve.

@mkbhanda
Copy link
Collaborator

mkbhanda commented Apr 1, 2025

DCO issue too

@mkbhanda
Copy link
Collaborator

mkbhanda commented Apr 1, 2025

Code not necessary

@haim-barad
Copy link
Contributor Author

Link to RouteLLM added in RFC. Signed off with DCO. Does it take time to recognize DCO?

haim-barad and others added 11 commits April 1, 2025 10:01
Signed-off-by: Haim Barad <haim.barad@intel.com>
* Create index.rst

* Update index.rst

Signed-off-by: Haim Barad <haim.barad@intel.com>
* Update index.rst

Added "Moving from OpenAI to Opensource using OPEA" blog post

* Updated index.rst with commit message

* Updated index.rst with the right dates. Signed off by: chrisahsiong23 <chris2397as@gmail.com>

* Update index.rst to pass DCO

Signed-off-by: chrisahsiong23 <chris.ah-siong@intel.com>

---------

Signed-off-by: chrisahsiong23 <chris.ah-siong@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
…ct#332)

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
…ject#321)

Fixes opea-project#181
Co-authored-by: Ghosh, Soumyadip <soumyadip.ghosh@intel.com>
Signed-off-by: Piroozan, Nariman <nariman.piroozan@intel.com>
Signed-off-by:  Jaini, Pallavi <pallavi.jaini@intel.com>
Signed-off-by: Kavulya, Soila <soila.kavulya@intel.com>
Signed-off-by: Rajabose, Shifani <shifani.rajabose@intel.com>
Signed-off-by: Shifani Rajabose <srajabose@habana.ai>

Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
Signed-off-by: Katherine Druckman <katherine.druckman@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com>
Signed-off-by: Katherine Druckman <katherine.druckman@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com>
Signed-off-by: Katherine Druckman <katherine.druckman@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
* Fix the URL for add_vectorDB.md

Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>

* minor formattingupdate to CONTRIBUTING.md

Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>

---------

Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
Signed-off-by: Yu Wang <yu.wang6@amd.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
* [RFC] OPEA Inference Microservices (OIM)

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

* review fixes

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

* review fixes

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

* review fixes

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

* RFC: add OIM operator diagram

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

* review fixes, add picture

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>

---------

Signed-off-by: Sakari Poussa <sakari.poussa@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
xiguiw and others added 3 commits April 1, 2025 10:01
* Revert "A brief introduction of OPEA in first part"

This reverts commit 9342278.

* Revert "Fefine the format"

This reverts commit fd1e2f2.

* Revert "Add build_chatbot_blog"

This reverts commit 0126778.

Signed-off-by: Haim Barad <haim.barad@intel.com>
* doc:Add emeritus code owners page

Signed-off-by: Wang,Le3 <le3.wang@intel.com>

* remove lines

Signed-off-by: Wang,Le3 <le3.wang@intel.com>

---------

Signed-off-by: Wang,Le3 <le3.wang@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
@haim-barad
Copy link
Contributor Author

DCO successful. Please accept.

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What types of input this particular solution is intended for? I.e. what OPEA apps could benefit from it; ChatQnA, AudioQnA, VisualQnA, DocSum...?

@haim-barad
Copy link
Contributor Author

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it supports text based prompts and will make decisions regarding complexity of prompts.

Maybe RFC could mention that it's for text based LLM prompts?

@haim-barad haim-barad force-pushed the barad branch 2 times, most recently from 3cfae1e to c48592a Compare April 1, 2025 13:54
Signed-off-by: Haim Barad <haim.barad@intel.com>
Signed-off-by: Haim Barad <haim.barad@intel.com>
@haim-barad
Copy link
Contributor Author

Now mention text-based inputs. Please approve. (I still need 2 more approvals)

Copy link
Collaborator

@ashahba ashahba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@louie-tsai louie-tsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. looking forward to the PRs.

@poussa
Copy link
Member

poussa commented Apr 1, 2025

How does this co-operate with K8s routing solutions such as Gateway API for LLMs and service level load balancers, e.g., here? Is this additional solution for those mentioned above, or replacement?

It is also unclear what use case this solution is solving.

@eero-t
Copy link
Contributor

eero-t commented Apr 1, 2025

Use-case is cost-optimization. Getting better latency with weaker model/HW, by using cheaper (to run) model when such is deemed enough for given prompt.

However, that implies potential risk of low / inconsistent quality replies though, would probably require better testing than what OPEA currently has (real life prompts), and otherwise smells a bit of premature optimization.

There are IMHO other improvements that should be done for OPEA first, both to improve service latency & utilization of the already available HW, and error handling in stress situations. Adding this kind of routing unconditionally would complicate fixing those.

I don't think that's necessary a blocker for merging the RFC though. Merging implementation for the RFC can be delayed until OPEA is performance & deployment wise otherwise in good shape.

@haim-barad
Copy link
Contributor Author

haim-barad commented Apr 1, 2025

Use-case is cost-optimization. Getting better latency with weaker model/HW, by using cheaper (to run) model when such is deemed enough for given prompt.

However, that implies potential risk of low / inconsistent quality replies though, would probably require better testing than what OPEA currently has (real life prompts), and otherwise smells a bit of premature optimization.

There are IMHO other improvements that should be done for OPEA first, both to improve service latency & utilization of the already available HW, and error handling in stress situations. Adding this kind of routing unconditionally would complicate fixing those.

I don't think that's necessary a blocker for merging the RFC though. Merging implementation for the RFC can be delayed until OPEA is performance & deployment wise otherwise in good shape.

I actually have a different take on the optimization:

  1. I look at it as a way to increase the capacity in the data center. In fact, there's a lot of interest of running the cheaper models AND the router in an AIPC and then go to the data center when warrented. Or, smaller K8S pods (e.g. Xeon only) can run the weaker models and a range of larger pods (e.g. 8 Gaudis) can run the stronger models. Lots of flexibility.
  2. Risk (i.e. quality) is something that can be measured. The researchers who developed the matrix factorization model quoted a 95% accuracy, while saving 85% of the computation. Clearly mileage will vary and the threshold can be adjust as per user requirement. How can we speak for the customers? Some will allow for some degredation if the performance benefits warrent it - otherwise, we would disallow quantization for the same reasons. On the other hand, some customers might be very sensitive to quality and accept a more modest performance boost by choosing a more conservative threshold. I believe in giving the customer the tools to make the decision that's best for them.
  3. Regarding "other optimizations" - I agree - but I view that argument as orthogonal as this can be done to route to the appropriate model and the models themselves can undergo a lot of other optimizations (e.g. model-based such as quantization or dynamic execution such as speculative sampling) - the router can then route to LLMs with a full set of optimizations.
  4. Routing would not complicate debugging. Turn it off or change the threshold to an extreme value to force all queries to a desired target model (if you still want the router agent still in the loop) - the branching is essentially eliminated.
  5. No argment at optimizating models themselves so that latency and throughputs are optimized.

However, I will say that we've developed the features of this router with simplicity in mind. The router is a simple classifier making a decision. It's not nearly as complex as an LLM and while it does provide benefit, it's not an essential part of the workflow when developing/debugging. Testing of the router can be done independently.

I like the discussion though...

@eero-t
Copy link
Contributor

eero-t commented Apr 1, 2025

Thanks, that was a great response!

I hadn't even considered that it could transparently route queries to another cluster, that's a good edge use-case.

A higher level alternative e.g. to using PoCL remote with OneAPI driver for AI workloads:

Risk (i.e. quality) is something that can be measured. The researchers who developed the matrix factorization model quoted a 95% accuracy, while saving 85% of the computation.

Those results depend a lot on what it was tested on / how much training prompts differ from quality testing prompts. Are those sets available for evaluation?

Regarding "other optimizations" - I agree - but I view that argument as orthogonal

LLM routing could interfere with LLM scale-up routing optimizations like prefix caching: https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/

But if it can be turned off when it does not help, that's OK.

@haim-barad
Copy link
Contributor Author

MTBench was used for the accuracy and performance claim. Yes, it's available. But even better would be using quality tools on the customer's actual data. It could query both models during a testing phase and determine when the weaker model gave a good enough answer.

Currently - we have the matrix factorization model with the embedding layer (only) updated for Huggingface embeddings - this makes it more useful to remove the OpenAI dependency. Additionally, we have a method to fully train our own matrix factorization model based on customer data (a future feature) so we can have even higher quality decision making.

I see the merging is blocked - how is this conversation resolved? I like the back and forth, but I really want to understand if something is really blocking the merger.

@ashahba
Copy link
Collaborator

ashahba commented Apr 2, 2025

MTBench was used for the accuracy and performance claim. Yes, it's available. But even better would be using quality tools on the customer's actual data. It could query both models during a testing phase and determine when the weaker model gave a good enough answer.

Currently - we have the matrix factorization model with the embedding layer (only) updated for Huggingface embeddings - this makes it more useful to remove the OpenAI dependency. Additionally, we have a method to fully train our own matrix factorization model based on customer data (a future feature) so we can have even higher quality decision making.

I see the merging is blocked - how is this conversation resolved? I like the back and forth, but I really want to understand if something is really blocking the merger.

Agreed!
Healthy conversation to nail down the problem you are trying to solve is always welcome but at some point we need to find common ground and agree that the PR is ready to be merged or it's still has many unknowns.

Currently, all you need another gatekeeper to approve your PR and once that's in place, we can merge it.
But we are getting there 😄

@yinghu5 yinghu5 added this to the v1.5 milestone Apr 2, 2025
@eero-t
Copy link
Contributor

eero-t commented Apr 2, 2025

I see the merging is blocked - how is this conversation resolved?

I'm not myself a gatekeeper in this project, I'm just reviewing it. That and comments from other non-gatekeepers, is just input for the required 2 gatekeeper approvals i.e. people with write access (shield icon in reviewer lists?).

@lkk12014402
Copy link
Collaborator

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

hi @haim-barad, Dose the routing Agent use the GenAIComps agent component ? where do you want to put the routing agent, GenAIComps or GenAIExamples?

@haim-barad
Copy link
Contributor Author

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

hi @haim-barad, Dose the routing Agent use the GenAIComps agent component ? where do you want to put the routing agent, GenAIComps or GenAIExamples?

We plan on the router code to go into GenAIComps and some examples to go into GenAIExamples (Jupyter notebooks).

@mkbhanda mkbhanda merged commit e2040df into opea-project:main Apr 4, 2025
4 checks passed
@haim-barad haim-barad deleted the barad branch April 4, 2025 13:48
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A0 need to scrub
Projects
None yet
Development

Successfully merging this pull request may close these issues.