Add rfc for Routing Agent #310

haim-barad · 2025-03-02T10:55:54Z

RFC for Routing Agent

haim-barad · 2025-03-06T07:14:51Z

Assigned as a feature in #308 - please approve PR.

mkbhanda · 2025-03-19T20:10:58Z

@haim-barad thank you for your proposal. Would you kindly add alternatives considering -- any existing open source projects in this space. Perhaps OPEA can re-use instead of build. Perhaps you have noticed some missing features and you might want to contribute it to a project that we could reuse. Some options are listed at https://github.com/Not-Diamond/awesome-ai-model-routing#intelligent-ai-model-routing. Support for discovering inference endpoints, and obtaining meta data about them perhaps from a model card to help determine best match for an incoming request like cost, access latency, query specific (math based or healthcare, finance) etc. See https://docs.withmartian.com/martian-model-router which even mentions migrating away from inference endpoints with degraded performance, and integrating new models.

The more details you can provide upfront at a high level, more folks can chime in before coding begins.

haim-barad · 2025-03-20T04:45:47Z

Our code is already 95% ready and is based on the open source framework of RouteLLM. We also have plans to incorporate a semantic router and future features to help route between:

Need for retrieval from a data source ("to RAG or not to RAG")
CAG vs RAG (i.e. it's appropriate under some conditions)
etc

Our sources cite the works of RouteLLM and more as appropriately incorporated into our routing agent.

yinghu5 · 2025-03-26T07:39:41Z

@haim-barad @mkbhanda @ftian1 thank you a lot for addressing the problem. Please help to review the RFC. thank you

haim-barad · 2025-04-01T06:00:33Z

Is there a reason this PR is awaiting review still? Do we need to simultaneously submit our code (which is ready for first release)?

mkbhanda · 2025-04-01T06:03:54Z

Our code is already 95% ready and is based on the open source framework of RouteLLM. We also have plans to incorporate a semantic router and future features to help route between:

Need for retrieval from a data source ("to RAG or not to RAG")

CAG vs RAG (i.e. it's appropriate under some conditions)

etc

Our sources cite the works of RouteLLM and more as appropriately incorporated into our routing agent.

@haim-barad what you mention in this conversation is missing in the RFC. Please add. I shall approve.

mkbhanda · 2025-04-01T06:04:24Z

DCO issue too

mkbhanda · 2025-04-01T06:04:50Z

Code not necessary

haim-barad · 2025-04-01T07:00:38Z

Link to RouteLLM added in RFC. Signed off with DCO. Does it take time to recognize DCO?

Signed-off-by: Haim Barad <haim.barad@intel.com>

* Create index.rst * Update index.rst Signed-off-by: Haim Barad <haim.barad@intel.com>

* Update index.rst Added "Moving from OpenAI to Opensource using OPEA" blog post * Updated index.rst with commit message * Updated index.rst with the right dates. Signed off by: chrisahsiong23 <chris2397as@gmail.com> * Update index.rst to pass DCO Signed-off-by: chrisahsiong23 <chris.ah-siong@intel.com> --------- Signed-off-by: chrisahsiong23 <chris.ah-siong@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

…ct#332) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

…ject#321) Fixes opea-project#181 Co-authored-by: Ghosh, Soumyadip <soumyadip.ghosh@intel.com> Signed-off-by: Piroozan, Nariman <nariman.piroozan@intel.com> Signed-off-by: Jaini, Pallavi <pallavi.jaini@intel.com> Signed-off-by: Kavulya, Soila <soila.kavulya@intel.com> Signed-off-by: Rajabose, Shifani <shifani.rajabose@intel.com> Signed-off-by: Shifani Rajabose <srajabose@habana.ai> Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Signed-off-by: Katherine Druckman <katherine.druckman@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com> Signed-off-by: Katherine Druckman <katherine.druckman@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

* Fix the URL for add_vectorDB.md Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> * minor formattingupdate to CONTRIBUTING.md Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> --------- Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Signed-off-by: Yu Wang <yu.wang6@amd.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

* [RFC] OPEA Inference Microservices (OIM) Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> * review fixes Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> * review fixes Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> * review fixes Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> * RFC: add OIM operator diagram Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> * review fixes, add picture Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> --------- Signed-off-by: Sakari Poussa <sakari.poussa@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

* Revert "A brief introduction of OPEA in first part" This reverts commit 9342278. * Revert "Fefine the format" This reverts commit fd1e2f2. * Revert "Add build_chatbot_blog" This reverts commit 0126778. Signed-off-by: Haim Barad <haim.barad@intel.com>

* doc:Add emeritus code owners page Signed-off-by: Wang,Le3 <le3.wang@intel.com> * remove lines Signed-off-by: Wang,Le3 <le3.wang@intel.com> --------- Signed-off-by: Wang,Le3 <le3.wang@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Signed-off-by: Haim Barad <haim.barad@intel.com>

haim-barad · 2025-04-01T07:03:30Z

DCO successful. Please accept.

eero-t

What types of input this particular solution is intended for? I.e. what OPEA apps could benefit from it; ChatQnA, AudioQnA, VisualQnA, DocSum...?

haim-barad · 2025-04-01T09:00:31Z

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

eero-t

it supports text based prompts and will make decisions regarding complexity of prompts.

Maybe RFC could mention that it's for text based LLM prompts?

Signed-off-by: Haim Barad <haim.barad@intel.com>

haim-barad · 2025-04-01T13:59:00Z

Now mention text-based inputs. Please approve. (I still need 2 more approvals)

ashahba

LGTM!

louie-tsai

looks good. looking forward to the PRs.

community/rfcs/25-03-03-GenAIComponents-001-Routing-Agent.md

poussa · 2025-04-01T16:53:08Z

How does this co-operate with K8s routing solutions such as Gateway API for LLMs and service level load balancers, e.g., here? Is this additional solution for those mentioned above, or replacement?

It is also unclear what use case this solution is solving.

eero-t · 2025-04-01T17:13:33Z

Use-case is cost-optimization. Getting better latency with weaker model/HW, by using cheaper (to run) model when such is deemed enough for given prompt.

However, that implies potential risk of low / inconsistent quality replies though, would probably require better testing than what OPEA currently has (real life prompts), and otherwise smells a bit of premature optimization.

There are IMHO other improvements that should be done for OPEA first, both to improve service latency & utilization of the already available HW, and error handling in stress situations. Adding this kind of routing unconditionally would complicate fixing those.

I don't think that's necessary a blocker for merging the RFC though. Merging implementation for the RFC can be delayed until OPEA is performance & deployment wise otherwise in good shape.

haim-barad · 2025-04-01T18:19:59Z

Use-case is cost-optimization. Getting better latency with weaker model/HW, by using cheaper (to run) model when such is deemed enough for given prompt.

However, that implies potential risk of low / inconsistent quality replies though, would probably require better testing than what OPEA currently has (real life prompts), and otherwise smells a bit of premature optimization.

There are IMHO other improvements that should be done for OPEA first, both to improve service latency & utilization of the already available HW, and error handling in stress situations. Adding this kind of routing unconditionally would complicate fixing those.

I don't think that's necessary a blocker for merging the RFC though. Merging implementation for the RFC can be delayed until OPEA is performance & deployment wise otherwise in good shape.

I actually have a different take on the optimization:

I look at it as a way to increase the capacity in the data center. In fact, there's a lot of interest of running the cheaper models AND the router in an AIPC and then go to the data center when warrented. Or, smaller K8S pods (e.g. Xeon only) can run the weaker models and a range of larger pods (e.g. 8 Gaudis) can run the stronger models. Lots of flexibility.
Risk (i.e. quality) is something that can be measured. The researchers who developed the matrix factorization model quoted a 95% accuracy, while saving 85% of the computation. Clearly mileage will vary and the threshold can be adjust as per user requirement. How can we speak for the customers? Some will allow for some degredation if the performance benefits warrent it - otherwise, we would disallow quantization for the same reasons. On the other hand, some customers might be very sensitive to quality and accept a more modest performance boost by choosing a more conservative threshold. I believe in giving the customer the tools to make the decision that's best for them.
Regarding "other optimizations" - I agree - but I view that argument as orthogonal as this can be done to route to the appropriate model and the models themselves can undergo a lot of other optimizations (e.g. model-based such as quantization or dynamic execution such as speculative sampling) - the router can then route to LLMs with a full set of optimizations.
Routing would not complicate debugging. Turn it off or change the threshold to an extreme value to force all queries to a desired target model (if you still want the router agent still in the loop) - the branching is essentially eliminated.
No argment at optimizating models themselves so that latency and throughputs are optimized.

However, I will say that we've developed the features of this router with simplicity in mind. The router is a simple classifier making a decision. It's not nearly as complex as an LLM and while it does provide benefit, it's not an essential part of the workflow when developing/debugging. Testing of the router can be done independently.

I like the discussion though...

eero-t · 2025-04-01T18:52:55Z

Thanks, that was a great response!

I hadn't even considered that it could transparently route queries to another cluster, that's a good edge use-case.

A higher level alternative e.g. to using PoCL remote with OneAPI driver for AI workloads:

Risk (i.e. quality) is something that can be measured. The researchers who developed the matrix factorization model quoted a 95% accuracy, while saving 85% of the computation.

Those results depend a lot on what it was tested on / how much training prompts differ from quality testing prompts. Are those sets available for evaluation?

Regarding "other optimizations" - I agree - but I view that argument as orthogonal

LLM routing could interfere with LLM scale-up routing optimizations like prefix caching: https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/

But if it can be turned off when it does not help, that's OK.

haim-barad · 2025-04-01T19:11:12Z

MTBench was used for the accuracy and performance claim. Yes, it's available. But even better would be using quality tools on the customer's actual data. It could query both models during a testing phase and determine when the weaker model gave a good enough answer.

Currently - we have the matrix factorization model with the embedding layer (only) updated for Huggingface embeddings - this makes it more useful to remove the OpenAI dependency. Additionally, we have a method to fully train our own matrix factorization model based on customer data (a future feature) so we can have even higher quality decision making.

I see the merging is blocked - how is this conversation resolved? I like the back and forth, but I really want to understand if something is really blocking the merger.

ashahba · 2025-04-02T01:10:49Z

MTBench was used for the accuracy and performance claim. Yes, it's available. But even better would be using quality tools on the customer's actual data. It could query both models during a testing phase and determine when the weaker model gave a good enough answer.

Currently - we have the matrix factorization model with the embedding layer (only) updated for Huggingface embeddings - this makes it more useful to remove the OpenAI dependency. Additionally, we have a method to fully train our own matrix factorization model based on customer data (a future feature) so we can have even higher quality decision making.

I see the merging is blocked - how is this conversation resolved? I like the back and forth, but I really want to understand if something is really blocking the merger.

Agreed!
Healthy conversation to nail down the problem you are trying to solve is always welcome but at some point we need to find common ground and agree that the PR is ready to be merged or it's still has many unknowns.

Currently, all you need another gatekeeper to approve your PR and once that's in place, we can merge it.
But we are getting there 😄

eero-t · 2025-04-02T08:45:17Z

I see the merging is blocked - how is this conversation resolved?

I'm not myself a gatekeeper in this project, I'm just reviewing it. That and comments from other non-gatekeepers, is just input for the required 2 gatekeeper approvals i.e. people with write access (shield icon in reviewer lists?).

lkk12014402 · 2025-04-04T04:58:24Z

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

hi @haim-barad, Dose the routing Agent use the GenAIComps agent component ? where do you want to put the routing agent， GenAIComps or GenAIExamples?

haim-barad · 2025-04-04T05:41:50Z

The router is a decision maker (classifier). Currently, it supports text based prompts and will make decisions regarding complexity of prompts. We expect it to be used initially with chat based apps, but there's no limit depending on how the model is constructed and we expect this be useful in many scenarios.

hi @haim-barad, Dose the routing Agent use the GenAIComps agent component ? where do you want to put the routing agent， GenAIComps or GenAIExamples?

We plan on the router code to go into GenAIComps and some examples to go into GenAIExamples (Jupyter notebooks).

haim-barad requested review from chensuyue, ftian1, mkbhanda, preethivenkatesh, chickenrae and tomlenth as code owners March 2, 2025 10:55

joshuayao mentioned this pull request Mar 10, 2025

[Feature] RouteLLM opea-project/GenAIComps#936

Open

haim-barad mentioned this pull request Mar 10, 2025

OPEA needs a Router Agent #308

Closed

yinghu5 requested a review from joshuayao March 26, 2025 07:38

yinghu5 added the A0 need to scrub label Mar 27, 2025

haim-barad and others added 11 commits April 1, 2025 10:01

Add rfc for Routing Agent

cb2ff24

Signed-off-by: Haim Barad <haim.barad@intel.com>

Add blogs page to OPEA (opea-project#318)

cf7d1b6

* Create index.rst * Update index.rst Signed-off-by: Haim Barad <haim.barad@intel.com>

Add GitHub Action to check and close stale issues and PRs (opea-proje…

42cb989

…ct#332) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Add doc for contributing vectorDB.

c5dcc18

Signed-off-by: Katherine Druckman <katherine.druckman@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Apply suggestions from code review

89e5fca

Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com> Signed-off-by: Katherine Druckman <katherine.druckman@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Apply suggestions from code review

f9c091a

Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com> Signed-off-by: Katherine Druckman <katherine.druckman@intel.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

Add AMD OPEA blog (opea-project#339)

53c0258

Signed-off-by: Yu Wang <yu.wang6@amd.com> Signed-off-by: Haim Barad <haim.barad@intel.com>

xiguiw and others added 3 commits April 1, 2025 10:01

Added link about routellm.

09193ad

Signed-off-by: Haim Barad <haim.barad@intel.com>

haim-barad force-pushed the barad branch from 4d5b90d to 09193ad Compare April 1, 2025 07:02

eero-t reviewed Apr 1, 2025

View reviewed changes

eero-t approved these changes Apr 1, 2025

View reviewed changes

haim-barad force-pushed the barad branch 2 times, most recently from 3cfae1e to c48592a Compare April 1, 2025 13:54

haim-barad added 2 commits April 1, 2025 16:56

Added note for text-based inputs

c54e5f6

Signed-off-by: Haim Barad <haim.barad@intel.com>

DCO checkin this commit...

46000ff

Signed-off-by: Haim Barad <haim.barad@intel.com>

haim-barad force-pushed the barad branch from 12f03af to 46000ff Compare April 1, 2025 13:58

Merge branch 'main' into barad

581eeaa

ashahba approved these changes Apr 1, 2025

View reviewed changes

louie-tsai approved these changes Apr 1, 2025

View reviewed changes

community/rfcs/25-03-03-GenAIComponents-001-Routing-Agent.md Show resolved Hide resolved

community/rfcs/25-03-03-GenAIComponents-001-Routing-Agent.md Show resolved Hide resolved

yinghu5 added this to the v1.5 milestone Apr 2, 2025

lkk12014402 approved these changes Apr 4, 2025

View reviewed changes

mkbhanda merged commit e2040df into opea-project:main Apr 4, 2025
4 checks passed

haim-barad deleted the barad branch April 4, 2025 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rfc for Routing Agent #310

Add rfc for Routing Agent #310

haim-barad commented Mar 2, 2025

haim-barad commented Mar 6, 2025

mkbhanda commented Mar 19, 2025 •

edited

Loading

haim-barad commented Mar 20, 2025

yinghu5 commented Mar 26, 2025

haim-barad commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

haim-barad commented Apr 1, 2025

haim-barad commented Apr 1, 2025

eero-t left a comment •

edited

Loading

haim-barad commented Apr 1, 2025

eero-t left a comment

haim-barad commented Apr 1, 2025

ashahba left a comment

louie-tsai left a comment

poussa commented Apr 1, 2025

eero-t commented Apr 1, 2025 •

edited

Loading

haim-barad commented Apr 1, 2025 •

edited

Loading

eero-t commented Apr 1, 2025 •

edited

Loading

haim-barad commented Apr 1, 2025

ashahba commented Apr 2, 2025

eero-t commented Apr 2, 2025 •

edited

Loading

lkk12014402 commented Apr 4, 2025

haim-barad commented Apr 4, 2025

Add rfc for Routing Agent #310

Add rfc for Routing Agent #310

Conversation

haim-barad commented Mar 2, 2025

haim-barad commented Mar 6, 2025

mkbhanda commented Mar 19, 2025 • edited Loading

haim-barad commented Mar 20, 2025

yinghu5 commented Mar 26, 2025

haim-barad commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

mkbhanda commented Apr 1, 2025

haim-barad commented Apr 1, 2025

haim-barad commented Apr 1, 2025

eero-t left a comment • edited Loading

Choose a reason for hiding this comment

haim-barad commented Apr 1, 2025

eero-t left a comment

Choose a reason for hiding this comment

haim-barad commented Apr 1, 2025

ashahba left a comment

Choose a reason for hiding this comment

louie-tsai left a comment

Choose a reason for hiding this comment

poussa commented Apr 1, 2025

eero-t commented Apr 1, 2025 • edited Loading

haim-barad commented Apr 1, 2025 • edited Loading

eero-t commented Apr 1, 2025 • edited Loading

haim-barad commented Apr 1, 2025

ashahba commented Apr 2, 2025

eero-t commented Apr 2, 2025 • edited Loading

lkk12014402 commented Apr 4, 2025

haim-barad commented Apr 4, 2025

mkbhanda commented Mar 19, 2025 •

edited

Loading

eero-t left a comment •

edited

Loading

eero-t commented Apr 1, 2025 •

edited

Loading

haim-barad commented Apr 1, 2025 •

edited

Loading

eero-t commented Apr 1, 2025 •

edited

Loading

eero-t commented Apr 2, 2025 •

edited

Loading