Support NomicBert MoE #596

kozistr · 2025-04-19T11:17:11Z

What does this PR do?

This PR unlocks more NomicBert configurations and the MoE layer.

Fixes #502

text-embeddings-router --model-id ../nomic-embed-text-v2-moe/ --pooling cls --port 8080 --dtype float32 --auto-truncate
2025-04-19T11:04:57.198528Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../nom**-*****-****-**-moe/", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(Cls), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-04-19T11:04:57.668115Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 512
2025-04-19T11:04:57.668932Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-04-19T11:04:59.262288Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-04-19T11:04:59.283614Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:261: Starting NomicBert model on Cpu
2025-04-19T11:05:01.610862Z  WARN text_embeddings_router: router/src/lib.rs:263: Backend does not support a batch size > 4
2025-04-19T11:05:01.610892Z  WARN text_embeddings_router: router/src/lib.rs:264: forcing `max_batch_requests=4`
2025-04-19T11:05:01.613807Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-04-19T11:05:01.613839Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-04-19T11:05:02.949753Z  INFO embed{total_time="212.307122ms" tokenization_time="227.718µs" queue_time="252.451µs" inference_time="211.745289ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

text-embeddings-router --model-id ../nomic-embed-text-v2-moe/ --pooling cls --port 8080 --dtype float16 --auto-truncate
2025-04-23T17:36:35.410931Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../nom**-*****-****-**-moe/", revision: None, tokenization_workers: None, dtype: Some(Float16), pooling: Some(Cls), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "r-kozistr-grant-org-tei-y9hhjvnh-79677-p4ky6", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-04-23T17:36:35.914484Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 512
2025-04-23T17:36:35.914721Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-04-23T17:36:37.808727Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-04-23T17:36:38.248782Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:403: Starting FlashNomicBert model on Cuda(CudaDevice(DeviceId(1)))
2025-04-23T17:36:38.769658Z  INFO text_embeddings_router: router/src/lib.rs:253: Warming up model
2025-04-23T17:36:39.034483Z  WARN text_embeddings_router: router/src/lib.rs:313: Invalid hostname, defaulting to 0.0.0.0
2025-04-23T17:36:39.035971Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-04-23T17:36:39.035991Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-04-23T17:36:51.852696Z  INFO embed{total_time="15.000519ms" tokenization_time="212.254µs" queue_time="238.905µs" inference_time="14.486828ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

I've also checked the output with this script on the CPU/Cuda environment (perhaps, it'll work on MPS too, but it'd be better to check).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @alvarobartt

Narsil · 2025-04-22T09:46:47Z

backends/candle/src/models/nomic.rs

+        match () {
+            _ if use_moe => Ok(Self::MoE(NomicMoELayer::load(vb, config)?)),
+            _ if config.activation_function == HiddenAct::Gelu => {
+                Ok(Self::Mlp(NomicBertMLP::load(vb, config)?))
+            }
+            _ => Ok(Self::GatedMLP(NomicBertGatedMLP::load(vb, config)?)),
+        }
+    }


This seems wrong.

if else sequence seems more idiomatic than empty match here.

activation_function deciding the type of MLP layers seems wrong, can't we find some better config option for it ?

Surprisingly, NomicBert decides the type of MLP layer by activation function. Here is the code and I haven't yet found any good configuration to distinguish between them.

Well, for now, the only way I can think of is to identify the type of MLP layer by the key name in the weights, instead of relying on the activation function name (e.g. if there's fc11.weight, then use NomicBertGatedMLP). However, I think it's also not a good approach.

I'll think about it further. If you have any good ideas, please feel free to let me know!

Maybe, I think it'd be better to make a PR to the nomic-ai models to add a new configuration for the type of MLP layer.

Narsil · 2025-04-22T09:49:02Z

backends/candle/src/models/nomic.rs

+            fc11,
+            fc12,


The fusing of gate_up into a single tensor, is done on purpose as it's faster to do a single matmul, can you fuse them again ?

sure! I'll work on it too!

I've changed fc11 and fc12 into a single linear layer fc1.

9581226

Narsil · 2025-04-22T09:50:25Z

backends/candle/src/models/nomic.rs

+        let activated_gate = match self.activation {
+            HiddenAct::Gelu => gate.gelu()?,
+            HiddenAct::Swiglu => gate.silu()?,
+            _ => candle_nn::ops::sigmoid(&gate)?,


This seems wrong, HiddenAct::Sigmoid should be the catch, we should handle, or panic for other activations.

Given the simplicity of activations, we could definitely abstract it away into a layer that is common to all models.

You're right! It'd be better to handle or panic for the other cases. I'll fix it too.

we could definitely abstract it away into a layer that is common to all models.

I agree with you. We could refactor this part too.

I've just implemented a forward() method for the HiddenAct enum to forward the activation with the given input. 188d098

Would this kind of approach be good?

Narsil · 2025-04-22T09:54:13Z

backends/candle/src/models/nomic.rs

+        let router = NomicRouter::load(vb.pp("router"), config)?;
+        let experts = NomicExperts::load(vb.pp("experts"), config)?;


This is a great first approach, so let's merge as is, however MoE implemented like this is quite slow. Ideally we should think about implementing some real MoE kernels at some point.

Just for pointers here is a start: https://huggingface.co/kernels-community/moe/tree/main (This is MoE kernels extracted from vLLM project which is quite nested),. It should be a simpler starting point to have a single kernel, layer, we could then add it here https://github.com/huggingface/candle-extensions.

There are actually quite a few MoE kernels variants, and for Embeddings models, we need to benchmark how impactful having a kernel is (on LLMs it's quite significant something up to 20% on the overall runtime iirc)

I also agree that having a real MoE kernel here would be super beneficial.

(on LLMs it's quite significant something up to 20% on the overall runtime iirc

wow, that's huge

I've fixed a shape issue with FlashNomicBert.

fixed 1bd78b6

kozistr · 2025-04-22T10:10:19Z

@Narsil thanks for taking the time to review! I'll get back to you after working on your reviews (1. better way to decide on the type of MLP 2. fusing linear layers)

Narsil · 2025-04-23T10:19:13Z

Could you add a test too if possible btw?

kozistr · 2025-04-23T12:29:14Z

Could you add a test too if possible btw?

sure! added efe0033

-- updated

I'll add a test for FlashNomicBert too today

kozistr added 5 commits April 19, 2025 00:51

update: NomicBert

a8020a5

update: NomicExperts

de98f82

docs: embedding model

6f8c0b5

update: flash nomic

341aca7

refactor: NomicMLP

1dd0f76

Narsil reviewed Apr 22, 2025

View reviewed changes

kozistr added 2 commits April 22, 2025 16:09

update: review

9581226

update: implement forward for HiddenAct

188d098

add: test_nomic_moe

efe0033

kozistr requested a review from Narsil April 23, 2025 12:33

kozistr added 2 commits April 23, 2025 17:06

update: fix flash_nomic

e13376a

fix: NomicExperts to work with FlashNomicBert

1bd78b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NomicBert MoE #596

Support NomicBert MoE #596

kozistr commented Apr 19, 2025 •

edited

Loading

Narsil Apr 22, 2025

kozistr Apr 22, 2025 •

edited

Loading

kozistr Apr 23, 2025

Narsil Apr 22, 2025

kozistr Apr 22, 2025

kozistr Apr 22, 2025

Narsil Apr 22, 2025

kozistr Apr 22, 2025

kozistr Apr 22, 2025

Narsil Apr 22, 2025

kozistr Apr 22, 2025

kozistr Apr 23, 2025

kozistr commented Apr 22, 2025

Narsil commented Apr 23, 2025

kozistr commented Apr 23, 2025 •

edited

Loading

		let router = NomicRouter::load(vb.pp("router"), config)?;
		let experts = NomicExperts::load(vb.pp("experts"), config)?;

Support NomicBert MoE #596

Are you sure you want to change the base?

Support NomicBert MoE #596

Conversation

kozistr commented Apr 19, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

kozistr Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kozistr commented Apr 22, 2025

Narsil commented Apr 23, 2025

kozistr commented Apr 23, 2025 • edited Loading

kozistr commented Apr 19, 2025 •

edited

Loading

kozistr Apr 22, 2025 •

edited

Loading

kozistr commented Apr 23, 2025 •

edited

Loading