Skip to content

Support NomicBert MoE #596

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kozistr
Copy link
Contributor

@kozistr kozistr commented Apr 19, 2025

What does this PR do?

This PR unlocks more NomicBert configurations and the MoE layer.

Fixes #502

text-embeddings-router --model-id ../nomic-embed-text-v2-moe/ --pooling cls --port 8080 --dtype float32 --auto-truncate
2025-04-19T11:04:57.198528Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../nom**-*****-****-**-moe/", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(Cls), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-04-19T11:04:57.668115Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 512
2025-04-19T11:04:57.668932Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-04-19T11:04:59.262288Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-04-19T11:04:59.283614Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:261: Starting NomicBert model on Cpu
2025-04-19T11:05:01.610862Z  WARN text_embeddings_router: router/src/lib.rs:263: Backend does not support a batch size > 4
2025-04-19T11:05:01.610892Z  WARN text_embeddings_router: router/src/lib.rs:264: forcing `max_batch_requests=4`
2025-04-19T11:05:01.613807Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-04-19T11:05:01.613839Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-04-19T11:05:02.949753Z  INFO embed{total_time="212.307122ms" tokenization_time="227.718µs" queue_time="252.451µs" inference_time="211.745289ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success
text-embeddings-router --model-id ../nomic-embed-text-v2-moe/ --pooling cls --port 8080 --dtype float16 --auto-truncate
2025-04-23T17:36:35.410931Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../nom**-*****-****-**-moe/", revision: None, tokenization_workers: None, dtype: Some(Float16), pooling: Some(Cls), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "r-kozistr-grant-org-tei-y9hhjvnh-79677-p4ky6", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-04-23T17:36:35.914484Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 512
2025-04-23T17:36:35.914721Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-04-23T17:36:37.808727Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-04-23T17:36:38.248782Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:403: Starting FlashNomicBert model on Cuda(CudaDevice(DeviceId(1)))
2025-04-23T17:36:38.769658Z  INFO text_embeddings_router: router/src/lib.rs:253: Warming up model
2025-04-23T17:36:39.034483Z  WARN text_embeddings_router: router/src/lib.rs:313: Invalid hostname, defaulting to 0.0.0.0
2025-04-23T17:36:39.035971Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-04-23T17:36:39.035991Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-04-23T17:36:51.852696Z  INFO embed{total_time="15.000519ms" tokenization_time="212.254µs" queue_time="238.905µs" inference_time="14.486828ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

I've also checked the output with this script on the CPU/Cuda environment (perhaps, it'll work on MPS too, but it'd be better to check).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @alvarobartt

Comment on lines 413 to 420
match () {
_ if use_moe => Ok(Self::MoE(NomicMoELayer::load(vb, config)?)),
_ if config.activation_function == HiddenAct::Gelu => {
Ok(Self::Mlp(NomicBertMLP::load(vb, config)?))
}
_ => Ok(Self::GatedMLP(NomicBertGatedMLP::load(vb, config)?)),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong.

if else sequence seems more idiomatic than empty match here.

activation_function deciding the type of MLP layers seems wrong, can't we find some better config option for it ?

Copy link
Contributor Author

@kozistr kozistr Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, NomicBert decides the type of MLP layer by activation function. Here is the code and I haven't yet found any good configuration to distinguish between them.

Well, for now, the only way I can think of is to identify the type of MLP layer by the key name in the weights, instead of relying on the activation function name (e.g. if there's fc11.weight, then use NomicBertGatedMLP). However, I think it's also not a good approach.

I'll think about it further. If you have any good ideas, please feel free to let me know!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I think it'd be better to make a PR to the nomic-ai models to add a new configuration for the type of MLP layer.

Comment on lines 142 to 143
fc11,
fc12,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fusing of gate_up into a single tensor, is done on purpose as it's faster to do a single matmul, can you fuse them again ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! I'll work on it too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed fc11 and fc12 into a single linear layer fc1.

9581226

let activated_gate = match self.activation {
HiddenAct::Gelu => gate.gelu()?,
HiddenAct::Swiglu => gate.silu()?,
_ => candle_nn::ops::sigmoid(&gate)?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong, HiddenAct::Sigmoid should be the catch, we should handle, or panic for other activations.

Given the simplicity of activations, we could definitely abstract it away into a layer that is common to all models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! It'd be better to handle or panic for the other cases. I'll fix it too.

we could definitely abstract it away into a layer that is common to all models.

I agree with you. We could refactor this part too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just implemented a forward() method for the HiddenAct enum to forward the activation with the given input. 188d098

Would this kind of approach be good?

Comment on lines +383 to +384
let router = NomicRouter::load(vb.pp("router"), config)?;
let experts = NomicExperts::load(vb.pp("experts"), config)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great first approach, so let's merge as is, however MoE implemented like this is quite slow. Ideally we should think about implementing some real MoE kernels at some point.

Just for pointers here is a start: https://huggingface.co/kernels-community/moe/tree/main (This is MoE kernels extracted from vLLM project which is quite nested),. It should be a simpler starting point to have a single kernel, layer, we could then add it here https://github.com/huggingface/candle-extensions.

There are actually quite a few MoE kernels variants, and for Embeddings models, we need to benchmark how impactful having a kernel is (on LLMs it's quite significant something up to 20% on the overall runtime iirc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that having a real MoE kernel here would be super beneficial.

(on LLMs it's quite significant something up to 20% on the overall runtime iirc

wow, that's huge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed a shape issue with FlashNomicBert.

fixed 1bd78b6

@kozistr
Copy link
Contributor Author

kozistr commented Apr 22, 2025

@Narsil thanks for taking the time to review! I'll get back to you after working on your reviews (1. better way to decide on the type of MLP 2. fusing linear layers)

@Narsil
Copy link
Collaborator

Narsil commented Apr 23, 2025

Could you add a test too if possible btw?

@kozistr
Copy link
Contributor Author

kozistr commented Apr 23, 2025

Could you add a test too if possible btw?

sure! added efe0033

-- updated

I'll add a test for FlashNomicBert too today

@kozistr kozistr requested a review from Narsil April 23, 2025 12:33
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for nomic-ai/nomic-embed-text-v2-moe
2 participants