-
Notifications
You must be signed in to change notification settings - Fork 249
Support NomicBert MoE #596
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
Support NomicBert MoE #596
Conversation
backends/candle/src/models/nomic.rs
Outdated
match () { | ||
_ if use_moe => Ok(Self::MoE(NomicMoELayer::load(vb, config)?)), | ||
_ if config.activation_function == HiddenAct::Gelu => { | ||
Ok(Self::Mlp(NomicBertMLP::load(vb, config)?)) | ||
} | ||
_ => Ok(Self::GatedMLP(NomicBertGatedMLP::load(vb, config)?)), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong.
if else sequence seems more idiomatic than empty match here.
activation_function deciding the type of MLP layers seems wrong, can't we find some better config option for it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, NomicBert decides the type of MLP layer by activation function. Here is the code and I haven't yet found any good configuration to distinguish between them.
Well, for now, the only way I can think of is to identify the type of MLP layer by the key name in the weights, instead of relying on the activation function name (e.g. if there's fc11.weight
, then use NomicBertGatedMLP
). However, I think it's also not a good approach.
I'll think about it further. If you have any good ideas, please feel free to let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I think it'd be better to make a PR to the nomic-ai models to add a new configuration for the type of MLP layer.
backends/candle/src/models/nomic.rs
Outdated
fc11, | ||
fc12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fusing of gate_up into a single tensor, is done on purpose as it's faster to do a single matmul, can you fuse them again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! I'll work on it too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed fc11
and fc12
into a single linear layer fc1
.
backends/candle/src/models/nomic.rs
Outdated
let activated_gate = match self.activation { | ||
HiddenAct::Gelu => gate.gelu()?, | ||
HiddenAct::Swiglu => gate.silu()?, | ||
_ => candle_nn::ops::sigmoid(&gate)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong, HiddenAct::Sigmoid should be the catch, we should handle, or panic for other activations.
Given the simplicity of activations, we could definitely abstract it away into a layer that is common to all models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! It'd be better to handle or panic for the other cases. I'll fix it too.
we could definitely abstract it away into a layer that is common to all models.
I agree with you. We could refactor this part too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just implemented a forward()
method for the HiddenAct
enum to forward the activation with the given input. 188d098
Would this kind of approach be good?
let router = NomicRouter::load(vb.pp("router"), config)?; | ||
let experts = NomicExperts::load(vb.pp("experts"), config)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great first approach, so let's merge as is, however MoE implemented like this is quite slow. Ideally we should think about implementing some real MoE kernels at some point.
Just for pointers here is a start: https://huggingface.co/kernels-community/moe/tree/main (This is MoE kernels extracted from vLLM project which is quite nested),. It should be a simpler starting point to have a single kernel, layer, we could then add it here https://github.com/huggingface/candle-extensions.
There are actually quite a few MoE kernels variants, and for Embeddings models, we need to benchmark how impactful having a kernel is (on LLMs it's quite significant something up to 20% on the overall runtime iirc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that having a real MoE kernel here would be super beneficial.
(on LLMs it's quite significant something up to 20% on the overall runtime iirc
wow, that's huge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed a shape issue with FlashNomicBert.
fixed 1bd78b6
@Narsil thanks for taking the time to review! I'll get back to you after working on your reviews (1. better way to decide on the type of MLP 2. fusing linear layers) |
Could you add a test too if possible btw? |
sure! added efe0033 -- updated I'll add a test for FlashNomicBert too today |
What does this PR do?
This PR unlocks more NomicBert configurations and the MoE layer.
Fixes #502
I've also checked the output with this script on the CPU/Cuda environment (perhaps, it'll work on MPS too, but it'd be better to check).
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil @alvarobartt