Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[New Model]: Add Cohere2 Model #11357

Closed
1 task done
CXIAAAAA opened this issue Dec 20, 2024 · 6 comments
Closed
1 task done

[New Model]: Add Cohere2 Model #11357

CXIAAAAA opened this issue Dec 20, 2024 · 6 comments
Labels
new model Requests to new models

Comments

@CXIAAAAA
Copy link

CXIAAAAA commented Dec 20, 2024

🚀 The feature, motivation and pitch

Recently cohere released a CommandR7B model in huggingface and I would like to contribute the vllm implementation version of it. @simon-mo

PR: #11358

The model also uses the interleave attention like gemma2 and mistral, so kv cache optimization is needed. I saw it is also on the roadmap. #9464

Alternatives

No response

Additional context

I have integrated and tested it work with all the benchmark scripts and would like to add a feature branch for review.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@CXIAAAAA CXIAAAAA added the feature request New feature or request label Dec 20, 2024
@DarkLight1337 DarkLight1337 changed the title [Feature]: Add Cohere2 Model [New Model]: Add Cohere2 Model Dec 20, 2024
@DarkLight1337 DarkLight1337 added new model Requests to new models and removed feature request New feature or request labels Dec 20, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 20, 2024

This has already been added in #11203, and is in the latest released version of vLLM. Thanks for the offer though!

@CXIAAAAA
Copy link
Author

CXIAAAAA commented Dec 20, 2024

@DarkLight1337 i think there is some issues for the impl:

  1. the attention is sssf, so when passing the sliding_window as None(global attention layer) and at the same time pass in the cache_config, it will go to this branch
    sliding_window = cache_config.sliding_window
    , which makes the sliding window back to 4096. So instead of [4096, 4096, 4096, None]. You current implementation will become [4096, 4096, 4096, 4096]. You could print here to double check:
    self.kv_cache_dtype = kv_cache_dtype
  2. i did this to fix: [Bugfix] Fix sliding window in cohere2 model #11358

So i think maybe gemma2 has the same impl issue

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 20, 2024

Thanks for the report! I'll ask @simon-mo to take a look since he added this.

@CXIAAAAA
Copy link
Author

@DarkLight1337 A followup thought, for long context impl correctness, we could add a needle test to gate the corretness. I would be able to help also.

@simon-mo
Copy link
Collaborator

actually i think @youkaichao added this

@youkaichao
Copy link
Member

@CXIAAAAA thanks for the report! The sliding window support for cohere2 is indeed broken, and I added a pr #11583 to fix it.

gemma2 works fine, because cache_config.sliding_window is None for models with interleaved sliding window.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

4 participants