Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat. add OpenVINO Model Server as a Backend #1722

Closed
fakezeta opened this issue Feb 18, 2024 · 0 comments
Closed

feat. add OpenVINO Model Server as a Backend #1722

fakezeta opened this issue Feb 18, 2024 · 0 comments
Labels
enhancement New feature or request roadmap

Comments

@fakezeta
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.

With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.

Could be an effective solution for computer with iGPU

I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/

It will be compatible with torch, onnx, openvino model format.

Describe the solution you'd like

This could be implemente with Optimum-Intel library or with gRPC OpenVINO model server

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

2 participants