-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
MPT support in llama.cpp #3417
MPT support in llama.cpp #3417
Conversation
…odified with deltas from ggml/examples/mpt
quantize warns because it is looking for attn_k and not attn_qkv:
|
Now fixed as well. |
…rom metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
…T_KEY macro instead of duplicate code
|
…nd rope_shift from build_mpt
Note that this PR does not include the modifications of convert script proposed in #3252 and referred to in #3417 (comment) yet. Since this PR is based on a pre-merge commit of #3252, it may be easier to add this change after the merge. |
…nvert-gptneox-hf-to-gguf.py in pr:3252
@cebtenzzre Thanks for the merge. If anyone can give this a quick try and confirms working, we should merge. |
Works for me. The PR is now almost the same as my own previous private merge attempt. The disable-n_past-assertion changes to ggml_compute_forward_alibi_f16 and ggml_compute_forward_alibi_f32 could be made syntactically more consistent - but AFAICS they are functionally equivalent. So not a show stopper for merge into master. |
…g hparams["vocab_size"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this, works fine for me. The test failure in test-tokenizer-1-bpe is due to added tokens. I'll fix this in a future PR.
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggml-org#3436) docs : fix typo GOMP_CPU_AFFINITY (ggml-org#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggml-org#3592) ci : check if there is enough VRAM (ggml-org#3596) server : add completion mode (no chat) (ggml-org#3582) prompts : add mnemonics.txt server : fix kv cache management (ggml-org#3588) main : fix session loading bug (ggml-org#3400) server : add parameter -tb N, --threads-batch N (ggml-org#3584) common : fix mirostat state when using multiple sequences (ggml-org#3543) batched : add bench tool (ggml-org#3545) examples : add batched.swift + improve CI for swift (ggml-org#3562) Add MPT model to supported models in README.md (ggml-org#3574) Minor improvements in GPT2 tokenizer (ggml-org#3567) readme : add bloom (ggml-org#3570) llm : add bloom models (ggml-org#3553) swift : improvements and fixes (ggml-org#3564) llm : add MPT support (ggml-org#3417) infill. : fix tokenization (ggml-org#3508) ...
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit f5f9121)
I converted the mpt-7b-chat and the mpt-7b-storywriter. The conversion and quantization completes sucessfully and produces the .gguf files. however, the files don't work for me. When running main with them, i get an
For reference, here is the full output:
I already have successfull converted a bunch of falcon models that work fine, butthe mpt conversion script does not work for me. |
Here is a hexdump of the beginning of the files:
in comparison to the openbuddy falconversion that works fine:
What I notice is that after In contrast, the actual falcon model has a |
As per #1333 (comment)
Some comments regarding this initial implementation: