-
-
Notifications
You must be signed in to change notification settings - Fork 309
Add Huggingface tokenzier support #189
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Thank you so much for this work @DOGEwbx . I'm waiting for deepseek coder support all the time. This is very helpful. |
Is creating Exl2 quants with this also possible? |
@CyberTimon Thanks for your interest on our work. I haven’t run testing on exl2 quants file but as all the modifications are on the tokenizer part, I don't think there will be problems on the specific data format. |
At the very least, this is going to take some time to review. Transformers is a massive dependency to include just to support one model (Falcon still wouldn't work as there are other architectural differences). As for remote code, my guess would be that 90% of users are unaware of the risks involved, so it should at least be opt-in. I'll need a moment to think about it, to test that this doesn't break functionality like constrained sampling, and make sure there really isn't a better way. |
Thanks for your reply. |
Is there any specific way to use the fork? With
6.7B shows very similar behaviour, but most of the time results in an invisible output loop in the chat example I get the same behaviour no matter what prompt format (also tested the deepseek instruct format) Maybe I am just doing something wrong, I'd appreciate help |
@SinanAkkoyun The model seems to use linear a RoPE scaling factor of 4. I've been able to get coherent output out of the 1.3B model at least, using that. @DOGEwbx The Tokenizers library seems like a more reasonable dependency, especially if it's optional. It largely mirrors Transformers, so it should be possible to adapt it to the code in this PR. There are still a few things I need to sort out and verify, like how control symbols are encoded, optional BOS/EOS tokens, that the vocabulary is preprocessed correctly, how UTF-8 characters are emitted and so on. I'll get to that in a few hours. It's definitely not a trivial issue. I see over on the llama.cpp repo a whole bunch of people have been working on it for some weeks now. As for remote code, the issue is that with the option enabled, |
@turboderp Thank youu, 6.7B is working coherently :) |
@turboderp However, I can't seem to get 1.3B to output coherent responses. What params did you use? EXL2 GPTQ:
Or is this just due to 4bit quantization? The bf16 model responds with great answers for it's 1.3b size |
I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare. |
There. I rewrote it to use the Tokenizers library instead, as an optional dependency, and it seems to run okay now. It seems to consistently encode and decode the same as a HF AutoTokenizer. Encoding seems to work correctly during quantization as well. I also added a workaround for the Tokenizer bug where some added tokens would decode incorrectly. Still need to test it with some of the other models that lack a SentencePiece tokenizer model. |
Thank you |
Yes, that's what puzzled me, the FP16 model ran perfectly fine and conquered most basic coding tasks easily
Thank you so much! |
Add logic to decide use huggingface tokenizer or sentence piece tokenizer.
It can support models using huggingface tokenizer like Falcon and Deepseek Coder