-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Add more tokenizer tests #3742
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add more tokenizer tests #3742
Conversation
I'm testing
for codepoint |
Restricting
fixes |
After applying your change to the test starcoder passes too
|
Interesting. I see
Hm. |
Pay no attention to neox, it's not relevant here (uses old model that was failing on map illegal access) |
Should I also include your patch to restrict tokenizer tests to unicode planes in this PR? |
Yes, please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
@Galunid : sorry for mixing it up when using Github online editor (will never try again;). May I ask you to repair the mess (don't know enough about git to fix it on your branch)? Thanks again. |
efd3c22
to
1244b00
Compare
@goerch Done, no worries ;) |
OK, I'll wait until tomorrow morning (+10 hours) before merging. |
* master: (350 commits) speculative : ensure draft and target model vocab matches (ggml-org#3812) llama : correctly report GGUFv3 format (ggml-org#3818) simple : fix batch handling (ggml-org#3803) cuda : improve text-generation and batched decoding performance (ggml-org#3776) server : do not release slot on image input (ggml-org#3798) batched-bench : print params at start log : disable pid in log filenames server : add parameter -tb N, --threads-batch N (ggml-org#3584) (ggml-org#3768) server : do not block system prompt update (ggml-org#3767) sync : ggml (conv ops + cuda MSVC fixes) (ggml-org#3765) cmake : add missed dependencies (ggml-org#3763) cuda : add batched cuBLAS GEMM for faster attention (ggml-org#3749) Add more tokenizer tests (ggml-org#3742) metal : handle ggml_scale for n%4 != 0 (close ggml-org#3754) Revert "make : add optional CUDA_NATIVE_ARCH (ggml-org#2482)" issues : separate bug and enhancement template + no default title (ggml-org#3748) Update special token handling in conversion scripts for gpt2 derived tokenizers (ggml-org#3746) llama : remove token functions with `context` args in favor of `model` (ggml-org#3720) Fix baichuan convert script not detecing model (ggml-org#3739) make : add optional CUDA_NATIVE_ARCH (ggml-org#2482) ...
Conversion scripts used 96981f3
PersimmonIssues:
Persimmon script doesn't allow for
--vocab-only
,GPT-Neox tokenizer fails with
std::unordered_map
illegal access seen in other gpt2 tokenizer based models. I applied the fix from Missing tokenizer tests #3730 (comment) and the test passed. I didn't include the passing version here, only the failing one, let me know which one you want @goerchRefact fails with
byte not found in vocab
(you can see in CI)Starcoder fails with
byte not found in vocab
(you can see in CI)Models used:
closes #3730