Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Gpt2 tokenizer does not support different vocab_size #148

Closed
rektomar opened this issue Aug 16, 2023 · 1 comment
Closed

Gpt2 tokenizer does not support different vocab_size #148

rektomar opened this issue Aug 16, 2023 · 1 comment

Comments

@rektomar
Copy link

rektomar commented Aug 16, 2023

I want to reduce vocab_size for the gpt2 tokenizer, but the tokenizer still has a full vocabulary size.

using Transformers.HuggingFace

config= hgf"gpt2:config"
vocab_size=2053
new_config = HuggingFace.HGFConfig(config, vocab_size=vocab_size, bos_token_id=vocab_size-1, eos_token_id=vocab_size-1)
te = HuggingFace.load_tokenizer("gpt2";config=new_config)
julia> te.vocab
Vocab{String, SizedArray}(size = 50257, unk = <unk>, unki = 0)
@chengchingwen
Copy link
Owner

vocab_size is only for modifying the model with different size of embedding table. It does not affect the tokenizer. It's also unclear what a correct behavior should be when specifying a smaller (or larger) value.

A workaround approach for reducing the vocab size is to create new tokenizer with a smaller vocabulary by directly copy the subset of the vocabulary, but this would result in having most of the tokens become unknown token. Personally, the better way would be constructing/training your own tokenizer with smaller vocabulary size.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants