You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vocab_size is only for modifying the model with different size of embedding table. It does not affect the tokenizer. It's also unclear what a correct behavior should be when specifying a smaller (or larger) value.
A workaround approach for reducing the vocab size is to create new tokenizer with a smaller vocabulary by directly copy the subset of the vocabulary, but this would result in having most of the tokens become unknown token. Personally, the better way would be constructing/training your own tokenizer with smaller vocabulary size.
I want to reduce vocab_size for the gpt2 tokenizer, but the tokenizer still has a full vocabulary size.
The text was updated successfully, but these errors were encountered: