Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

Open
HunYuanfeng opened this issue Sep 12, 2020 · 1 comment
Open

Comments

@HunYuanfeng
Copy link

HunYuanfeng commented Sep 12, 2020

For example:
tokens:
['i', 'am', 'looking', 'for', 'a', 'restaurant', 'in', 'the', '[restaurant_area]', '.', 'postcode', 'type', 'phone', 'food', 'pricerange', 'address', 'area', 'name', 'id', 'reference']

input_ids:
[8, 35, 51, 15, 12, 45, 18, 9, 67, 6, 89, 117, 68, 88, 3, 82, 70, 346, 281, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tokenizer.convert_id_to_tokens(input_ids)):
i am looking for a restaurant in the [restaurant_area] . postcode type phone food [UNK] address area name id reference.

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’, so the training code might not work.
Is it normal?Does the source code has something incorrect?
I try to examine this issue by:
tokenizer = Tokenizer(vocab, ivocab, False)
print(tokenizer.vocab_len) # 3130
print(tokenizer.get_word_id('pricerange')) # 3
print(tokenizer.get_word(3)) # [UNK]

@fasterbuild
Copy link

@HunYuanfeng According to the "data/vocab.json" file, 'pricerange' may be replaced to "[restaurant_pricerange]" in your sentence above.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants