The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

HunYuanfeng · 2020-09-12T04:02:39Z

For example:
tokens:
['i', 'am', 'looking', 'for', 'a', 'restaurant', 'in', 'the', '[restaurant_area]', '.', 'postcode', 'type', 'phone', 'food', 'pricerange', 'address', 'area', 'name', 'id', 'reference']

input_ids:
[8, 35, 51, 15, 12, 45, 18, 9, 67, 6, 89, 117, 68, 88, 3, 82, 70, 346, 281, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tokenizer.convert_id_to_tokens(input_ids)):
i am looking for a restaurant in the [restaurant_area] . postcode type phone food [UNK] address area name id reference.

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’, so the training code might not work.
Is it normal？Does the source code has something incorrect?
I try to examine this issue by:
tokenizer = Tokenizer(vocab, ivocab, False)
print(tokenizer.vocab_len) # 3130
print(tokenizer.get_word_id('pricerange')) # 3
print(tokenizer.get_word(3)) # [UNK]

fasterbuild · 2020-09-27T09:26:08Z

@HunYuanfeng According to the "data/vocab.json" file, 'pricerange' may be replaced to "[restaurant_pricerange]" in your sentence above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

HunYuanfeng commented Sep 12, 2020 •

edited

Loading

fasterbuild commented Sep 27, 2020

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’? #1

Comments

HunYuanfeng commented Sep 12, 2020 • edited Loading

fasterbuild commented Sep 27, 2020

HunYuanfeng commented Sep 12, 2020 •

edited

Loading