[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

dongyang-mt · 2024-12-26T13:08:06Z

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen2.5-3B-Instruct

What is the scenario where the problem happened?

transformers

Is this badcase known and can it be solved using avaiable techniques?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find a solution there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04
Python: Python 3.10
GPUs: 8 x NVIDIA A100
NVIDIA driver: 535 (from nvidia-smi)
CUDA compiler: 12.1 (from nvcc -V)
PyTorch: 2.2.1+cu121 (from python -c "import troch; print(torch.version)")

Description

Steps to reproduce

import torch
from transformers import AutoTokenizer

加载 tokenizer

model_name_or_path = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

例子文本

text = " 那明天呢?"

检查每个 token

tokens = tokenizer.tokenize(text)
print("Tokenized Text:", tokens)

转换成 token ID

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

解码回文本

decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)

Encoded Text: " 那明天呢?"
Tokenized Text: ['Ġé', 'Ĥ', '£', 'æĺİå¤©', 'åĳ¢', '?']
Token IDs: [18137, 224, 96, 104807, 101036, 30]
Decoded Text: 那明天呢?

1006076811 · 2025-01-03T13:23:35Z

请问您这个token计算器是哪个工具

dongyang-mt · 2025-01-04T01:46:41Z

请问您这个token计算器是哪个工具
通过tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
网页版token计算器链接：https://dashscope.console.aliyun.com/tokenizer
计算的结果是一致的：

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

dongyang-mt commented Dec 26, 2024

1006076811 commented Jan 3, 2025

dongyang-mt commented Jan 4, 2025

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

Comments

dongyang-mt commented Dec 26, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Steps to reproduce

加载 tokenizer

例子文本

检查每个 token

转换成 token ID

解码回文本

1006076811 commented Jan 3, 2025

dongyang-mt commented Jan 4, 2025