Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

Open
4 tasks done
dongyang-mt opened this issue Dec 26, 2024 · 2 comments

Comments

@dongyang-mt
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen2.5-3B-Instruct

What is the scenario where the problem happened?

transformers

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04
Python: Python 3.10
GPUs: 8 x NVIDIA A100
NVIDIA driver: 535 (from nvidia-smi)
CUDA compiler: 12.1 (from nvcc -V)
PyTorch: 2.2.1+cu121 (from python -c "import troch; print(torch.version)")

Description

Steps to reproduce

import torch
from transformers import AutoTokenizer

加载 tokenizer

model_name_or_path = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

例子文本

text = " 那明天呢?"

检查每个 token

tokens = tokenizer.tokenize(text)
print("Tokenized Text:", tokens)

转换成 token ID

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

解码回文本

decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)

Encoded Text: " 那明天呢?"
Tokenized Text: ['Ġé', 'Ĥ', '£', 'æĺİ天', 'åij¢', '?']
Token IDs: [18137, 224, 96, 104807, 101036, 30]
Decoded Text: 那明天呢?

image

@1006076811
Copy link

请问您这个token计算器是哪个工具

@dongyang-mt
Copy link
Author

请问您这个token计算器是哪个工具
通过tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
网页版token计算器链接:https://dashscope.console.aliyun.com/tokenizer
计算的结果是一致的:

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants