Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

作者您好,请教问题:tokenizer词表大小和模型embedding层对应不上 #39

Open
zhangzai666 opened this issue Apr 1, 2023 · 2 comments

Comments

@zhangzai666
Copy link

作者您好,感谢您分享模型。之前问过您问题如何预训练。
我发现加载模型后embedding层大小是31128但是加载tokenzier分词器词表大小32228.原因就是多了预训练需要的extra_0到extra_100.而这是预训练所需要的。所以如何基于您分享这个embedding的32128的模型预训练。
tokenizer的
image
model的
image

@joytianya
Copy link
Collaborator

已经修复,可以重新加载下

@zhangzai666
Copy link
Author

已经修复,可以重新加载下

您好,感谢您的回复。
刚才试了加载chatyuanV2。您是加载词表吧extra_id的数量设置为0了,所以tokinzer的vocab_size减少了100.但是T5模型预训练期间需要extra_0到extra_100把。不应该是把模型的embdding层的维度增加为32228来适应extra_0到extra_100这100个mask词么

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants