Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 #9407

Merged
merged 15 commits into from
Nov 27, 2024

Conversation

yinfan98
Copy link
Contributor

PR types

Others

PR changes

Models

Description

为Bloom提供tokenizer fast支持,顺便想问一下。是对test里每个def我都要添加一个fast的测试吗~感谢🙏

Copy link

paddle-bot bot commented Nov 11, 2024

Thanks for your contribution!

@yinfan98 yinfan98 changed the title 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 Nov 11, 2024
@DrownFish19
Copy link
Collaborator

  • 如讨论结果,如果存在对应测试case,需要对齐添加。
  • 基础测试函数已经在common中添加,基本已经覆盖所有fast类型tokenizer,如果存在特殊case,可以在test_tokenizer.py中重写函数以跳过。

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 52.96%. Comparing base (1ba3bef) to head (8907b0c).
Report is 14 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/bloom/tokenizer_fast.py 78.78% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9407      +/-   ##
===========================================
- Coverage    53.08%   52.96%   -0.12%     
===========================================
  Files          687      689       +2     
  Lines       109472   109412      -60     
===========================================
- Hits         58114    57952     -162     
- Misses       51358    51460     +102     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yinfan98
Copy link
Contributor Author

cc:@DrownFish19 麻烦再帮我看下pr吧,感谢

@@ -0,0 +1,131 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

辛苦在这里增加一下HuggingFace的Copyright

("bloom", "BloomTokenizer"),
(
"bloom",
("BloomTokenizer", "BloomTokenizerFast" if is_tokenizers_available() else None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这里换行一下,格式统一

"LlamaTokenizer": LlamaConverter,
"BertTokenizer": BertConverter,
}
SLOW_TO_FAST_CONVERTERS = {"LlamaTokenizer": LlamaConverter, "BertTokenizer": BertConverter}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的convert是可以通用吗?后续可以验证一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该没有新加bloom的,因为我看在hf上bloom只有fast,没有convert的流程

Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DrownFish19 DrownFish19 merged commit a9a6b80 into PaddlePaddle:develop Nov 27, 2024
10 of 12 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants