Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

tiny tweak to allow BatchEncoding.token_to_char when token doesn't correspond to chars #15901

Merged
merged 5 commits into from
Apr 21, 2022

Conversation

garyhlai
Copy link
Contributor

@garyhlai garyhlai commented Mar 2, 2022

Tagging @n1t0 @thomwolf @sgugger but this PR should be extremely quick to review for anyone.

Problem

BatchEncoding.token_to_char is supposed to return the char spans in the original string; however, right now, for tokens such as "<s>, </s>, <CLS>" that don't correspond to any chars in the original string, an error is raised TypeError: type object argument after * must be an iterable, not NoneType.

Run the following snippet replicate:

from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "He is an absolutely amazing software developer"
tokenized_text = tokenizer(text)
tokenized_text.token_to_chars(0) # 0 corresponds to <CLS>

Fix

The solution is to return None instead of raising an error for tokens not corresponding to any chars in the original string.

P.S.

I am lost as to why run_tests_torch failed for if [ -f test_list.txt ]; then python -m pytest -n 3 --dist=loadfile -s --make-reports=tests_torch $(cat test_list.txt) | tee tests_output.txt fi. Some help would be appreciated.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@garyhlai
Copy link
Contributor Author

Can I follow up on this? Would love to get some feedback -- even if it's about the specific reasons why this PR is not up to standard. Thanks!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the ping! Sorry I didn't pick this PR earlier. @LysandreJik do you want to have a look too?

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me too!

@LysandreJik
Copy link
Member

Sorry for taking so long to review, could you just rebase on main so that the test suite passes?

cc @SaulLu for knowledge

@garyhlai
Copy link
Contributor Author

Done! @sgugger @LysandreJik

@sgugger sgugger merged commit daf520b into huggingface:main Apr 21, 2022
@sgugger
Copy link
Collaborator

sgugger commented Apr 21, 2022

Thanks again for your contribution!

Narsil pushed a commit to Narsil/transformers that referenced this pull request Apr 21, 2022
…rrespond to chars (huggingface#15901)

* tweak to allow BatchEncoding.char_to_token(0)

* update docstring

* remote trailing whitespace

* make fixup

* make value checking for span_indices explicit

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…rrespond to chars (huggingface#15901)

* tweak to allow BatchEncoding.char_to_token(0)

* update docstring

* remote trailing whitespace

* make fixup

* make value checking for span_indices explicit

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants