Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SPLIT PR: add user defined symbols and control symbols #31305

Merged
merged 9 commits into from
Jun 21, 2024

Conversation

itazap
Copy link
Contributor

@itazap itazap commented Jun 7, 2024

Fixes portion of #30824 ->

  • adds user_defined_symbols and control_symbols from proto so that they are not split during encoding / decoding.
  • tests gemma
  • removed same logic from GemmaConverter as it is handled in general SpmConverter class
  • updates common test (!!!) bc fast != slow since fast an read from SPM converter and get user_defined_symbols and control_symbols. @ArthurZucker thoughts on this?
  • copied test from common to camember and rembert tests
  • updated deberta v2 tests to not use '.' in test cases' texts because it is a user added token for fast tokenizers, so the spacing around it differs from slow.
  • add_special_tokens and add_tokens has same effect for now - but fix is already merged in Tokenizers by @ArthurZucker remove enforcement of non special when adding tokens tokenizers#1521

TODO in new PR:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap requested a review from ArthurZucker June 7, 2024 10:17
@itazap itazap changed the title PR SPLIT: moving origina changes for adding user defined symbols PR SPLIT: moving original changes for adding user defined symbols Jun 12, 2024
@ArthurZucker ArthurZucker removed their request for review June 12, 2024 15:13
@itazap itazap changed the title PR SPLIT: moving original changes for adding user defined symbols PR SPLIT: add user defined symbols and control symbols Jun 13, 2024
@itazap itazap requested a review from ArthurZucker June 13, 2024 08:39
@itazap itazap marked this pull request as ready for review June 13, 2024 13:28
@itazap itazap force-pushed the user_defined_symbols_add_30824 branch 6 times, most recently from 97d0d35 to befadd1 Compare June 17, 2024 08:57
@itazap itazap force-pushed the user_defined_symbols_add_30824 branch from befadd1 to 904121e Compare June 17, 2024 09:00
@itazap itazap changed the title PR SPLIT: add user defined symbols and control symbols SPLIT PR: add user defined symbols and control symbols Jun 18, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think we should check the type of the defined symbols see this comment offline:

            if piece.type == 4:
                tokens_to_add += [piece.piece]

regarding trainer_spec.user_defined_symbols not giving the same results

AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
]

tokenizer.add_tokens(user_defined_symbols)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Related fix to only add once: huggingface/tokenizers@f2ec3b2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe using map will be faster? but anyway it's good enough fro what we want !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the encoded sequence here?

…e instead of trainer_spec for user_defined_symbols
@itazap itazap requested a review from ArthurZucker June 20, 2024 15:05
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Last nit and good to go

# Add user defined symbols
user_defined_symbols = [
AddedToken(token, normalized=False, special=False)
for token in [p.piece for p in self.proto.pieces if p.type == 4]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment that references the documentation of sentencepiece as to why we use 4 (arbitrary number)

AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
]

tokenizer.add_tokens(user_defined_symbols + control_symbols)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this will work with the release of tokenizers, so fine for me! Good job

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also propagate to gemma! (only has this for now:

  user_defined_symbols = [
            AddedToken(token, normalized=True, special=False) for token in proto.trainer_spec.user_defined_symbols
        ]
        tokenizer.add_tokens(user_defined_symbols)
       

@itazap itazap merged commit 1e79ead into main Jun 21, 2024
24 checks passed
@itazap itazap deleted the user_defined_symbols_add_30824 branch June 21, 2024 08:48
itazap added a commit that referenced this pull request Jun 21, 2024
* PR SPLIT: moving origina changes for adding user defined symbols

* adding gemma test and generalizing gemma converter

* ruff

* update common test

* update serialization test

* deberta v2 tests updates as rust version adds '.' as a user added token, so a space is not added

* removing commented lines

* applying feedback - user only added_tokens to add and check piece.type instead of trainer_spec for user_defined_symbols

* add comment referencing sentencepiece
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants