Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

Closed
wants to merge 4 commits into from

Conversation

csukuangfj
Copy link
Collaborator

Since byte BPE models use utf8 encoded bytes to represent some words. It is possible that some predicted token sequences cannot form a valid UTF-8 character, which causes utf8-decoding errors in Python. This PR fixes that by ignoring invalid UTF-8 encodings.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant