Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

tokenize by script boundaries - only #327

Open
mediabuff opened this issue Mar 8, 2024 · 0 comments
Open

tokenize by script boundaries - only #327

mediabuff opened this issue Mar 8, 2024 · 0 comments

Comments

@mediabuff
Copy link

I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following

the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति

should break as 4 tokens

"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant