Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Token to piece (#1) #4106

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Nov 17, 2023

This adds the option to un-tokenize special tokens, something I recently needed and I found no way to get it from the exposed functions.
Any special tokens like EOS ("</s" or "<|end-of-text|>" would return empty instead.

This adds a boolean that defaults to false, when set to true the special tokens are treated as normal.

Example:

auto eos_token = llama_token_eos(model);
std::string eos_token_str = llama_token_to_piece(ctx, eos_token,true);

Please have a second closer look, it appears to be fine on my end but it's a central API function.

* Update common.h

* Update common.cpp

* Update llama.h

* Update llama.cpp

* Update llama.h

* Update common.cpp

* Update llama.cpp
@@ -550,7 +550,8 @@ extern "C" {
const struct llama_model * model,
llama_token token,
char * buf,
int length);
int length,
bool print_all_types);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through the patch. From the logic standpoint, it doesn't look like your change will break anything. However, this is a breaking API change I think since you can't set a default there.

I don't know whether or not that's a serious problem, but if you wanted to avoid it, you could add a new API function instead and then have the original llama_token_to_piece call that with your new flag set to false. That way you could opt into the new behavior by using the new function.

The other thing I noticed is the lack of a space after the comma in code like result.size(),special) doesn't conform to the existing style. Also print_all_types=false) should probably have a space around the equals sign.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right on the spaces, I manually typed the changes on github from what I did in my local IDE. Didn't think about that.

I considered creating just a _special version of the function, though all the other API functions use the "bool special" or a similar boolean when it comes to switching special token behavior. So it also would be a bit codestyle breaking.
But I can change it that way ofc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really qualified to say what's worth breaking the API, so I was just suggesting a way to avoid that if you wanted to.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Nov 20, 2023

I'm not sure what to make out of the Swift check that failed ?

Regarding API: it might even be useful to consider changing the bool flag to a OR flag that can combine multiple enums of llama_token_types
This way you could tune exactly which type of tokens you want included.

I believe there is no way around it so user defined tokens can be printed

@KerfuffleV2
Copy link
Collaborator

changing the bool flag to a OR flag

If you're going that far, you might as well switch to using a struct to pass in settings. Then you can support multiple parameters, and even ones that aren't boolean. It's also more future proof.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants