Token to piece (#1) #4106

cmp-nct · 2023-11-17T01:17:01Z

This adds the option to un-tokenize special tokens, something I recently needed and I found no way to get it from the exposed functions.
Any special tokens like EOS ("</s" or "<|end-of-text|>" would return empty instead.

This adds a boolean that defaults to false, when set to true the special tokens are treated as normal.

Example:

auto eos_token = llama_token_eos(model);
std::string eos_token_str = llama_token_to_piece(ctx, eos_token,true);

Please have a second closer look, it appears to be fine on my end but it's a central API function.

* Update common.h * Update common.cpp * Update llama.h * Update llama.cpp * Update llama.h * Update common.cpp * Update llama.cpp

KerfuffleV2 · 2023-11-17T02:30:27Z

llama.h

@@ -550,7 +550,8 @@ extern "C" {
              const struct llama_model * model,
                           llama_token   token,
                                  char * buf,
-                                  int    length);
+                                  int    length,
+                                  bool   print_all_types);


I skimmed through the patch. From the logic standpoint, it doesn't look like your change will break anything. However, this is a breaking API change I think since you can't set a default there.

I don't know whether or not that's a serious problem, but if you wanted to avoid it, you could add a new API function instead and then have the original llama_token_to_piece call that with your new flag set to false. That way you could opt into the new behavior by using the new function.

The other thing I noticed is the lack of a space after the comma in code like result.size(),special) doesn't conform to the existing style. Also print_all_types=false) should probably have a space around the equals sign.

You are right on the spaces, I manually typed the changes on github from what I did in my local IDE. Didn't think about that.

I considered creating just a _special version of the function, though all the other API functions use the "bool special" or a similar boolean when it comes to switching special token behavior. So it also would be a bit codestyle breaking.
But I can change it that way ofc.

I'm not really qualified to say what's worth breaking the API, so I was just suggesting a way to avoid that if you wanted to.

trailing ws

cmp-nct · 2023-11-20T15:18:58Z

I'm not sure what to make out of the Swift check that failed ?

Regarding API: it might even be useful to consider changing the bool flag to a OR flag that can combine multiple enums of llama_token_types
This way you could tune exactly which type of tokens you want included.

I believe there is no way around it so user defined tokens can be printed

KerfuffleV2 · 2023-11-20T15:43:45Z

changing the bool flag to a OR flag

If you're going that far, you might as well switch to using a struct to pass in settings. Then you can support multiple parameters, and even ones that aren't boolean. It's also more future proof.

Token to piece (#1)

aa094ac

* Update common.h * Update common.cpp * Update llama.h * Update llama.cpp * Update llama.h * Update common.cpp * Update llama.cpp

KerfuffleV2 reviewed Nov 17, 2023

View reviewed changes

cmp-nct added 3 commits November 17, 2023 03:41

Update common.cpp

1d6b201

Update llama.cpp

fdcd968

Update llama.cpp

8f83ca5

trailing ws

cmp-nct mentioned this pull request Nov 29, 2023

How can I access the string belonging to a token ID during a sampling function? #4247

Closed

cmp-nct mentioned this pull request Dec 8, 2023

Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more) #4360

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token to piece (#1) #4106

Token to piece (#1) #4106

cmp-nct commented Nov 17, 2023

KerfuffleV2 Nov 17, 2023

cmp-nct Nov 17, 2023

KerfuffleV2 Nov 17, 2023

cmp-nct commented Nov 20, 2023

KerfuffleV2 commented Nov 20, 2023

Token to piece (#1) #4106

Are you sure you want to change the base?

Token to piece (#1) #4106

Conversation

cmp-nct commented Nov 17, 2023

KerfuffleV2 Nov 17, 2023

Choose a reason for hiding this comment

cmp-nct Nov 17, 2023

Choose a reason for hiding this comment

KerfuffleV2 Nov 17, 2023

Choose a reason for hiding this comment

cmp-nct commented Nov 20, 2023

KerfuffleV2 commented Nov 20, 2023