Skip to content

Byte-Pair Encoding based trainable tokenizer similar to GPT models

Notifications You must be signed in to change notification settings

avishekdas539/tinyBPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tinyBPE - Trainable tokenizer based on Byte-Pair Encoding similar to GPT-2 and GPT-4

For LLM to understand texts it needs a translator between text and number, that is called Tokenizer. For LLMs Byte Pair Encoding is the most used algorithm to avoid very little character level tokenizers and very huge word level/ n-gram level tokenizers.

The algorithm is inspired from the folloing references.

File and Code Descriptions

  • tinyBPE/base.py conatins the helper functions like get_pair_counts, merge_pairs, replace_control_chars, render_tokens and the base class BaseBPETokenizer for all tokenizers with to_local and from_loal functions.

  • tinyBPE/bytelevel.py file contains a very basic level of tokenizer ByteLevelBPETokenizer where the base splitting is byte level. Then the merges are performed.

  • tinyBPE/regexBPE.py this implements RegexBPETokenizer class which incorporates Regular Expressions for initial splittingg to optimize token splitting.

Documentation

  1. Training & Inference

>>> from tinyBPE import ByteLevelBPETokenizer, RegexBPETokenizer
>>> tokenizer = ByteLevelBPETokenizer()
>>> text = """VERY_LONG_TEXT"""
>>> tokenizer.train(text, vocab_size= 4096, verbose=True)
>>> tokenizer.to_local("tokenizer") 
>>> # above function will generate tokenizer.tbpe which will be used for loading. tokenizer.vocab is a lossy version and will just for human interpretation
>>> tokens = tokenizer.encode("VERY_LONG_TEXT")
>>> tokenizer.decode(tokens)
"VERY_LONG_TEXT"
  1. Infer with Special Tokens

>>> from tinyBPE import RegexBPETokenizer
>>> tokenizer = RegexBPETokenizer()
>>> sp_tokens = {
    "<|startoftext|>" : 256,
    "<|endoftext|>" : 257,
    "<|midprompt|>" : 258
}
>>> text = "<|startoftext|> this is a new hello random text <|endoftext|>"
>>> tokenizer.add_special_tokens(sp_tokens)
>>> tokens = tokenizer.encode(text, consider_special_tokens = "ALL")
>>> tokenizer.decode(tokens)
"<|startoftext|> this is a new hello random text <|endoftext|>"

My Contributions:

  1. Removed repeatative function calls to one single call
  2. Updated GPT2_SPLIT_PATTERN and GPT4_SPLIT_PATTERN to take care of multiple languages in regexBPE.py
  3. Updated to_local and from_local function by removing the dependency of order of merges in .tbpe file.

About

Byte-Pair Encoding based trainable tokenizer similar to GPT models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages