Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Proposal : CompressedIndex, a new Index Structure #185 #187

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

agourdel
Copy link

@agourdel agourdel commented Mar 1, 2025

This pull request is linked to the proposal #185

Hi guys,

I had an idea so I went through with it to compare with the current behavior.
Here : https://github.com/agourdel/outlines-core/tree/opt/new-index-struct you can find a version of outlines-core with a new index structure called CompressedIndex.
The details of the structure can be found in_./src/index/compressed_index.README_ (thanks LLM) but the main idea was, hashmap is expensive in memory and slow in access when it comes to store a lot of transitions and a lot of state , So, what if we tried to store every allowed tokens for every states in a vector of bitmasks ? (token_masks)

I started by adding an asbtract layer called IndexVariant between Guide Object and Index Objects to allow Guides with different kind of Index structures.

#./src/index/mod.rs
pub enum IndexVariant {
    Standard(Index),
    Compressed(CompressedIndex),
}

-----------
#./src/python_bindings/mod.rs

#[pyclass(name = "Index", module = "outlines_core.outlines_core_rs")]
#[derive(Clone, Debug, PartialEq, Encode, Decode)]
pub struct PyIndex(Arc<IndexVariant>);

Then I built the CompressedIndex structure. I coded it with one standard Index as parameter in the constructor because I'm lazy but it could be/should be instantiate with regex and vocabulary as standard index does .

pub struct CompressedIndex {
    initial_state: StateId,
    final_states: HashSet<StateId>,

    pub state_to_index: HashMap<StateId, usize>,
    pub state_offsets: Vec<usize>,
    pub next_states: Vec<StateId>,

    pub token_masks: Vec<Vec<u64>>,

    eos_token_id: TokenId,
    vocab_size: usize,
    transitions: HashMap<StateId, HashMap<TokenId, StateId>>, // Useless but needed to be IndexBehavior Compliant
}

Then, I made benchmark for a few regex usecase as follows with GPT2 Vocabulary :

let regexes = vec![
        (
            "email", r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
        ),
        ("simple_phone", r"\+?[1-9][0-9]{7,14}"),
        (
            "complex_phone", r"\+?\d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}",
        ),
        ("permissive_any", r".{255}$"),
        ("permissive_words", r"[a-zA-Z]{100}"),
        (
            "schema_simple",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}"#,
        ),
        (
            "schema_simple_phone",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}, "complexe_phone": {"type": "string", "pattern": "\\+?\\d{1,4}?[-. ]?\\(\\d{1,3}\\)?[-. ]?\\d{1,4}[-. ]?\\d{1,4}[-. ]?\\d{1,9}"}}, "required": ["name", "age", "complexe_phone"]}"#,
        ),
        (
            "schema_complexe",
            r###"{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "Schema for a recording",
  "type": "object",
  "definitions": {
    "artist": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "functions": {
          "type": "array",
          "items": {"type": "string"}
        }
      },
      "required": ["id", "name", "functions"]
    }
  },
  "properties": {
    "id": {"type": "number"},
    "work": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "composer": {"$ref": "#/definitions/artist"}
      }
    },
    "recording_artists": {
      "type": "array",
      "items": {"$ref": "#/definitions/artist"}
    }
  },
  "required": ["id", "work", "recording_artists"]
}"###,
        ),
    ];

So, 4 regexes and 3 Json structures.

First of all, I wanted to know the size of each index in memory based on each regexe.
(We are in Rustland)
Image

As you can see the results are a bit meandering but after investigation it turns out that the determining factor is the transitions/states ratio.
The higher it is, the more the CompressedIndex will save memory. So bigger the vocab is, best the saving is.
With an equilibrium around 1200 transitions per states
(The benchmark used is ./src/index/test_bench_memory.rs )

After that, I decided to make them compete for computing perfomances.
In pythonland, for each regexes I created a Guide with the standard index and a Guide with the compressed index. Then, I make them, for one regex, take exactly the same random path in the DFA.

The times displayed correspond only to the time needed to make the advance
Each advance is an iteration. The technical difference between the two indexes is that for the standard one, at the end of the iteration we have a list of allowed token_id and for the compressed one, at the end of the iteration we have a bitset mask of the entire vocab with bit == 1 when the token is allowed. With GPT2.

compressed_performance

( The benchmark used is ./benchmarks/bench_index_variant.py )

As you can see, the performances "by mask" of the CompressedIndex are constant. Whatever the regex which is used. And more you take a deep path in the DFA (lot of tokens), more the SpeedUp Ratio is in favor of the CompressedIndex.

I made the same benchmark with unsloth/Llama-3.1-8B-Instruct (128 257 tokens) :

compressed_performance_128k

It's all about the size of the inference output. (Number of generated tokens).
The regex + The vocab size give us a transitions/states ratio.
The transitions/states ratio + the vocab size give us an equilibrium (A number of generated tokens where the StandardIndex and the CompressIndex have the same performance).
Then, the further you go beyond the equilibrium, the better the compressedIndex performs.
(This is why there are two lines "schema_complexe", one with 283 tokens generated ad one with 2601 generated tokens. Both following the same DFA)

So in Conclusion, I think, the CompressedIndex or something like that should be considerated as possible improvment for the futur of outlines-core. (for The public repo at least)

If I had to venture further, I would say that we can have even better performance than this by stopping transforming the given input structure into a single regex/ single DFA but rather creating an acyclic graph where only some nodes are DFAs for "local regex" because there is no path dependency between a subregex to establish a phone number and a subregex to establish an email inside one Json Structure but considered as a single regex we, at the instanciation, create a combinatorial explosion, C(R1xR2) instead of just C(R1)+C(R2).

What do you think ?

@agourdel agourdel force-pushed the opt/new-index-struct branch from 159912b to 36d8a50 Compare March 2, 2025 22:11
@agourdel
Copy link
Author

agourdel commented Mar 3, 2025

I made a mistake.
I was returning mask by copy value between Python-land and Rust-land.
And This is why I had poor measurement from python compare to rust.
(Because we can't pass reference from rust to python. Only from python to rust. )
So I changed the API, we create a mask from Python and pass it as a reference to rust.
Now, The CompressIndex is significantly faster than the standard Index.
No Matter the vocab size, the regex, the number of generated tokens.

Here is new resuts for GPT2 and 52 257 Tokens :

image

Here is new results for unsloth/Llama-3.1-8B-Instruct and 128 256 Tokens :

image

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant