Proposal : CompressedIndex, a new Index Structure #185 #187

agourdel · 2025-03-01T17:12:21Z

This pull request is linked to the proposal #185

Hi guys,

I had an idea so I went through with it to compare with the current behavior.
Here : https://github.com/agourdel/outlines-core/tree/opt/new-index-struct you can find a version of outlines-core with a new index structure called CompressedIndex.
The details of the structure can be found in_./src/index/compressed_index.README_ (thanks LLM) but the main idea was, hashmap is expensive in memory and slow in access when it comes to store a lot of transitions and a lot of state , So, what if we tried to store every allowed tokens for every states in a vector of bitmasks ? (token_masks)

I started by adding an asbtract layer called IndexVariant between Guide Object and Index Objects to allow Guides with different kind of Index structures.

#./src/index/mod.rs
pub enum IndexVariant {
    Standard(Index),
    Compressed(CompressedIndex),
}

-----------
#./src/python_bindings/mod.rs

#[pyclass(name = "Index", module = "outlines_core.outlines_core_rs")]
#[derive(Clone, Debug, PartialEq, Encode, Decode)]
pub struct PyIndex(Arc<IndexVariant>);

Then I built the CompressedIndex structure. I coded it with one standard Index as parameter in the constructor because I'm lazy but it could be/should be instantiate with regex and vocabulary as standard index does .

pub struct CompressedIndex {
    initial_state: StateId,
    final_states: HashSet<StateId>,

    pub state_to_index: HashMap<StateId, usize>,
    pub state_offsets: Vec<usize>,
    pub next_states: Vec<StateId>,

    pub token_masks: Vec<Vec<u64>>,

    eos_token_id: TokenId,
    vocab_size: usize,
    transitions: HashMap<StateId, HashMap<TokenId, StateId>>, // Useless but needed to be IndexBehavior Compliant
}

Then, I made benchmark for a few regex usecase as follows with GPT2 Vocabulary :

let regexes = vec![
        (
            "email", r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
        ),
        ("simple_phone", r"\+?[1-9][0-9]{7,14}"),
        (
            "complex_phone", r"\+?\d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}",
        ),
        ("permissive_any", r".{255}$"),
        ("permissive_words", r"[a-zA-Z]{100}"),
        (
            "schema_simple",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}"#,
        ),
        (
            "schema_simple_phone",
            r#"{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}, "complexe_phone": {"type": "string", "pattern": "\\+?\\d{1,4}?[-. ]?\\(\\d{1,3}\\)?[-. ]?\\d{1,4}[-. ]?\\d{1,4}[-. ]?\\d{1,9}"}}, "required": ["name", "age", "complexe_phone"]}"#,
        ),
        (
            "schema_complexe",
            r###"{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "Schema for a recording",
  "type": "object",
  "definitions": {
    "artist": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "functions": {
          "type": "array",
          "items": {"type": "string"}
        }
      },
      "required": ["id", "name", "functions"]
    }
  },
  "properties": {
    "id": {"type": "number"},
    "work": {
      "type": "object",
      "properties": {
        "id": {"type": "number"},
        "name": {"type": "string"},
        "composer": {"$ref": "#/definitions/artist"}
      }
    },
    "recording_artists": {
      "type": "array",
      "items": {"$ref": "#/definitions/artist"}
    }
  },
  "required": ["id", "work", "recording_artists"]
}"###,
        ),
    ];

So, 4 regexes and 3 Json structures.

First of all, I wanted to know the size of each index in memory based on each regexe.
(We are in Rustland)

As you can see the results are a bit meandering but after investigation it turns out that the determining factor is the transitions/states ratio.
The higher it is, the more the CompressedIndex will save memory. So bigger the vocab is, best the saving is.
With an equilibrium around 1200 transitions per states
(The benchmark used is ./src/index/test_bench_memory.rs )

After that, I decided to make them compete for computing perfomances.
In pythonland, for each regexes I created a Guide with the standard index and a Guide with the compressed index. Then, I make them, for one regex, take exactly the same random path in the DFA.

The times displayed correspond only to the time needed to make the advance
Each advance is an iteration. The technical difference between the two indexes is that for the standard one, at the end of the iteration we have a list of allowed token_id and for the compressed one, at the end of the iteration we have a bitset mask of the entire vocab with bit == 1 when the token is allowed. With GPT2.

( The benchmark used is ./benchmarks/bench_index_variant.py )

As you can see, the performances "by mask" of the CompressedIndex are constant. Whatever the regex which is used. And more you take a deep path in the DFA (lot of tokens), more the SpeedUp Ratio is in favor of the CompressedIndex.

I made the same benchmark with unsloth/Llama-3.1-8B-Instruct (128 257 tokens) :

It's all about the size of the inference output. (Number of generated tokens).
The regex + The vocab size give us a transitions/states ratio.
The transitions/states ratio + the vocab size give us an equilibrium (A number of generated tokens where the StandardIndex and the CompressIndex have the same performance).
Then, the further you go beyond the equilibrium, the better the compressedIndex performs.
(This is why there are two lines "schema_complexe", one with 283 tokens generated ad one with 2601 generated tokens. Both following the same DFA)

So in Conclusion, I think, the CompressedIndex or something like that should be considerated as possible improvment for the futur of outlines-core. (for The public repo at least)

If I had to venture further, I would say that we can have even better performance than this by stopping transforming the given input structure into a single regex/ single DFA but rather creating an acyclic graph where only some nodes are DFAs for "local regex" because there is no path dependency between a subregex to establish a phone number and a subregex to establish an email inside one Json Structure but considered as a single regex we, at the instanciation, create a combinatorial explosion, C(R1xR2) instead of just C(R1)+C(R2).

What do you think ?

…nner mask

agourdel · 2025-03-03T16:25:20Z

I made a mistake.
I was returning mask by copy value between Python-land and Rust-land.
And This is why I had poor measurement from python compare to rust.
(Because we can't pass reference from rust to python. Only from python to rust. )
So I changed the API, we create a mask from Python and pass it as a reference to rust.
Now, The CompressIndex is significantly faster than the standard Index.
No Matter the vocab size, the regex, the number of generated tokens.

Here is new resuts for GPT2 and 52 257 Tokens :

Here is new results for unsloth/Llama-3.1-8B-Instruct and 128 256 Tokens :

agourdel added 2 commits March 1, 2025 18:04

Add advance_with_mask() and write_into_mask() to guide object

4098f3e

Add CompressedIndex and IndexVariant Abstract Layer

86971c2

agourdel mentioned this pull request Mar 1, 2025

Proposal : CompressedIndex, a new Index Structure #185

Open

Remove useless condition inserted for testing purpose

36d8a50

agourdel force-pushed the opt/new-index-struct branch from 159912b to 36d8a50 Compare March 2, 2025 22:11

Mask Buffer from Python as argument instead of return copy value of i…

67babf1

…nner mask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal : CompressedIndex, a new Index Structure #185 #187

Proposal : CompressedIndex, a new Index Structure #185 #187

agourdel commented Mar 1, 2025

agourdel commented Mar 3, 2025

Proposal : CompressedIndex, a new Index Structure #185 #187

Are you sure you want to change the base?

Proposal : CompressedIndex, a new Index Structure #185 #187

Conversation

agourdel commented Mar 1, 2025

agourdel commented Mar 3, 2025