[FIX] Error: "Initial token count exceeds token limit" #1941

DvdNss · 2024-05-22T15:58:57Z

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:

private-gpt version: 0.5.0
LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1
Store type: Postgres DB

ERROR ENCOUNTERED:
When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE:
From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.

# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)

EDIT: Confirmed, sliding-window attention is not supported in llama_index, see ggml-org/llama.cpp#3377

SOLUTION:

If you are using Mistral 7B Instruct v0.1:
This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. However, its actual context window size is 8192 (see screenshot below). Therefore, the fix is to increase the context_window value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.

If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup):
This LLM does not use a sliding-window attention mechanism. In this case, simply increase the context_window value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).

If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

marikan114 · 2024-06-10T15:14:27Z

Thank you for the simple solution. I also used the following to get an early indication of the new token limit.

#1701 (comment)

anamariaUIC · 2024-07-08T21:11:13Z

@DvdNss Thank you so much for this post. Can you please let em know if max_new_tokens: value has to match context_window value? I'm using using Mistral 7B Instruct v0.2.

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:
* private-gpt version: 0.5.0

* LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1

* Store type: Postgres DB
ERROR ENCOUNTERED: When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE: From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.
# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)
EDIT: Confirmed, sliding-window attention is not supported in llama_index, see ggerganov/llama.cpp#3377

SOLUTION:
* **If you are using Mistral 7B Instruct v0.1:**
  This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. **However, its actual context window size is 8192** (see screenshot below). Therefore, the fix is to increase the **context_window** value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.
* **If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup):**
  This LLM does not use a sliding-window attention mechanism. In this case, simply increase the **context_window** value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).
If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

DvdNss · 2024-07-10T14:41:08Z

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

anamariaUIC · 2024-07-10T14:45:52Z

@DvdNss thank you so much. Which values you would recommend for
max_new_tokens:
context_window:

when querying CSV files?

right now I have it set to:
max_new_tokens: 8000
context_window: 13000

And I am getting very poor results. Even the most basic questions like how many rows or how many columns are in the file can't be answered, also basic summary statistics questions are completely wrong. Any advice?

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

dejankocic mentioned this issue Jun 19, 2024

Initial token count exceeds token limit #1977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Error: "Initial token count exceeds token limit" #1941

[FIX] Error: "Initial token count exceeds token limit" #1941

DvdNss commented May 22, 2024

marikan114 commented Jun 10, 2024

anamariaUIC commented Jul 8, 2024

DvdNss commented Jul 10, 2024

anamariaUIC commented Jul 10, 2024 •

edited

Loading

[FIX] Error: "Initial token count exceeds token limit" #1941

[FIX] Error: "Initial token count exceeds token limit" #1941

Comments

DvdNss commented May 22, 2024

marikan114 commented Jun 10, 2024

anamariaUIC commented Jul 8, 2024

DvdNss commented Jul 10, 2024

anamariaUIC commented Jul 10, 2024 • edited Loading

anamariaUIC commented Jul 10, 2024 •

edited

Loading