Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FIX] Error: "Initial token count exceeds token limit" #1941

Open
DvdNss opened this issue May 22, 2024 · 4 comments
Open

[FIX] Error: "Initial token count exceeds token limit" #1941

DvdNss opened this issue May 22, 2024 · 4 comments

Comments

@DvdNss
Copy link

DvdNss commented May 22, 2024

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:

  • private-gpt version: 0.5.0
  • LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1
  • Store type: Postgres DB

ERROR ENCOUNTERED:
When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE:
From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.

# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)

EDIT: Confirmed, sliding-window attention is not supported in llama_index, see ggml-org/llama.cpp#3377

SOLUTION:

  • If you are using Mistral 7B Instruct v0.1:
    This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. However, its actual context window size is 8192 (see screenshot below). Therefore, the fix is to increase the context_window value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.

image

  • If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup):
    This LLM does not use a sliding-window attention mechanism. In this case, simply increase the context_window value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).

If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

@marikan114
Copy link

Thank you for the simple solution. I also used the following to get an early indication of the new token limit.

#1701 (comment)

@anamariaUIC
Copy link

@DvdNss Thank you so much for this post. Can you please let em know if max_new_tokens: value has to match context_window value? I'm using using Mistral 7B Instruct v0.2.

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:

* private-gpt version: 0.5.0

* LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1

* Store type: Postgres DB

ERROR ENCOUNTERED: When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE: From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.

# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)

EDIT: Confirmed, sliding-window attention is not supported in llama_index, see ggerganov/llama.cpp#3377

SOLUTION:

* **If you are using Mistral 7B Instruct v0.1:**
  This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. **However, its actual context window size is 8192** (see screenshot below). Therefore, the fix is to increase the **context_window** value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.

image

* **If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup):**
  This LLM does not use a sliding-window attention mechanism. In this case, simply increase the **context_window** value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).

If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

@DvdNss
Copy link
Author

DvdNss commented Jul 10, 2024

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

@anamariaUIC
Copy link

anamariaUIC commented Jul 10, 2024

@DvdNss thank you so much. Which values you would recommend for
max_new_tokens:
context_window:

when querying CSV files?

right now I have it set to:
max_new_tokens: 8000
context_window: 13000

And I am getting very poor results. Even the most basic questions like how many rows or how many columns are in the file can't be answered, also basic summary statistics questions are completely wrong. Any advice?

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants