How to figure out rate limits? #120

Calamari · 2023-11-02T12:49:14Z

Calamari
Nov 2, 2023

For OpenAI they specify rate limits here. They add fields to the header to show, how many tokens are still left. To build something that respects those limits and retries after the limit hast been reset, it would be great to have those in the response some.
I quickly searched in the code but could not find anything. Is there currently a way to handle this?

brainlid · 2023-11-03T04:11:26Z

brainlid
Nov 3, 2023
Maintainer

Hi @Calamari, no, there is currently is not a way to surface the rate information or the current token usage. The token limit one is a general think that applies to all LLMs and should be implemented. The token limit is per model too.

I haven't looked into this much at this point.

0 replies

Calamari · 2023-11-03T10:16:51Z

Calamari
Nov 3, 2023
Author

I am not quite sure about other models as OpenAI's but wouldn't it be a relatively easy solution to add a struct containing either the real response or the headers of that response as fourth element of the result tuple of the LLMChains run method?
I maybe can conjure up a PR if that is way that you think is viable.

0 replies

brainlid · 2023-11-03T12:45:30Z

brainlid
Nov 3, 2023
Maintainer

Here's what I mean by the model limits varying. Heads up: I'm mostly talking out loud here as I think through it too.

Here's the details on ChatGPT's 3.5 models:

https://platform.openai.com/docs/models/gpt-3-5

Notice they are 4K or 16K with one legacy of 8K.

Then the ChatGPT 4 models:

https://platform.openai.com/docs/models/gpt-4

Are 8K or 32K.

The cost for using the larger limit models is higher too.

The idea of LangChain is to abstract away some of the differences between different models so a config change swaps us to a different model. So the information about the token limit is relative. It's more about "how many tokens do I have left?"

I'd like to review how this is managed in the JS or Python LangChain too, since they've had more time to think about it and what's actually helpful.

0 replies

Calamari · 2023-11-03T12:52:11Z

Calamari
Nov 3, 2023
Author

I was also thinking along the lines of the meta info of "how many tokens are left and when do they reset". At least for ChatGPT they say, they provide those info as header parameters in the response. And I think that would make sense to somehow pass this through to the caller as well, so they can put in some form of rate limiting in place. As far as I can see, right now, if you make a call that brings you over the limit, you don't even get to know when it would reset.

0 replies

brainlid · 2023-11-03T13:02:42Z

brainlid
Nov 3, 2023
Maintainer

I looked into the JS version and they don't have anything documented at least. The Python version docs are much more complete here and I like their approach.

https://python.langchain.com/docs/modules/model_io/models/llms/token_usage_tracking

The caller can provide a callback to get that information. In an Elixir world, passing in an anonymous callback function could be all that's needed. Then after a call to the LLM, the callback fires with the information in a struct format.

Here's the example of the Python result information:

    Tokens Used: 42
        Prompt Tokens: 4
        Completion Tokens: 38
    Successful Requests: 1
    Total Cost (USD): $0.00084

0 replies

brainlid · 2023-11-03T13:04:08Z

brainlid
Nov 3, 2023
Maintainer

But this still doesn't tell me what I want to know, which is, "given the model I'm using, how many tokens do I have left?"

That becomes left up to me, the caller, to figure out.

0 replies

Calamari · 2023-11-03T13:21:14Z

Calamari
Nov 3, 2023
Author

A callback sounds like a nice idea. Looking at the API docs at least for OpenAI this information about, how many tokens are left is returned in the header as x-ratelimit-remaining-tokens and also the x-ratelimit-reset-requests so one can know when to schedule the next call, when the tokens that are left are too low. Those are infos we can present though that callback.

0 replies

brainlid · 2023-11-03T13:39:24Z

brainlid
Nov 3, 2023
Maintainer

There are two different types of limits being talked about here.

The max tokens allowed for the conversation (what the callback talks about)
The ratelimit tokens (what your API link talks about)

The ratelimit tokens are separate and focus on the number of tokens-per-minute that the callers's account is allowed to make. That count and limit resets based on time. The first one is a fixed count based on the conversation size.

The time-based rate limits are something a server might want to track and force their own limits on their users across requests. The size-based limits are hard limits and those force the need to summarize or start new conversations.

0 replies

Calamari · 2023-11-06T12:10:27Z

Calamari
Nov 6, 2023
Author

Yes, I am currently interested in the time-based rate limits. Since there is currently no way to track them, so the server cannot throttle anything or doesn't know when to retry.

0 replies

krrishdholakia · 2023-11-27T17:19:00Z

krrishdholakia
Nov 27, 2023

Hey @Calamari, I'm the maintainer of LiteLLM our Router (used for load balancing across different openai/azure/etc. endpoints) uses time-based limits as way to timeout + retry requests - https://github.com/BerriAI/litellm/blob/9b5f52ae635594aeba3cb6f2a3f81dd3da03e169/litellm/router.py#L190

Let me know how our implementation can be improved. Attaching sample code for quick start below.

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

0 replies

brainlid · 2024-01-22T23:38:24Z

brainlid
Jan 22, 2024
Maintainer

@Calamari Just a quick follow-up. I'm not sure how to best support this feature. I'm also thinking of Bumblebee based LLMs. I've been in talks with that team about getting token counts from that as well. Just letting you know that I'm tracking with it but not actively working on implementing it myself at this time.

0 replies

krrishdholakia · 2024-05-24T13:17:18Z

krrishdholakia
May 24, 2024

so the server cannot throttle anything

If you're trying to set rate limits wouldn't it make more sense to setup a proxy which can track the rate limits per deployment across all the calls in the project? @Calamari @brainlid

0 replies

brainlid · 2024-05-24T14:58:16Z

brainlid
May 24, 2024
Maintainer

Related to discussion #103

0 replies

brainlid · 2024-06-06T01:21:22Z

brainlid
Jun 6, 2024
Maintainer

As I explained in #103, the next version introduces a callback system that will now make it easy to expose this information.

0 replies

brainlid · 2024-06-08T04:42:39Z

brainlid
Jun 8, 2024
Maintainer

@Calamari I merged #133.

This adds support for ratelimit response information in the Callbacks. I've added it for OpenAI and Anthropic. It should be very easy to add it for additional services as well following the same pattern.

Hopefully this addresses your need!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to figure out rate limits? #120

{{title}}

Replies: 15 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to figure out rate limits? #120

Calamari Nov 2, 2023

Replies: 15 comments

brainlid Nov 3, 2023 Maintainer

Calamari Nov 3, 2023 Author

brainlid Nov 3, 2023 Maintainer

Calamari Nov 3, 2023 Author

brainlid Nov 3, 2023 Maintainer

brainlid Nov 3, 2023 Maintainer

Calamari Nov 3, 2023 Author

brainlid Nov 3, 2023 Maintainer

Calamari Nov 6, 2023 Author

krrishdholakia Nov 27, 2023

brainlid Jan 22, 2024 Maintainer

krrishdholakia May 24, 2024

brainlid May 24, 2024 Maintainer

brainlid Jun 6, 2024 Maintainer

brainlid Jun 8, 2024 Maintainer

Calamari
Nov 2, 2023

brainlid
Nov 3, 2023
Maintainer

Calamari
Nov 3, 2023
Author

brainlid
Nov 3, 2023
Maintainer

Calamari
Nov 3, 2023
Author

brainlid
Nov 3, 2023
Maintainer

brainlid
Nov 3, 2023
Maintainer

Calamari
Nov 3, 2023
Author

brainlid
Nov 3, 2023
Maintainer

Calamari
Nov 6, 2023
Author

krrishdholakia
Nov 27, 2023

brainlid
Jan 22, 2024
Maintainer

krrishdholakia
May 24, 2024

brainlid
May 24, 2024
Maintainer

brainlid
Jun 6, 2024
Maintainer

brainlid
Jun 8, 2024
Maintainer