-
-
Notifications
You must be signed in to change notification settings - Fork 239
Prompt Caching #234
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
Prompt Caching #234
Conversation
@crmne As I don't have an Anthropic key, I'll need you to generate the VCR cartridges for that provider. Hoping everything just works, but let me know if not. |
@tpaulshippy this would be great to have! Will you be willing to enable it on all providers? I'll do a proper review when I can. |
My five minutes of research indicates that at least OpenAI and Gemini take the approach of automatically caching for you based on the size and structure of your request. So the only support I think we'd really need for those two is to populate the cached token counts on the response messages. Unless we want to try to support explicit caching on the Gemini API but that looks complex and not as commonly needed. Do you know of other providers that require payload changes for prompt caching? |
def with_cache_control(hash, cache: false) | ||
return hash unless cache | ||
|
||
hash.merge(cache_control: { type: 'ephemeral' }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing this might cause errors on older models that do not support caching. If it does, we could raise here, or just let the API validation handle it. I'm torn on whether the capabilities check complexity is worth it as these models are probably so rarely used.
Scratch that. I decided to stop being a cheapskate and just pay Anthropic their $5. |
Looking to implement this in our project and now I'm wondering if it should be an opt out rather than an opt in. If you are using unique prompts every time I guess it adds some cost to cache them but my guess is in most applications prompts will get repeated, especially system prompts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this feature @tpaulshippy, however there are several improvements I'd like you to make before we merge this.
On top of the ones made in the comments, and the most important one, I'd like to have prompt caching implemented in all providers.
Plus I have not fully checked the logic in providers/anthropic
but the patch seems a bit heavy-handed with the amount of changes needed at first glance. Where all changes necessary or could it be done in a simpler manner?
Did you see this? Is the request to populate the cached token counts on the response messages for OpenAI and Gemini? |
Thank you for pointing that out, I had missed it. I think it would certainly be a nice addition to RubyLLM to have all providers have almost the same level of support of caching. |
Ok we have a bit of a naming issue. Here's the property names we get from each provider: Anthropic OpenAI Gemini My reading of the docs indicates that the OpenAI and Gemini values correspond pretty closely with the What should we call these properties in the Message? |
For the naming, let's go with:
This keeps it consistent with our existing Can you update the Message properties to use these names? Thanks Paul! |
Had to play with the length/complexity of the prompts. Finally got them to cache. Should be ready to go. |
This will be huge for me- possible to get a review and merge? Thanks! |
I rolled this out 2 weeks ago. So far the numbers show that I am paying 55% less than I would have without this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paul, this PR is 90% VCR cassettes, but it's still a monster and I feel like the actual code is overengineered. Passing cache_control
through 7 layers? No thanks.
However, how about embracing the fact that there may be moments we want to augment each message?
# Just add message-level params
chat.add_message(
role: :user,
content: "huge doc",
params: { cache_control: { type: 'ephemeral' } }
)
That's it. Add params
to Message.
We can then even add:
chat.with_message_params(cache_control: { type: 'ephemeral' })
In order to apply it to every message. That's it.
Then adding a provider agnostic implementation of cache control becomes trivial.
That could work for system and user messages. How about tools? I still think there will be some complexity in the anthropic provider that may be unavoidable. |
One complication is that these params have to be nested under the content array for anthropic. So it would actually need to be something like this:
Right? |
I don't think adding the params to all messages with chat.with_message_params will generally work as you suggest. Couple of reasons -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to explore making this a per message option and maybe something like a with_tools_params option to make this more provider agnostic.
Thanks for the feedback @crmne
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
|
||
# Enable caching for different types of content | ||
chat.cache_prompts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more I'm thinking it is odd to have a provider specific feature enabled like this. I wish anthropic was just more like the others in this area.
Looks like to make this work, deep_merge will have to be enhanced to support arrays. Currently it just overrides the whole array rather than merging the hashes within. Will need something like this: |
What this does
Support prompt caching config in both Anthropic and Bedrock providers for Claude models that support it. And report prompt caching token counts for OpenAI and Gemini which cache automatically.
Caching system prompts:
Caching user prompts:
Caching tool definitions:
Type of change
Scope check
Quality check
overcommit --install
and all hooks passmodels.json
,aliases.json
)API changes
Related issues
Resolves #13