-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add per namespace burst limit control #2067
Conversation
* Add per namespace burst limit control * Default per namespace burst limit to be 4800, 2x of rate limit * Update logic accordingly
@@ -32,66 +32,116 @@ type ( | |||
// RateFn returns a float64 as the RPS | |||
RateFn func() float64 | |||
|
|||
// BurstFn returns a int as the RPS | |||
// BurstFn returns an int as the burst / bucket size | |||
BurstFn func() int | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these 2 fn, since we are using interface to define the methods.
common/quotas/dynamic.go
Outdated
Load() int | ||
Store(burst int) | ||
BurstFn() BurstFn | ||
DynamicRateBurst interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to MutableRateLimiter?
// for incoming traffic | ||
func NewDefaultIncomingDynamicRateLimiter( | ||
func NewDefaultIncomingRateLimiter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove Default from name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
// for outgoing traffic | ||
func NewDefaultOutgoingDynamicRateLimiter( | ||
func NewDefaultOutgoingRateLimiter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove Default from name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default indicate the ratio, either butst = 1 * rate * 1 second
or butst = 2 * rate * 1 second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name let me think that this creates a default implementation of RateLimiter for outgoing calls. The name should be NewOutgoingRateLimiterWithDefaultRatio. I'm not suggesting that you have to change this name in this PR, just state what confuses me as a reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the DynamicRateBurst should be MutableRateBurst. But it is your call.
common/quotas/dynamic.go
Outdated
|
||
return &DynamicBurstImpl{ | ||
func NewDynamicRateBurst( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic in the name really give impression that this is based on dynamic config, but it is just like mutable that can be set by code. There is another NewDynamicRateLimiter() which as the name suggested is based on dynamic config. Wouldn't MutableRateBurst more appropriate here?
// for outgoing traffic | ||
func NewDefaultOutgoingDynamicRateLimiter( | ||
func NewDefaultOutgoingRateLimiter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name let me think that this creates a default implementation of RateLimiter for outgoing calls. The name should be NewOutgoingRateLimiterWithDefaultRatio. I'm not suggesting that you have to change this name in this PR, just state what confuses me as a reader.
<!-- Describe what has changed in this PR --> **What changed?** This PR changes the behavior of `RateLimitInterceptorProvider`. Instead of providing rate limiters to `NewRequestToRateLimiter`, we provide `RateBurst` functions. <!-- Tell your future self why have you made these changes --> **Why?** Because this was causing a bug: _Impact:_ These dynamic configs are never updated after the server starts: - `frontend.rps` - `frontend.globalRPS` - `frontend.rps.namespaceReplicationInducingAPIs` Additionally, frontend.globalRPS is rendered completely unusable because it calculates the number of members to be 0 on startup, so it defaults to frontend.rps to avoid a division-by-zero when calculating the effective per-isntance limit: _Root cause:_ We construct a quota calculator [here](https://github.com/temporalio/temporal/blob/c7847968b309eb71edbe1da6f6cfa4d1c0b21341/service/frontend/fx.go#L305) . Then, we convert that to a RateFn and make 3 different DynamicRateLimiterImpl objects from it. We then pass all of those objects to NewRequestToRateLimiter. However, that function actually just accepts RateBurst objects, not RateLimiters. This is important because, since they’re just RateBurst objects, only their Rate and Burst methods may be used. However, those are also the only two methods in DynamicRateLimiterImpl which don’t call maybeRefresh. As a result, none of [these rate limiters](https://github.com/temporalio/temporal/blob/c7847968b309eb71edbe1da6f6cfa4d1c0b21341/service/frontend/fx.go#L314-L317) get updated dynamically. This bug was originally introduced in this PR [Add per namespace burst limit control by wxing1292 · Pull Request #2067 · temporalio/temporal](#2067), specifically this commit: [Add per namespace burst limit control (#2067) · af181b4](af181b4) . After that, it was repeated by two more configs which were added below it using the same strategy. So, for 2 years (since Oct 26, 2021), frontend.rps has not been updating dynamically. _Evidence:_ Here’s a commit which demonstrates the problem: [Demonstrate that maybeRefresh is never called for these dynamic rate … · 9578140](9578140) Solution: There are a couple of ways to fix it, but the most obvious is to simply not wrap the rate burst functions in a dynamic rate limiter to begin with because it’s actually doing the exact opposite. It’s taking what would be dynamically-updated rate limits and making them constant. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Testing this relies on time. To avoid a large refactor propagating clock.TimeSource, I wrote a test with sleeps in it. However, I don't want to commit that because it also requires us changing the refresh interval of the dynamic rate limiter. Instead, I'll just provide a link to the commit containing the changes here: 0e4b4c2 . This commit adds a test which verifies that, if we update the rps, the rate limiter itself also updates its rate. I also tested this without this change and verified that the test fails. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?**
<!-- Describe what has changed in this PR --> **What changed?** This PR changes the behavior of `RateLimitInterceptorProvider`. Instead of providing rate limiters to `NewRequestToRateLimiter`, we provide `RateBurst` functions. <!-- Tell your future self why have you made these changes --> **Why?** Because this was causing a bug: _Impact:_ These dynamic configs are never updated after the server starts: - `frontend.rps` - `frontend.globalRPS` - `frontend.rps.namespaceReplicationInducingAPIs` Additionally, frontend.globalRPS is rendered completely unusable because it calculates the number of members to be 0 on startup, so it defaults to frontend.rps to avoid a division-by-zero when calculating the effective per-isntance limit: _Root cause:_ We construct a quota calculator [here](https://github.com/temporalio/temporal/blob/c7847968b309eb71edbe1da6f6cfa4d1c0b21341/service/frontend/fx.go#L305) . Then, we convert that to a RateFn and make 3 different DynamicRateLimiterImpl objects from it. We then pass all of those objects to NewRequestToRateLimiter. However, that function actually just accepts RateBurst objects, not RateLimiters. This is important because, since they’re just RateBurst objects, only their Rate and Burst methods may be used. However, those are also the only two methods in DynamicRateLimiterImpl which don’t call maybeRefresh. As a result, none of [these rate limiters](https://github.com/temporalio/temporal/blob/c7847968b309eb71edbe1da6f6cfa4d1c0b21341/service/frontend/fx.go#L314-L317) get updated dynamically. This bug was originally introduced in this PR [Add per namespace burst limit control by wxing1292 · Pull Request #2067 · temporalio/temporal](#2067), specifically this commit: [Add per namespace burst limit control (#2067) · af181b4](af181b4) . After that, it was repeated by two more configs which were added below it using the same strategy. So, for 2 years (since Oct 26, 2021), frontend.rps has not been updating dynamically. _Evidence:_ Here’s a commit which demonstrates the problem: [Demonstrate that maybeRefresh is never called for these dynamic rate … · 9578140](9578140) Solution: There are a couple of ways to fix it, but the most obvious is to simply not wrap the rate burst functions in a dynamic rate limiter to begin with because it’s actually doing the exact opposite. It’s taking what would be dynamically-updated rate limits and making them constant. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Testing this relies on time. To avoid a large refactor propagating clock.TimeSource, I wrote a test with sleeps in it. However, I don't want to commit that because it also requires us changing the refresh interval of the dynamic rate limiter. Instead, I'll just provide a link to the commit containing the changes here: 0e4b4c2 . This commit adds a test which verifies that, if we update the rps, the rate limiter itself also updates its rate. I also tested this without this change and verified that the test fails. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is this PR a hotfix candidate or require that a notification be sent to the broader community? (Yes/No) --> **Is hotfix candidate?**
What changed?
Why?
More configs
How did you test it?
Existing tests
Potential risks
N/A
Is hotfix candidate?
No