-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
insufficient capacity API possibly errors not propagating into logs and events #2154
Comments
Hey @joemiller thank you for this feedback. Would you mind sharing your version of EBS CSI Driver? And any full controller logs that you still have? The error you show is a retry-rate limit error, which might have obscured the EBS In either case, yes you should have seen a clear error message pointing to insufficient Volume Capacity in the controller logs and we will work to fix this. Thanks! |
@AndrewSirenko Which class of errors are considered retryable by our The official docs describe I'm thinking it would make more sense to error out without retrying at the sdk level and letting the exponential backoff mechanism implemented in the provisioner do the needful, but would love to hear your thoughts on this. |
We use the defaults set by the EC2 SDK team for what errors are considered terminal. This info can be found in the retryer package's documentation / code, because as of v1.35.0 we trust those defaults set by the EC2 team. However, let's confirm the version of the driver, because our retry mechanisms have went through three changes (sdk v1, sdk v2, adaptive retry)? Regardless of retries or terminal errors, real error EBS messages should be propagated to controller logs, which according to the issue might not have occurred. The link to go AWS sdk v2 retry module docs; https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/aws/retry (The retryable errors should be clearly listed in the AWS go sdk v2 repository as well if you grep for the error. I can do this as well once I'm back at my workstation) |
I'm not sure I follow the logic for why retrying Regardless, I don't this is super relevant to the issue reported (just pretend the error that occurred is one we definitely want to retry), I think one of two things (or maybe both) likely happened:
@joemiller if you could confirm which version of the EBS CSI driver you're using as @AndrewSirenko asked above that would be great as we've revamped the retrying a few months ago which would hopefully reduce the spam of rate limit exceeded errors by doing client-side throttling when we detect them. |
Perfect, this answers my question @AndrewSirenko. -- Thanks for your input as well, @ConnorJC3:
You're right that retrying is appropriate. My concern isn't about whether to retry, but rather at which level the retry should occur.
I generally agree that we shouldn't deviate from default behavior without good reason. Our use case is arguably different from the average SDK consumer's case. The average SDK consumer isn't working with multiple layers of retry mechanisms. The external provisioner already retries indefinitely using a backoff strategy to avoid hammering the API if we returned the error without retrying at the SDK level. That said, I'm not advocating for deviating from default behavior or implementing a custom retrier to make an exception for this class of error. If we think or know that its fruitful to retry at the SDK level because an ICE event can be resolved before the RPC times out, then I'm cool with the existing logic.
+1, its plausible that the adaptive retrier swallows the initial error. |
On 2024-08-30 starting at roughly 1900 UTC and ending many hours later the
use1-az2
AZ was unable to provision newio2
volumes. The csi driver logs and events posted back to pending pods was generic, eg:It was not clear what the error was from these logs. This looked fairly generic to our responding engineers. It was not until someone thought to try creating a volume manually via the web console saw the error message
"There is currently insufficient capacity to serve the request. Try again at a later time or in a different availability zone"
I don't know if this error was being returned from the AWS API to the CSI driver or not. If so, it would be ideal to have this error propagated back to the pod/pvc events.
The text was updated successfully, but these errors were encountered: