-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891
Comments
Hi, @naseemkullah Thanks for reporting this and suggesting a solution. I presume your hypothesis is that, in some cases, the invalidation happens very fast and the waiter gets created after the invalidation has completed, causing it to wait until the timeout is reached. Is that fair? Also, how easily can you reproduce this issue? Race conditions are usually tricky to test. I would like to get some assurance that the swap will actually fix the issue. |
Hi @otaviomacedo,
Yep, that's right.
Not easily 😞 , in fact it is an intermittent issue that I've observed at the end of our CI/CD pipeline (during deployment) once every now and then (rough estimate 1/50). I'm afraid I cannot provide more assurance than the reasoning above. If you don't see any potential issues arising from reversing the order that I may not have thought of, I'll be happy to submit this potential fix. Cheers! |
I think the risk involved in this change is quite low. Please submit the PR and I'll be happy to review it. |
After reading up on the waiter, it appears that it uses a poll mechanism, furthermore the ID of the invalidation request needs to be passed into it, so all seems well on that front. Not sure why I see these timeouts occasionally 👻 .... but my hypothesis no longer holds, closing. Thanks! edit: re-opened since this is still an issue |
|
Reopening because additional customers have been impacted by this issue. @naseemkullah are you still running into this issue? From other customer experiencing the issue
customer's code: new s3deploy.BucketDeployment(this, 'DeployWithInvalidation', {
sources: [s3deploy.Source.asset(`../packages/dist`)],
destinationBucket: bucket,
distribution,
distributionPaths: [`/*`],
retainOnDelete: false,
prune: false,
});
|
@peterwoodworth yes occasionally! I was a little quick to close it once my proposed solution fell through, thanks for reopening. |
In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it's not clear to me what else we can do. In any case, contributions are welcome! |
it has happened twice in recent days, next time it occurs i will try to confirm this, iirc the first time this happened i checked and I saw the invalidation event had occurred almost immediately yet the waiter did not see that (that's why i thought it might be a race condition). Will confirm though! |
Noticed the same with a client I support over the last few weeks and makes us rethink using the BucketDeployment construct overall. I will check any new occurrences and confirm the actual behavior of CloudFront in the background. |
Confirming that in my case the validation occurs when it should, but the waiter just never gets the memo and fails the deployment after 10 minutes. |
I can also confirm this issue occurs with some regularity for me too... I have a script that that deploys the same stack to 29 different accounts - with a deploy I just did, I had 3 of 29 fail with |
Started to see this problem when using s3 bucket deployments with CDK |
We started seeing this, it started on 4/19/23, still happening today 4/20 |
This is happening us frequently now also |
This is happening more frequently now |
Indeed the cache invalidates on the CloudFront side almost instantly. But the deploy fails, and rolls back (which CloudFront also takes affect immediately, and the rollback fails). |
Seeing this now also |
Same here... |
encounter the same issue, some action log timestamps:
took 10m to failed the CDK stack, and the invalidation was created 1 min after the failure.
|
It seems there's currently a problem in AWS cloudfront I get the same timeouts errors |
we are also encountering this intermittently in one of our CDK stacks and have noticed it happen more frequently in the last few weeks. when it occurs, the stack initiates a rollback - sometimes this fails (and requires manual intervention) and sometimes the rollback succeeds. Any update/workaround would be appreciated |
Started seeing this regularly today as well |
After not encountering this problem for a while, we're now also having this issue again. Luckily, it happened in our dev account, but I'm hesitant about deploying it to production. |
We are also experiencing this issue. The Lambda successfully uploads all the files to S3, however it does not complete and results in a timeout error. The other strange thing, for the latest Lambda invocation that have timed out, I don't see any Cache invalidation in the CloudFront Distribution. These are the Lambda logs:
|
Just hit this. CDK deployement. RequestId: 552880ea-f37b-4b8b-8cc8-3772e52e4cd3 |
Still happening in 2024.... |
Same here. What can you guys propose to prevent such issues in production pipelines? |
Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko ***@***.***> wrote:
Same here. What can you guys propose to prevent such issues in production pipelines?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Hey man. What do you mean? Where? The cloud formation deployment service fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that) Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right? We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform. |
My team chose to swap back to Terraform because Cloudformation is.. not great. But we would basically run the cdk stack twice if it failed the first time. It wasn't a good experience for us. Mind you, this issue is not unique to CDK. This is an issue with CloudFront at the end of the day.Sent from my iPhoneOn Feb 28, 2024, at 3:04 AM, Alexandr Cherednichenko ***@***.***> wrote:
Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
Hey man. What do you mean? Where? The cloud formation deployment stack fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that)
Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right?
We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
We no longer experience this issue after increasing the memory limit of the bucket deployment. new BucketDeployment(this, 'website-deployment', {
...config,
memoryLimit: 2048
}) The defalut memory limit is 128. (docs) |
I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out. Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET). |
Somewhere in the last 2 years the devs said this was an issue internal to Cloudfront and they were working with that team on it. That was a long time ago. Abandon all hope ye who enter here. |
Still facing this issue in June 2024. |
I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful). |
I've been getting notifications for this issue since 2021, I wouldn't hold your breath and perhaps implement retry logic. |
aws-cdk/packages/@aws-cdk/aws-s3-deployment/lib/lambda/index.py
Lines 150 to 163 in beb01b5
I've come across a deployment where cloudfront was invalidated but the lambda timed out with
cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded
.I suspect a race conditon, and that reversing the order ofcloudfront.create_invalidation()
andcloudfront.get_waiter()
would fix this race condition.edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see #15891 (comment)
The text was updated successfully, but these errors were encountered: