aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891

naseemkullah · 2021-08-04T15:40:27Z

aws-cdk/packages/@aws-cdk/aws-s3-deployment/lib/lambda/index.py

Lines 150 to 163 in beb01b5

    
           def cloudfront_invalidate(distribution_id, distribution_paths): 
        
               invalidation_resp = cloudfront.create_invalidation( 
        
                   DistributionId=distribution_id, 
        
                   InvalidationBatch={ 
        
                       'Paths': { 
        
                           'Quantity': len(distribution_paths), 
        
                           'Items': distribution_paths 
        
                       }, 
        
                       'CallerReference': str(uuid4()), 
        
                   }) 
        
               # by default, will wait up to 10 minutes 
        
               cloudfront.get_waiter('invalidation_completed').wait( 
        
                   DistributionId=distribution_id, 
        
                   Id=invalidation_resp['Invalidation']['Id'])

I've come across a deployment where cloudfront was invalidated but the lambda timed out with cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded. ~~I suspect a race conditon, and that reversing the order of cloudfront.create_invalidation() and cloudfront.get_waiter() would fix this race condition.~~

edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see #15891 (comment)

The text was updated successfully, but these errors were encountered:

otaviomacedo · 2021-08-13T10:03:02Z

Hi, @naseemkullah

Thanks for reporting this and suggesting a solution.

I presume your hypothesis is that, in some cases, the invalidation happens very fast and the waiter gets created after the invalidation has completed, causing it to wait until the timeout is reached. Is that fair?

Also, how easily can you reproduce this issue? Race conditions are usually tricky to test. I would like to get some assurance that the swap will actually fix the issue.

naseemkullah · 2021-08-13T11:35:03Z

Hi @otaviomacedo,

I presume your hypothesis is that, in some cases, the invalidation happens very fast and the waiter gets created after the invalidation has completed, causing it to wait until the timeout is reached. Is that fair?

Yep, that's right.

Also, how easily can you reproduce this issue? Race conditions are usually tricky to test. I would like to get some assurance that the swap will actually fix the issue.

Not easily 😞 , in fact it is an intermittent issue that I've observed at the end of our CI/CD pipeline (during deployment) once every now and then (rough estimate 1/50). I'm afraid I cannot provide more assurance than the reasoning above. If you don't see any potential issues arising from reversing the order that I may not have thought of, I'll be happy to submit this potential fix. Cheers!

otaviomacedo · 2021-08-13T11:38:50Z

I think the risk involved in this change is quite low. Please submit the PR and I'll be happy to review it.

naseemkullah · 2021-08-13T12:08:41Z

After reading up on the waiter, it appears that it uses a poll mechanism, furthermore the ID of the invalidation request needs to be passed into it, so all seems well on that front.

Not sure why I see these timeouts occasionally 👻 .... but my hypothesis no longer holds, closing. Thanks!

edit: re-opened since this is still an issue

github-actions · 2021-08-13T12:09:09Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

peterwoodworth · 2021-08-30T19:40:23Z

Reopening because additional customers have been impacted by this issue. @naseemkullah are you still running into this issue?

From other customer experiencing the issue Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded

this issue is intermittent and when we redeploy it works.
Our pipelines are automated and we deploy 3-5 times everyday in production.
When our stack fails due to this error then cloudfront is unable to rollback, which create high severity issues in prod and there is a downtime until we redeploy the pipeline again. This error happens during the invalidation part but somehow cloudfront is not able to get the files from s3 origin when this error occurs. We have enabled versioning in s3 bucket so that cloudfront is able to serve the older version in case of rollback but its still unable to fetch files until we redeploy.

customer's code:

  new s3deploy.BucketDeployment(this, 'DeployWithInvalidation', {
      sources: [s3deploy.Source.asset(`../packages/dist`)],
      destinationBucket: bucket,
      distribution,
      distributionPaths: [`/*`],
      retainOnDelete: false,
      prune: false,
    });

This deploys the files in s3 bucket and creates a cloudfront invalidation which is when the stack fails on the waiter error.

naseemkullah · 2021-08-31T11:35:11Z

@peterwoodworth yes occasionally! I was a little quick to close it once my proposed solution fell through, thanks for reopening.

otaviomacedo · 2021-09-03T09:29:31Z

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it's not clear to me what else we can do. In any case, contributions are welcome!

naseemkullah · 2021-09-03T13:45:36Z

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it's not clear to me what else we can do. In any case, contributions are welcome!

it has happened twice in recent days, next time it occurs i will try to confirm this, iirc the first time this happened i checked and I saw the invalidation event had occurred almost immediately yet the waiter did not see that (that's why i thought it might be a race condition). Will confirm though!

quixoticmonk · 2021-09-08T01:05:55Z

Noticed the same with a client I support over the last few weeks and makes us rethink using the BucketDeployment construct overall. I will check any new occurrences and confirm the actual behavior of CloudFront in the background.

quixoticmonk · 2021-09-08T13:15:56Z

In my case, the invalidation kicked off two and both were in progress for a long time and eventually timed out.

naseemkullah · 2021-09-11T01:08:59Z

Confirming that in my case the validation occurs when it should, but the waiter just never gets the memo and fails the deployment after 10 minutes.

sblackstone · 2021-09-16T13:07:49Z

I can also confirm this issue occurs with some regularity for me too...

I have a script that that deploys the same stack to 29 different accounts - with a deploy I just did, I had 3 of 29 fail with Waiter InvalidationCompleted failed:

nalejandroveron · 2023-04-19T21:21:50Z

Started to see this problem when using s3 bucket deployments with CDK

jkbailey · 2023-04-20T17:29:31Z

We started seeing this, it started on 4/19/23, still happening today 4/20

calebwilson706 · 2023-05-03T09:57:08Z

This is happening us frequently now also

miekassu · 2023-05-03T10:32:26Z

This is happening more frequently now

costleya · 2023-06-06T15:37:28Z

Indeed the cache invalidates on the CloudFront side almost instantly. But the deploy fails, and rolls back (which CloudFront also takes affect immediately, and the rollback fails).

niallkeys · 2023-06-06T15:43:57Z

Seeing this now also

leantorres73 · 2023-06-06T15:53:08Z

Same here...

xli2227 · 2023-06-09T16:23:29Z

encounter the same issue, some action log timestamps:

2023-06-09 08:15:11 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: 3b01a325-6c24-45f0-8f6c-86638f2e282b)
-- | -- | -- | --
2023-06-09 08:04:38 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_IN_PROGRESS | -

took 10m to failed the CDK stack, and the invalidation was created 1 min after the failure.

  | IEKSZWOI5U3Q6GNNNQMQLJ11WH | Completed | June 9, 2023 at 3:16:20 PM UTC

JonWallsten · 2023-10-23T10:28:23Z

I've just seen this the first time today.
But in my case the invalidation is actually not complete:

It's been going on for 19 minutes now.
I have a single origin: A S3 bucket with three files on it.
It just failed the deploy for the third time in a row.

hugomallet · 2023-10-23T10:40:13Z

It seems there's currently a problem in AWS cloudfront I get the same timeouts errors

…cause of aws/aws-cdk#15891

nbeag · 2023-11-13T15:26:08Z

we are also encountering this intermittently in one of our CDK stacks and have noticed it happen more frequently in the last few weeks. when it occurs, the stack initiates a rollback - sometimes this fails (and requires manual intervention) and sometimes the rollback succeeds. Any update/workaround would be appreciated

abury · 2023-11-29T03:54:45Z

Started seeing this regularly today as well
Edit: Seeing this almost every day around the same time? I'm not even sure we can use Cloudfront going forward if we can't reliably deploy

mattiLeBlanc · 2023-12-05T01:55:17Z

I am getting same error all of a sudden in our Staging deployment via Bitbucket:

UPDATE_FAILEDLikely root cause | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: dcd7fbdb-d6b7-441f-96f1-08026063b052)

This is a cloudfront deployment. I tried to deploy a deployment of 4 days ago which was fine and that also fails.
It happens at:

Our Dev and Prod deployments are working fine (different accounts)

This is totally unacceptable because I think I need to delete my stack, which luckily I can because of our microservice approach, but again, totally unacceptable.

MrDark · 2023-12-05T10:57:52Z

After not encountering this problem for a while, we're now also having this issue again. Luckily, it happened in our dev account, but I'm hesitant about deploying it to production.

alechewitt · 2023-12-11T22:13:21Z

We are also experiencing this issue. The Lambda successfully uploads all the files to S3, however it does not complete and results in a timeout error. The other strange thing, for the latest Lambda invocation that have timed out, I don't see any Cache invalidation in the CloudFront Distribution.

These are the Lambda logs:

[INFO]	2023-12-11T19:31:34.655Z	181e927e-a970-43bc-a974-d88e6761c4cc	| aws s3 sync /tmp/tmp9zlftjbn/contents s3://notebooks/
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~3 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/error/403.html to s3://notebooks/error/403.html
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~2 file(s) remaining (calculating...)
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~2 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/index.html to s3://notebooks/index.html
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~1 file(s) remaining (calculating...)
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~1 file(s) remaining (calculating...) 
upload: ../../tmp/tmp9zlftjbn/contents/error/404.html to s3://notebooks/error/404.html
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~0 file(s) remaining (calculating...)
2023-12-11T19:45:36.537Z 181e927e-a970-43bc-a974-d88e6761c4cc Task timed out after 900.17 seconds

END RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc
REPORT RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc	Duration: 900171.11 ms	Billed Duration: 900000 ms

richard-collette-precisely · 2024-02-14T19:23:57Z

Just hit this. CDK deployement. RequestId: 552880ea-f37b-4b8b-8cc8-3772e52e4cd3

abury · 2024-02-15T06:31:49Z

Still happening in 2024....
Not sure why I'm using Cloudfront at this point...

alexandr2110pro · 2024-02-27T18:00:37Z

Same here. What can you guys propose to prevent such issues in production pipelines?

edwardofclt · 2024-02-27T22:45:56Z

Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko ***@***.***> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

alexandr2110pro · 2024-02-28T08:04:34Z

Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Hey man. What do you mean? Where? The cloud formation deployment service fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that)

Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right?

We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.

edwardofclt · 2024-02-28T10:46:44Z

My team chose to swap back to Terraform because Cloudformation is.. not great. But we would basically run the cdk stack twice if it failed the first time. It wasn't a good experience for us. Mind you, this issue is not unique to CDK. This is an issue with CloudFront at the end of the day.Sent from my iPhoneOn Feb 28, 2024, at 3:04 AM, Alexandr Cherednichenko ***@***.***> wrote: Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.> Hey man. What do you mean? Where? The cloud formation deployment stack fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that) Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right? We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

jkbailey · 2024-02-28T20:51:29Z

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

LosD · 2024-02-28T21:24:13Z

We no longer experience this issue after increasing the memory limit of the bucket deployment.
new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})
The defalut memory limit is 128. (docs)

I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.

Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).

sblackstone · 2024-02-28T23:04:07Z

We no longer experience this issue after increasing the memory limit of the bucket deployment.
new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})
The defalut memory limit is 128. (docs)
I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.

Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).

Somewhere in the last 2 years the devs said this was an issue internal to Cloudfront and they were working with that team on it. That was a long time ago.

Abandon all hope ye who enter here.

pardeepdhingra · 2024-06-05T04:26:22Z

Still facing this issue in June 2024.

ashellunts · 2024-08-16T12:58:50Z

I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).

sblackstone · 2024-08-17T00:13:54Z

I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).

I've been getting notifications for this issue since 2021, I wouldn't hold your breath and perhaps implement retry logic.

github-actions bot assigned njlynch Aug 4, 2021

github-actions bot added the @aws-cdk/aws-cloudfront Related to Amazon CloudFront label Aug 4, 2021

github-actions bot assigned otaviomacedo Aug 4, 2021

github-actions bot added the @aws-cdk/aws-s3-deployment label Aug 4, 2021

otaviomacedo unassigned njlynch Aug 13, 2021

otaviomacedo removed the @aws-cdk/aws-cloudfront Related to Amazon CloudFront label Aug 13, 2021

naseemkullah closed this as completed Aug 13, 2021

peterwoodworth reopened this Aug 30, 2021

peterwoodworth added the p1 label Aug 30, 2021

github-actions bot added the @aws-cdk/aws-cloudfront Related to Amazon CloudFront label Aug 31, 2021

github-actions bot assigned njlynch Aug 31, 2021

peterwoodworth unassigned njlynch Aug 31, 2021

peterwoodworth removed the @aws-cdk/aws-cloudfront Related to Amazon CloudFront label Aug 31, 2021

otaviomacedo added the bug This issue is a bug. label Sep 3, 2021

otaviomacedo removed their assignment Sep 3, 2021

github-actions bot assigned njlynch Sep 16, 2021

github-actions bot added the @aws-cdk/aws-cloudfront Related to Amazon CloudFront label Sep 16, 2021

dzmitry-kankalovich mentioned this issue Apr 13, 2023

bug: Unable to deploy static website with S3, CloudFront and custom domain localstack/localstack#8136

Closed

1 task

benjaminpottier mentioned this issue May 9, 2023

fix(s3-deployment): Add retry/backoff to create invalidation #25502

Closed

mdbudnick added a commit to mdbudnick/personal-website that referenced this issue Oct 24, 2023

Use sleep instead of sleep 180 instead of waiting for invalidation be…

8e2c635

…cause of aws/aws-cdk#15891

aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891

aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891

Comments

naseemkullah commented Aug 4, 2021 • edited Loading

otaviomacedo commented Aug 13, 2021

naseemkullah commented Aug 13, 2021

otaviomacedo commented Aug 13, 2021

naseemkullah commented Aug 13, 2021 • edited Loading

github-actions bot commented Aug 13, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

peterwoodworth commented Aug 30, 2021

naseemkullah commented Aug 31, 2021

otaviomacedo commented Sep 3, 2021

naseemkullah commented Sep 3, 2021

quixoticmonk commented Sep 8, 2021

quixoticmonk commented Sep 8, 2021

naseemkullah commented Sep 11, 2021

sblackstone commented Sep 16, 2021

nalejandroveron commented Apr 19, 2023

jkbailey commented Apr 20, 2023

calebwilson706 commented May 3, 2023

miekassu commented May 3, 2023

costleya commented Jun 6, 2023

niallkeys commented Jun 6, 2023

leantorres73 commented Jun 6, 2023 • edited Loading

xli2227 commented Jun 9, 2023

JonWallsten commented Oct 23, 2023 • edited Loading

hugomallet commented Oct 23, 2023

nbeag commented Nov 13, 2023

abury commented Nov 29, 2023 • edited Loading

mattiLeBlanc commented Dec 5, 2023 • edited Loading

MrDark commented Dec 5, 2023

alechewitt commented Dec 11, 2023 • edited Loading

richard-collette-precisely commented Feb 14, 2024

abury commented Feb 15, 2024

alexandr2110pro commented Feb 27, 2024

edwardofclt commented Feb 27, 2024 via email

alexandr2110pro commented Feb 28, 2024 • edited Loading

edwardofclt commented Feb 28, 2024 via email

jkbailey commented Feb 28, 2024 • edited Loading

LosD commented Feb 28, 2024

sblackstone commented Feb 28, 2024

pardeepdhingra commented Jun 5, 2024

ashellunts commented Aug 16, 2024 • edited Loading

sblackstone commented Aug 17, 2024

naseemkullah commented Aug 4, 2021 •

edited

Loading

naseemkullah commented Aug 13, 2021 •

edited

Loading

leantorres73 commented Jun 6, 2023 •

edited

Loading

JonWallsten commented Oct 23, 2023 •

edited

Loading

abury commented Nov 29, 2023 •

edited

Loading

mattiLeBlanc commented Dec 5, 2023 •

edited

Loading

alechewitt commented Dec 11, 2023 •

edited

Loading

alexandr2110pro commented Feb 28, 2024 •

edited

Loading

jkbailey commented Feb 28, 2024 •

edited

Loading

ashellunts commented Aug 16, 2024 •

edited

Loading