Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Context deadline exceeded on VMs with large disks #1134

Closed
morganhowarth-fd opened this issue Jul 13, 2020 · 11 comments
Closed

Context deadline exceeded on VMs with large disks #1134

morganhowarth-fd opened this issue Jul 13, 2020 · 11 comments
Labels
acknowledged Status: Issue or Pull Request Acknowledged bug Type: Bug size/s Relative Sizing: Small

Comments

@morganhowarth-fd
Copy link

Terraform Version

0.12.28

vSphere Provider Version

v1.18.1

Affected Resource(s)

vsphere_virtual_machine

Terraform Configuration Files

https://gist.github.com/morganhowarth-fd/aed994cf9c2ff1c155deb86e02ce2104

Expected Behavior

Terraform should be able to create the VM with a large disk without failing.

Actual Behavior

We have multiple database servers with 200GB thick-provisioned eager-zeroed secondary disks which deploy fine, and a few with 400GB thick-provisioned eager-zeroed secondary disks which have an issue as described below.

When creating a server from a template with a 400GB+ thick-provisioned eager-zeroed disk it takes some time and after exactly after 5 minutes whilst the template is still cloning another VMware job appears which deletes the virtual machine and Terraform fails with a message:

There was an error performing post-clone changes to virtual machine "foo": error reconfiguring virtual machine: Post https://VCENTER_SERVER/sdk: context deadline exceeded

It looks like the exact same issue as #641 which apparently was fixed by #792 but we still have the issue.

I've tried setting wait_for_guest_net_timeout to a higher value, the same with vim_keep_alive but to no avail.

A workaround for us, is to deploy the server with a smaller thick-provisioned disk then increase it to the desired size.

Steps to Reproduce

YMMV, you may need a larger disk if your clone job completes faster than 5 minutes.

  1. terraform apply with the TF config for a server with a large (400GB) thick-provisioned eager-zeroed secondary disk.
  2. Wait 5 minutes.
  3. Observe vCenter with a task in the queue to delete the virtual machine you're deploying.
  4. Watch TF fail after the clone job completes with

There was an error performing post-clone changes to virtual machine "foo": error reconfiguring virtual machine: Post https://VCENTER_SERVER/sdk: context deadline exceeded

Important Factoids

We're running ESXI/vCenter 6.7 on hyper-converged infrastructure. Large storage writes especially with thick-provisoned disk eager-zeroed disks takes a while due to storage replication across the nodes.

References

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@morganhowarth-fd morganhowarth-fd added the bug Type: Bug label Jul 13, 2020
@bill-rich bill-rich added size/s Relative Sizing: Small acknowledged Status: Issue or Pull Request Acknowledged labels Jul 14, 2020
@gosarami
Copy link

gosarami commented Jan 27, 2021

I am experiencing the same problem.

From my research, I expect it's probably because this plugin always adopting provider.DefaultAPITimeout in the context section of the following code that customizes the VM (Sorry if I'm wrong).

if err := virtualmachine.Customize(vm, custSpec); err != nil {

ctx, cancel := context.WithTimeout(context.Background(), provider.DefaultAPITimeout)

To solve this, I think it needs to be modified to allow passing a timeout parameter such as api_timeout as an argument.

Please let me know your opinion.

@greeneg
Copy link

greeneg commented Apr 14, 2021

Thisi s also accuring on 0.13.5.

We have VMs that by policy need to be built with thck, eager zeroed VMDKs. When a Windows SQL instance is being built, we have a number of disks added, along with the main OS drive, which takes a while for vCenter to build out. On average, most of the SQL instances around 3.5TB of disk overall, which can take a fair amount of time to complete.

For now, we're building the instances as thin provisioned to work around this, however, this is a violation of our internal policy that we would like resolved at the provisioning stage.

@Panplumousse
Copy link

Panplumousse commented Apr 15, 2021

hello,
We use 1.25.0 of provider vsphere version and we experiment exactly the same problem
After 5 minutes of reconfigure virtual machine , one more job vsphere apear and delete vm

terraform version:
0.11.11
we had api_timeout option in provider vsphere , but no change

with error >
There was an error performing post-clone changes to virtual machine "foo": error reconfiguring virtual machine: Post https://VCENTER_SERVER/sdk: context deadline exceeded

@greeneg
Copy link

greeneg commented May 11, 2021

This bug is impacting the following issue entries:

#1238
#1335
#641
#790
#1401
#1387

@KenzoB73
Copy link

KenzoB73 commented Aug 9, 2021

This issue is over a year old, is there any update on this?

@CollinLeishman
Copy link

This is also affecting me. Any help or update on this would be very much appreciated!

@CollinLeishman
Copy link

@KenzoB73 Is this still affecting you?

@KenzoB73
Copy link

KenzoB73 commented Dec 1, 2021 via email

@tenthirtyam
Copy link
Collaborator

tenthirtyam commented Dec 2, 2021

v2.0.0 added the api_timeout to the provider configuration via #1405.

api_timeout - (Optional) Sets the number of minutes to wait for operations to complete. The default timeout is 5 minutes. Currently it will override the timeout for all VM creation operations.

Example:

terraform {
  required_providers {
    vsphere = {
      source  = "hashicorp/vsphere"
      version = ">= 2.0.0"
    }
  }
  required_version = ">= 1.0.0"
}

provider "vsphere" {
  vsphere_server       = "sfo-m01-vc01.rainpole.io"
  user                 = "svc-terraform-vsphere@rainpole.io"
  password             = "***********”
  allow_unverified_ssl = false
  api_timeout          = 30 // Example. Default 5.
}

@morganhowarth-fd - have you tried this version or higher with this provider configuration.

Ryan

@tenthirtyam
Copy link
Collaborator

Resolved in #1405 with the introduction of the defaultAPITimeout configuration for the provider.

Marking this issue as closed. If this issue is reappears with the latest version of the provider, please create a new issue linking back to this one for added context.

Ryan Johnson
Staff II Solutions Architect
Cloud Infrastructure Business Group, VMware

@github-actions
Copy link

github-actions bot commented Mar 8, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 8, 2022
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
acknowledged Status: Issue or Pull Request Acknowledged bug Type: Bug size/s Relative Sizing: Small
Projects
None yet
Development

No branches or pull requests

8 participants