Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Issue with stop/start container on WS2k19 #1822

Open
dardelean opened this issue Jun 21, 2023 · 2 comments
Open

Issue with stop/start container on WS2k19 #1822

dardelean opened this issue Jun 21, 2023 · 2 comments

Comments

@dardelean
Copy link

The issue is that the containers (process or hyperv isolation) fail to start (after stop) or restart. This happens on WS2k19. The issue is easy to reproduce, a standard WS2k19 deployment with nerdctl and containerd (v1.7.0-339-g87dbdd2ca). This is the latest version of containerd as of today (07.06.2023), but the issue reproduces on older versions as well.

The specific error is
errors: failed to create shim task: hcs::CreateComputeSystem 7741aa979c8a1ef17659b625d73418b28421be780e848e12d82edd5c6b76312e: The requested operation for attach namespace failed.: unknown"

This is how the Cirrus CI uses WS2k19:
https://github.com/containerd/nerdctl/blob/main/.cirrus.yml#L26

It uses an image built on top of "windows-2019-core-for-containers":
https://github.com/cirruslabs/vm-images/blob/master/googlecompute/windows_images.json#L8

An this is how the image is configured:
https://github.com/containerd/nerdctl/blob/main/hack/configure-windows-ci.ps1

We saw that during the period the container is stopped, if we remove the endpoint, the container successfully starts, but then it won't have a network endpoint. We suspect that the issue is there. containerd and the shim sends correct information to HCS, during debug we compared the go stuctures with a WS2k22 deployent, which works. One thing we did not understand were the endpoint states, state 4 for example (after the container failed to start).

@acobaugh
Copy link

I'm seeing the exact same thing with:

  • datadog agent: gcr.io/datadoghq/agent:7.43.1
  • containerd 1.6.6
  • EKS 1.24 v1.24.13-eks-0a21954
  • Host AMI: Windows_Server-2019-English-Core-EKS_Optimized-1.24-2023.06.14, Windows Server 2019 Datacenter 10.0.17763.4499

I did not see this on dockerd and EKS 1.23.

Every once-in-a-while I will have a container start up just fine.

Other containers start fine on these hosts, it just seems to be this datadog agent image that consistently fails to start with this error.

@jterry75
Copy link
Contributor

AttachNamespace is a networking failure. @kevpar - Could you add the right people for that. I dont remember if networking should be here or on WinContainers

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants