-
Notifications
You must be signed in to change notification settings - Fork 1.4k
gVisor on GCP with gVNIC has long epoll_wait() delays when sending HTTP data #9816
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I am able to repro with docker using the
Running with the default runtime is fast:
|
I tried upgrading to the latest gVNIC release, 1.3.4, but no improvement.
|
Thanks for the detailed reports! I was able to repro on my own machine. I'll let you know when I've found a root cause. |
Hi, it seems like we aren't handling the gVNIC hardware properly when setting up our fdbased link endpoints. A temporary workaround is to pass |
Hi @manninglucas — just curious if you had an update on this? We haven't passed the
And because a lot of ML workloads have large payloads. |
Awesome, thank you! |
…gVNIC). This change is mostly just plumbing configuration through the network stack. This change will ensure that future unrecognized host network drivers will not break GSO. It also makes it easier to add support for future network drivers. Fixes issue #9816 PiperOrigin-RevId: 592335983
Hi @ekzhang, we've been trying to get this to work, but unfortunately we suspect getting gVNIC to work properly with gVisor will require specific hardware integrations that will take longer than originally anticipated. In the meantime you can try using our software GSO |
Thank you! We'll try using the |
Thank you! Will try soon. |
Hi @manninglucas — the latency looks better, but I can still reproduce the issue about half of the time I run my above script in the first comment of this issue. It does not reproduce with runc. Half the time I'm getting latencies of 0.4 seconds, which matches runc, and the other half the time I'm getting latencies between 4-9 seconds.
The |
Here are the latencies for 20 consecutive runs of the reproduction script at the beginning of the issue, from a n2-standard-8 GCP instance in us-central1: (This is on
Here are the timings for runc, same parameters, PUT 10 MB:
And here are the timings with
|
@ekzhang internal testing showed adding some extra padding to the maximum GSO packet resolves the issue. Let me know if you still see any slowdowns and I can reopen. |
Thank you for the internal testing! I'll try it out and confirm on our end. |
Confirmed, thank you very much :)
Very interesting comment! |
Description
gVisor containers have unusually slow outbound network performance on newer Google Compute Engine machines, taking an average of 40 seconds to send 10 MiB of data over an HTTP request. Usually it takes <1 second.
This only happens when using asynchronous-I/O user libraries (
sendto
andepoll_wait
), and also only on GCP instances with gVNIC network drivers. When testing on Ubuntu 20.04 on GCP using the legacy VirtIO driver format, the issue does not happen.It is also not reproducible when using synchronous I/O (
write
system call orcurl
command), or when running code outside of gVisor, or when running gVisor on AWS instances. From my tests the network delays only happen on the combination of:Steps to reproduce
We have been reproducing this by using a PUT request to a presigned PutObject URL to Google Cloud Storage.
First, create an instance in GCP Compute Engine. We can reproduce with a
n2-standard-8
instance, with "gVNIC" NIC type in the advanced networking options. It is not reproducible with VirtIO NIC type. Set the boot disk to Ubuntu 20.04.Create a Docker image with the following Dockerfile:
And for
main.py
, use this program (filling inurl
with some external URL that will take the request):Then use gVisor as a container runtime and run
python main.py
.With runc:
When running the same file on the host directly (not inside a container runtime):
runsc version
docker version (if using docker)
uname
Linux eric-temp-testbed3-dec13 5.15.0-1047-gcp #55~20.04.1-Ubuntu SMP Wed Nov 15 11:38:25 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
Relevant part of the debug logs, when running with strace, show 5-6 second gaps on some
epoll_wait()
system calls:Debug logs with strace
The text was updated successfully, but these errors were encountered: