Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use large MTU by default for pasta-backed rootless custom networks #23883

Open
dgibson opened this issue Sep 6, 2024 · 6 comments
Open

Use large MTU by default for pasta-backed rootless custom networks #23883

dgibson opened this issue Sep 6, 2024 · 6 comments
Labels
jira kind/feature Categorizes issue or PR as related to a new feature. network Networking related issue or feature pasta pasta(1) bugs or features

Comments

@dgibson
Copy link
Collaborator

dgibson commented Sep 6, 2024

Feature request description

By default pasta uses an MTU of 65520 bytes in containers it backs. This is an important strategy to improve TCP throughput and reduce load, by reducing the number of system calls. pasta is able to coalesce individual TCP packets, allowing it to take advantage of the large local MTU, even if the full path has a lower MTU (which will be typical across the internet).

However, when pasta is used for a rootless custom network, a Linux bridge sits between pasta and the container(s) it's supporting. Unless overridden in the podman configuration this bridge will have the default MTU of 1500, negating this performance strategy of pasta.

Increasing the MTU of the custom network (e.g. with podman network create -o mtu=65520) can significantly improve performance in some situations.

Suggest potential solution

When creating a custom network which uses pasta for external connectivity, podman should default to configuring an MTU of 65520.

This won't help in all cases: if traffic is not coming directly from the container, but from (for example) a tunnel running in the container, the TCP MSS will still be constrained by the tunnel's MTU. Nonetheless a different default will help the common case of TCP traffic originating directly in the container.

Have you considered any alternatives?

The end user can, of course, manually set a large MTU, but that's extra inconvenience.

While, of course, we endeavour to keep performance of pasta good even with smaller MTUs, the large MTU strategy is an important tool which it seems unwise to discard.

Additional context

This limitation came to light amidst discussion of a number of issues occuring in this ticket.

@dgibson dgibson added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 6, 2024
@dgibson
Copy link
Collaborator Author

dgibson commented Sep 6, 2024

/cc @Luap99

@Luap99
Copy link
Member

Luap99 commented Sep 6, 2024

While injecting mtu=65520 as option is easy it will break consumers that inspect the networks and expect certain options to be (or not be) there... Doing something like this we broke docker-compose in the past as it thinks it always has to recreate the network because of this. I think the podman ansible roles behave similar so just adding the option in there by default is not good.

Netavark itself doesn't set a mtu unless it was set in the network config file and the kernel defaults to 1500 AFAIK. This causes already problems for some users (#20009). Therefore adding a generic default_mtu option sounds reasonable to me to solve this and then we could have rootless default to 65520.

But if we do not want to add it in the network config used by inspect/on disk then we either have to add a new option to netavark to send the default mtu there or before sending the networking config to netavark we copy the config add the mtu option the send it to netavark in c/common/libnetwork. As this doesn't require netavark changes this seems likely the easier option.

Also there is the question about slirp4netns support? Our slirp4netns code also defaults to 65520 so I guess we would not have to make a difference and could just use the same default there. And we have to consider backwards compat as well, if a user today has mtu 1500 configured for pasta or for slirp4netns and we then default to higher mtu on for the bridge networks all of the sudden it may negatively impact them.


Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

@Luap99 Luap99 added the network Networking related issue or feature label Sep 6, 2024
@sbrivio-rh
Copy link
Collaborator

Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

Third table ("pasta: connections/traffic via tap" ) here: https://passt.top/passt/about/#performance_1. Those figures are taken, however, with CPU-bound pasta, at much higher transfer rates than what you'd find in common use cases.

The effect on reported CPU load with lower transfer rates is more dramatic than that because CPU load scales much less than linearly. If there's less data available to transfer at a time, we'll use (many) more CPU cycles per byte. But I don't have hard numbers here, yet.

@dgibson
Copy link
Collaborator Author

dgibson commented Sep 9, 2024

/cc @Luap99

While injecting mtu=65520 as option is easy it will break consumers that inspect the networks and expect certain options to be (or not be) there... Doing something like this we broke docker-compose in the past as it thinks it always has to recreate the network because of this. I think the podman ansible roles behave similar so just adding the option in there by default is not good.

Heh.

Netavark itself doesn't set a mtu unless it was set in the network config file and the kernel defaults to 1500 AFAIK.

Kernel defaults will depend on the exact interface types, but typically it will be 1500, yes.

This causes already problems for some users (#20009). Therefore adding a generic default_mtu option sounds reasonable to me to solve this and then we could have rootless default to 65520.

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

But if we do not want to add it in the network config used by inspect/on disk then we either have to add a new option to netavark to send the default mtu there or before sending the networking config to netavark we copy the config add the mtu option the send it to netavark in c/common/libnetwork. As this doesn't require netavark changes this seems likely the easier option.

Also there is the question about slirp4netns support? Our slirp4netns code also defaults to 65520 so I guess we would not have to make a difference and could just use the same default there.

I'm guessing you mean it defaults to that MTU with slirp4netns itself? Presumably slirp4netns combined with a custom network will hit the same issue as I'm describing here.

And we have to consider backwards compat as well, if a user today has mtu 1500 configured for pasta or for slirp4netns and we then default to higher mtu on for the bridge networks all of the sudden it may negatively impact them.

Hard to see how, but yes, that's possible in principle.

Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

It's only a single data point, but a real user reports a fairly noticeable difference here.

@Luap99
Copy link
Member

Luap99 commented Sep 9, 2024

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

Basically it comes down to not showing the mtu option when you do podman network inspect. Ansible and docker-compose will only recreate resources when something was changed in the config files (ansible calls this idempotency). So if we always add mtu to the options and show it in inspect the next time the tool runs it thinks the user changed settings (because mtu is not in their config) and has to recreate the network and thus all containers depending on it which is not wanted. And for docker-compose at least we cannot even tell the tool to handle this special case as they only target docker. So podman must behave like docker at the compat API level.

With the default_mtu option in containers.conf I would not add the option into the actual network config file json thus avoiding the problem with it showing in inspect.

@dgibson
Copy link
Collaborator Author

dgibson commented Sep 11, 2024

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

Basically it comes down to not showing the mtu option when you do podman network inspect. Ansible and docker-compose will only recreate resources when something was changed in the config files (ansible calls this idempotency). So if we always add mtu to the options and show it in inspect the next time the tool runs it thinks the user changed settings (because mtu is not in their config) and has to recreate the network and thus all containers depending on it which is not wanted. And for docker-compose at least we cannot even tell the tool to handle this special case as they only target docker. So podman must behave like docker at the compat API level.

With the default_mtu option in containers.conf I would not add the option into the actual network config file json thus avoiding the problem with it showing in inspect.

Ok. Seems like a perfectly reasonable approach from my point of view.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
jira kind/feature Categorizes issue or PR as related to a new feature. network Networking related issue or feature pasta pasta(1) bugs or features
Projects
None yet
Development

No branches or pull requests

4 participants