Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Internal macvlan network doesn't work in swarm #2418

Closed
lemrouch opened this issue Jul 16, 2019 · 18 comments · Fixed by moby/moby#40579
Closed

Internal macvlan network doesn't work in swarm #2418

lemrouch opened this issue Jul 16, 2019 · 18 comments · Fixed by moby/moby#40579
Assignees

Comments

@lemrouch
Copy link
Contributor

When I create macvlan internal network it uses dummy interface despite the -o parent setting.
This means containers connected to such network can communicate to each other only if they are running on the same node.
I think internal macvlan network should be able to use interface defined in --config-only network.

@chiragtayal
Copy link

@lemrouch: Can you please provide more information, like commands you executed?

@lemrouch
Copy link
Contributor Author

lemrouch commented Aug 13, 2019

Hi,
this might be tricky since result of such commands depends on fact if you have patches from #2411 and #2414 applied.

My goal:
We have our network segmentated over several VLANs.
I was asked to provide access to legacy system on one of them to service running on swarm while only some containers should be allowed to access them.
Documentation suggests it should be doable by using macvlan internal network.
If you want to use macvlan in swarm you have to use config-only/config-from setup. Therefore commands should look like:

  • on node A:
    docker network create --config-only -d macvlan --subnet 10.20.30.0/24 -o parent=enp1s0.50 --ip-range 10.20.30.160/27 config_net
  • on node B:
    docker network create --config-only -d macvlan --subnet 10.20.30.0/24 -o parent=enp1s0.50 --ip-range 10.20.30.192/27 config_net
  • on node C:
    docker network create --config-only -d macvlan --subnet 10.20.30.0/24 -o parent=enp1s0.50 --ip-range 10.20.30.224/27 config_net
  • on swarm manager:
    docker network create -d macvlan --scope swarm --internal --attachable --config-from config_net final_net

There was just one problem - it didn't work.
I found out such network was not in fact internal which was fixed by those 2 PRs mentioned above.
If you run those commands on patched system they will really be internal which will hit the hard check and it will force the network to use dummy interface.

@chiragtayal
Copy link

@lemrouch : I am seeing --config-from network is using parent interface from config-only network . Let me know if I am misunderstanding your query

docker@mgr:~$ docker network create --config-only --subnet 192.168.99.0/24 --gateway=192.168.99.1 --ip-range=192.168.99.200/32 -o parent=eth1 -d macvlan net

docker@mgr:~$ docker network create --scope swarm --config-from net -d macvlan --internal mnet

docker@mgr:~$ docker service create --network mnet praqma/network-multitool

docker@mgr:~$ docker network inspect mnet
[
    {
        "Name": "mnet",
        "Id": "tzj7982ylx89dor84h3pxs0mk",
        "Created": "2019-08-20T23:07:48.903599258Z",
        "Scope": "swarm",
        "Driver": "macvlan",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "192.168.99.0/24",
                    "IPRange": "192.168.99.200/32",
                    "Gateway": "192.168.99.1"
                }
            ]
        },
        "Internal": true,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": "net"
        },
        "ConfigOnly": false,
        "Containers": {
            "7cc3a6dead4fd7b75ee4fc9e732442b281fc1b0b396edb7dc596d1fa2619b51e": {
                "Name": "quirky_thompson.1.kqz0whc5en1e5jzugoelah0wn",
                "EndpointID": "19ba0a0b385f8ca5f7efc4f280c5c2d8dcc77f1ad4eccabff0f57ff2c1705b93",
                "MacAddress": "02:42:c0:a8:63:c8",
                "IPv4Address": "192.168.99.200/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "parent": "eth1"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "082aca3b1d22",
                "IP": "192.168.99.100"
            }
        ]
    }
]```

@lemrouch
Copy link
Contributor Author

To prevent some possible misundestading I have to ask first: do you have patch from #2414 applied?
If not docker network inspect will lie to you and the network is not in fact internal.

@chiragtayal
Copy link

chiragtayal commented Aug 21, 2019

@lemrouch: I do have patch for #2414 and #2411
I see what you are saying, let me check internally what is expected behavior and will provide the correct answer.

@lemrouch
Copy link
Contributor Author

And the correct answer is..?

@lemrouch
Copy link
Contributor Author

Ping. Anybody out there?

@suwang48404
Copy link
Contributor

@lemrouch
sorry for the delay.

Some background with regard to swarm networking so we speak on the same wavelength.

  1. all containers created using "docker service create" uses swarm default overlay work "ingress" for inbound request, and this is independent of "--network" flag from "docker service create"
  2. "--network" flag specifies what network to attach the container to for its outbound request.

Typically a container in swarm mode with "--network A_MAC_VLAN_NETWORK" has the following interfaces,

sudo ip -d add
29: eth1@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:ff:00:10 brd ff:ff:ff:ff:ff:ff link-netnsid 1 promiscuity 0
veth
inet 10.255.0.16/16 brd 10.255.255.255 scope global eth1
valid_lft forever preferred_lft forever
33: eth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default
link/ether 02:42:0a:14:1e:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0
macvlan mode bridge
inet 10.20.30.161/24 brd 10.20.30.255 scope global eth0
valid_lft forever preferred_lft forever

sudo nsenter -t 16614 -n ip route
default via 10.20.30.160 dev eth0
10.20.30.0/24 dev eth0 proto kernel scope link src 10.20.30.161
10.255.0.0/16 dev eth1 proto kernel scope link src 10.255.0.16

where eth1 is veth pair attaching to swarm ingress overlay network, and eth0 is macvlan has has direct access external to host.

Say if this container also provide a http service at port 80, then http request arrives at and replied to on eth1 (veth pair) and macvlan interface is not in play.

if this container itself requires external service, say ping www.google.com, as dictated by above routing table, it go through eth0(macvlan) interface directly.

Thus to summarize,

  1. any containers on swarm, i.e created via "docker service create" can be accessible from external world via published port always
  2. "--network" flag indicate what network to attach to communicate to outside world from within the container.
  3. "--internal" flag in "docker network" has local scope, not cluster. Thus, all containers attached to a internal network(macvlan/overlay) via some interfaces can only communicate via these interfaces if they are on the same node.

Hope this partially help you determine your requirement.

@lemrouch
Copy link
Contributor Author

lemrouch commented Dec 9, 2019

Just to be clear what am I trying to achieve: My goal is to allow containers running on swarm to access legacy server which is running without any external connectivity at all in it's own subnet in separate VLAN.

There is just one small problem with your background 2)
Containers can be attached to multiple networks and as I mentioned in #2406 documentation says:
When a container is connected to multiple networks, its external connectivity is provided via the first non-internal network, in lexical order.
Sadly this is true just for overlay networks right now. Any MACVLAN internal network changes default gateway as they ignore the parameter completely. My patches are simply trying to fix this bug.

I'm afraid your summary 3) is also not correct.
Let see what the documentation says:
If you want to create an externally isolated overlay network, you can specify the --internal option.
Overlay network definitely has not local scope and it even does work as expected. Every container connected to such network can see each other even when the don't share node and default gateway is not changed.

I was able to join the VXLAN of such internal overlay network but it's quite difficult to maintain it as docker manages ARP tables on nodes internally.

MACVLAN approach is little bit more complicated to set up as one has to use the config-from networks but it's easy to maintain.

@lemrouch
Copy link
Contributor Author

lemrouch commented Jan 2, 2020

Happy New Year!
Anyone here?

@arkodg
Copy link
Contributor

arkodg commented Jan 2, 2020

hi @lemrouch , I'm not sure how an internal network would work for macvlan networks , from my understanding, internal is limiting traffic to east-west / disallowing north-south

This can taken care in overlay networks by not connecting the container endpoints to the docker_gwbridge - https://github.com/docker/libnetwork/blob/feeff4f0a3fd2a2bb19cf67c826082c66ffaaed9/default_gateway.go#L127

This also makes sense for bridge drivers which can apply iptable policies to achieve this
https://github.com/docker/libnetwork/blob/feeff4f0a3fd2a2bb19cf67c826082c66ffaaed9/drivers/bridge/setup_ip_tables.go#L338

@lemrouch
Copy link
Contributor Author

lemrouch commented Jan 3, 2020

My original task:
I have swarm which will run several containers.
This swarm is connected to internet and one legacy system which is separated in it's own network segment.
Only some of the containers are supposed to have access to the legacy system.

Therefore I need some kind of swarm network which will not mess with default gateway and will allow some non-swarm server to join it.

I can create internal overlay network which will create almost exactly what I need but it's quite complicated to connect non-swarm server to overlay network's VXLAN and explain each container where to find the legacy system in it.

If I go the MACVLAN way it's a little bit more complicated to setup but the legacy network segment can be connected to the underlay VLAN just fine and each container has just simple static route to the legacy system subnet.
But there comes the bug to play. If I create any MACVLAN it will just imagine it's own default gateway which is preferred over all other networks and all traffic goes to this black hole.

From my point of view it's just fine to have traffic limited east-west.

@arkodg
Copy link
Contributor

arkodg commented Jan 3, 2020

Thanks for the clarification, so the issue is that this statement doesn't hold true for you - When a container is connected to multiple networks, its external connectivity is provided via the first non-internal network, in lexical order. . In your case your containers have 2 endpoints (VXLAN and MACVLAN) and you want north-south traffic to egress via the VXLAN endpoint, so you're attempting to make the MACVLAN network internal ?

Sharing some docker commands and using net=host to mimic the legacy server might help me and anyone else interested in this issue, understand the problem better.

@lemrouch
Copy link
Contributor Author

lemrouch commented Jan 6, 2020

Yes!
I'm sorry it was unclear. I have to fix several bugs at once to solve my problem.

The default gateway problem is described in #2406.

This issue is about underlay device. Problem is when I finally got internal MACVLAN network working it was using just dummy interface instead of real VLAN device. I have no idea who and why wrote such code. It might be OK for single node but not for swarm.
Fix is in #2419.

@arkodg
Copy link
Contributor

arkodg commented Jan 6, 2020

@lemrouch
regarding #2419 I think the authors intended internal (for macvlan drivers) to mean internal to the local node based on this comment #964 (comment) cc : @nerdalert @mavenugo

@lemrouch
Copy link
Contributor Author

lemrouch commented Jan 7, 2020

Frankly, this makes no sense.
I hope at this point you agree that internal means just isolated = no external connectivity = no default gateway.
Why any MACVLAN network without connectivity is not actual MACVLAN is beyond me.
The dummy interface makes sense only in case where the parent interface is not set at all but even in such case there should be check for config-from network which might set it to some real device.

I think @chiragtayal agreed and went on mission to fix this but got lost in the woods in the process.

@maaraneasi
Copy link

Is there any chance that this annoying bug will get fixed?
Thanks!

@lemrouch
Copy link
Contributor Author

lemrouch commented Feb 20, 2020

Working example as requested in #2419

Let say we have legacy server (apex1) running PostgreSQL. It has interface in VLAN 49 and PostgreSQL is listening there.

root@apex5:~# ip a s enp1s0.49
3: enp1s0.49@enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:62:62:87 brd ff:ff:ff:ff:ff:ff
    inet 10.64.49.128/24 brd 10.64.49.255 scope global enp1s0.49
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe62:6287/64 scope link
       valid_lft forever preferred_lft forever

root@apex5:~# netstat -lnptu | grep 10.64.49.128
tcp        0      0 10.64.49.128:5432       0.0.0.0:*               LISTEN      13052/postgres

Next to it are two docker swarm nodes (apex1, apex2) which have interfaces in the same VLAN 49:

root@apex1:~# ip a s enp1s0.49
3: enp1s0.49@enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:2a:ab:45 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe2a:ab45/64 scope link
       valid_lft forever preferred_lft forever

root@apex2:~# ip a s enp1s0.49
3: enp1s0.49@enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:4b:96:38 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe4b:9638/64 scope link
       valid_lft forever preferred_lft forever

Let's prepare config-only network for our MACVLAN network where subnets don't collide against each other neither the legacy system:

root@apex1:~# docker network create --config-only --subnet 10.64.49.0/24 -o parent=enp1s0.49 --ip-range 10.64.49.0/27 private_db_net49
root@apex2:~# docker network create --config-only --subnet 10.64.49.0/24 -o parent=enp1s0.49 --ip-range 10.64.49.32/27 private_db_net49

Create MACVLAN network with config-from parameter:

root@apex1:~# docker network create -d macvlan --scope swarm --internal --attachable --config-from private_db_net49 db_net

Inspect network:

root@apex1:~# docker inspect db_net
[
    {
        "Name": "db_net",
        "Id": "ri5690n3emnwtyw69d5cphna8",
        "Created": "2020-02-20T14:22:30.764770052Z",
        "Scope": "swarm",
        "Driver": "macvlan",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "",
            "Options": null,
            "Config": []
        },
        "Internal": true,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": "private_db_net49"
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": null,
        "Labels": null
    }
]

Run container for test:

root@apex1:~# docker run -it --rm --network=db_net --name pg_client postgres:11 /bin/bash

Inspect the container:

root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ip r s
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 

Connect the client to our internal MACVLAN network:

root@apex1:~# docker network connect db_net pg_client

Inspect network again. It should have "parent": "enp1s0.49":

[
    {
        "Name": "db_net",
        "Id": "ri5690n3emnwtyw69d5cphna8",
        "Created": "2020-02-20T14:57:46.059003428Z",
        "Scope": "swarm",
        "Driver": "macvlan",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.64.49.0/24",
                    "IPRange": "10.64.49.0/27"
                }
            ]
        },
        "Internal": true,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": "private_db_net49"
        },
        "ConfigOnly": false,
        "Containers": {
            "9de45ad12c9722ee93a61c0e36be2363cf50a3aac37e64e1921702fab46d82b0": {
                "Name": "pg_client",
                "EndpointID": "701ae023f03895a6b78e79e27c660ff6d6bcfd264ce52c669c30413213d70dc7",
                "MacAddress": "02:42:0a:40:31:02",
                "IPv4Address": "10.64.49.2/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "parent": "enp1s0.49"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "b744b237683f",
                "IP": "10.64.10.51"
            }
        ]
    }
]

Inspect the container again. Default gateway should not be changed with new interface:

root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
17: eth1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:0a:40:31:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.64.49.2/24 brd 10.64.49.255 scope global eth1
       valid_lft forever preferred_lft forever
root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ip r s
default via 172.17.0.1 dev eth0 
10.64.49.0/24 dev eth1 proto kernel scope link src 10.64.49.2 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 

Test connectivity from container to legacy system over MACVLAN network:

root@9de45ad12c97:/# psql -U just_test -h 10.64.49.128
psql: FATAL:  no pg_hba.conf entry for host "10.64.49.2", user "just_test", database "just_test", SSL on
FATAL:  no pg_hba.conf entry for host "10.64.49.2", user "just_test", database "just_test", SSL off

Run and test container on 2nd node:

root@apex2:~# docker run -it --rm --name pg_client postgres:11 /bin/bash
..
root@apex2:~# docker network connect db_net pg_client
..
root@72cae331dfac:/# psql -U just_test -h 10.64.49.128
psql: FATAL:  no pg_hba.conf entry for host "10.64.49.33", user "just_test", database "just_test", SSL on
FATAL:  no pg_hba.conf entry for host "10.64.49.33", user "just_test", database "just_test", SSL off

Test connectivity over overlay network:

root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=7.72 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 7.721/7.721/7.721/0.000 ms

Test connectivity between nodes over MACVLAN network:

root@apex1:~# nsenter --net=/proc/`docker inspect --format '{{.State.Pid}}' pg_client`/ns/net ping -c 1 10.64.49.33
PING 10.64.49.33 (10.64.49.33) 56(84) bytes of data.
64 bytes from 10.64.49.33: icmp_seq=1 ttl=64 time=0.513 ms

--- 10.64.49.33 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.513/0.513/0.513/0.000 ms

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants