Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

evpn: scope mac-mobility handling to MAC-VRF of the route #2805

Merged
merged 2 commits into from
Jun 10, 2024

Conversation

Tuetuopay
Copy link
Contributor

@Tuetuopay Tuetuopay commented May 3, 2024

The mac mobility code used lookup tables to find the routes to the MAC address. It only considered the MAC itself, and not the associated MAC-VRF. This led to the complexity still being way too high when one MAC is present in all MAC-VRFs, like a virtual router or an anycast gateway.

This patch now considers the MAC-VRF for a route by looking up its route targets, and considers them all if multiple are present. This way, the lookup for routes that will participate in mac-mobility will be limited to the MAC-VRF. This allows to not blow up the CPU when a MAC address is present in a lot of MAC-VRFs.

The biggest behavioral change is now how mac-mobility is sharded. For example, it can now have the same MAC in multiple MAC-VRFs on different PEs. This is quite useful in highly multi-tenant environments where control on the MAC addresses is nigh impossible. This is a net fix of the behavior where MAC-VRFs cannot affect each other. For reference, FRRouting does scope Mac Mobility to the MAC-VRF.

But the main reason this was done is to fix the quadratic complexity that comes when there are many times the same MAC in many different MAC-VRFs, by not needing to iterate through all the routes for other unrelated MAC-VRFs.

A quick benchmark was done by adding type-2 routes through the gRPC API. For 20k routes, on my laptop (Ryzen 5 PRO 4650U):

  • without this patch
    • clean run: 228 seconds (3min 48s), starting around 1200 routes/s and decreasing (non linearily) to less than 50 routes/s
    • rerun a second time: 424 seconds (7min 4s), stuck at around 50 routes/s
  • with this patch, regardless of the whether the process is empty or not: 7.7s seconds, at 2700 routes/s the whole time

It is to be noted that, without the patch, the whole route ingestion will be throttled by type-2 route ingestion.

Note: this PR is both a performance and a correctness fix. It is built atop #2804 which is another correctness fix, hence the two commits. Please keep the comments for the first commit in the other pull request

@fujita
Copy link
Member

fujita commented May 16, 2024

Looks like the evpn test failed.

@Tuetuopay
Copy link
Contributor Author

Tuetuopay commented May 21, 2024

Looks like the evpn test failed.

@fujita indeed. Since you're here I would like for your input on the implementation. I am not sure about the behavior when one of the RTs of the route is for an IP-VRF and not just a MAC-VRF (i.e. a route when doing type-5 routing for e.g. a router's mac).

I will fix the test, thanks!

@Tuetuopay
Copy link
Contributor Author

@fujita the evpn tests have been fixed. there was indeed a logic error in my patch due to some place I missed accessing the destinations directly (TableManager directly writing to Table's Destinations without going through table, which is bad isolation). I moved accesses to Destination to Table itself.

The graceful restart tests do fail which I don't understand since I did not touch this part of the code, especially since it's on an address family not affected by the patch (ipv4/6 unicast vs evpn). Those appear to be flaky since when running them locally they don't fail more often than they do. But they still do occasionally.

This test ensures that a MAC can be announced by different peers as long
as it is in different MAC-VRFs, not applying MAC-mobility algorithms.

It does so by adding two additional VRFs for the new route to land in,
and by cleaning up a bit the RDs and RTs used by the existing tests.
The mac mobility code used lookup tables to find the routes to the MAC
address. It only considered the MAC itself, and not the associated
MAC-VRF. This led to the complexity still being way too high when one
MAC is present in all MAC-VRFs, like a virtual router or an anycast
gateway.

This patch now considers the MAC-VRF for a route by looking up its route
targets, and considers them all if multiple are present. This way, the
lookup for routes that will participate in mac-mobility will be limited
to the MAC-VRF. This allows to not blow up the CPU when a MAC address is
present in a lot of MAC-VRFs.

The biggest behavioral change is now how mac-mobility is sharded. For
example, it can now have the same MAC in multiple MAC-VRFs on different
PEs. This is quite useful in highly multi-tenant environments where
control on the MAC addresses is nigh impossible. This is a net fix of
the behavior where MAC-VRFs cannot affect each other. For reference,
FRRouting does scope Mac Mobility to the MAC-VRF.

But the main reason this was done is to fix the quadratic complexity
that comes when there are many times the same MAC in many different
MAC-VRFs, by not needing to iterate through all the routes for other
unrelated MAC-VRFs.

A quick benchmark was done by adding type-2 routes through the gRPC API.
For 20k routes, on my laptop (Ryzen 5 PRO 4650U):

- without this patch
  * clean run: 228 seconds (3min 48s), starting around 1200 routes/s and
    decreasing (non linearily) to less than 50 routes/s
  * rerun a second time: 424 seconds (7min 4s), stuck at around 50
    routes/s
- with this patch, regardless of the whether the process is empty or
  not: 7.7s seconds, at 2700 routes/s the whole time

It is to be noted that, without the patch, the whole route ingestion
will be throttled by type-2 route ingestion.
@Tuetuopay
Copy link
Contributor Author

rebased

@fujita fujita merged commit 7d2d020 into osrg:master Jun 10, 2024
39 checks passed
@fujita
Copy link
Member

fujita commented Jun 10, 2024

Thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants