Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Quadratic runtime when sending many requests #87

Closed
MrMage opened this issue Apr 8, 2019 · 14 comments
Closed

Quadratic runtime when sending many requests #87

MrMage opened this issue Apr 8, 2019 · 14 comments

Comments

@MrMage
Copy link
Contributor

MrMage commented Apr 8, 2019

Using NIOHTTP2 version 1.0.1.

Caveat: We might be doing something wrong in SwiftGRPC and/or this might be fixed by @Lukasa's recent changes, but I figured you guys might want to take a look.

When I run https://github.com/grpc/grpc-swift/blob/da1a59c919555367abfd6e2ffb285a175ffa11c0/Tests/SwiftGRPCNIOTests/NIOFunctionalTests.swift#L79 in Release mode with numberOfRequests = 2_000:

Test Suite 'Selected tests' started at 2019-04-08 14:11:22.901
Test Suite 'SwiftGRPCNIOTests.xctest' started at 2019-04-08 14:11:22.901
Test Suite 'NIOFunctionalTestsInsecureTransport' started at 2019-04-08 14:11:22.901
Test Case '-[SwiftGRPCNIOTests.NIOFunctionalTestsInsecureTransport testUnaryLotsOfRequests]' started.
1000 requests sent so far, elapsed time: 0.32004
total time to send 2000 requests: 0.852957
total time to receive 2000 responses: 2.479385
Test Case '-[SwiftGRPCNIOTests.NIOFunctionalTestsInsecureTransport testUnaryLotsOfRequests]' passed (1.644 seconds).
Test Suite 'NIOFunctionalTestsInsecureTransport' passed at 2019-04-08 14:11:24.545.
	 Executed 1 test, with 0 failures (0 unexpected) in 1.644 (1.644) seconds
Test Suite 'SwiftGRPCNIOTests.xctest' passed at 2019-04-08 14:11:24.545.
	 Executed 1 test, with 0 failures (0 unexpected) in 1.644 (1.644) seconds
Test Suite 'Selected tests' passed at 2019-04-08 14:11:24.546.
	 Executed 1 test, with 0 failures (0 unexpected) in 1.644 (1.645) seconds
Program ended with exit code: 0

When run with numberOfRequests = 4_000:

Test Suite 'Selected tests' started at 2019-04-08 14:11:36.312
Test Suite 'SwiftGRPCNIOTests.xctest' started at 2019-04-08 14:11:36.312
Test Suite 'NIOFunctionalTestsInsecureTransport' started at 2019-04-08 14:11:36.312
Test Case '-[SwiftGRPCNIOTests.NIOFunctionalTestsInsecureTransport testUnaryLotsOfRequests]' started.
1000 requests sent so far, elapsed time: 0.345793
2000 requests sent so far, elapsed time: 0.896414
3000 requests sent so far, elapsed time: 2.113964
total time to send 4000 requests: 3.534887
total time to receive 4000 responses: 7.418182
Test Case '-[SwiftGRPCNIOTests.NIOFunctionalTestsInsecureTransport testUnaryLotsOfRequests]' passed (4.771 seconds).
Test Suite 'NIOFunctionalTestsInsecureTransport' passed at 2019-04-08 14:11:41.084.
	 Executed 1 test, with 0 failures (0 unexpected) in 4.771 (4.772) seconds
Test Suite 'SwiftGRPCNIOTests.xctest' passed at 2019-04-08 14:11:41.085.
	 Executed 1 test, with 0 failures (0 unexpected) in 4.771 (4.772) seconds
Test Suite 'Selected tests' passed at 2019-04-08 14:11:41.085.
	 Executed 1 test, with 0 failures (0 unexpected) in 4.771 (4.773) seconds
Program ended with exit code: 0

Partial Instruments screenshots:

Screen Shot 2019-04-08 at 14 19 04

Screen Shot 2019-04-08 at 14 19 14

@Lukasa
Copy link
Contributor

Lukasa commented Apr 8, 2019

Huh, weird. I'll investigate.

As a quick note, I don't think my recent changes will have affected that, as they imposed a linear cost on each header block encoded, and header block encoding is always going to be linear-time anyway. So we need to investigate further and find where the quadratic operation count is.

@Lukasa
Copy link
Contributor

Lukasa commented Apr 9, 2019

Hrm, this is a nasty one.

Regardless of whether or not the HTTP/2 stack has any quadratic runtime in it, we're not getting a chance to exercise it. This is because we're hitting a quadratic runtime problem in NIO's scheduler first. We'll need to resolve that before we go any further.

The issue is fundamentally to do with the way our Heap is implemented. Specifically, we're hitting a weird edgecase which appears to cause Swift to emit unnecessary ARC traffic, and that unnecessary ARC traffic leads to an unnecessary CoW of the backing array storage. Tracking that down took quite a while, but the result is that a loop spinning on any of the SelectableEventLoop functions that modifies the task queue (execute, submit, schedule) exhibits quadratic behaviour.

While we wait for the above Swift bug to be fixed I'm also going to rewrite the Heap. The weird edge case was hit because of our use of static funcs, and those were only there to allow us to feel more confident that we hadn't misimplented the algorithms. We clearly haven't, and we also wrote a bunch of tests, so I'm going to just pull those functions down into instance methods that Swift is more capable of seeing through, which should resolve the problem nicely.

I am not confident this is the end of the quadratic behaviour this test will see. In particular, I'm sure there is a quadratic loop in the HTTP/2 code that I've already found that this test should hit. With that one I'm less confident whether we should fix it or let it sit there, but that's a discussion for another day: first, let's fix the quadratic behaviour we know about.

@Lukasa
Copy link
Contributor

Lukasa commented Apr 9, 2019

See apple/swift-nio#960 for the fix for Heap.

@MrMage
Copy link
Contributor Author

MrMage commented Apr 10, 2019

Wow, impressive detective work! Looking forward to seeing what the performance of this test will be once all the quadratic runtime issues are fixed.

@Lukasa
Copy link
Contributor

Lukasa commented Apr 10, 2019

Incidentally, there's another major performance gain you can have here that works today.

Right now the test spins in a loop on client.get, which ends up calling EventLoop.execute to swap onto the event loop thread. The effect of that is that you are acquiring and releasing the event loop task queue lock many times in a tight loop. This lock will simultaneously be being contended on by the event loop thread itself, so this lock acquisition pattern ends up being quite slow.

In the pre-Heap-fix code I changed this by moving the loop execution onto the event loop thread. This eliminated the cross-thread traffic, and in one case dropped the runtime for preparing 20k requests from 13 seconds to 2 seconds.

Put another way, the cross-thread communication still utterly dominates the quadratic runtime of the heap logic if the iteration count gets large enough.

@Lukasa
Copy link
Contributor

Lukasa commented Apr 15, 2019

@MrMage Is this problem still outstanding for you?

@MrMage
Copy link
Contributor Author

MrMage commented Apr 16, 2019

@Lukasa I haven't tried recently; my understanding was that there were other causes of quadratic behavior that have yet to be fixed. Is that still the case or should I re-run our tests?

@Lukasa
Copy link
Contributor

Lukasa commented Apr 16, 2019

So there's one possible source of quadratic behaviour in your code, which is that calls to flush from the child channels will lead to a loop over the number of streams that are being buffered to avoid violating SETTINGS_MAX_CONCURRENT_STREAMS. As you attempt to create a bunch of streams all at once, and flush each of them, that buffer will be sized O(n), and will be flushed O(n) times, leading to a quadratic loop.

However, I don't believe that quadratic behaviour is likely to be a performance problem in real programs. Your test wildly exceeds SETTINGS_MAX_CONCURRENT_STREAMS (which is usually set to 100), and so is an absolute worst-possible-case for that code. It's also profoundly difficult to fix within the semantics of NIO's flushing model: I end up needing to maintain a separate data structure that indicates which streams need to be flushed, and in the common-case (where there are only a handful of streams buffered at most) that's more expensive than the current model.

So I'm interested in seeing whether you're still having trouble. If your test runs in an acceptable amount of time then I'm going to consider that quadratic behaviour low-urgency to address.

@MrMage
Copy link
Contributor Author

MrMage commented Apr 16, 2019

@Lukasa thank you for the elaboration! I can confirm that sending requests is now very fast, even from "off" the event loop (I guess acquiring 2000 locks isn't a major bottleneck yet). However, waiting for all requests to be received takes three times longer for each doubling of the request count:

1000 requests sent so far, elapsed time: 0.085628
total time to send 2000 requests: 0.12969
total time to receive 2000 responses: 1.721965
1000 requests sent so far, elapsed time: 0.110867
2000 requests sent so far, elapsed time: 0.176551
3000 requests sent so far, elapsed time: 0.232632
total time to send 4000 requests: 0.291121
total time to receive 4000 responses: 4.914795
1000 requests sent so far, elapsed time: 0.090735
2000 requests sent so far, elapsed time: 0.151104
3000 requests sent so far, elapsed time: 0.2133
4000 requests sent so far, elapsed time: 0.269586
5000 requests sent so far, elapsed time: 0.312878
6000 requests sent so far, elapsed time: 0.355842
7000 requests sent so far, elapsed time: 0.410365
total time to send 8000 requests: 0.453256
total time to receive 8000 responses: 15.107122

Looking at the Instruments traces, it does appear that flushing is the culprit here, and is a bottleneck, mostly on the client:

Screen Shot 2019-04-16 at 10 21 55

Screen Shot 2019-04-16 at 10 22 39

Screen Shot 2019-04-16 at 10 24 32

I assume this is the behavior you are describing; given that it only occurs in a fairly "extreme" test right now, there's probably no point in trying to fix this.

@MrMage
Copy link
Contributor Author

MrMage commented Apr 16, 2019

Follow-up: I can confirm that injecting a

Thread.sleep(forTimeInterval: 0.00025)

inside the "send" loop completely eliminates the quadratic-runtime issue. However, that runtime appears to not be very relevant for 2000 requests, anyway; it only becomes apparent when sending 4000 or more requests.

MrMage added a commit to Timing-GmbH/grpc-swift that referenced this issue Apr 16, 2019
@Lukasa
Copy link
Contributor

Lukasa commented Apr 16, 2019

Ok, for now I think we can accept that if you're exceeding SETTINGS_MAX_CONCURRENT_STREAMS by 40x then you may begin to see performance problems and want to rearchitect your application. We can revisit this decision down the road if it becomes worthwhile.

Let's call this done for now. Thanks for the report, it led to a number of useful fixes!

@glbrntt
Copy link
Contributor

glbrntt commented Apr 16, 2019

Follow-up: I can confirm that injecting a

Thread.sleep(forTimeInterval: 0.00025)

inside the "send" loop completely eliminates the quadratic-runtime issue. However, that runtime appears to not be very relevant for 2000 requests, anyway; it only becomes apparent when sending 4000 or more requests.

Sorry, why does this work?

@MrMage
Copy link
Contributor Author

MrMage commented Apr 16, 2019

Sorry, why does this work?

The test runtime explodes because each flush call needs to iterate over all open streams. Throttling the number of send operations in flight reduces the amount of streams that are still open and thus need to be iterated over.

MrMage added a commit to grpc/grpc-swift that referenced this issue Apr 16, 2019
* Avoid quadratic runtime in `testUnaryLotsOfRequests`.

See apple/swift-nio-http2#87 (comment).

* Decrease a few test timeouts (so tests pass in Release builds as well) and improve test error reporting.

* Delay spinning up an event loop group for each test until the test is actually being set up.

This avoid prematurely spinning up dozens of event loop threads when the test cases get allocated.
@glbrntt
Copy link
Contributor

glbrntt commented Apr 16, 2019

The test runtime explodes because each flush call needs to iterate over all open streams. Throttling the number of send operations in flight reduces the amount of streams that are still open and thus need to be iterated over.

Ahh of course, thanks!

IceRocky pushed a commit to IceRocky/grpc-swift that referenced this issue May 28, 2024
* Avoid quadratic runtime in `testUnaryLotsOfRequests`.

See apple/swift-nio-http2#87 (comment).

* Decrease a few test timeouts (so tests pass in Release builds as well) and improve test error reporting.

* Delay spinning up an event loop group for each test until the test is actually being set up.

This avoid prematurely spinning up dozens of event loop threads when the test cases get allocated.
teskobif7 added a commit to teskobif7/grpc-swift that referenced this issue Aug 14, 2024
* Avoid quadratic runtime in `testUnaryLotsOfRequests`.

See apple/swift-nio-http2#87 (comment).

* Decrease a few test timeouts (so tests pass in Release builds as well) and improve test error reporting.

* Delay spinning up an event loop group for each test until the test is actually being set up.

This avoid prematurely spinning up dozens of event loop threads when the test cases get allocated.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants