Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[LibOS] Use static buffer for first thread executing sendfile() #1615

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

dimakuv
Copy link

@dimakuv dimakuv commented Oct 23, 2023

Description of the changes

This performance optimization uses a statically allocated buffer instead of allocating on the heap in the sendfile() syscall. To prevent data races on a single static buffer, this optimization is applied only to the first thread. This optimization is useful for sendfile-heavy workloads, e.g. Nginx in HTTP mode (and with config sendfile=on).

The approach is suggested by @mkow privately.

Closes #1600. See there for preliminary discussions and approaches that don't work well.

Thanks to @aneessahib and @TejaswineeL for doing initial exploration.

How to test this PR?

Use Nginx in HTTP mode. The sendfile=on option is already set in our CI-Examples config file.

Manual results below. Take them with grain of salt as I tested on my single developer machine, without bothering for proper experiment environment. But they give an idea of how the optimization helps:

HOW I RAN

  • First terminal:
CI-Examples/nginx$ make start-gramine-server [SGX=1]
  • Second terminal:
CI-Examples/nginx$ LOOP=1 CONCURRENCY_LIST="256" ../common_tools/benchmark-http.sh http://127.0.0.1:8002

The above benchmark-http.sh script basically runs this:
wrk -c 300 -d 30 -t 256 -R 10000 http://127.0.0.1:8002/random/10K.1.html

WITHOUT PR

  • direct:
Run = 1 Concurrency = 256 Per thread Throughput (bytes) = 40.33, Latency (ms) = 1.00
Concurrency = 256: Per Thread Median Througput (bytes) =    40.330, Latency (ms) =     1.000
  • SGX:
Run = 1 Concurrency = 256 Per thread Throughput (bytes) = 40.75, Latency (ms) = 3.24
Concurrency = 256: Per Thread Median Througput (bytes) =    40.750, Latency (ms) =     3.240

WITH PR

  • direct (make start-gramine-server):
Run = 1 Concurrency = 256 Per thread Throughput (bytes) = 39.89, Latency (ms) = .773
Concurrency = 256: Per Thread Median Througput (bytes) =    39.890, Latency (ms) =     0.773
  • SGX (make start-gramine-server SGX=1):
Run = 1 Concurrency = 256 Per thread Throughput (bytes) = 40.74, Latency (ms) = 2.80
Concurrency = 256: Per Thread Median Througput (bytes) =    40.740, Latency (ms) =     2.800

CONCLUSIONS

We have roughly the same throughput, but the latency is improved by 15-25%. Not bad.


This change is Reviewable

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)


libos/src/sys/libos_file.c line 23 at r1 (raw file):

 * Read/write in 64KB chunks in the sendfile() syscall. This syscall also has an optimization of
 * using a statically allocated buffer instead of allocating on the heap (as our internal malloc()
 * has a heavy mutex that guards all memory-allocation operations). To prevent data races of

Actually, maybe I should drop the has a heavy mutex ... explanation because I'm not sure it's correct in our case -- after all, Nginx is single-threaded so there shouldn't be really any mutex contention on malloc/free. Probably the overhead comes just from our internal allocator, which is not super-optimized...

@dimakuv dimakuv force-pushed the dimakuv/sendfile-static-buf-insteadof-malloc branch from b9ea93d to 8b46db0 Compare October 23, 2023 12:48
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)


libos/src/sys/libos_file.c line 23 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, maybe I should drop the has a heavy mutex ... explanation because I'm not sure it's correct in our case -- after all, Nginx is single-threaded so there shouldn't be really any mutex contention on malloc/free. Probably the overhead comes just from our internal allocator, which is not super-optimized...

+1


libos/src/sys/libos_file.c line 458 at r2 (raw file):

    bool buf_in_use = false;
    if (__atomic_compare_exchange_n(&g_sendfile_buf_in_use, &buf_in_use, /*desired=*/true,
                                    /*weak=*/true, __ATOMIC_RELEASE, __ATOMIC_ACQUIRE)) {

Actually, you can simplify (+ probably slightly speed up) this by using __atomic_exchange_n() instead.

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)


libos/src/sys/libos_file.c line 23 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

+1

Done.


libos/src/sys/libos_file.c line 458 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Actually, you can simplify (+ probably slightly speed up) this by using __atomic_exchange_n() instead.

Done.

According to my quick-and-dirty measurements, there is no change in performance (in comparison to the previous approach with __atomic_compare_exchange_n).

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners

a discussion (no related file):
@aneessahib and @TejaswineeL: could you verify that this PR improves performance on your Nginx workload? The effect should be the same as in the original PR #1600. I will block until you ack/report the result.


Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, "fixup! " found in commit messages' one-liners

Copy link
Contributor

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, "fixup! " found in commit messages' one-liners (waiting on @aneessahib)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

@aneessahib and @TejaswineeL: could you verify that this PR improves performance on your Nginx workload? The effect should be the same as in the original PR #1600. I will block until you ack/report the result.

@dimakuv
yes the changes give the expected performance results for nginx (as that of with thread unsafe static buffer approach)
we are good to go with this change for nginx

Thanks


Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners

a discussion (no related file):

Previously, TejaswineeL (Tejaswinee) wrote…

@dimakuv
yes the changes give the expected performance results for nginx (as that of with thread unsafe static buffer approach)
we are good to go with this change for nginx

Thanks

@TejaswineeL Awesome, thanks for running the experiment!


This performance optimization uses a statically allocated buffer instead
of allocating on the heap in the `sendfile()` syscall. To prevent data
races on a single static buffer, this optimization is applied only to
the first thread. This optimization is useful for sendfile-heavy
workloads, e.g. Nginx in HTTP mode (and with config `sendfile=on`).

Co-authored-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
@dimakuv dimakuv force-pushed the dimakuv/sendfile-static-buf-insteadof-malloc branch from 7a8c86f to 53a0178 Compare October 25, 2023 11:39
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dimakuv dimakuv merged commit 53a0178 into master Oct 25, 2023
@dimakuv dimakuv deleted the dimakuv/sendfile-static-buf-insteadof-malloc branch October 25, 2023 13:59
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants