-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Multiple Pals could return spuriously from wake on address #739
Conversation
Co-authored-by: Alexander Nadeau <wareya@gmail.com>
This commit checks that wait_on_address has not returned spuriously.
The code in the Pal for wake on address was incorrectly assuming the operation returning success meant it had actually changed. The specification allows for spurious wake ups. This change makes the Linux Pal recheck for a change.
The underlying OS APIs for all of these (Linux, macOS, Windows, FreeBSD, and so on) are all permitted to spuriously wake. The C++ notify and wake APIs are not allowed to spuriously wake, but are then vulnerable to ABA issues (they wrap the underlying APIs and then check on return). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I do know spurious wake up but I missed it when writing the loop logic.
@@ -336,8 +336,7 @@ namespace snmalloc | |||
{ | |||
while (addr.load(stl::memory_order_relaxed) == expected) | |||
{ | |||
if (::WaitOnAddress(&addr, &expected, sizeof(T), INFINITE)) | |||
break; | |||
::WaitOnAddress(&addr, &expected, sizeof(T), INFINITE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitOnAddress is guaranteed to return when the address is signaled, but it is also allowed to return for other reasons. For this reason, after WaitOnAddress returns the caller should compare the new value with the original undesired value to confirm that the value has actually changed. For example, the following circumstances can result in waking the thread early:
- Low memory conditions
- A previous wake on the same address was abandoned
- Executing code on a checked build of the operating system
import matplotlib.pyplot as plt
import subprocess
import time as tt
subprocess.run(["gh", "pr", "checkout", '739'], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel", '--target', 'perf-startup-fast'], check=True)
time = []
tt.sleep(1)
for i in range(1, 1000):
out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
out = out.stdout.decode("utf-8")
for line in out.split("\n"):
if line.startswith("Taken: "):
time.append(float(line.split()[1]))
break
subprocess.run(["git", "checkout", "upstream/main"], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel", '--target', 'perf-startup-fast'], check=True)
time2 = []
tt.sleep(1)
for i in range(1, 1000):
out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
out = out.stdout.decode("utf-8")
for line in out.split("\n"):
if line.startswith("Taken: "):
time2.append(float(line.split()[1]))
break
plt.boxplot([time, time2], labels=["pr-739", "main"], vert=True, patch_artist=True, showmeans=True)
plt.yscale('log')
plt.show() On FreeBSD-14.2, performance impact seems negligible. |
Thanks @davidchisnall I have made all the Pal uses loop. Our use case is not prone to ABA, because only the waiting thread could move it back to the bad state. These are thankfully all used point to point. I have expanded the PR description, and hopefully have all the platforms correct. The test failed on Windows before adding this, which is a good sign for the test. |
Looks ok just a quick note, I think openbsd needs this update too, wdyt ? |
Sorry, it was on my machine, but I hadn't pushed the last commit. Thanks for looking so quickly. |
This fixes a bug in multiple Pals that could return from wake on address spuriously.
The Linux man page (https://www.man7.org/linux/man-pages/man2/futex.2.html#RETURN_VALUE)
says:
But the code could return without rechecking the the value had actually changed.
There is a similar comment on Windows. I have made all Pals recheck.
I have also added defensive code to detect if this occurs in the combining lock, and hard
fail the allocator.