-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Optimization: Removed unnecessary divides #306
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #306 +/- ##
==========================================
- Coverage 77.34% 77.19% -0.15%
==========================================
Files 9 9
Lines 971 978 +7
==========================================
+ Hits 751 755 +4
- Misses 220 223 +3
Continue to review full report at Codecov.
|
I think the only thing that failed was coverage because it doesn't test for lanes=3 (or in general not 2**N). |
Do you have some performance numbers from before/after to show that this is worth doing? I would personally hoist the divisionless divisions to a set of inline functions, to improve readability. As it stands it looks like magic. |
BTW, I think Edit: strike out "this" because of the data-dependent branching introduced with this PR as it currently is. Desirable is getting rid of the divides, but not along with introduction of that other issue. |
src/opt.c
Outdated
ref_lane = (pseudo_rand >> 32) - (((pseudo_rand >> 32) * lanes_reciprocal) >> 32) * instance->lanes; | ||
if (ref_lane >= instance->lanes) { | ||
ref_lane -= instance->lanes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, when running with data dependent addressing the ref_lane >= instance->lane
check makes things worse by also introducing data dependent branching. Can we avoid that?
Edit: technically, this check and subtraction can be replaced e.g. with signed subtraction, arithmetic right shift to produce a mask, AND, and subtraction. However, perhaps this is worth a comment on why this condition is even possible (is it?) and why at most one subtraction is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why at most one subtraction is enough
When you do reciprocal = floor(1 / n * 2**32)
and floor(x * reciprocal / 2**32)
(ie floor(x * floor(1 / n * 2**32) / 2**32)
) where x
is a 32 bit integer:
x / n
= x * 1 / n
= x * (1 / n * 2**32) / 2**32
< x * floor(1 / n * 2**32) / 2**32 // the thing
< x * (1 / n * 2**32 + 1) / 2**32
= (x / n * 2**32 + x) / 2**32
= x / n + x / 2**32
< x / n + 1 // note x < 2**32 thus x / 2**32 < 1
so
x / n < x * floor(1 / n * 2**32) / 2**32 < x / n + 1
thus
floor(x / n) ≤ floor(x * floor(1 / n * 2**32) / 2**32) ≤ floor(x / n) + 1
Sorry I've been a lazy piece of shit. I don't know what's wrong with me. I need to like do stuff. Anyway hopefully I didn't fuck up that proof since I'm quite drunk but sort of sobered up in the middle it. (I've proved it to myself a few times but I think this is the first time I've been like oh yeah that's like a real proof.)
The change made in the "Removed unnecessary divides" commit is also a reminder that we have several checks of the form "0 == a && 0 == b" (and that commit adds one more of those). Checks like this can also be written as "0 == (a | b)", so that the code would typically compile to only one conditional branch instead of two. This is generally beneficial, except where the "0 == a" condition is almost always false (thus, almost always triggering short-circuit evaluation and not reaching the second check). |
When I wrote this I was thinking about this but it's either public if it's doing data independent look ups or doesn't matter if it's doing data dependent look ups.
This would be better but it shouldn't matter too much since this only runs 4 times per pass over memory. Also I was going for less edits.
|
Right.
This depends on whether you're only thinking in coarse categories such as "side-channel safe vs. unsafe" or also consider real-world nuance within the "unsafe" category. In the real world, different kinds of side-channels might be (un)usable in different circumstances and to different extents. I wouldn't say that with data dependent lookups it completely doesn't matter that we also have data dependent division or data dependent branching. Ideally, in that flavor we'd have only the data dependent lookups and neither of the other two side-channel leaks. So to me this PR would make most sense if it removes a side-channel leak (division) without introducing another (data dependent branching). This can in fact be done with only a little bit more code. |
Also moved `instance->lanes` onto the stack and fixed formatting.
Yeah you're right. OK I removed the if statement. I kind of want to fix the Oh right I found another divide. It's at the end of Lines 254 to 255 in 60b31a3
|
Another reason to get rid of these divides is the recently disclosed AMD Zen1 DIV bug: https://www.openwall.com/lists/oss-security/2023/09/25/3
I guess with SMT enabled it's possible to infer the other thread's DIV argument ranges anyway due to the instruction's data-dependent timings, but this bug makes it more direct (just trigger and return from #DE in the other hw thread?) and exact (but I'm not sure whether it's maybe only for the quotient or also for the remainder? someone would need to test). |
@Sc00bz FWIW, the remainder may also be computed more directly, not via subtraction, see: openwall/john#5246 (comment) "Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that." |
So far, I failed to generalize this approach to arbitrary divisors across the full 32-bit dividend range. So I proceeded to test @Sc00bz's approach with subtraction and a fix-up more. It passes my testing, except for divisor = 1, which this PR's code handles specially anyway (along with all powers of 2). It appears possible to get it to work for divisor = 1 as well by simply changing 0x100000000 to 0xffffffff when computing the reciprocal, so maybe this should be done. I guess it can also be a simpler instruction on some architectures (no 64-bit input). I am testing with this program: #include <stdio.h>
#include <stdint.h>
int main(void)
{
uint32_t j = 1;
do {
printf("Testing %u\n", j);
uint32_t rec = 0xffffffffU / j;
//uint32_t rec = (1ull << 32) / j;
//uint32_t rec = ((1ull << 32) + j - 1) / j;
//uint32_t rec = ((1ull << 33) + j - 1) / j;
uint32_t i = 0;
do {
//uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
//uint32_t res = (i * rec * (uint64_t)j) >> 32;
uint32_t res = i - ((i * (uint64_t)rec) >> 32) * j;
res += (j & (((uint64_t)res - j) >> 32)) - j;
//uint32_t res = i - ((i * (uint64_t)rec) >> 33) * j;
if (res != i % j) {
printf("%u %% %u = %u got %u\n", i, j, i % j, res);
break;
}
} while (++i);
} while (++j);
return 0;
} Indeed, can't practically test the full (almost) 64-bit space in this way, but can adjust it to test expected edge-cases. |
Lemire, Kaser, and Kurz have worked out the details of that approach: https://arxiv.org/abs/1902.01961 |
So I recently found out this is called Barrett reduction (or Wikipedia article on it). It's more for a*b (mod n) where a<n and b<n, but it can be made to work in more general cases like this. You can think of it as doing:
But since x is an approximation, the quotient is an approximation and is either correct or 1 less assuming In your test code, you only need to check Here's my test code. I didn't remove nice binary numbers for #include <stdio.h>
#include <stdint.h>
#ifdef _WIN32
#include <windows.h>
typedef LARGE_INTEGER TIMER_TYPE;
#define TIMER_FUNC(t) QueryPerformanceCounter(&t)
inline double TIMER_DIFF(LARGE_INTEGER s, LARGE_INTEGER e)
{
LARGE_INTEGER f;
QueryPerformanceFrequency(&f);
return ((double) (e.QuadPart - s.QuadPart)) / f.QuadPart;
}
#else
#include <sys/time.h>
typedef timeval TIMER_TYPE;
#define TIMER_FUNC(t) gettimeofday(&t, NULL)
#define TIMER_DIFF(s,e) ((e.tv_sec - s.tv_sec) + (e.tv_usec - s.tv_usec) / (double) 1000000.0)
#endif
int test(uint32_t i, uint32_t j, uint32_t rec, uint32_t correctQuotient)
{
uint32_t quotient = (uint32_t) (((uint64_t) i * rec) >> 32) + 1; // correctQuotient or correctQuotient+1
uint32_t remainder = i - quotient * j; // i < quotient * j, otherwise the remainder is correct
uint32_t offByOne = ((int32_t) remainder >> 31); // 0 or 0xffffffff "-1"
quotient += offByOne; // quotient or quotient-1
remainder += j & offByOne; // remainder or remainder+j
if (quotient != correctQuotient)
{
printf("\nError %u / %u = %u got %u\n", i, j, correctQuotient, quotient);
return 1;
}
return 0;
}
int main()
{
TIMER_TYPE s, sFirst, e;
uint32_t rec;
TIMER_FUNC(s);
sFirst = s;
// j = 1
printf("Testing 1 to 1...");
rec = 0xffffffffU / 1;
for (uint32_t i = 1; i != 0; i++)
{
if (test(i, 1, rec, i))
{
return 1;
}
}
for (uint32_t j = 2; j < 0x1000000; j++)
{
if (((j - 1) & j) == 0)
{
TIMER_FUNC(e);
printf(" took %f\nTesting %u to %u...", TIMER_DIFF(s, e), j, 2 * j - 1);
s = e;
}
// rec = (uint32_t) (UINT64_C(0x100000000) / j);
rec = 0xffffffffU / j;
uint32_t correctQuotient = 0;
for (uint32_t nj = j; ; nj += j)
{
// n*j-1 overflowed
if (nj - 1 < nj - j) break;
if (test(nj - 1, j, rec, correctQuotient)) return 1;
correctQuotient++;
// n*j overflowed
if (nj == 0) break;
if (test(nj, j, rec, correctQuotient)) return 1;
}
}
TIMER_FUNC(e);
printf(" took %f\nTotal time %f\n", TIMER_DIFF(s, e), TIMER_DIFF(sFirst, e));
return 0;
} Oh by: res += (j & (((uint64_t) res - j) >> 32)) - j; You meant to do signed right shift: res += (j & (((int32_t) res - j) >> 31)) - j; OH LOL you copied that from me... because ref_lane is a uint64_t even though it should be a uint32_t (or I guess size_t) and I didn't want to make more changes than necessary. Either works it's just your test code makes what I did look weirder. P.S. Using |
Just realized I should also test max value |
Oh, thanks! Also Lemire's blog posts https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ and https://lemire.me/blog/2019/02/20/more-fun-with-fast-remainders-when-the-divisor-is-a-constant/ So for the full 32-bit dividend range and arbitrary (even if not full 32-bit) divisors, we need wider than 32x32->64 multiplies to retain sufficient precision. (For Edit: for those skimming these comments - the above is about my (non-)suggestion to compute the remainder more directly than is done in this PR. The code currently in this PR is likely correct as-is, and should probably be merged. I am not arguing against it. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #306 +/- ##
==========================================
- Coverage 77.34% 77.22% -0.13%
==========================================
Files 9 9
Lines 971 979 +8
==========================================
+ Hits 751 756 +5
- Misses 220 223 +3 ☔ View full report in Codecov by Sentry. |
Just noticed this still had a modulo for data independent addressing and fixed it. |
Well, that one was non-leaking and probably even faster than your changed code now since |
As to the previous changes here (not including today's added commit), if we'd be making them not only for performance, but also for side-channel safety, then should we be making equivalent changes also to |
Shoot yeah I was not thinking. I just searched for "%" and was like "huh how did I miss that?" |
@Sc00bz kindly submitted these changes via a PR to John the Ripper, and I've now run its Note that by "threads" I mean concurrent computation of multiple independent Argon2 instances. Also, in JtR we've moved memory (de)allocation out of the loop (by using the callbacks this implementation kindly provides). |
To further confirm how costly the divides were, I tried making these changes to a code version from prior to divide removal: +++ b/src/argon2_opt_plug.c
@@ -258,6 +258,16 @@ void argon2_fill_segment(const argon2_instance_t *instance,
/* 1.2.2 Computing the lane of the reference block */
ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+ __asm__ __volatile__("movq %1,%0" : "=g" (pseudo_rand) : "r" (pseudo_rand), "g" (ref_lane));
+ ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+ __asm__ __volatile__("movq %1,%0" : "=g" (pseudo_rand) : "r" (pseudo_rand), "g" (ref_lane));
+ ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+ __asm__ __volatile__("movq %1,%0" : "=g" (pseudo_rand) : "r" (pseudo_rand), "g" (ref_lane));
+ ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+ __asm__ __volatile__("movq %1,%0" : "=g" (pseudo_rand) : "r" (pseudo_rand), "g" (ref_lane));
+ ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+ __asm__ __volatile__("movq %1,%0" : "=g" (pseudo_rand) : "r" (pseudo_rand), "g" (ref_lane));
+ ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
if ((position.pass == 0) && (position.slice == 0)) {
/* Can not reference other lanes yet */ This adds 5 more divides. I measured a reliable slowdown of 7% at 1 and 2 threads on i7-4770K. With the asm lines alone (which were needed to prevent the compiler from optimizing out the redundant repeated divides), there's no measurable slowdown. So almost all of the 7% in this test was in the divides. This means that one divide could cost something like 7/5 = 1.4% or more if in this test the CPU core manages to run more than one of these in parallel. |
Similarly to the above, this causes a 6% slowdown: +++ b/src/argon2_core_plug.c
@@ -260,6 +260,21 @@ uint32_t argon2_index_alpha(const argon2_instance_t *instance,
/* 1.2.6. Computing absolute position */
absolute_position = (start_position + relative_position) %
instance->lane_length; /* absolute position */
+ __asm__ __volatile__("movq %1,%0" : "=g" (relative_position) : "r" (relative_position), "g" (absolute_position));
+ absolute_position = (start_position + relative_position) %
+ instance->lane_length; /* absolute position */
+ __asm__ __volatile__("movq %1,%0" : "=g" (relative_position) : "r" (relative_position), "g" (absolute_position));
+ absolute_position = (start_position + relative_position) %
+ instance->lane_length; /* absolute position */
+ __asm__ __volatile__("movq %1,%0" : "=g" (relative_position) : "r" (relative_position), "g" (absolute_position));
+ absolute_position = (start_position + relative_position) %
+ instance->lane_length; /* absolute position */
+ __asm__ __volatile__("movq %1,%0" : "=g" (relative_position) : "r" (relative_position), "g" (absolute_position));
+ absolute_position = (start_position + relative_position) %
+ instance->lane_length; /* absolute position */
+ __asm__ __volatile__("movq %1,%0" : "=g" (relative_position) : "r" (relative_position), "g" (absolute_position));
+ absolute_position = (start_position + relative_position) %
+ instance->lane_length; /* absolute position */
return absolute_position;
}
I am surprised the previous test (for lanes) caused so much slowdown given that I was benchmarking with a lane count of 1. So it looks like the DIV instruction can be this slow even in trivial cases. |
No description provided.