-
-
Notifications
You must be signed in to change notification settings - Fork 461
Rng::gen_range() is slowest with power-of-2 input sizes? #1145
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
That's a keen observation, thanks! Definitely something we need to investigate. The results you are showing are from the benchmarks in the repository you linked? I would like to run them on my machine as well. It almost looks like (About your experiments with generating only as many bits as necessary: This approach is indeed popular in some academic publications. We also have some discussion in #1014 and #1055. The problem is that it is tricky to make fast without compromising performance when generating |
yeah, you should be able to just pull and run Reading those issues now, I bet they'll be interesting! |
Thanks! The results on my machine are consistent with yours. |
This is an artifact of the leading_zeros threshold approximation. I believe this should fix it: It passes a 16-bit bias test: (1..=u16::max_value()).into_par_iter().for_each(|range| {
let zone = (range << ((range - 1).leading_zeros())).wrapping_sub(1);
let mut map = vec![0; range as usize];
for r in 0..=u16::max_value() {
let m = r as u32 * range as u32;
let h = (m >> 16) as u16;
let l = m as u16;
if l <= zone {
map[h as usize] += 1;
}
}
// must be uniform for every value
assert!(map.iter().all(|&x| x == map[0]), "{:?}", range);
}); |
Hang on, it makes others worse. Like |
Note also that O'neill found that shortcutting the modulo-wide-mul method is actually fastest. See "All-Ranges Benchmark Results" for "Debiased Int Mult (t-opt/m-opt)" vs "Bitmask" (which should have similar speed to our method, perhaps even faster thanks to no multiplication) |
So, would @TheIronBorn like to make a PR? |
Sure I’ll look into it |
I ran @elidupree's benchmark for the new RNGs evaluated in #1196. Summary: all Canon variants do not display poor behaviour except for |
I was just fiddling around with some RNG stuff when I noticed something that looks pretty odd:
These are time-per-call for calls to
rng.gen_range(0..n)
(you can see the value ofn
at the end of each name written on the vertical axis), sampled using Criterion. You'd think that power-of-2 range sizes would be the fastest to generate, since they can theoretically be done as just a single bitmask, with no rejections. But instead, they are the slowest!I'm not deeply familiar with RNG algorithms, but I took a quick look through the code, and I wonder if this line from sample_single_inclusive is the culprit. Maybe it was intended to use
(range-1).leading_zeros()
rather thanrange.leading_zeros()
? I haven't grokked what the line is doing, but that would be something that affects exactly power-of-2 sizes differently than their neighbors.To test, this, I confirmed that
Uniform::new(0, n).sample(&mut rng)
does not have this pattern, and in fact, is faster thanrng.gen_range(0..n)
on my computer in almost every case I tested (but especially much faster on power-of-2 sizes):Is it possible that the
sample_single_inclusive
code is out of date compared to optimizations that have been made more recently insample_inclusive
?The text was updated successfully, but these errors were encountered: