Future-proofing 24/53bit precision for f32/f64 generation #416

sbarral · 2018-04-24T04:50:18Z

I know this is a contentious issue, sorry for bringing it again... Like others, I feel a bit uneasy with the choice to promote a Standard distribution that samples within the open (0, 1) interval. As a matter of fact, I am a bit warry of promoting any "one true way" to produce FP values.

Now I do agree that a fully open interval is adequate in most applications. OTOH, I have never seen a practical situation where a half-open interval would not be adequate too: it is often important to avoid singularity at one of the bound, typically with log(u)-type transforms, but I suspect that the need to avoid both 0 and 1 is very rare.

The main reason I dislike the (0, 1) open interval as the default choice, though, is that it implicitly bakes into the API a defficiency of the current transmute-based FP generation algorithm, namely the truncation to 23 bits (resp. 52 bits for f64) precision, instead of the 24/53 bits that one may expect based on the FP significand precision.
The problem is that, unlike with half-open [0, 1) or (0, 1] intervals, the generation of FP values in the (0, 1) open interval becomes AFAIK prohibitively expensive with 24/53 bits due to the need for a rejection step. So in a way, the current Standard distribution would become a commitment to an implementation detail of the current FP generation method.

For this reason, if it is really deemed important to promote a default FP generation distribution rather than just define e.g. Open01, OpenClosed01 and ClosedOpen01, I would then favor the widely-adopted convention of sampling within [0, 1) because (i) it is also adequate for most situations and (ii) it leaves open the possibility (today, or in the future) to efficiently generate FP values with a 24/53-bit precision.

Regarding point (ii) above, I made some preliminary investigations to assess the computational advantage of the current 52-bit transmute-based method over two 53-bit FP generation methods. The following u64->f64 conversion methods were benchmarked:

// The 52-bit transmute-based method we now use.
fn transmute52(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE: u64 = 1023 << FRACTION_WIDTH; // = transmute::<f64, u64>(1.0)
    
    let fraction: u64 = r >> (64 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE) };
    
    u - 1.0
}

// The conventional 53-bit conversion method.
fn direct53(r: u64) -> f64 {
    const PRECISION: i32 = 53;
    const SCALE: f64 = 1.0/((1u64 << PRECISION) as f64);
    
    SCALE * (r >> (64 - PRECISION)) as f64
}

// Home-grown, branch-less 53-bit transmute-based method -- faster methods may exist.
fn transmute53(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE_HALF: u64 = 1022 << FRACTION_WIDTH; // = transmute::<f64, u64>(0.5)
    const RIGHT_MASK: u64 = (1u64 << 63) - 1u64; // all bits set except MSB
    
    // Create a 000...00 or a 111...11 mask depending on the MSB.
    // The commented alternative is slightly slower but functionally equivalent to
    // direct53(), whereas the uncommented version inverts the role of the MSB.  
    let compl_mask: u64 = (r as i64 >> 63) as u64;
    //let compl_mask: u64 = (!r as i64 >> 63) as u64; 
    let fraction: u64 = (r & RIGHT_MASK) >> (63 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE_HALF) };
    // `c` is 0.0 or 0.5 depending on the MSB.
    let c = unsafe { mem::transmute::<u64, f64>(compl_mask & ONE_HALF) };
    
    u - c
}

The benchmark performs a simple sum of a large number of FP values produced by one of the above conversion function fed by the 64-bit output of XorShiftRng.
As always, the benchmark needs to be taken with a big grain of salt, especially that the methods are normally inlined so a lot depends on the actual code context. In order to assess robustness towards inlining, the benchmark was ran a second time with the above functions marked with #[inline(never)]. Also, I did not try any other CPU than mine (i5-7200). With this caveat, here are the computation times:

Method	`#[inline(always)]`	`#[inline(never)]`
`XorShiftRng` + `transmute52`	t0	t1 [1.77 × t0]
`XorShiftRng` + `direct53`	1.00 × t0	1.19 × t1
`XorShiftRng` + `transmute53`	1.14 × t0	0.99 × t1

Two surprises:

the inlined direct 53 bit method was exactly as fast as the transmute method,
the non-inlined version of transmute53 was very slightly but consistently faster, which I surmise is because c has a 50% chance to be 0.0 so u - c can evaluate fast.

In any case, these results seem to strongly question the purported advantage of the 52-bit transmute-based method, at least for modern CPUs. And even for old CPUs, I would expect the transmute53 version to be reasonably close to transmute52.

The text was updated successfully, but these errors were encountered:

pitdicker · 2018-04-24T07:56:34Z

@sbarral thank you for opening the issue.

It is a bit embarrasing to say, but I/we never noticed that a multiply-based approach can produce floating point values with one more bit of precision than the transmute-based approach. And that with all the discussion there has been around the conversion step...

As you maybe know, we had a problem of an exploding number of options to convert to a 0..1 range. With the transmute approach [0, 1), (0, 1) and (0, 1] have the same performance. With all things being equal, picking an open range as default made the most sense.

Options we have have/had until now:

transmute-based approach: 52 bits of precision
- [0, 1) natively
- (0, 1) natively
- (0, 1] natively
multiply-base approach: 53 bits of precision
- [0, 1) natively
- (0, 1) with rejection sampling
- (0, 1] with addition of EPSILON/2
~~32/64-bits precision~~: (scrapped as not worth it)
- (0, 1] natively
- [0, 1) with small non-uniformity near power of two boundaries
- (0, 1) with rejection sampling
full/high precision:
- [0, 1] natively
- [0, 1) with small non-uniformity near power of two boundaries
- (0, 1) with rejection sampling

I suppose the extra bit of precision that the multiply-method offers is an extra argument for bringing back something like ClosedOpen01. Besides helping with existing code, and to have a better change of replicating results from other libraries/languages. ClosedOpen01 would then is not only have the tiny difference of what happens at the bounds, but also that it uses a different method and provides one more bit of precision.

I do believe having a Standard distribution is good to have, even if it does not fill all needs. We provide something for all types. But you maybe mean that we shouldn't offer only one way, but offer multiple solutions.

Your transmute53 is quite clever, I would need to take a bit more time to really understand it. I liked playing with the different conversion methods, but am not really jumping at the moment to see another one... I think we should strive for the methods to get inlined, and then it does not really have benefit over direct53? But we would not have the advantage of reproducibility with other libraries.

I remember from my own benchmarks that the transmute method (with bitmasks and floating subtraction) wasfaster than the multiplication method (integer to float conversion and float multiplication). But I don't have things around to confirm at the moment, maybe it is also just CPU dependend.

@dhardy, @tspiteri What do you think of eventually having this selection:

Standard: (0, 1); 52 bits of precision (the number of mantissa bits)
ClosedOpen01: [0, 1); 53 bits of precision (i.e. including implicit bit)
Closed01: [0, 1]; as many bits of precision as can be represented

sbarral · 2018-04-24T09:03:29Z

Regarding direct53 vs transmute53: I would not draw conclusions on that single benchmark, more benchmarks on more architectures would be needed to make an informed decision. I do not consider differences within +/-20% to be meaningful, especially on a single benchmark.
I did not look at the inlined assembly (can't check it now) for direct53 but I would not be surprised if the compiler had used FMA to merge the multiply with the sum computed by the benchmark. In any case a lot depends on the inlining context, which is why I actually trust more the non-inlined benchmark to estimate robustness, and why I think transmute53 may turn out to be more robust across architectures and across different inlining contexts.

Just to be clear though, I didn't really mean to push for an immediate move away from the current transmute-based method, I merely meant to suggest that the computational efficiency of the 52-bit transmute based method should not be used as an argument in the discussion around Standard, even if it were a bit more (pun unintended) efficient today.

So the question for me is: assuming that the computational costs are comparable, should we prefer 52-bit (0,1) sampling or 53-bit [0,1) sampling? To me the latter seems like "the right thing to do" because 53 bits is what a user familiar with IEEE (but unfamiliar with our implementation details) would expect.

Regarding naming, I think that the naming should suggest the accuracy/complexity trade-off, so I would prefer HighPrecisionClosed01 to Closed01. And I would of course prefer Open01 to Standard... 😉

dhardy · 2018-04-24T09:45:48Z

rng.gen() is based on Standard

@vks already advocated not supporting any FP sampling in Standard and only having explicitly named distributions. I'm not really sure what the best option is.

If we have [0, 1) and (0, 1] half-open distributions, what would we name them?

vks · 2018-04-24T11:14:54Z

If we have [0, 1) and (0, 1] half-open distributions, what would we name them?

I suggest ClosedOpen01 and OpenClosed01.

sbarral · 2018-04-24T11:53:09Z

I suggest ClosedOpen01 and OpenClosed01.

Seconded, this looks unambiguous to me.

pitdicker · 2018-04-24T13:31:02Z

@vks already advocated not supporting any FP sampling in Standard and only having explicitly named distributions. I'm not really sure what the best option is.

Was removing the Rng::next_f* not agreed to be acceptable only because using gen::<f32>() was an easy alternative? Removing Standard for floating point would make the breakage worse. And it would also make the breakage because of removing the Rand trait worse. And didn't we drop the design with a separate float distribution (Uniform01) and a standard distribution (then called Default) because it didn't add much?

We are making quite a lot of braking changes the past couple of weeks. I hope that with the 0.5 release we can make a couple of things stable and say they are going to need very good reasons that have not been brought up before to be up for discussion.

I merely meant to suggest that the computational efficiency of the 52-bit transmute based method should not be used as an argument in the discussion around Standard, even if it were a bit more (pun unintended) efficient today.

Yes, definitely the choice should not solely be based on performance. But it is not something to ignore either. For example, we choose to not go with the high precision conversion because of performance.

So the question for me is: assuming that the computational costs are comparable, should we prefer 52-bit (0,1) sampling or 53-bit [0,1) sampling? To me the latter seems like "the right thing to do" because 53 bits is what a user familiar with IEEE (but unfamiliar with our implementation details) would expect.

I think only the high precision variant makes sense to someone who doesn't know the details. If you know some of the details and are familiar with IEEE, both 52-bit and 53-bit seems reasonable, it is the difference the implicit bit makes.

So I still think the practical concerns matter most: that the guarantee to not return some value is more interesting than that it might be generated.

dhardy · 2018-04-24T13:59:25Z

If we have several distributions like HighPrecisionClosedOpen01 then we need to think about convenient usage, because writing rng.sample(HighPrecisionClosedOpen01) everywhere is neither pretty nor easy to type.

We could try adding several convenience functions to Rng, but it gets a bit repetitive:

gen01_co
gen01_oc
gen01_cc_hp
...

Or we could potentially add a function like this (which would normally be inlined), though it would still be cumbersome to specify all parameters:

fn gen01<T>(&mut self, type: RangeType, precision: Precision) -> T {
    match (type, precision) {
        (RangeType::ClosedOpen, Precision::Standard) => self.sample(Range::new(0.0, 1.0)),
        (RangeType::ClosedOpen, Precision::Full) => self.sample(HighPrecision01),
        ...
    }
}

@pitdicker Whether or not Standard will still be able to sample floats is another question, but one we could not say no to without a convenient alternative.

clarfonthey · 2018-04-25T20:09:57Z

Why not make an enum for Open/Closed, one for the various precision types, and use those to generate all the options? I agree that rng.sample(HighPrecisionClosedOpen01) but rng.gen01(Closed, Open, High) is a lot nicer, and doesn't "bless" a particular interval.

Alternatively, I think gen01_precise and gen01 is also fine to avoid the second enum.

dhardy · 2018-04-26T10:06:14Z

Hmm, do we want to implement all 8 variants though? I suppose we can try and just use unimplemented!() if any don't work out.

There's also the question of how this relates to the Uniform range; as @pitdicker pointed out getting that right with every open/closed variant is hard.

dhardy · 2018-04-26T10:08:09Z

Since there's some pressure to get a 0.5 release out soon, we could also revert to the old behaviour here (gen() outputs [0, 1) and keep open/closed distributions), then not adjust this further until 0.6.

pitdicker · 2018-04-27T12:44:28Z

do we want to implement all 8 variants though?

I can imagine quite a few more variants than 8 😄.

But we really should make choices. Offering 8 ways to do basically the same thing is going to make Rand hard to use, without bringing many advantages. And it probably is not the most preferable choice, but as Distribution is a trait different conversation methods can live in an external crate just as well.

We also make choices when it comes to which RNGs to pick, with which constants, sizes etc. And we make choice in which algorithms to use certain distributions. Just have to make a choice that is somewhat sane...

I think I can agree to make [0, 1) the default again. In a way it is the most 'natural'. The RNG produces values in powers of two. An open distribution needs 2^n - 1 values, and a closed distribution 2^n + 1.

The only reason we can currently fit 2^n values in an open range is because it is shifted in such a way the interval between 0.0 and the first values is half that of the interval between all other values. It is a nice trick, but maybe a bit too weird to set as default.

Because our Closed distribution was rescaling the range and in a way just creating (smaller) gaps in other places, it was a bit dubious. I don't mind seeing it gone.

What do you think of this selection:

Standard: [0, 1); 53 bits of precision, half-open range, using multiply method.
Open01: (0, 1); 52 bits of precision, open range, faster, using transmute method.
HighPrecision01: [0, 1]; as many bits of precision as can be represented, closed range, slower.

dhardy · 2018-04-27T15:26:06Z

Standard as [0, 1) with 53-bits of precision makes sense, so long as this is close to current speed
Open01 as mentioned makes sense, except I don't know how often this is actually needed
OpenClosed01 as (0, 1] with 53-bits of precision would be a nice addition
Lets get back to the high-precision stuff later? The latest iteration is [0, 1) because of reasons that seem a reasonable choice to me.

sbarral · 2018-04-27T18:11:54Z

Just wondering: wouldn't it be possible, whichever the choice of Standard, to provide all of Open01, ClosedOpen01, OpenCLosed01 (and possibly HighPrecision01), i.e. to expose one "redundant" distribution?

My thinking is that it would somewhat release the pressure to keep behavioral backward compatibility for Standard in the future.
I expect that people using these distributions as low-level primitives to build other distributions will not mind trading the convenience of Standard for a guaranty of backward compatibility. So Standard could be accompanied by a disclaimer encouraging the use of its explicit sibling (say ClosedOpen01) for future-proofing.

dhardy · 2018-04-27T18:59:53Z

Totally possible; I did something similar before (with Standard and ClosedOpen01 in fact, but by different names). It got dropped but I'd happily see this reintroduced.

clarfonthey · 2018-04-27T23:50:20Z

Regarding the eight combinations, it'd only be six because OpenClosed can easily be converted to ClosedOpen. But I agree it'd be a bit odd in hindsight.

dhardy · 2018-05-08T07:26:37Z

The main issue (change back to [0, 1)) has been addressed. Is there anything else we need to do to close this issue?

dhardy mentioned this issue Apr 25, 2018

ChaCha: support reading counter and setting nonce #374

Merged

dhardy mentioned this issue Apr 27, 2018

Tracker: planned changes for 0.5 #232

Closed

33 tasks

This was referenced Apr 29, 2018

Switch the Standard distribution for floats to [0, 1) #420

Merged

possible issue with u64->f64 conversion in rand/float.rs #426

Closed

dhardy closed this as completed May 17, 2018

dhardy mentioned this issue Jun 5, 2018

Compromising accuracy and speed [distributions] #494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Future-proofing 24/53bit precision for f32/f64 generation #416

Future-proofing 24/53bit precision for f32/f64 generation #416

sbarral commented Apr 24, 2018

pitdicker commented Apr 24, 2018 •

edited

Loading

Uh oh!

sbarral commented Apr 24, 2018 •

edited

Loading

Uh oh!

dhardy commented Apr 24, 2018

Uh oh!

vks commented Apr 24, 2018

Uh oh!

sbarral commented Apr 24, 2018

Uh oh!

pitdicker commented Apr 24, 2018

Uh oh!

dhardy commented Apr 24, 2018

Uh oh!

clarfonthey commented Apr 25, 2018 •

edited

Loading

Uh oh!

dhardy commented Apr 26, 2018

Uh oh!

dhardy commented Apr 26, 2018

Uh oh!

pitdicker commented Apr 27, 2018 •

edited

Loading

Uh oh!

dhardy commented Apr 27, 2018 •

edited

Loading

Uh oh!

sbarral commented Apr 27, 2018

Uh oh!

dhardy commented Apr 27, 2018

Uh oh!

clarfonthey commented Apr 27, 2018

Uh oh!

dhardy commented May 8, 2018

Uh oh!

Uh oh!

Future-proofing 24/53bit precision for f32/f64 generation #416

Future-proofing 24/53bit precision for f32/f64 generation #416

Comments

sbarral commented Apr 24, 2018

pitdicker commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbarral commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy commented Apr 24, 2018

Uh oh!

vks commented Apr 24, 2018

Uh oh!

sbarral commented Apr 24, 2018

Uh oh!

pitdicker commented Apr 24, 2018

Uh oh!

dhardy commented Apr 24, 2018

Uh oh!

clarfonthey commented Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy commented Apr 26, 2018

Uh oh!

dhardy commented Apr 26, 2018

Uh oh!

pitdicker commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbarral commented Apr 27, 2018

Uh oh!

dhardy commented Apr 27, 2018

Uh oh!

clarfonthey commented Apr 27, 2018

Uh oh!

dhardy commented May 8, 2018

Uh oh!

pitdicker commented Apr 24, 2018 •

edited

Loading

sbarral commented Apr 24, 2018 •

edited

Loading

clarfonthey commented Apr 25, 2018 •

edited

Loading

pitdicker commented Apr 27, 2018 •

edited

Loading

dhardy commented Apr 27, 2018 •

edited

Loading