str and [u8] should hash the same #27108

chris-morgan · 2015-07-18T14:07:36Z

str and [u8] hash differently and have done since the solution to #5257.

This is inconvenient in that it leads to things like the StrTendril type (from the tendril crate) hashing like a [u8] rather than like a str; one is thus unable to happily implement Borrow<str> on it in order to make HashMap<StrTendril, _> palatable.

[u8] gets its length prepended, while str gets 0xff appended. Sure, one u8 rather than one usize is theoretically cheaper to hash; but marginally so only, marginally so. I see no good reason why they should use different techniques, and so I suggest that str should be changed to use the hashing technique of [u8]. This will prevent potential nasty surprises and bring us back closer to the blissful land of “str is just [u8] with a UTF-8 guarantee”.

Hashes being internal matters, I do not believe this should be considered a breaking change, but it would still probably be a release-notes-worthy change as it could conceivably break eldritch codes.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2015-07-18T14:10:12Z

👍

This makes using a Tendril as a HashMap key a little more palatable. Sadly, str and [u8] hash differently at present, so we shouldn’t implement Borrow<str> for StrTendril. An alternative would be making the Hash implementations for Tendril<F> manual and making Tendril<UTF8> use the str Hash implementation and the rest use the [u8] one. See also rust-lang/rust#27108 which deals with fixing the underlying problem of the differing Hash implementations.

Gankra · 2015-07-18T17:52:00Z

I'm all for this change as long as it's benchmarked to not cause a noticeable regression. (I agree with your assessment that it shouldn't)

bluss · 2015-08-14T18:04:56Z

Hashing an strings of length n will be n + 1 bytes before and n + 8 bytes after. Siphash adds one byte and hashes in 8-byte blocks, so very short strings bumped up to the next size, using 2 * 2 + 4 = 8 siphash rounds vs. previous 2 * 1 + 4 = 6. I'm sure you can measure the difference.

Also, this is a change to what the Hash trait emits for str, so it's a breaking change no matter which hash function is used. I understand it can be a minor change, I think it's fine to change this in a new rust release.

arthurprs · 2016-08-17T08:30:39Z

This would be optimized if rust-lang/rfcs#1666 is accepted.

chris-morgan · 2016-08-17T12:11:06Z

@arthurprs I don’t see how rust-lang/rfcs#1666 would have any bearing on this. All this should need is

diff --git a/src/libcore/hash/mod.rs b/src/libcore/hash/mod.rs
index 051eb97..85427d0 100644
--- a/src/libcore/hash/mod.rs
+++ b/src/libcore/hash/mod.rs
@@ -328,8 +328,7 @@ mod impls {
     #[stable(feature = "rust1", since = "1.0.0")]
     impl Hash for str {
         fn hash<H: Hasher>(&self, state: &mut H) {
-            state.write(self.as_bytes());
-            state.write_u8(0xff)
+            self.as_bytes().hash(state)
         }
     }

arthurprs · 2016-08-17T12:24:59Z

Yes, but &[u8].hash() is internally

    #[stable(feature = "rust1", since = "1.0.0")]
    impl<T: Hash> Hash for [T] {
        fn hash<H: Hasher>(&self, state: &mut H) {
            self.len().hash(state);
            Hash::hash_slice(self, state)
        }
    }

And that self.len().hash(state); would be optimized to a single branch (or nothing at all if LLVM is that that) if the RFC is accepted.

dtolnay · 2017-11-16T07:21:47Z

I agree that it would be desirable for str and [u8] to hash the same. Does anyone have an estimate of the worst case, how much this would affect performance of realistic usage of hash map with string keys?

arthurprs · 2017-11-16T08:42:25Z

We can always make &[u8] hash to be similar to &str (appending 0xFF) and not the other way around if any performance regression is a concern. It's purely implementation detail.

chris-morgan · 2017-11-16T08:43:26Z

My preliminary benchmarking of hashing [u8] and str shows no measurable difference on nightly-x86_64-pc-windows-msvc, regardless of data size (up to a test file of 3.2 megabytes)—it’s always well within ε (though admittedly ε is always at least ⅓ of the total) with one winning sometimes and the other other times. I therefore think we can make this change without concern.

Here’s one of my tests:

#![feature(test)]

use std::hash::Hash;
use std::collections::hash_map::DefaultHasher;

extern crate test;
use test::Bencher;

#[bench]
fn hash_str(b: &mut Bencher) {
    let mut hasher = DefaultHasher::new();
    b.iter(|| {
        include_str!("x.rs").hash(&mut hasher);
    });
}

#[bench]
fn hash_bytes(b: &mut Bencher) {
    let mut hasher = DefaultHasher::new();
    b.iter(|| {
        include_bytes!("x.rs").hash(&mut hasher);
    });
}

I had been thinking of specializing the [T] implementation on all integer types, to make just one write call with the entire slice rather than one per element (as Hash::hash_slice does), but it looks like that may not be necessary; it may well be a zero-cost abstraction (though I haven’t looked at the ASM produced). (I presume we’re OK with libcore using specialization, seeing as it has #![feature(specialization)] in it.)

chris-morgan · 2017-11-16T08:45:25Z

@arthurprs I presume the length thing is to help with types with few possibilities, e.g. zero-sized types, so that a two-element [()] is more likely to hash differently to a three-element [()]. It bears a certain resemblance to techniques to strengthen against length-extension attacks, also. Not sure whether it’s useful in practice.

arthurprs · 2017-11-16T08:50:06Z

There's definitely a difference, but you need to use very small sequences, from 0 to 17 bytes. It'll also differ among hashers depending on the internal block size. I have plenty of test code around, I'll compile a test for this in a bit.

@chris-morgan Precisely, they use different techniques to achieve the same (prefixing the usize len / appending a 0xFF).

arthurprs · 2017-11-16T09:17:19Z

Fnv and others byte-at-a-time will take a sizeable hit.

Siphash has an internal block size of 8 and it's more expensive overall so its less susceptible to the change.

➜  hash-rs git:(rfc-extend-hasher) ✗ cargo benchcmp sip13::str_:: sip13::slice:: benches.txt
 name             sip13::str_:: ns/iter  sip13::slice:: ns/iter  diff ns/iter   diff % 
 bytes_000000001  13 (76 MB/s)           15 (66 MB/s)                       2   15.38% 
 bytes_000000002  13 (153 MB/s)          15 (133 MB/s)                      2   15.38% 
 bytes_000000004  13 (307 MB/s)          13 (307 MB/s)                      0    0.00% 
 bytes_000000007  18 (388 MB/s)          16 (437 MB/s)                     -2  -11.11% 
 bytes_000000008  15 (533 MB/s)          16 (500 MB/s)                      1    6.67% 
 bytes_000000009  16 (562 MB/s)          17 (529 MB/s)                      1    6.25% 
 bytes_000000012  15 (800 MB/s)          16 (750 MB/s)                      1    6.67% 
 bytes_000000016  16 (1000 MB/s)         18 (888 MB/s)                      2   12.50% 
 bytes_000000017  17 (1000 MB/s)         19 (894 MB/s)                      2   11.76% 
➜  hash-rs git:(rfc-extend-hasher) ✗ cargo benchcmp fnv::str_:: fnv::slice:: benches.txt
 name             fnv::str_:: ns/iter  fnv::slice:: ns/iter  diff ns/iter   diff % 
 bytes_000000001  1 (1000 MB/s)        4 (250 MB/s)                     3  300.00% 
 bytes_000000002  2 (1000 MB/s)        5 (400 MB/s)                     3  150.00% 
 bytes_000000004  2 (2000 MB/s)        5 (800 MB/s)                     3  150.00% 
 bytes_000000007  4 (1750 MB/s)        7 (1000 MB/s)                    3   75.00% 
 bytes_000000008  3 (2666 MB/s)        8 (1000 MB/s)                    5  166.67% 
 bytes_000000009  4 (2250 MB/s)        8 (1125 MB/s)                    4  100.00% 
 bytes_000000012  5 (2400 MB/s)        10 (1200 MB/s)                   5  100.00% 
 bytes_000000016  7 (2285 MB/s)        13 (1230 MB/s)                   6   85.71% 
 bytes_000000017  9 (1888 MB/s)        13 (1307 MB/s)                   4   44.44%

bluss · 2017-11-16T11:34:48Z

Another real example of needing the length, tuples. They use no field separator themselves. We need to avoid hash dos with pairs of slices too.

arthurprs · 2017-11-17T09:12:37Z

We could make [u8] append a 0xFF like str, this will essentially guarantee no perf regressions. In theory this could be a breaking change though.

chris-morgan · 2017-11-17T10:30:05Z

I return to what I suggested: specialisation of the implementation on [T] for all integer types? That resolves the perf issue in types like [u8].

I suppose it would also allow others to perform the same specialisation on their own integer newtype arrays if they really cared about it.

I will also mention that str can reasonably be changed to use a length prefix rather than a 0xff suffix quite independently of this, by implementing that the fast way rather than saying “hash self.as_bytes()” which can be the slow way.

@arthurprs You’d need to do that for all [T], not just [u8], because if you use trait objects you lose the specialisation. Not sure if that change would be ideal for arbitrary types or not.

arthurprs · 2017-11-17T10:43:44Z

I return to what I suggested: specialisation of the implementation on [T] for all integer types? That resolves the perf issue in types like [u8].

I may be missing something here. Isn't that already done?

rust/src/libcore/hash/mod.rs

Line 559 in b32267f

fn hash_slice<H: Hasher>(data: &[$ty], state: &mut H) {

and the len prefix comes from

rust/src/libcore/hash/mod.rs

Line 647 in b32267f

self.len().hash(state);

chris-morgan · 2017-11-17T10:52:52Z

@arthurprs I’m talking about the implementation of Hash for [T], not the hash_slice method on the implementations of Hash for integer types T.

arthurprs · 2017-11-17T12:08:12Z

I fell we are talking about the same thing though. What are you proposing exactly?

chris-morgan · 2017-11-18T10:03:30Z

I was proposing specializing the actual Hash implementation for [T] (impl<T: Hash> Hash for [T] → default impl<T: Hash> Hash for [T] plus the barrage of impl Hash for [u8] et al.); I had neglected to notice that the Hash implementations on those integer types already override the trait’s default method for hash_slice, thus obviating specialization of the [T] implementation (which would be a legitimate approach these days, potentially even preferable to the hash_slice method; as Hash already can’t be a trait object I see no downsides to it at a brief glance).

OK, so [u8] and str do hash all of their data in just one call, leaving the difference as that of hashing a usize versus a u8. Correct?

arthurprs · 2017-11-18T11:15:16Z

OK, so [u8] and str do hash all of their data in just one call, leaving the difference as that of hashing a usize versus a u8. Correct?

Exactly.

bluss · 2017-11-18T11:56:50Z

@arthurprs appending 0xFF to [u8] is not protecting against manufactured collisions.

Example:

// the 257 (&[u8], &[u8]) tuples here all hash the same way.
// if hashing uses 0xFF-termination instead of length prefix.
let data = [0xFFu8; 256];
let mut map = HashSet::new();
for i in 0..data.len() + 1 { map.insert(data.split_at(i)); }

The problem with this is that it gives an hash function indpendent way of generating arbitrarily many collisions, so it's the hash-dos problem, which we want to protect against by default, or at least when using the default hasher.

arthurprs · 2017-11-18T12:02:51Z

My understanding is that the len prefix (slices) and 0xFF suffix (str) is added to prevent stuff like ("1", "") from hashing the same as ("", "1"). Not sure what's the name of that.

bluss · 2017-11-18T12:03:45Z

That's the same as my split example. It's crucial that the 0xFF byte never appears in a str's representation. And it doesn't, by the utf-8 invariant.

steveklabnik · 2019-05-20T12:48:29Z

Triage: not aware of any changes here

chris-morgan mentioned this issue Jul 18, 2015

impl Borrow<[u8]> for Tendril servo/tendril#13

Merged

steveklabnik added the A-libs label Aug 13, 2015

Gankra mentioned this issue Aug 23, 2015

Tracking issue for custom hashers in HashMap #27713

Closed

bluss mentioned this issue Sep 18, 2015

WIP: Hash and Hasher update for faster common case hashing #28044

Closed

steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017

Mark-Simulacrum added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 22, 2017

austinabell mentioned this issue Dec 14, 2020

Version Hamt, Amt, State tree ChainSafe/forest#887

Merged

ShE3py mentioned this issue Sep 17, 2024

hash_one calls hash_str #130475

Closed

str and [u8] should hash the same #27108

str and [u8] should hash the same #27108

Comments

chris-morgan commented Jul 18, 2015

SimonSapin commented Jul 18, 2015

Uh oh!

Gankra commented Jul 18, 2015

Uh oh!

bluss commented Aug 14, 2015

Uh oh!

arthurprs commented Aug 17, 2016

Uh oh!

chris-morgan commented Aug 17, 2016

Uh oh!

arthurprs commented Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtolnay commented Nov 16, 2017

Uh oh!

arthurprs commented Nov 16, 2017

Uh oh!

chris-morgan commented Nov 16, 2017

Uh oh!

chris-morgan commented Nov 16, 2017

Uh oh!

arthurprs commented Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurprs commented Nov 16, 2017

Uh oh!

bluss commented Nov 16, 2017

Uh oh!

arthurprs commented Nov 17, 2017

Uh oh!

chris-morgan commented Nov 17, 2017

Uh oh!

arthurprs commented Nov 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris-morgan commented Nov 17, 2017

Uh oh!

arthurprs commented Nov 17, 2017

Uh oh!

chris-morgan commented Nov 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurprs commented Nov 18, 2017

Uh oh!

bluss commented Nov 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurprs commented Nov 18, 2017

Uh oh!

bluss commented Nov 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveklabnik commented May 20, 2019

Uh oh!

arthurprs commented Aug 17, 2016 •

edited

Loading

arthurprs commented Nov 16, 2017 •

edited

Loading

arthurprs commented Nov 17, 2017 •

edited

Loading

chris-morgan commented Nov 18, 2017 •

edited

Loading

bluss commented Nov 18, 2017 •

edited

Loading

bluss commented Nov 18, 2017 •

edited

Loading