-
Notifications
You must be signed in to change notification settings - Fork 13.4k
str and [u8] should hash the same #27108
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
👍 |
This makes using a Tendril as a HashMap key a little more palatable. Sadly, str and [u8] hash differently at present, so we shouldn’t implement Borrow<str> for StrTendril. An alternative would be making the Hash implementations for Tendril<F> manual and making Tendril<UTF8> use the str Hash implementation and the rest use the [u8] one. See also rust-lang/rust#27108 which deals with fixing the underlying problem of the differing Hash implementations.
I'm all for this change as long as it's benchmarked to not cause a noticeable regression. (I agree with your assessment that it shouldn't) |
Hashing an strings of length n will be n + 1 bytes before and n + 8 bytes after. Siphash adds one byte and hashes in 8-byte blocks, so very short strings bumped up to the next size, using 2 * 2 + 4 = 8 siphash rounds vs. previous 2 * 1 + 4 = 6. I'm sure you can measure the difference. Also, this is a change to what the |
This would be optimized if rust-lang/rfcs#1666 is accepted. |
@arthurprs I don’t see how rust-lang/rfcs#1666 would have any bearing on this. All this should need is diff --git a/src/libcore/hash/mod.rs b/src/libcore/hash/mod.rs
index 051eb97..85427d0 100644
--- a/src/libcore/hash/mod.rs
+++ b/src/libcore/hash/mod.rs
@@ -328,8 +328,7 @@ mod impls {
#[stable(feature = "rust1", since = "1.0.0")]
impl Hash for str {
fn hash<H: Hasher>(&self, state: &mut H) {
- state.write(self.as_bytes());
- state.write_u8(0xff)
+ self.as_bytes().hash(state)
}
} |
Yes, but &[u8].hash() is internally
And that |
I agree that it would be desirable for |
We can always make &[u8] hash to be similar to &str (appending 0xFF) and not the other way around if any performance regression is a concern. It's purely implementation detail. |
My preliminary benchmarking of hashing Here’s one of my tests: #![feature(test)]
use std::hash::Hash;
use std::collections::hash_map::DefaultHasher;
extern crate test;
use test::Bencher;
#[bench]
fn hash_str(b: &mut Bencher) {
let mut hasher = DefaultHasher::new();
b.iter(|| {
include_str!("x.rs").hash(&mut hasher);
});
}
#[bench]
fn hash_bytes(b: &mut Bencher) {
let mut hasher = DefaultHasher::new();
b.iter(|| {
include_bytes!("x.rs").hash(&mut hasher);
});
} I had been thinking of specializing the |
@arthurprs I presume the length thing is to help with types with few possibilities, e.g. zero-sized types, so that a two-element |
There's definitely a difference, but you need to use very small sequences, from 0 to 17 bytes. It'll also differ among hashers depending on the internal block size. I have plenty of test code around, I'll compile a test for this in a bit. @chris-morgan Precisely, they use different techniques to achieve the same (prefixing the usize len / appending a 0xFF). |
Fnv and others byte-at-a-time will take a sizeable hit. Siphash has an internal block size of 8 and it's more expensive overall so its less susceptible to the change.
|
Another real example of needing the length, tuples. They use no field separator themselves. We need to avoid hash dos with pairs of slices too. |
We could make [u8] append a 0xFF like str, this will essentially guarantee no perf regressions. In theory this could be a breaking change though. |
I return to what I suggested: specialisation of the implementation on I suppose it would also allow others to perform the same specialisation on their own integer newtype arrays if they really cared about it. I will also mention that @arthurprs You’d need to do that for all |
I may be missing something here. Isn't that already done? Line 559 in b32267f
and the len prefix comes from Line 647 in b32267f
|
@arthurprs I’m talking about the implementation of |
I fell we are talking about the same thing though. What are you proposing exactly? |
I was proposing specializing the actual OK, so |
Exactly. |
@arthurprs appending 0xFF to [u8] is not protecting against manufactured collisions. Example: // the 257 (&[u8], &[u8]) tuples here all hash the same way.
// if hashing uses 0xFF-termination instead of length prefix.
let data = [0xFFu8; 256];
let mut map = HashSet::new();
for i in 0..data.len() + 1 { map.insert(data.split_at(i)); } The problem with this is that it gives an hash function indpendent way of generating arbitrarily many collisions, so it's the hash-dos problem, which we want to protect against by default, or at least when using the default hasher. |
My understanding is that the len prefix (slices) and 0xFF suffix (str) is added to prevent stuff like |
That's the same as my split example. It's crucial that the 0xFF byte never appears in a str's representation. And it doesn't, by the utf-8 invariant. |
Triage: not aware of any changes here |
str
and[u8]
hash differently and have done since the solution to #5257.This is inconvenient in that it leads to things like the
StrTendril
type (from thetendril
crate) hashing like a[u8]
rather than like astr
; one is thus unable to happily implementBorrow<str>
on it in order to makeHashMap<StrTendril, _>
palatable.[u8]
gets its length prepended, whilestr
gets 0xff appended. Sure, one u8 rather than one usize is theoretically cheaper to hash; but marginally so only, marginally so. I see no good reason why they should use different techniques, and so I suggest thatstr
should be changed to use the hashing technique of[u8]
. This will prevent potential nasty surprises and bring us back closer to the blissful land of “str is just [u8] with a UTF-8 guarantee”.Hashes being internal matters, I do not believe this should be considered a breaking change, but it would still probably be a release-notes-worthy change as it could conceivably break eldritch codes.
The text was updated successfully, but these errors were encountered: